Topology Aware Decentralized Learning

Most machine learning training data are generated, collected, and sensed from decentralized sources: Internet-of-Things, edge/fog/cloudlet computing systems, sensor networks, smart grids, and smart transportation networks. Because most data are naturally decentralized, a question arises: How do we train models across decentralized data?

Decentralized Learning enables collaborative learning across decentralized data without creating a single global model or requiring that data be centralized. Instead, training devices are located at/near data generation sites. Each device maintains its model by training over local data and integrating additional (non-local) knowledge by periodically receiving neighboring devices’ models and aggregating them with its local model. Devices are organized in a flexible topology in which nodes represent devices and edges/links represent communication channels. Communication channels between devices can be a function offactors like physical locality, administrative connections, and privacy concerns.


argument descriptions:
	M: set of models in topology
	S: aggregation strategy
DecentralizedLearning(M, S):
	for m_i ∈ M do:
		Initialize m_i⁰ # model at device i
		Initialize x_i # data at device i
	for t ∈ Rounds do:
		for m_i ∈ M do:
			m_i^{t + 1/2} ← LocalTrain(m_i^t)
		for m_i ∈ M do:
			N_i ← {neighbors(m_i^{t + 1/2}) ∪ m_i^{t + 1/2}} # models in device i's neighborhood
			C_i ← GetAggregationCoeffs(N_i, S) # aggregation coefficients for models in neighborhood
			m_i^{t + 1} ← ∑_{j∈N_i}C_i,j * m_j^{t + 1/2} # weighted average of models in neighborhood

What is knowledge propagation?

The flexibility afforded by decentralized learning can come at a cost, in particular slower convergence and regional/hyperpersonalized models that lack knowledge from distant devices. To prevent this from happening, we aim for each device-specific model to be performant over the global data distribution across all devices in a topology; device-specific models must be generalizable beyond their local data distribution so that they are performant on out-of-distribution (OOD) inference requests. This is especially challenging in decentralized learning as the only way for device-specific knowledge to propagate in a topology is by “hopping” between devices via successive aggregation rounds.

Here, we study knowledge propagation in decentralized topologies by asking: How can each device-specific model learn from all data present in a topology, regardless of its location, in as few aggregation rounds as possible? This goal is especially challenging in settings where data are not independently and identically distributed (IID) across devices as devices have no knowledge of how data are distributed globally.

We study the extreme case in which most data in a topology are IID, with the exception of a single device which contains OOD data. In the figure below, we report average percent difference in test accuracy AUC between IID and OOD data over 40 rounds of training across all devices in a topology; averaged again over 3 realistic 33 device topologies and 3 seeds. OOD data was placed on the node with the fourth highest degree. Lower percent difference indicates that the OOD data did not propagate to as many nodes as the IID data.

OOD vs IID knowledge propagation. Existing decentralized learning strategies (i.e., FL, Unweighted, Weighted, Random) struggle to propagate the OOD knowledge.

How can we accelerate knowledge propagation?

Traditional decentralized learning strategies struggle to propagate the OOD knowledge. We hypothesize that this may be because they fail to account for a node’s (non)beneficial location in a topology:

Unweighted: Models in a neighborhood are equally weighted.
Weighted: Models in a neighborhood are weighted by number of training data points.
Random: Models in a neighborhood are assigned random weighted from a uniform distribution.
FL: Assume fully-connected topology; models are uniformly weighted.

To use variation in topology to our advantage to accelerate of knowledge propagation, we propose topology-aware aggregation strategies for decentralized learning. Topology-aware aggregation strategies allow each device to account for its own and its neighbors’ location in a topology when aggregating models:

Degree: Models in a neighborhood are weighted by the number of edges their respective nodes have.
Betweenness: Models in a neighborhood are weighted by the "betweenness" score of their respective nodes.

Numerous network science metrics can be used to quantify a node’s location within a topology: some metrics quantify a node’s location with respect to its neighborhood (local) or the entire topology (global). We choose to study Degree (local) as it measures how many neighbors a node has, and by proxy, how well positioned a node is to spread knowledge to its neighbors. We also choose Betweenness (global) as it measures how often a node lies on the shortest path between all pairs of nodes in a topology, and by proxy, how well positioned a node is to bridge the number of hops needed for knowledge to travel between nodes in the topology.

Topology-Aware vs. Topology-Unaware Learning

Topology-aware aggregation strategies (Degree, Betweenness) lead to higher levels of OOD knowledge prorogation for each topology and dataset. In the figure below, OOD data are located on node with highest degree. Green indicates node with OOD data.

Topology Aware vs. Unaware Aggregation.

Impact of Data Placement on Knowledge Propagation

There is a negative relationship between degree of device on which OOD data is located and propagation of OOD data. While this negative trend holds across all aggregation strategies, the topology-aware strategies (Degree, Betweenness) outperform non-topology-aware aggregation strategies (i.e., weighted, unweighted, random and even traditional FL). In the figure below, OOD data location is varied across the four highest degree nodes in each topology (we successively palce the OOD data on nodes with lower degree).

Impact of data placement.

Impact of Topology on Knowledge Propagation

We study the impact of network topology on OOD data propagation from the perspective of topology degree, modularity, and number of nodes. Topology-aware aggregation strategies outperform topology-unaware strategies in a diverse set of topologies. In the figure below, we study learning over the CIFAR10 dataset.

Impact of data placement.

To study the impact of degree, we study a set of Barabási-Albert (BA) topologies each with 33 nodes varying degrees parameters. BA are scale-free models often used to model real-world networks such as the internet, citation networks, and social networks. We find that as degree increases, OOD data propagates further in a topology.

To study the impact of modularity, we study a set of Stochastic-Block (SB) topologies each with 33 nodes and varying modularity. SB commonly models topologies with modular sub-communities in fields such as social network analysis. We find that as modularity decreases, OOD data propagates futher in a topology (see below).

Impact of data placement.

To study impact of node count, we study both Barabási-Albert (BA) and Watts-Strogatz (WS) topologies. While BA more realistically model many real-world phenomena, WS also generates topologies with small-world properties. However, unlike BA, WS topologies do not have a power law degree distribution observed in many real-world networks. We find that While node count does not seem to impact knowledge propagation for topology-aware strategies in BA topologies, it negatively affects knowledge propagation for topology-unaware strategies. For WS topologies, however, both topology-aware and -unaware strategies are negatively impacted by node count. We explain this as the degree distribution in BA topologies follows a power-law distribution so the two topology-aware metrics we studied (degree and betweenness) can both successfully disambiguate devices with different locations; WS topologies, however, have a more uniform degree distribution and therefore topology-aware metrics do not differ significantly in their aggregation coefficient assignment compared to topology-unaware metrics.

Conclusion

Machine learning training data are largely generated, collected, and sensed from decentralized sources. Decentralized learning algorithms enable learning over these naturally decentralized data without centralized coordination; instead, training devices self-organize into communication topologies that arise from real-world constraints (e.g., physical locality, administrative connections, privacy concerns). In decentralized learning, because devices can only communicate with neighboring devices, knowledge propagates via model aggregation between neighbors. We find a critical limitation in existing decentralized learning strategies: they struggle to propagate OOD knowledge to the same extent at IID knowledge. This limitation affects the performance of models that are not able to learn from OOD data present in the topology.

We find that the propagation of OOD knowledge is greatly impacted by both the location of OOD data in a topology and the topology itself. To address these challenges, we introduce topology-aware decentralized learning strategies that enable reliable propagation of OOD knowledge in arbitrary communication topologies. We demonstrate that our proposed topology-aware aggregation strategies outperform traditional aggregation strategies. We also study the impact of topology node count, modularity, and degree distribution on topology-aware aggregation strategy performance. We show that regardless of how these values are varied, topology-aware methods perform as well as, or better than, traditional aggregation strategies.

Further details about all experiments and figures discussed in this blog can be found in the main paper. If there are any questions feel free to email the first author for clarification.

Topology-Aware Knowledge Propagation in Decentralized Learning

Abstract

What is decentralized learning?