**Convergent Learning: Do different neural**

networks learn the same representations?

networks learn the same representations?

Yixuan Li¹, Jason Yosinski¹, Jeff Clune², Hod Lipson³ and John Hopcroft¹

¹Department of Computer Science, Cornell University

²Department of Computer Science, University of Wyoming

³Department of Mechanical Engineering, Columbia University

MLDG

Dec 2, 2015

conv1

conv2

Feature

maps

Net1

conv1

conv2

Feature

maps

Net1

Net2

Feature

maps

conv1

Net1

conv2

An interesting phenomenon:

networks with the same architecture trained starting at different random initializations frequently converge to solutions with similar performance (Dauphin et al. 2014)

conv2

Convolutional Neural Networks

[2] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding Neural Networks Through Deep

Visualization. Deep Learning Workshop, International Conference on Machine Learning (ICML). 10 July 2015.

**To what extent the learned internal representations differ?**

OR

OR

**Motivation**

**Do they learn radically different sets of features that happen to perform similarly?**

**Do they exhibit**

convergent learning

, meaning that their learned feature representations are largely the same?

convergent learning

, meaning that their learned feature representations are largely the same?

**What do neural nets learn in the middle?**

**What do neural nets learn in the middle?**

**One-to-one alignment**

Correlation

Mutual Information

Many-to-one: Sparse Prediction

Many-to-many: Spectral Clustering

Correlation

Mutual Information

Many-to-one: Sparse Prediction

Many-to-many: Spectral Clustering

**Outline**

[3] Picture Source:

http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/

conv2

**Neuron Activation Statistics**

Mean:

Standard deviation:

Within-net correlation:

Between-net correlation:

Figure: Correlation matrices for the conv1 layer, displayed as images with minimum value at black and maximum at white.

_________

[2] For reference, the number of channels for conv1 to fc8 is given by: S = {96, 256, 384, 384, 256, 4096, 4096, 1000}.

**Is there a one-to-one alignment between features learned by different neural networks?**

greedy matching (semi-matching)

Figure: Eight highest correlation matched features and eight lowest correlation matched features using greedy matching.

**Is there a one-to-one alignment between features learned by different neural networks?**

max bipartite matching

Diagonal of the permuted matrix is bright in some places, which shows that for these units in Net1 it is

possible

to find a

unique

,

highly correlated

unit in Net2.

**Is there a one-to-one alignment between features learned by different neural networks?**

For many units, a one-to-one alignment is possible.

Both methods reveal that the average correlation for one-to-one alignments varies from layer to layer, with the highest matches in the conv1 and lowest in conv4.

The two networks learn different numbers of units to span certain subspaces.

Some filters in Net1 can be paired up with filters in Net2 with high correlation, but other filters in Net1 and Net2 are network-specific and have no high-correlation pairing in the alternate network, implying that those filters are rare and not always learned.

Take-aways:

**Is there a one-to-one alignment between features learned by different neural networks?**

Alignment via mutual information:

These results suggest that correlation is an adequate measurement of the similarity between two neurons, and that switching to a mutual information metric would not qualitatively change the correlation-based conclusions presented in previous slide.

**Relaxing the one-to-one constraint to find sparse, few-to-one mappings**

Learn mapping layers with an L1 penalty

(known as a LASSO model, (Tibshirani, 1996))

The mapping layer’s parameters can be considered as a square weight matrix with side length equal to the number of units in the layer;

The layer learns to predict any unit in one network via a linear weighted sum of any number of units in the other.

Architecture of training sparse mapping layers

**Relaxing the one-to-one constrain to find sparse, few-to-one mappings**

(a) sparse mapping

layer for conv1

(b) permuted sparse

mapping matrix

(c) examples of few-to-one prediction

**Relaxing the one-to-one constraint to find sparse, few-to-one mappings**

local vs. distributed code:

The units that match well one-to-one suggest the presence of a

local code

.

Each of these dimensions is important enough, independent enough, or privileged enough in some other way to be relearned by different networks.

Units that do not match well one-to-one, but are predicted well by a sparse model, suggest the presence, along those dimensions, of slightly

distributed codes

.

Table: Average squared prediction errors for each layer

For the conv1 and conv2 layers, the prediction errors do not rise with the imposition of a sparsity penalty until a penalty greater than 10e−3.

That such extreme sparsity does not hurt performance implies that each neuron in one network can be predicted by only one or a few neurons in another network.

The mapping layers for higher layers (conv3 – conv5) showed poor performance even without regularization, for reasons we do not yet fully understand, so further results on those layers are not included here.

**How can we identify small groups of units in each network that span similar subspaces?**

Most parents of leaf clusters (the smallest merges shown as two blue lines of length two covering a 2 × 2 region) contain one unit from Net1 and one from Net2. These units can be considered most predictive of each other.

Slightly higher level clusters show small subspaces, comprised of multiple units from each network, where multiple units from one network are useful for predicting activations from the other network.

**Conclusion**

1. Some features are learned reliably in multiple networks, yet other features are not consistently learned.

3. Units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not.

2. The representation codes are a mix between a local (single unit) code and slightly, but not fully, distributed codes across multiple units.

**Future Work**

1. Model compression. How would removing low-correlation, rare filters affect performance?

2. Model combination: can multiple models be combined by concatenating their features, deleting those with high overlap, and then fine-tuning?

**Thank you :-)**

Yixuan Li

Jeff Clune

Hod Lipson

Jason Yosinski

John Hopcroft

http://www.cs.cornell.edu/~yli/

yli@cs.cornell.edu

58.65%

58.73%

Local

Distributed

Slightly distributed

(spanning low

dimensional

subspaces)

**Are deep representations local or distributed?**

[Krizhevsky et al. 2012]

[Zeiler et al. 2014]

________

[1] Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.

I

I

W

W^T

Hierachical Agglomerative Clustering (HAC)

s

Key idea: probe representation types by comparing multiple networks