Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Convergent Learning: Do different neural networks learn the same representations?

NIPS 2015 Workshop Slides

Yixuan Li

on 24 March 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Convergent Learning: Do different neural networks learn the same representations?

Convergent Learning: Do different neural
networks learn the same representations?

Yixuan Li¹, Jason Yosinski¹, Jeff Clune², Hod Lipson³ and John Hopcroft¹

¹Department of Computer Science, Cornell University
²Department of Computer Science, University of Wyoming
³Department of Mechanical Engineering, Columbia University

Dec 2, 2015
An interesting phenomenon:

networks with the same architecture trained starting at different random initializations frequently converge to solutions with similar performance (Dauphin et al. 2014)
Convolutional Neural Networks
[2] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding Neural Networks Through Deep
Visualization. Deep Learning Workshop, International Conference on Machine Learning (ICML). 10 July 2015.
To what extent the learned internal representations differ?


Do they learn radically different sets of features that happen to perform similarly?

Do they exhibit
convergent learning
, meaning that their learned feature representations are largely the same?

What do neural nets learn in the middle?
What do neural nets learn in the middle?
One-to-one alignment
Mutual Information

Many-to-one: Sparse Prediction

Many-to-many: Spectral Clustering

[3] Picture Source:

Neuron Activation Statistics
Standard deviation:
Within-net correlation:
Between-net correlation:
Figure: Correlation matrices for the conv1 layer, displayed as images with minimum value at black and maximum at white.
[2] For reference, the number of channels for conv1 to fc8 is given by: S = {96, 256, 384, 384, 256, 4096, 4096, 1000}.
Is there a one-to-one alignment between features learned by different neural networks?
greedy matching (semi-matching)
Figure: Eight highest correlation matched features and eight lowest correlation matched features using greedy matching.
Is there a one-to-one alignment between features learned by different neural networks?
max bipartite matching
Diagonal of the permuted matrix is bright in some places, which shows that for these units in Net1 it is
to find a
highly correlated
unit in Net2.
Is there a one-to-one alignment between features learned by different neural networks?
For many units, a one-to-one alignment is possible.

Both methods reveal that the average correlation for one-to-one alignments varies from layer to layer, with the highest matches in the conv1 and lowest in conv4.

The two networks learn different numbers of units to span certain subspaces.

Some filters in Net1 can be paired up with filters in Net2 with high correlation, but other filters in Net1 and Net2 are network-specific and have no high-correlation pairing in the alternate network, implying that those filters are rare and not always learned.
Is there a one-to-one alignment between features learned by different neural networks?
Alignment via mutual information:
These results suggest that correlation is an adequate measurement of the similarity between two neurons, and that switching to a mutual information metric would not qualitatively change the correlation-based conclusions presented in previous slide.
Relaxing the one-to-one constraint to find sparse, few-to-one mappings
Learn mapping layers with an L1 penalty
(known as a LASSO model, (Tibshirani, 1996))
The mapping layer’s parameters can be considered as a square weight matrix with side length equal to the number of units in the layer;
The layer learns to predict any unit in one network via a linear weighted sum of any number of units in the other.
Architecture of training sparse mapping layers
Relaxing the one-to-one constrain to find sparse, few-to-one mappings
(a) sparse mapping
layer for conv1
(b) permuted sparse
mapping matrix
(c) examples of few-to-one prediction
Relaxing the one-to-one constraint to find sparse, few-to-one mappings
local vs. distributed code:
The units that match well one-to-one suggest the presence of a
local code
Each of these dimensions is important enough, independent enough, or privileged enough in some other way to be relearned by different networks.

Units that do not match well one-to-one, but are predicted well by a sparse model, suggest the presence, along those dimensions, of slightly
distributed codes
Table: Average squared prediction errors for each layer
For the conv1 and conv2 layers, the prediction errors do not rise with the imposition of a sparsity penalty until a penalty greater than 10e−3.

That such extreme sparsity does not hurt performance implies that each neuron in one network can be predicted by only one or a few neurons in another network.
The mapping layers for higher layers (conv3 – conv5) showed poor performance even without regularization, for reasons we do not yet fully understand, so further results on those layers are not included here.
How can we identify small groups of units in each network that span similar subspaces?
Most parents of leaf clusters (the smallest merges shown as two blue lines of length two covering a 2 × 2 region) contain one unit from Net1 and one from Net2. These units can be considered most predictive of each other.

Slightly higher level clusters show small subspaces, comprised of multiple units from each network, where multiple units from one network are useful for predicting activations from the other network.
1. Some features are learned reliably in multiple networks, yet other features are not consistently learned.
3. Units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not.
2. The representation codes are a mix between a local (single unit) code and slightly, but not fully, distributed codes across multiple units.
Future Work
1. Model compression. How would removing low-correlation, rare filters affect performance?
2. Model combination: can multiple models be combined by concatenating their features, deleting those with high overlap, and then fine-tuning?
Thank you :-)
Yixuan Li
Jeff Clune
Hod Lipson
Jason Yosinski
John Hopcroft


Slightly distributed
(spanning low
Are deep representations local or distributed?
[Krizhevsky et al. 2012]
[Zeiler et al. 2014]
[1] Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
Hierachical Agglomerative Clustering (HAC)
Key idea: probe representation types by comparing multiple networks
Full transcript