Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


From deep learning to evolving culture

Lecture given remotely on April 27th 2012 in COGS 200 class at UCSD (teacher = Gary Cottrell), and to GRSNC (U. Montreal) Jan 7 2014

Yoshua Bengio

on 7 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of From deep learning to evolving culture

From Deep Learning to Evolving Culture
Yoshua Bengio
U. Montreal
January 7th, 2014

Main message:
Evolving human culture
as a means to fight the
difficulty of optimizing
brains due to effective
local minima in deep nets
Exploring Local Minima
Local Minima in Neural Networks
Pre-2006 practitioner's wisdom:
local minima not a big deal, basically handled by multiple initializations
Opposed conclusions
should actually be drawn from experiments on unsupervised pre-training of deep networks (Erhan et al 2009, JMLR):

there is a very significant (
) local minima issue, especially with deeper nets
somewhat ignored before the advent of deep nets in 2006
So... how do humans manage to learn high-level abstractions?
Proposed theory:
Humans get hints from other humans that guide learning of high-level abstractions
A lone human would not be able to learn most of these abstractions w/o education / culture, because of effective local minima

Extends the Curriculum Learning story (Bengio et al ICML 2009), recently validated by studying human teaching behavior (Kahn et al NIPS 2011)
How is one brain transfering abstractions to another brain?
How is one brain transfering abstractions to another brain?
Two individuals share a similar visual input and one of them uses language to give hints of the relevant high-level abstractions
The linguistic output of one is a probabilistic target for the other one.
How could language/education/culture possibly help find the better local minima associated with more useful abstractions?
Culture is a snapshot of a population of ideas (memes) spread across the brains of a population of individuals.
Like genetic populations, cultures evolve.
Ideas or concepts are the units of selection = memes
The power of sexual reproduction:

the cross-over operator to combine sub-solutions

More than random search:
potential exponential speed-up by
combinatorial advantage: can
combine solutions
to independently solved sub-problems
Parallel search:
Each brain learns and falls in a local minimum, society can benefit from the best solutions encountered across the population
Ideas as efficient memes
application of genetic algorithms is hampered by / sensitive to the choice of representation (what should each gene represent?)
good units should make offsprings viable often enough
ideas & verbalized concepts = by construction the units of recombination for verbalizable thought

Evolving societies as dynamical systems
What factors influence the efficiency of the optimization?
exploration of new ideas

rate of spread of good ideas

investing in scientific research
especially high-risk high potential impact ideas
- non-homogeneous education systems
- openness to multiple schools of thought
- openness to marginal beliefs and individual differences
open & free access to information, scientific results
investing in (open) education
internet with every individual able to publish
multiple non-centralised ways of 'rating' ideas
Wrap-up & Next
Better, longer-term, more exploratory science & education & cultural diversity
Deep Learning research
Evolving cultures
Local Minima
good ones occupy lower-dimensional manifolds that are difficult to find by chance, especially when learning higher-level abstractions
to exchange high-level abstractions as 'targets' for deep layers
as units of cultural evolution that help explore solutions to build better brains
as dynamical systems to optimize better models of the world into our brains
Populations of Machine Learners
When the brain of a single biological agent learns, it performs an approximate optimization with respect to some endogenous objective.
Hypothesis 1:
Hypothesis 2:
When the brain of a single biological agent learns, it relies on approximate local descent in order to gradually improve itself.
Higher-level abstractions in brains are represented by deeper computations (going through more areas or more computational steps in sequence over the same areas).
Hypothesis 3:
Isomap: trying to preserve global geometry
t-SNE: trying to preserve local geometry
without pre-training
with pre-training
- it gets more difficult to exploit the added power of deeper nets, as depth is increased
- wise initialization can reduce that difficulty
- similar performance for different random init.
- very different performance with vs w/o pretrain
Learning of a single human learner is limited by effective local minima.
Hypothesis 4:
Hypothesis 5:
A single human learner is unlikely to discover high-level abstractions by chance because these are represented by a deep sub-network in the brain.
How is one brain transfering abstractions to another brain?
A human brain can learn high-level abstractions if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-level abstractions.
Hypothesis 6:
Fellow humans provide not only supervision but they do it by presenting examples in a pedagogical order!
Easy examples ==> lower-level concepts, on which to build higher-level concepts illustrated by more complex examples.
Curriculum as a continuation method to defeat local minima
Language and meme recombination provide an efficient evolutionary operator, allowing rapid search in the space of memes, that helps humans build up better high-level internal representations of their world.
Hypothesis 7:

: reconciles all the pieces of evidence, new concepts (memes) emerge that better explain the data
: combining memes

= global+local search a la Baldwin (Hinton&Nowlan'89)
From where do new ideas emerge?
... small
Neural Net Trajectories in Function Space
This is not the usual meaning of "optimization" in ML (can't measure, only estimate, the cost function).
Criterion to be optimized: generalization error
Online gradient = stochastic gradient of generalization error
- linguistic inputs = extra examples, summarize knowledge
- criterion landscape easier to optimize (e.g. curriculum learning)
- turn difficult unsupervised learning into easy supervised learning
- e.g. providing well-chosen labels to transductive-SVM
How do we escape local min.?
online gradient
generalization error
Theoretical and experimental results on deep learning suggest:
hypothesis 3
Output: is this bob?

Highest-level features:
Abstract features:

Primitive features:

Input: Raw pixels
Deep Architectures
Deep Motivations
Brains have a deep architecture
Humans’ ideas composed from simpler ones
Insufficient depth can be exponentially inefficient
Distributed (possibly sparse) representations necessary for non-local generalization, exponentially more efficient than 1-of-N enumeration of latent variable values
Multiple levels of latent variables allow combinatorial sharing of statistical strength
Deep Architecture in our Mind
Humans organize their ideas and concepts hierarchically
Humans first learn simpler concepts and then compose them to represent more abstract ones
Engineers break-up solutions into multiple levels of abstraction and processing
It would be nice to learn / discover these concepts
(knowledge engineering failed because of limits of introspection?)
The Curse of
To generalize locally, need representative examples for all relevant variations!

Classical solution: hope for a smooth enough target function
How to Beat the Curse of Many Factors of Variation?

Compositionality: exponential gain in representational power
Distributed representations / embeddings: feature learning
Deep architecture: multiple levels of feature learning
Can generalize to new configurations
Local vs Distributed Latent Variables
Deep Architectures are More Expressive
Theoretical arguments:
= universal approximator
2 layers of
Logic gates
Formal neurons
RBF units
Theorems on advantage of depth:

(Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011)
Functions compactly represented with k layers may require exponential size with 2 layers
RBMs & auto-encoders = universal approximator
Full transcript