Return to Video

The wake-sleep algorithm [10 min]

  • 0:00 - 0:06
    In this video, I'll describe the first way
    we discovered for getting Sigmoid Belief
  • 0:06 - 0:13
    Nets to learn efficiently.
    It's called the wake-sleep algorithm and
  • 0:13 - 0:16
    it should not be confused with Boltzmann
    machines.
  • 0:16 - 0:20
    They have two phases, a positive and a
    negative phase, that could plausibly be
  • 0:20 - 0:25
    related to wake and sleep.
    But the wake-sleep algorithm is a very
  • 0:25 - 0:30
    different kind of learning, mainly because
    it's for directed graphical models like
  • 0:30 - 0:34
    Sigmoid Belief Nets, rather than for
    undirected graphical models
  • 0:34 - 0:36
    like Boltzmann machines.
  • 0:36 - 0:38
    The ideas behind the wake-sleep algorithm
  • 0:38 - 0:44
    led to a whole new area of machine
    learning called variational learning,
  • 0:44 - 0:47
    which didn't take off until the late
    1990s,
  • 0:47 - 0:53
    despite early examples like the wake-sleep
    algorithm, and is now one of the main ways
  • 0:53 - 0:58
    of learning complicated graphical models
    in machine learning.
  • 0:58 - 1:02
    The basic idea behind these variational
    methods sounds crazy.
  • 1:02 - 1:07
    The idea is that since it's hard to
    compute the correct posterior distribution,
  • 1:07 - 1:11
    we'll compute some cheap
    approximation to it.
  • 1:11 - 1:14
    And then, we'll do maximum likelihood
    learning anyway.
  • 1:14 - 1:17
    That is, we'll apply the learning rule
    that would be correct,
  • 1:17 - 1:20
    if we'd got a sample from the true posterior,
  • 1:20 - 1:23
    and hope that it works, even though we haven't.
  • 1:23 - 1:29
    Now, you could reasonably expect this to
    be a disaster,
  • 1:29 - 1:32
    but actually the learning comes to your rescue.
  • 1:32 - 1:35
    So, if you look at what's driving the weights during the learning,
  • 1:35 - 1:38
    when you use an approximate posterior,
  • 1:38 - 1:41
    there are actually two terms driving the
    weights.
  • 1:41 - 1:46
    One term is driving them to get a better
    model of the data.
  • 1:46 - 1:50
    That is, to make the Sigmoid Belief Net
    more likely to generate the observed data
  • 1:50 - 1:54
    in the training center.
    But there's another term that's added to that,
  • 1:54 - 1:56
    that's actually driving the weights
  • 1:56 - 2:02
    towards sets of weights for which the
    approximate posterior it's using is a good fit
  • 2:02 - 2:06
    to the real posterior.
    It does this by manipulating the real
  • 2:06 - 2:10
    posterior to try to make it fit the
    approximate posterior.
  • 2:10 - 2:16
    It's because of this effect, the
    variational learning of these models works
  • 2:16 - 2:21
    quite nicely.
    Back in the mid 90s,' when we first came
  • 2:21 - 2:24
    up with it, we thought this was an
    interesting new theory of how the brain
  • 2:24 - 2:27
    might learn.
    That idea has been taken up since by Karl
  • 2:27 - 2:33
    Friston, who strongly believes this is
    what's going on in real neural learning.
  • 2:33 - 2:39
    So, we're now going to look in more detail
    at how we can use an approximation to the
  • 2:39 - 2:47
    posterior distribution for learning.
    To summarize, it's hard to learn
  • 2:47 - 2:52
    complicated models like Sigmoid Belief
    Nets because it's hard to get samples from
  • 2:52 - 2:57
    the true posterior distribution over
    hidden configurations, when given a data vector.
  • 2:57 - 3:01
    And it's hard even to get a sample from that posterior.
  • 3:01 - 3:05
    That is, it's hard to get an unbiased sample.
  • 3:05 - 3:09

    So, the crazy idea is that we're going to
  • 3:09 - 3:14
    use samples from some other distribution
    and hope that the learning will still work.
  • 3:14 - 3:19
    And as we'll see, that turns out to be true for Sigmoid Belief Nets.
  • 3:19 - 3:23
    So, the distribution that we're going to
  • 3:23 - 3:27
    use is a distribution that ignores
    explaining away.
  • 3:27 - 3:31
    We're going to assume (wrongly) that the
    posterior over hidden configurations
  • 3:31 - 3:37
    factorizes into a product of
    distributions for each separate hidden unit.
  • 3:37 - 3:39
    In other words, we're going to assume that
  • 3:39 - 3:44
    given the data, the units in each hidden
    layer are independent of one another,
  • 3:44 - 3:47
    as they are in a Restricted Boltzmann
    machine.
  • 3:47 - 3:50
    But in a Restricted Boltzmann machine,
    this is correct.
  • 3:50 - 3:53
    Whereas, in a Sigmoid Belief Net, it's
    wrong.
  • 3:53 - 3:57
    So, let's quickly look at what a factorial
    distribution is.
  • 3:57 - 4:01
    In a factorial distribution, the
    probability of a whole vector is just the
  • 4:01 - 4:05
    product of the probabilities of its
    individual terms.
  • 4:05 - 4:10
    So, suppose we have three hidden units in
    the layer and they have probabilities of
  • 4:10 - 4:17
    being wrong of 0.3, 0.6, and 0.8. If we
    want to compute the probability of the
  • 4:17 - 4:24
    hidden layer having the state (1, 0, 1),
    We compute that by multiplying 0.3
  • 4:24 - 4:33
    by (1 - 0.6), by 0.8.
    So, the probability of a configuration of
  • 4:33 - 4:37
    the hidden layer is just the product of
    the individual probabilities.
  • 4:37 - 4:41
    That's why it's called factorial.
    In general, the distribution of binary
  • 4:41 - 4:45
    vectors of length n will have two to the n
    degrees of freedom.
  • 4:45 - 4:51
    Actually, it's only two to the n minus one
    because the probabilities must add to one.
  • 4:51 - 4:56
    A factorial distribution, by contrast,
    only has n degrees of freedom.
  • 4:56 - 5:01
    It's a much simpler beast.
    So now, I'm going to describe the
  • 5:01 - 5:05
    wake-sleep algorithm that makes use of the
    idea of using the wrong distribution.
  • 5:05 - 5:12
    And in this algorithm, we have a neural
    net that has two different sets of weights.
  • 5:14 - 5:17
    It's really a generative model and so, the
  • 5:17 - 5:23
    weights shown in green for generative are
    the weights of the model.
  • 5:23 - 5:28
    Those are the weights that define the
    probability distribution over data vectors.
  • 5:29 - 5:35
    We've got some extra weights.
    The weights shown in red, for recognition weights,
  • 5:35 - 5:37
    and these are weights that are used for
  • 5:37 - 5:41
    approximately getting the posterior
    distribution.
  • 5:41 - 5:45
    That is, we're going to use these weights
    to get a factorial distribution at each
  • 5:45 - 5:49
    hidden layer that approximates the
    posterior, but not very well.
  • 5:51 - 5:55
    So, in this algorithm, there's a wake
    phase.
  • 5:55 - 6:00
    In the wake phase, you put data in at the
    visible layer, the bottom, and you do a
  • 6:00 - 6:05
    forward pass through the network using the
    recognition weights.
  • 6:05 - 6:10
    And in each hidden layer, you make a
    stochastic binary decision for each hidden
  • 6:10 - 6:14
    unit independently, about whether it
    should be on or off.
  • 6:14 - 6:20
    So, the forward pass gets us stochastic
    binary states for all of the hidden units.
  • 6:21 - 6:27
    Then, once we have those stochastic binary
    states, we treat that as though it was a
  • 6:27 - 6:32
    sample from the true posterior
    distribution given the data and we do
  • 6:32 - 6:35
    maximum likelihood learning.
    But what we're doing maximum likelihood
  • 6:35 - 6:39
    learning for is not the recognition
    weights that we just used to get the
  • 6:39 - 6:44
    approximate sample.
    It's the generative weights that define our models.
  • 6:44 - 6:47
    So, you drive the system in the forward
  • 6:47 - 6:52
    pass with the recognition weights, but you
    learn the generative weights.
  • 6:53 - 7:00
    In the sleep phase, you do the opposite.
    You drive the system with the generative weights.
  • 7:00 - 7:03
    That is, you start with a random vector of
  • 7:03 - 7:06
    the top hidden layer.
    You generate the binary states of those
  • 7:06 - 7:10
    hidden units from their prior in which
    they're independent.
  • 7:10 - 7:17
    And then, you go down through the system,
    generating state for each layer at a time.
  • 7:17 - 7:20
    And here you're using the generative model
    correctly.
  • 7:20 - 7:23
    That's how the generative model says you
    want to generate data.
  • 7:23 - 7:27
    And so, you can generate an unbiased
    sample from the model.
  • 7:29 - 7:35
    Having used the generative weights to
    generate an unbiased sample, you then say,
  • 7:35 - 7:39
    let's see if we can recover the hidden
    states from the data.
  • 7:39 - 7:44
    Well, let's see if we can recover the
    hidden states that layer h2 from the
  • 7:44 - 7:50
    hidden states at layer h1.
    So, you train recognition weights, to try
  • 7:50 - 7:56
    and recover the hidden states that
    actually generated the states in the layer below.
  • 7:58 - 8:00
    So, it's just the opposite of the weight phase.
  • 8:00 - 8:02
    We're now using the generative weights to
  • 8:02 - 8:06
    drive the system and we're learning the
    recognition weights.
  • 8:07 - 8:11
    It turns out that if you start with random
    weights and you alternate between wake
  • 8:11 - 8:16
    phases and sleep phases it learns a pretty
    good model.
  • 8:16 - 8:22
    There are flaws in this algorithm.
    The first flaw is a rather minor one
  • 8:22 - 8:27
    which is, the recognition weights are
    learning to invert the generative model.
  • 8:27 - 8:31
    But at the beginning of learning, they're
    learning to invert the generative model in
  • 8:31 - 8:34
    parts of the space where there isn't any
    data.
  • 8:34 - 8:37
    Because when you generate from the model,
    you're generating stuff that looks very
  • 8:37 - 8:41
    different from the real data,
    because the weights aren't any good.
  • 8:41 - 8:45
    That's a waste, but it's not a big
    problem.
  • 8:45 - 8:51
    The serious problem with this algorithm is
    that the recognition weights not only
  • 8:51 - 8:53
    don't follow the gradient of the log
    probability of the data,
  • 8:53 - 8:59
    They don't even follow the gradient of the
    variational bound on this probability.
  • 8:59 - 9:03
    And because they're not following the
    right gradient, we get incorrect mode averaging,
  • 9:03 - 9:06
    which I'll explain in the next slide.
  • 9:06 - 9:09
    A final problem is that we know that the
  • 9:09 - 9:14
    true posterior over the top hidden layer
    is bound to be far from independent
  • 9:14 - 9:19
    because of explaining away effects.
    And yet, we're forced to approximate it
  • 9:19 - 9:22
    with a distribution that assumes
    independence.
  • 9:22 - 9:28
    This independence approximation might not
    be so bad for intermediate hidden layers,
  • 9:28 - 9:34
    because if we're lucky, the explaining
    away effects that come from below will be
  • 9:34 - 9:38
    partially canceled out by prior effects
    that come from above.
  • 9:38 - 9:45
    You'll see that in much more detail later.
    Despite all these problems, Karl Friston
  • 9:45 - 9:48
    thinks this is how the brain works.
    When we initially came up with the
  • 9:48 - 9:51
    algorithm, we thought it was an
    interesting new theory of the brain.
  • 9:51 - 9:59
    I currently believe that it's got too many
    problems to be how the brain works and
  • 9:59 - 10:01
    that we'll find better algorithms.
  • 10:02 - 10:06
    So now let me explain mode averaging, using the little model
  • 10:06 - 10:09
    with the earthquake and the truck that we saw before.
  • 10:09 - 10:15
    Suppose we run the sleep phase, and we generate data from this model.
  • 10:15 - 10:20
    Most of the time, those top two units would be off
  • 10:20 - 10:23
    because they are very unlikely to turn on under their prior,
  • 10:23 - 10:33
    and, because they are off, the visible unit will be firmly off, because its bias is -20.
  • 10:35 - 10:43
    Just occassionally, one time in about e to the -10, one of the two top units will turn on
  • 10:44 - 10:47
    and it will be equally often the left one or the right one.
  • 10:48 - 10:51
    When that unit turns on, there's a probability of a half
  • 10:51 - 10:54
    that the visible unit will turn on.
  • 10:54 - 10:58
    So, if you think about the occassions on which the visible unit turns on,
  • 10:58 - 11:03
    half those occassions have the left-hand hidden unit on,
  • 11:03 - 11:07
    the other half of those occassions have the right-hand hidden unit on
  • 11:07 - 11:11
    and almost none of those occassions have neither or both units on.
  • 11:11 - 11:17
    So now think what the learning would do for the recognition weights.
  • 11:17 - 11:21
    Half the time we'll have a 1 on the visible layer,
  • 11:21 - 11:25
    the leftmost unit will be on at the top,
  • 11:25 - 11:33
    so we'll actually learn to predict that that's on with a probability of 0.5, and the same for the right unit.
  • 11:33 - 11:37
    So the recognition units will learn to produce a factorial distribution
  • 11:37 - 11:41
    over the hidden layer, of (0.5, 0.5)
  • 11:41 - 11:47
    and that factorial distribution puts a quarter of its mass on the configuration (1,1)
  • 11:47 - 11:50
    and another quarter of its mass on the configuration (0,0)
  • 11:50 - 11:56
    and both of those are extremely unlikely configurations given that the visible unit was on.
  • 11:58 - 12:01
    It would have been better just to pick one mode,
  • 12:01 - 12:06
    that is, it would have been better for the visible unit just to go for truck, or just to go for earthquake.
  • 12:08 - 12:11
    That's the best recognition model you can have,
  • 12:11 - 12:17
    that's the best recognition model you can have if you're forced to have a factorial model.
  • 12:18 - 12:21
    So even though the hidden configurations we're dealing with
  • 12:21 - 12:24
    are best represented as the corners of a square
  • 12:24 - 12:29
    actually show it as if it's a one-dimensional continuous value,
  • 12:29 - 12:38
    and the true posterior is bimodal. It's focused on (1,0) or (0,1), that's shown in black.
  • 12:38 - 12:44
    The approximation window, if you use the sleep phase of the wake-sleep algorithm,
  • 12:44 - 12:49
    is the red curve, which gives all four states of the hidden units equal probability
  • 12:51 - 12:57
    and the best solution would be to pick one of these states, and give it all the probability mass.
  • 12:57 - 13:05
    That's the best solution, because in variational learning we're manipulating the true posterior
  • 13:05 - 13:08
    to make it fit the approximation we're using.
  • 13:08 - 13:12
    Normally, in learning we'll manipulate an approximation to fit the true thing, but here it's backwards.
Title:
The wake-sleep algorithm [10 min]
Video Language:
English

English subtitles

Revisions