The wake-sleep algorithm [10 min]

Edit subtitles

0:00 - 0:06

In this video, I'll describe the first way
we discovered for getting Sigmoid Belief
0:06 - 0:13

Nets to learn efficiently.
It's called the wake-sleep algorithm and
0:13 - 0:16

it should not be confused with Boltzmann
machines.
0:16 - 0:20

They have two phases, a positive and a
negative phase, that could plausibly be
0:20 - 0:25

related to wake and sleep.
But the wake-sleep algorithm is a very
0:25 - 0:30

different kind of learning, mainly because
it's for directed graphical models like
0:30 - 0:34

Sigmoid Belief Nets, rather than for
undirected graphical models
0:34 - 0:36

like Boltzmann machines.
0:36 - 0:38

The ideas behind the wake-sleep algorithm
0:38 - 0:44

led to a whole new area of machine
learning called variational learning,
0:44 - 0:47

which didn't take off until the late
1990s,
0:47 - 0:53

despite early examples like the wake-sleep
algorithm, and is now one of the main ways
0:53 - 0:58

of learning complicated graphical models
in machine learning.
0:58 - 1:02

The basic idea behind these variational
methods sounds crazy.
1:02 - 1:07

The idea is that since it's hard to
compute the correct posterior distribution,
1:07 - 1:11

we'll compute some cheap
approximation to it.
1:11 - 1:14

And then, we'll do maximum likelihood
learning anyway.
1:14 - 1:17

That is, we'll apply the learning rule
that would be correct,
1:17 - 1:20

if we'd got a sample from the true posterior,
1:20 - 1:23

and hope that it works, even though we haven't.
1:23 - 1:29

Now, you could reasonably expect this to
be a disaster,
1:29 - 1:32

but actually the learning comes to your rescue.
1:32 - 1:35

So, if you look at what's driving the weights during the learning,
1:35 - 1:38

when you use an approximate posterior,
1:38 - 1:41

there are actually two terms driving the
weights.
1:41 - 1:46

One term is driving them to get a better
model of the data.
1:46 - 1:50

That is, to make the Sigmoid Belief Net
more likely to generate the observed data
1:50 - 1:54

in the training center.
But there's another term that's added to that,
1:54 - 1:56

that's actually driving the weights
1:56 - 2:02

towards sets of weights for which the
approximate posterior it's using is a good fit
2:02 - 2:06

to the real posterior.
It does this by manipulating the real
2:06 - 2:10

posterior to try to make it fit the
approximate posterior.
2:10 - 2:16

It's because of this effect, the
variational learning of these models works
2:16 - 2:21

quite nicely.
Back in the mid 90s,' when we first came
2:21 - 2:24

up with it, we thought this was an
interesting new theory of how the brain
2:24 - 2:27

might learn.
That idea has been taken up since by Karl
2:27 - 2:33

Friston, who strongly believes this is
what's going on in real neural learning.
2:33 - 2:39

So, we're now going to look in more detail
at how we can use an approximation to the
2:39 - 2:47

posterior distribution for learning.
To summarize, it's hard to learn
2:47 - 2:52

complicated models like Sigmoid Belief
Nets because it's hard to get samples from
2:52 - 2:57

the true posterior distribution over
hidden configurations, when given a data vector.
2:57 - 3:01

And it's hard even to get a sample from that posterior.
3:01 - 3:05

That is, it's hard to get an unbiased sample.
3:05 - 3:09

So, the crazy idea is that we're going to
3:09 - 3:14

use samples from some other distribution
and hope that the learning will still work.
3:14 - 3:19

And as we'll see, that turns out to be true for Sigmoid Belief Nets.
3:19 - 3:23

So, the distribution that we're going to
3:23 - 3:27

use is a distribution that ignores
explaining away.
3:27 - 3:31

We're going to assume (wrongly) that the
posterior over hidden configurations
3:31 - 3:37

factorizes into a product of
distributions for each separate hidden unit.
3:37 - 3:39

In other words, we're going to assume that
3:39 - 3:44

given the data, the units in each hidden
layer are independent of one another,
3:44 - 3:47

as they are in a Restricted Boltzmann
machine.
3:47 - 3:50

But in a Restricted Boltzmann machine,
this is correct.
3:50 - 3:53

Whereas, in a Sigmoid Belief Net, it's
wrong.
3:53 - 3:57

So, let's quickly look at what a factorial
distribution is.
3:57 - 4:01

In a factorial distribution, the
probability of a whole vector is just the
4:01 - 4:05

product of the probabilities of its
individual terms.
4:05 - 4:10

So, suppose we have three hidden units in
the layer and they have probabilities of
4:10 - 4:17

being wrong of 0.3, 0.6, and 0.8. If we
want to compute the probability of the
4:17 - 4:24

hidden layer having the state (1, 0, 1),
We compute that by multiplying 0.3
4:24 - 4:33

by (1 - 0.6), by 0.8.
So, the probability of a configuration of
4:33 - 4:37

the hidden layer is just the product of
the individual probabilities.
4:37 - 4:41

That's why it's called factorial.
In general, the distribution of binary
4:41 - 4:45

vectors of length n will have two to the n
degrees of freedom.
4:45 - 4:51

Actually, it's only two to the n minus one
because the probabilities must add to one.
4:51 - 4:56

A factorial distribution, by contrast,
only has n degrees of freedom.
4:56 - 5:01

It's a much simpler beast.
So now, I'm going to describe the
5:01 - 5:05

wake-sleep algorithm that makes use of the
idea of using the wrong distribution.
5:05 - 5:12

And in this algorithm, we have a neural
net that has two different sets of weights.
5:14 - 5:17

It's really a generative model and so, the
5:17 - 5:23

weights shown in green for generative are
the weights of the model.
5:23 - 5:28

Those are the weights that define the
probability distribution over data vectors.
5:29 - 5:35

We've got some extra weights.
The weights shown in red, for recognition weights,
5:35 - 5:37

and these are weights that are used for
5:37 - 5:41

approximately getting the posterior
distribution.
5:41 - 5:45

That is, we're going to use these weights
to get a factorial distribution at each
5:45 - 5:49

hidden layer that approximates the
posterior, but not very well.
5:51 - 5:55

So, in this algorithm, there's a wake
phase.
5:55 - 6:00

In the wake phase, you put data in at the
visible layer, the bottom, and you do a
6:00 - 6:05

forward pass through the network using the
recognition weights.
6:05 - 6:10

And in each hidden layer, you make a
stochastic binary decision for each hidden
6:10 - 6:14

unit independently, about whether it
should be on or off.
6:14 - 6:20

So, the forward pass gets us stochastic
binary states for all of the hidden units.
6:21 - 6:27

Then, once we have those stochastic binary
states, we treat that as though it was a
6:27 - 6:32

sample from the true posterior
distribution given the data and we do
6:32 - 6:35

maximum likelihood learning.
But what we're doing maximum likelihood
6:35 - 6:39

learning for is not the recognition
weights that we just used to get the
6:39 - 6:44

approximate sample.
It's the generative weights that define our models.
6:44 - 6:47

So, you drive the system in the forward
6:47 - 6:52

pass with the recognition weights, but you
learn the generative weights.
6:53 - 7:00

In the sleep phase, you do the opposite.
You drive the system with the generative weights.
7:00 - 7:03

That is, you start with a random vector of
7:03 - 7:06

the top hidden layer.
You generate the binary states of those
7:06 - 7:10

hidden units from their prior in which
they're independent.
7:10 - 7:17

And then, you go down through the system,
generating state for each layer at a time.
7:17 - 7:20

And here you're using the generative model
correctly.
7:20 - 7:23

That's how the generative model says you
want to generate data.
7:23 - 7:27

And so, you can generate an unbiased
sample from the model.
7:29 - 7:35

Having used the generative weights to
generate an unbiased sample, you then say,
7:35 - 7:39

let's see if we can recover the hidden
states from the data.
7:39 - 7:44

Well, let's see if we can recover the
hidden states that layer h2 from the
7:44 - 7:50

hidden states at layer h1.
So, you train recognition weights, to try
7:50 - 7:56

and recover the hidden states that
actually generated the states in the layer below.
7:58 - 8:00

So, it's just the opposite of the weight phase.
8:00 - 8:02

We're now using the generative weights to
8:02 - 8:06

drive the system and we're learning the
recognition weights.
8:07 - 8:11

It turns out that if you start with random
weights and you alternate between wake
8:11 - 8:16

phases and sleep phases it learns a pretty
good model.
8:16 - 8:22

There are flaws in this algorithm.
The first flaw is a rather minor one
8:22 - 8:27

which is, the recognition weights are
learning to invert the generative model.
8:27 - 8:31

But at the beginning of learning, they're
learning to invert the generative model in
8:31 - 8:34

parts of the space where there isn't any
data.
8:34 - 8:37

Because when you generate from the model,
you're generating stuff that looks very
8:37 - 8:41

different from the real data,
because the weights aren't any good.
8:41 - 8:45

That's a waste, but it's not a big
problem.
8:45 - 8:51

The serious problem with this algorithm is
that the recognition weights not only
8:51 - 8:53

don't follow the gradient of the log
probability of the data,
8:53 - 8:59

They don't even follow the gradient of the
variational bound on this probability.
8:59 - 9:03

And because they're not following the
right gradient, we get incorrect mode averaging,
9:03 - 9:06

which I'll explain in the next slide.
9:06 - 9:09

A final problem is that we know that the
9:09 - 9:14

true posterior over the top hidden layer
is bound to be far from independent
9:14 - 9:19

because of explaining away effects.
And yet, we're forced to approximate it
9:19 - 9:22

with a distribution that assumes
independence.
9:22 - 9:28

This independence approximation might not
be so bad for intermediate hidden layers,
9:28 - 9:34

because if we're lucky, the explaining
away effects that come from below will be
9:34 - 9:38

partially canceled out by prior effects
that come from above.
9:38 - 9:45

You'll see that in much more detail later.
Despite all these problems, Karl Friston
9:45 - 9:48

thinks this is how the brain works.
When we initially came up with the
9:48 - 9:51

algorithm, we thought it was an
interesting new theory of the brain.
9:51 - 9:59

I currently believe that it's got too many
problems to be how the brain works and
9:59 - 10:01

that we'll find better algorithms.
10:02 - 10:06

So now let me explain mode averaging, using the little model
10:06 - 10:09

with the earthquake and the truck that we saw before.
10:09 - 10:15

Suppose we run the sleep phase, and we generate data from this model.
10:15 - 10:20

Most of the time, those top two units would be off
10:20 - 10:23

because they are very unlikely to turn on under their prior,
10:23 - 10:33

and, because they are off, the visible unit will be firmly off, because its bias is -20.
10:35 - 10:43

Just occassionally, one time in about e to the -10, one of the two top units will turn on
10:44 - 10:47

and it will be equally often the left one or the right one.
10:48 - 10:51

When that unit turns on, there's a probability of a half
10:51 - 10:54

that the visible unit will turn on.
10:54 - 10:58

So, if you think about the occassions on which the visible unit turns on,
10:58 - 11:03

half those occassions have the left-hand hidden unit on,
11:03 - 11:07

the other half of those occassions have the right-hand hidden unit on
11:07 - 11:11

and almost none of those occassions have neither or both units on.
11:11 - 11:17

So now think what the learning would do for the recognition weights.
11:17 - 11:21

Half the time we'll have a 1 on the visible layer,
11:21 - 11:25

the leftmost unit will be on at the top,
11:25 - 11:33

so we'll actually learn to predict that that's on with a probability of 0.5, and the same for the right unit.
11:33 - 11:37

So the recognition units will learn to produce a factorial distribution
11:37 - 11:41

over the hidden layer, of (0.5, 0.5)
11:41 - 11:47

and that factorial distribution puts a quarter of its mass on the configuration (1,1)
11:47 - 11:50

and another quarter of its mass on the configuration (0,0)
11:50 - 11:56

and both of those are extremely unlikely configurations given that the visible unit was on.
11:58 - 12:01

It would have been better just to pick one mode,
12:01 - 12:06

that is, it would have been better for the visible unit just to go for truck, or just to go for earthquake.
12:08 - 12:11

That's the best recognition model you can have,
12:11 - 12:17

that's the best recognition model you can have if you're forced to have a factorial model.
12:18 - 12:21

So even though the hidden configurations we're dealing with
12:21 - 12:24

are best represented as the corners of a square
12:24 - 12:29

actually show it as if it's a one-dimensional continuous value,
12:29 - 12:38

and the true posterior is bimodal. It's focused on (1,0) or (0,1), that's shown in black.
12:38 - 12:44

The approximation window, if you use the sleep phase of the wake-sleep algorithm,
12:44 - 12:49

is the red curve, which gives all four states of the hidden units equal probability
12:51 - 12:57

and the best solution would be to pick one of these states, and give it all the probability mass.
12:57 - 13:05

That's the best solution, because in variational learning we're manipulating the true posterior
13:05 - 13:08

to make it fit the approximation we're using.
13:08 - 13:12

Normally, in learning we'll manipulate an approximation to fit the true thing, but here it's backwards.

Title:: The wake-sleep algorithm [10 min]
Video Language:: English

	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]
	Jesús Quiroga edited English subtitles for The wake-sleep algorithm [10 min]

Show all

English subtitles

Revisions

Revision 8

Jesús Quiroga

The wake-sleep algorithm [10 min]

Revisions

Our website uses cookies

Operating cookies (Required)