
Title:

Description:

We've boiled down our max ent. prescription

into 2 steps. The first thing is you want the

probability distribution to satisfy this constraint

on the average value of the waiting time.

And the second thing is we want

that particular probability distribution to

have the maximum entropy  to maximize the

function p log p, and I always forget the negative

sign. So in fact, we're maximizing the function

negative sum over all states of the system

p log p. And remember, states of the system

here are how long you've waited for a particular

cab. Or rather, how long you've waited from

a particular time you decided to start waiting.

So, this turns out to be a hard problem, or

at least a nontrivial problem. If you have

a little calculus, you're really good at maximizing

functions, so let's start there. Let's imagine

in particular...and I'll use this very simple example

of maximizing a function over a 2dimensional

space. Let's call these the x1 and x2 axes.

So I have some function. I'm just

going to draw the contours for you.

So here we go. And what I'm going to do is

make it such that there's a single maximum over

this space. And what we'll talk about in one of

the appendices, if we have time, is why we can

prove that the entropy function has a unique

maximum even what you subject it to these

constraints. For now, you can take it on faith

that, in fact, this problem has a unique solution.

So, here's a function. I've given it a unique

maximum, and so we'll call this function f.

So, using your amazing calculus skills, you know

that the maximum of this function is defined as

the point where the derivative of f with

respect to x is 0. And remember this is a vector

here so what this means is df/dx1 is 0 and

df/dx2 is zero. Now, yes, you might have

accidentally hit a minimum, so just check to

make sure it's actually a minimum. That's

what you would do. So now the problem is that

we're not allowed to range over this entire space.

We're restricted to some subspace. We're restricted

in particular to some constraint here. So how

can we find the maximum of the function, not

the global maximum, but the maximum that

is also satisfying a set of constraints and what I'll

do is I'll draw those constraints as a line

in this space. A point here is a valid argument

for the function f, but it doesn't satisfy this

constraint here and what I'll do is

define this constraint in the following way.

I'll say the constraint is that g(x) = c, where c

is some number. And just to be clear, for us,

it's better to write g(p). We set that equal to

4 minutes. That's just to remind you our particular

constraint is that the function g is equal to 4.

This is the general case here. So what we want

to do now is not find the maximum point, the top

of the mountain. We want to find the point

that is the maximum along this line  this

line defined by g(x) = c. So let me give you

a little intuition about how you might do this.

Imagine you're on a train going through this

mountainous landscape, and as you're going

through, down here, you'll see you're crossing

contours of the function f. In this case, you're

going uphill  the function is increasing  so you

know a point there is not a maximum of the function

along this line because if you wait a little bit

longer, you'll get to this point here and you've

already crossed the contour line. So here,

you're going up. Notice here, you're going down

the other side of the mountain  you're crossing

contour lines in the other way, so you know,

in fact, that the maximum can't be over here

because you were already higher over here.

So, somewhere between here and here is the

maximum  somewhere in the middle you reach the

peak. You reach the peak when the contours of the

function f are parallel to the track that you are taking 

where there is a tangent point between the contour

and the direction of your motion on this fictitious

train. So, we know how to get the directions of the

contours of the function f  those are, indeed, just the

gradient of the function...this is a vector, remember.

And we are going to say these are equal to the

perpendiculars to the train tracks. So if the

perpendicular to the contour is parallel to the

perpendicular to the train track, that means

that the direction of the contour is parallel

to the direction of the train track. If the two

perpendiculars are parallel, so are the two

original vectors. So then the next question is

how can I get the perpendicular to the train track.

What I want you do is imagine this is the train

track for g(x) = c, and here is the train track

for g(x) = c' and so forth. So here is another set

of contours defined by the function g and we

want the perpendiculars to these contours to

be parallel to the contours for f  the

perpendiculars for the contours of f. So what

that means is that this gradient here  these arrows

here, and in particular, these arrows right here 

are equal to some real number lambda times

the gradient of the constraint. When this equation

is satisfied, that means that these contours here

are precisely parallel to these contours here.

So in order to maximize a function f subject

to a set of constraints, don't solve this here.

Don't solve this problem, solve this problem.

And now you notice you have this mysterious

value lambda. This is called the Lagrange multiplier.

So what we're going to do is try to find a solution

where the gradients are parallel to each other.

In other words, that one can be transformed into

another by scaling by some constant factor

along all axes. So this is the intuitive motivation

for how to solve the problem of maximization

subject to constraints when you have only

a single constraint. What we do is find the point

at which these two gradients align. There's a wrinkle.

The wrinkle is the following. It looks like we have only

one constraint here and our contraint is just

that this function is equal to 4. But we actually have

2 constraints. Our second constraint is overall

normalization and it says that we want this

function p here to normalize to 1. If you

sum the probabilities of all the arrival times, they

have to equal 1. Now, of course, p is a probability

so we know that has to be true. We didn't talk about

this explicitly, but when we start wandering over

function space  when these xs here become ps,

we start manipulating the probabilities  what we

want to do is be able to relax the constraint

that they sum to 1 when we consider maximizing

this function f. We want to be able to range over

the entire space including, for example, points

where all the probabilities  all the ps  are 0.

And then later what we'll do is we'll impose the

normalization constraint. So we actually have

2 constraints, and not just 1.