WEBVTT 00:00:11.859 --> 00:00:15.129 this presentation is delivered by the stanford center for professional 00:00:15.129 --> 00:00:22.129 development. 00:00:23.559 --> 00:00:26.100 Okay, so welcome back. And 00:00:26.100 --> 00:00:30.800 what I want to do today is talk about 00:00:30.800 --> 00:00:32.429 Newton's method, an algorithm 00:00:32.429 --> 00:00:34.860 for fitting models like logistic regression, 00:00:34.860 --> 00:00:39.150 and then we'll talk about exponential family distributions and generalized linear 00:00:39.150 --> 00:00:42.810 models. It's a very nice class of ideas that will tie together, 00:00:42.810 --> 00:00:47.180 the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that 00:00:47.180 --> 00:00:50.730 today. So 00:00:50.730 --> 00:00:54.630 throughout the previous lecture and this lecture, we're starting to use increasingly 00:00:54.630 --> 00:00:58.210 large amounts of material on probability. 00:00:58.210 --> 00:01:01.710 So if you'd like to see a refresher on sort of the 00:01:01.710 --> 00:01:02.920 foundations of probability 00:01:02.920 --> 00:01:06.310 - if you're not sure if you quite had your prerequisites for this class 00:01:06.310 --> 00:01:09.130 in terms of a background in probability and statistics, 00:01:09.130 --> 00:01:11.330 then the discussion section 00:01:11.330 --> 00:01:16.130 taught this week by the TA's will go over so they can review a probability. 00:01:16.130 --> 00:01:21.080 At the same discussion sections also for the TA's, we'll also briefly go over 00:01:21.080 --> 00:01:26.180 sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you 00:01:26.180 --> 00:01:27.730 any of you want to see 00:01:27.730 --> 00:01:32.100 a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of 00:01:32.100 --> 00:01:34.189 Matlab and Octave, 00:01:34.189 --> 00:01:41.189 please come to this - the next discussion section. All right. So 00:01:42.090 --> 00:01:45.369 just to recap briefly, 00:01:45.369 --> 00:01:49.070 towards the end of the last lecture I talked about the logistic regression model 00:01:49.070 --> 00:01:50.890 where we had - 00:01:50.890 --> 00:01:57.210 which was an algorithm for classification. We had that P of y given one 00:01:57.210 --> 00:02:01.810 [inaudible] - if an X - if Y equals one, give an X parameterized by theta 00:02:01.810 --> 00:02:03.750 under this model, all right. If 00:02:03.750 --> 00:02:05.280 this was one over one 00:02:05.280 --> 00:02:10.609 plus e to the theta, transpose X. 00:02:10.609 --> 00:02:15.239 And then you can write down the log-likelihood - 00:02:15.239 --> 00:02:22.239 like given the training sets, 00:02:22.599 --> 00:02:29.599 00:02:29.819 --> 00:02:31.329 which was that. 00:02:31.329 --> 00:02:32.289 And 00:02:32.289 --> 00:02:35.860 by taking the derivatives of this, you can derive 00:02:35.860 --> 00:02:38.359 sort of a gradient ascent rule 00:02:38.359 --> 00:02:42.689 for finding the maximum likelihood estimate of the parameter theta for 00:02:42.689 --> 00:02:46.479 this logistic regression model. And so 00:02:46.479 --> 00:02:48.170 last time I wrote down 00:02:48.170 --> 00:02:51.769 the learning rule for batch gradient ascent, but the version of stochastic gradient 00:02:51.769 --> 00:02:55.309 ascent where 00:02:55.309 --> 00:03:02.309 you look at just one training example at a time, 00:03:07.049 --> 00:03:09.370 would be like this, okay. 00:03:09.370 --> 00:03:12.369 So last time I wrote down a batch gradient ascent. This is stochastic gradient ascent. 00:03:12.369 --> 00:03:13.500 So 00:03:13.500 --> 00:03:17.969 if you want to fit a logistic regression model, meaning find the 00:03:17.969 --> 00:03:20.810 value of theta that maximizes this log likelihood, 00:03:20.810 --> 00:03:24.650 gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine 00:03:24.650 --> 00:03:27.159 algorithm to use. 00:03:27.159 --> 00:03:29.150 But what I want to do is talk about 00:03:29.150 --> 00:03:30.430 a different 00:03:30.430 --> 00:03:31.659 algorithm 00:03:31.659 --> 00:03:32.669 for fitting 00:03:32.669 --> 00:03:34.659 models like logistic regression. 00:03:34.659 --> 00:03:39.419 And this would be an algorithm that will, I guess, often run much faster than 00:03:39.419 --> 00:03:41.809 gradient descent. 00:03:41.809 --> 00:03:43.969 Um. 00:03:43.969 --> 00:03:48.799 And this algorithm is called Newton's Method. 00:03:48.799 --> 00:03:54.149 And when we describe Newton's Method - let me ask you - I should 00:03:54.149 --> 00:04:01.149 ask you to consider a different problem first, 00:04:01.239 --> 00:04:03.609 which is - 00:04:03.609 --> 00:04:05.439 let's say 00:04:05.439 --> 00:04:10.650 you have a function F of theta, 00:04:10.650 --> 00:04:13.549 and let's say you want to find the value of theta 00:04:13.549 --> 00:04:17.519 so that 00:04:17.519 --> 00:04:20.559 F of theta 00:04:20.559 --> 00:04:25.789 is equal to zero. Let's start the [inaudible], and then we'll sort of slowly 00:04:25.789 --> 00:04:27.830 change this until it becomes an algorithm for 00:04:27.830 --> 00:04:34.830 fitting maximum likelihood models, like logistic regression. So - let's see. 00:04:51.589 --> 00:04:54.509 I guess that works. Okay, so let's say that's my function F. 00:04:54.509 --> 00:04:57.259 This 00:04:57.259 --> 00:05:00.629 is my horizontal axis of theta, plot of F of theta, 00:05:00.629 --> 00:05:03.350 and so they're really trying to find this value for theta, and 00:05:03.350 --> 00:05:05.450 which F of theta is equal to zero. This 00:05:05.450 --> 00:05:08.180 is a horizontal axis. 00:05:08.180 --> 00:05:11.779 So here's the [inaudible]. I'm going to 00:05:11.779 --> 00:05:15.259 initialize 00:05:15.259 --> 00:05:18.259 theta as some value. 00:05:18.259 --> 00:05:21.780 We'll call theta superscript zero. 00:05:21.780 --> 00:05:27.110 And then here's what Newton's Method does. We're going to evaluate the function F at a 00:05:27.110 --> 00:05:30.300 value of theta, and then 00:05:30.300 --> 00:05:34.360 we'll compute the derivative of F, and we'll use the linear approximation to the 00:05:34.360 --> 00:05:38.050 function F of that value of theta. So in particular, 00:05:38.050 --> 00:05:39.179 00:05:39.179 --> 00:05:43.089 00:05:43.089 --> 00:05:45.039 00:05:45.039 --> 00:05:52.039 I'm going to take the tangents to my function - 00:05:52.369 --> 00:05:55.259 hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the 00:05:55.259 --> 00:05:56.819 00:05:56.819 --> 00:06:00.030 tangent to my function at that point there zero, 00:06:00.030 --> 00:06:04.099 and I'm going to sort of extend this tangent down until it intercepts the horizontal axis. 00:06:04.099 --> 00:06:11.099 I want to see what value this is. And I'm going to call this theta one, okay. 00:06:12.069 --> 00:06:15.509 And then so that's one iteration of Newton's Method. And 00:06:15.509 --> 00:06:18.610 what I'll do then is the same thing with this point. Take the 00:06:18.610 --> 00:06:20.889 tangent 00:06:20.889 --> 00:06:22.469 down here, 00:06:22.469 --> 00:06:25.329 and that's two iterations of the algorithm. And then just 00:06:25.329 --> 00:06:27.960 sort of keep going, that's 00:06:27.960 --> 00:06:31.149 theta three and so on, okay. 00:06:31.149 --> 00:06:32.400 So 00:06:32.400 --> 00:06:37.620 let's just go ahead and write down what this algorithm actually does. 00:06:37.620 --> 00:06:38.090 00:06:38.090 --> 00:06:40.789 To go from theta zero to theta one, let 00:06:45.689 --> 00:06:50.089 length - let me just call that capital delta. 00:06:50.089 --> 00:06:51.799 So capital - so if you 00:06:51.799 --> 00:06:52.870 remember the 00:06:52.870 --> 00:06:55.180 definition of a derivative [inaudible], 00:06:55.180 --> 00:06:58.569 derivative of F evaluated at theta zero. 00:06:58.569 --> 00:07:01.489 In other words, the gradient of this first line, 00:07:01.489 --> 00:07:04.809 by the definition of gradient is going to be equal to this vertical 00:07:04.809 --> 00:07:05.590 length, 00:07:05.590 --> 00:07:07.360 divided by this horizontal length. A 00:07:07.360 --> 00:07:09.819 gradient of this - so the slope of this function 00:07:09.819 --> 00:07:10.920 is defined as 00:07:10.920 --> 00:07:14.299 the ratio between this vertical height 00:07:14.299 --> 00:07:16.490 and this width of triangle. 00:07:16.490 --> 00:07:18.510 So that's just equal to F of theta 00:07:18.510 --> 00:07:22.849 zero, 00:07:22.849 --> 00:07:24.870 divided by delta, 00:07:24.870 --> 00:07:27.710 which implies that 00:07:27.710 --> 00:07:30.159 delta is equal to F of 00:07:30.159 --> 00:07:32.289 theta zero, 00:07:32.289 --> 00:07:34.339 divided by a 00:07:34.339 --> 00:07:37.519 prime of 00:07:37.519 --> 00:07:41.840 theta zero, 00:07:41.840 --> 00:07:43.330 okay. 00:07:43.330 --> 00:07:46.389 And so theta 00:07:46.389 --> 00:07:49.740 one is therefore theta zero minus delta, 00:07:49.740 --> 00:07:51.620 minus capital delta, 00:07:51.620 --> 00:07:56.189 which is therefore just F theta zero over F 00:07:56.189 --> 00:07:58.090 prime of theta 00:07:58.090 --> 00:08:00.199 zero, all 00:08:00.199 --> 00:08:04.740 right. And more generally, one iteration of Newton's Method precedes this, theta T plus 00:08:04.740 --> 00:08:07.629 one equals theta T 00:08:07.629 --> 00:08:10.529 minus 00:08:10.529 --> 00:08:14.620 F of theta T divided by F prime of theta 00:08:14.620 --> 00:08:16.899 T. So that's one iteration 00:08:16.899 --> 00:08:23.899 of Newton's Method. 00:08:23.969 --> 00:08:28.300 Now, this is an algorithm for finding a value of theta for which F of theta equals 00:08:28.300 --> 00:08:30.759 zero. And so we apply the same idea 00:08:30.759 --> 00:08:32.729 to 00:08:32.729 --> 00:08:36.870 maximizing the log likelihood, right. So we have a function L 00:08:36.870 --> 00:08:37.990 of 00:08:37.990 --> 00:08:40.159 theta, and we want to maximize 00:08:40.159 --> 00:08:44.420 this function. Well, how do you maximize the function? You set the derivative to zero. So we want 00:08:44.420 --> 00:08:50.120 theta such that our 00:08:50.120 --> 00:08:53.900 prime of theta is equal to zero, so to maximize this function we want to find the place where 00:08:53.900 --> 00:08:57.270 the derivative of the function is equal to zero, 00:08:57.270 --> 00:08:58.379 and so we just apply 00:08:58.379 --> 00:09:01.380 the same idea. 00:09:01.380 --> 00:09:04.210 So we get 00:09:04.210 --> 00:09:08.510 theta one equals theta T minus L 00:09:08.510 --> 00:09:10.710 prime 00:09:10.710 --> 00:09:17.710 of theta T over L 00:09:18.100 --> 00:09:19.790 double prime of T, L 00:09:19.790 --> 00:09:22.850 double prime of 00:09:22.850 --> 00:09:28.210 theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be 00:09:28.210 --> 00:09:30.900 the [inaudible] of L, 00:09:30.900 --> 00:09:35.690 and then we want to find the value of theta for which the derivative of L is 00:09:35.690 --> 00:09:36.300 zero, and 00:09:36.300 --> 00:09:43.300 therefore must be a local optimum. 00:09:48.350 --> 00:09:55.350 Does this make sense? Any questions about this? 00:09:58.900 --> 00:10:02.699 [Inaudible] The answer to that is fairly complicated. There 00:10:02.699 --> 00:10:06.259 are conditions on F that would guarantee that this will work. They are fairly complicated, 00:10:06.259 --> 00:10:10.530 and this is more complex than I want to go into now. 00:10:10.530 --> 00:10:13.350 In practice, this works very well for logistic regression, 00:10:13.350 --> 00:10:18.650 and for sort of generalizing any models I'll talk about later. [Inaudible] 00:10:18.650 --> 00:10:23.310 Yeah, it 00:10:23.310 --> 00:10:27.190 usually doesn't matter. When I implement this, I usually just initialize theta zero to 00:10:27.190 --> 00:10:29.389 zero to 00:10:29.389 --> 00:10:31.570 just initialize the parameters 00:10:31.570 --> 00:10:35.610 to the - back to all zeros, and usually this works fine. It's usually not a huge 00:10:35.610 --> 00:10:42.610 deal how you initialize theta. 00:10:44.110 --> 00:10:45.940 [Inaudible] 00:10:45.940 --> 00:10:51.440 or is it just different conversions? Let me 00:10:51.440 --> 00:10:55.600 say some things about that that'll sort of answer it. All of these 00:10:55.600 --> 00:10:57.250 00:10:57.250 --> 00:10:57.630 00:10:57.630 --> 00:11:01.180 algorithms tend not to - converges problems, and all of these 00:11:01.180 --> 00:11:03.870 algorithms will generally converge, 00:11:03.870 --> 00:11:06.530 unless you choose too large a linear rate for 00:11:06.530 --> 00:11:08.870 gradient ascent or something. 00:11:08.870 --> 00:11:10.620 But the speeds of 00:11:10.620 --> 00:11:14.570 conversions of these algorithms are very different. 00:11:14.570 --> 00:11:18.130 So 00:11:18.130 --> 00:11:22.400 it turns out that Newton's Method is an algorithm that enjoys extremely 00:11:22.400 --> 00:11:24.530 fast conversions. 00:11:24.530 --> 00:11:28.670 The technical term is that it enjoys a property called [inaudible] conversions. Don't 00:11:28.670 --> 00:11:29.950 know [inaudible] what that means, 00:11:29.950 --> 00:11:31.170 but just 00:11:31.170 --> 00:11:33.540 stated informally, it 00:11:33.540 --> 00:11:35.460 means that asymptotically 00:11:35.460 --> 00:11:41.500 every iteration of Newton's Method will double the number of significant digits 00:11:41.500 --> 00:11:43.740 that your solution 00:11:43.740 --> 00:11:44.980 is accurate 00:11:44.980 --> 00:11:46.610 to. Just lots of constant 00:11:46.610 --> 00:11:48.010 factors. 00:11:48.010 --> 00:11:48.809 Suppose that 00:11:48.809 --> 00:11:51.320 on a certain iteration 00:11:51.320 --> 00:11:56.730 your solution is within 0.01 at the optimum, so you have 00:11:56.730 --> 00:11:58.569 0.01 error. Then after one 00:11:58.569 --> 00:12:04.110 iteration, your error will be on the order of 0.001, 00:12:04.110 --> 00:12:11.110 and after another iteration, your error will be on the order of 0.0001. 00:12:13.610 --> 00:12:14.600 So this is called 00:12:14.600 --> 00:12:18.320 quadratic conversions because you essentially get to square the error 00:12:18.320 --> 00:12:21.520 on every iteration of Newton's Method. [Inaudible] 00:12:21.520 --> 00:12:24.520 this is an asymptotic result that holds only when you are pretty 00:12:24.520 --> 00:12:27.120 ** close to the optimum anyway, so this is 00:12:27.120 --> 00:12:30.550 the theoretical result that says it's true, but because of constant factors 00:12:30.550 --> 00:12:34.340 and so on, may paint a slightly rosier picture than 00:12:34.340 --> 00:12:35.690 might be accurate. 00:12:35.690 --> 00:12:39.590 But the fact is, when you implement - when I implement Newton's 00:12:39.590 --> 00:12:41.390 Method for logistic regression, 00:12:41.390 --> 00:12:44.830 usually converges like a dozen iterations or so for most reasonable 00:12:44.830 --> 00:12:46.270 size problems of 00:12:46.270 --> 00:12:50.910 tens of hundreds of features. So one 00:12:50.910 --> 00:12:51.550 thing I should 00:12:51.550 --> 00:12:53.170 talk about, which 00:12:53.170 --> 00:12:54.110 is 00:12:54.110 --> 00:12:58.140 what I wrote down over there was actually Newton's Method for the case of 00:12:58.140 --> 00:13:01.410 theta being a single-row number. The 00:13:01.410 --> 00:13:05.820 generalization to Newton's Method for when theta is a vector rather than 00:13:05.820 --> 00:13:07.830 when theta is just a row number 00:13:07.830 --> 00:13:11.040 is the following, 00:13:11.040 --> 00:13:14.580 which is that theta T plus one is theta T 00:13:14.580 --> 00:13:16.870 plus - and then we have the 00:13:16.870 --> 00:13:18.580 second derivative divided by the 00:13:18.580 --> 00:13:22.520 first - the first derivative divided by the second derivative. 00:13:22.520 --> 00:13:25.450 00:13:25.450 --> 00:13:30.090 And the appropriate generalization is this, where 00:13:30.090 --> 00:13:32.949 this is the usual gradient of 00:13:32.949 --> 00:13:37.350 your objective, and 00:13:37.350 --> 00:13:44.350 each [inaudible] is a matrix called a Hessian, 00:13:50.730 --> 00:13:55.460 which is just a matrix of second derivative where HIJ 00:13:55.460 --> 00:14:02.460 equals - okay. 00:14:07.210 --> 00:14:11.000 So just to sort of - the 00:14:11.000 --> 00:14:15.980 first derivative divided by the second derivative, now you have a vector of first derivatives 00:14:15.980 --> 00:14:17.050 times 00:14:17.050 --> 00:14:21.060 sort of the inverse of the matrix of second derivatives. So 00:14:21.060 --> 00:14:22.770 this is sort of just the same thing 00:14:22.770 --> 00:14:28.210 [inaudible] of multiple dimensions. 00:14:28.210 --> 00:14:32.660 So for logistic regression, again, use the - 00:14:32.660 --> 00:14:35.179 for a reasonable number of features 00:14:35.179 --> 00:14:40.069 and training examples - when I run this algorithm, usually you see a conversion 00:14:40.069 --> 00:14:41.150 anywhere from sort of [inaudible] 00:14:41.150 --> 00:14:44.430 to like a dozen or so other [inaudible]. 00:14:44.430 --> 00:14:47.610 To compare to gradient ascent, it's [inaudible] to gradient ascent, this 00:14:47.610 --> 00:14:52.860 usually means far fewer iterations to converge. 00:14:52.860 --> 00:14:55.230 Compared to gradient ascent, let's say [inaudible] gradient ascent, 00:14:55.230 --> 00:14:58.600 the disadvantage of Newton's Method is that on every iteration 00:14:58.600 --> 00:15:01.560 you need to invert the Hessian. 00:15:01.560 --> 00:15:02.950 So the Hessian 00:15:02.950 --> 00:15:07.190 will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N 00:15:07.190 --> 00:15:09.010 is the number of features. 00:15:09.010 --> 00:15:12.700 And so if you have a large number of features in your learning problem, if you have 00:15:12.700 --> 00:15:14.540 tens of thousands of features, 00:15:14.540 --> 00:15:18.050 then inverting H could be a slightly computationally expensive 00:15:18.050 --> 00:15:21.910 step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think 00:15:21.910 --> 00:15:28.910 you're right. 00:15:35.030 --> 00:15:38.650 That should probably be a minus. 00:15:38.650 --> 00:15:40.000 Do you have [inaudible]? 00:15:40.000 --> 00:15:44.810 Yeah, thanks. 00:15:44.810 --> 00:15:48.130 Yeah, X to a minus. Thank you. [Inaudible] problem also. 00:15:48.130 --> 00:15:49.720 I wrote down 00:15:49.720 --> 00:15:51.110 this algorithm 00:15:51.110 --> 00:15:54.760 to find the maximum likely estimate of the parameters for logistic regression. 00:15:54.760 --> 00:15:56.019 I wrote this 00:15:56.019 --> 00:15:58.430 down for maximizing a function. 00:15:58.430 --> 00:16:01.790 So I'll leave you to think about this yourself. If I wanted to use Newton's 00:16:01.790 --> 00:16:04.810 Method to minimize the function, how 00:16:04.810 --> 00:16:06.820 does the algorithm change? All right. 00:16:06.820 --> 00:16:09.470 So I'll leave you to think about that. So in other words, 00:16:09.470 --> 00:16:12.620 it's not the maximizations. How does the algorithm change if you want to use it 00:16:12.620 --> 00:16:19.620 for minimization? 00:16:27.550 --> 00:16:30.030 Actually, 00:16:30.030 --> 00:16:31.820 the answer 00:16:31.820 --> 00:16:32.970 is that 00:16:32.970 --> 00:16:34.730 it doesn't change. I'll 00:16:34.730 --> 00:16:40.410 leave you to work that out yourself why, okay. All right. 00:16:40.410 --> 00:16:47.340 Let's talk about generalized linear models. 00:16:47.340 --> 00:16:50.340 Let me just say, just to give a recap of 00:16:50.340 --> 00:16:52.620 both of the algorithms we've talked about so far. 00:16:52.620 --> 00:16:55.440 We've talked about 00:16:55.440 --> 00:16:59.470 two different algorithms for modeling PFY given X and parameterized by 00:16:59.470 --> 00:17:00.580 theta. And 00:17:00.580 --> 00:17:03.240 one of them - R was 00:17:03.240 --> 00:17:05.910 a real number 00:17:05.910 --> 00:17:10.090 and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then 00:17:10.090 --> 00:17:13.050 we got 00:17:13.050 --> 00:17:16.390 [inaudible] of linear regression. 00:17:16.390 --> 00:17:19.540 In the other case, we saw 00:17:19.540 --> 00:17:23.140 that if - was a classification problem where Y took on a value of either 00:17:23.140 --> 00:17:25.310 zero or one. 00:17:25.310 --> 00:17:27.110 00:17:27.110 --> 00:17:31.430 In that case, well, what's the most natural distribution of zeros and 00:17:31.430 --> 00:17:34.990 ones is the [inaudible]. The [inaudible] distribution models 00:17:34.990 --> 00:17:37.210 random variables with two values, 00:17:37.210 --> 00:17:39.890 and in that case we got 00:17:39.890 --> 00:17:45.820 logistic regression. 00:17:45.820 --> 00:17:48.680 So along the way, some of the questions that came up were - so logistic regression, where 00:17:48.680 --> 00:17:51.279 00:17:51.279 --> 00:17:52.289 on 00:17:52.289 --> 00:17:54.480 earth did I get the [inaudible] function from? 00:17:54.480 --> 00:17:57.450 And then so there are the choices you can use for, sort of, 00:17:57.450 --> 00:17:59.780 just where did this function come from? 00:17:59.780 --> 00:18:02.200 And there are other functions I could've plugged in, but 00:18:02.200 --> 00:18:05.540 the [inaudible] function turns out to be a natural 00:18:05.540 --> 00:18:06.850 default choice 00:18:06.850 --> 00:18:09.240 that lead us to logistic regression. 00:18:09.240 --> 00:18:11.810 And what I want to do now is 00:18:11.810 --> 00:18:14.690 take both of these algorithms and 00:18:14.690 --> 00:18:18.520 show that there are special cases that have [inaudible] the course of algorithms called 00:18:18.520 --> 00:18:19.450 generalized 00:18:19.450 --> 00:18:21.550 linear models, 00:18:21.550 --> 00:18:23.560 and there will be pauses for - it will be as [inaudible] the course 00:18:23.560 --> 00:18:27.110 of algorithms that think that the [inaudible] 00:18:27.110 --> 00:18:30.900 function will fall out very naturally as well. 00:18:30.900 --> 00:18:32.850 So, let's see - 00:18:32.850 --> 00:18:36.820 just looking for a longer piece of chalk. I 00:18:36.820 --> 00:18:40.929 should warn you, the ideas in generalized linear models are somewhat 00:18:40.929 --> 00:18:42.179 complex, so what 00:18:42.179 --> 00:18:46.440 I'm going to do today is try to sort of point you - point out the key ideas and give you a 00:18:46.440 --> 00:18:48.560 gist of the entire story. 00:18:48.560 --> 00:18:52.230 And then some of the details in the map and the derivations I'll leave you to work through 00:18:52.230 --> 00:18:57.850 by yourselves in the intellection [inaudible], which posts 00:18:57.850 --> 00:19:04.850 online. 00:19:05.710 --> 00:19:08.980 So [inaudible] these two distributions, the [inaudible] and 00:19:08.980 --> 00:19:13.350 the Gaussian. 00:19:13.350 --> 00:19:15.140 So suppose we have data 00:19:15.140 --> 00:19:18.550 that is zero-one valued, and we and we want to model it 00:19:18.550 --> 00:19:21.420 with 00:19:21.420 --> 00:19:23.940 [inaudible] variable 00:19:23.940 --> 00:19:25.790 parameterized by 00:19:25.790 --> 00:19:30.960 phi. So the [inaudible] distribution has the probability of Y equals one, 00:19:30.960 --> 00:19:35.049 which just equals the phi, right. So the parameter phi in the 00:19:35.049 --> 00:19:40.030 [inaudible] specifies the probability of Y being one. Now, 00:19:40.030 --> 00:19:42.539 as you vary the parameter theta, you get - 00:19:42.539 --> 00:19:47.419 you sort of get different [inaudible] distributions. As you vary the value of 00:19:47.419 --> 00:19:50.560 theta you get different probability distributions on Y 00:19:50.560 --> 00:19:53.310 that have different probabilities of being equal to one. 00:19:53.310 --> 00:19:58.130 And so I want you to think of this as not one fixed distribution, but as a set where there are a 00:19:58.130 --> 00:20:00.360 cause of distributions 00:20:00.360 --> 00:20:03.740 that you get as you vary theta. 00:20:03.740 --> 00:20:08.870 And in the same way, if you consider Gaussian distribution, 00:20:08.870 --> 00:20:10.880 as you vary [inaudible] you 00:20:10.880 --> 00:20:14.260 would get different 00:20:14.260 --> 00:20:18.320 Gaussian distributions. So think of this again as a cost, or as a set to 00:20:18.320 --> 00:20:19.910 distributions. 00:20:19.910 --> 00:20:26.910 And what I want to do now is show that 00:20:27.680 --> 00:20:31.890 both of these are special cases of the cause of distribution that's called the 00:20:31.890 --> 00:20:34.300 exponential family distribution. 00:20:34.300 --> 00:20:38.160 And in particular, we'll say that the cost of distributions, like 00:20:38.160 --> 00:20:41.330 the [inaudible] distributions that you get as you vary theta, 00:20:41.330 --> 00:20:44.960 we'll say the cost of distributions is in the exponential family 00:20:44.960 --> 00:20:46.429 if it can be written 00:20:46.429 --> 00:20:49.630 in the following form. 00:20:49.630 --> 00:20:51.850 P of Y parameterized by theta is 00:20:51.850 --> 00:20:54.220 equal to B of Y 00:20:54.220 --> 00:21:01.220 [inaudible], 00:21:05.049 --> 00:21:05.860 okay. Let me just 00:21:05.860 --> 00:21:07.470 get some of these 00:21:07.470 --> 00:21:10.690 terms, names, and 00:21:10.690 --> 00:21:13.049 then - let me - I'll 00:21:13.049 --> 00:21:19.360 say a bit more about what this means. 00:21:19.360 --> 00:21:23.780 So [inaudible] is called the natural parameter of the distribution, 00:21:23.780 --> 00:21:26.380 and T 00:21:26.380 --> 00:21:33.380 of Y is called the sufficient statistic. 00:21:34.630 --> 00:21:37.110 Usually, 00:21:37.110 --> 00:21:41.020 for many of the examples we'll see, including the [inaudible] 00:21:41.020 --> 00:21:43.590 and the Gaussian, T 00:21:43.590 --> 00:21:47.810 of Y is just equal to Y. So for most of this lecture you can 00:21:47.810 --> 00:21:51.660 mentally replace T of Y to be equal to Y, although this won't be true for the very 00:21:51.660 --> 00:21:56.070 fine example we do today, but mentally, you think of T of Y as equal to 00:21:56.070 --> 00:21:57.299 Y. 00:21:57.299 --> 00:22:04.299 And so 00:22:08.230 --> 00:22:09.930 for a given choice of 00:22:09.930 --> 00:22:15.890 these functions, A, B and 00:22:15.890 --> 00:22:21.100 T, all right - so we're gonna sort of fix the forms of the functions A, B and T. 00:22:21.100 --> 00:22:21.950 Then 00:22:21.950 --> 00:22:26.550 this formula defines, again, a set of distributions. It defines the cause of distributions 00:22:26.550 --> 00:22:27.990 that 00:22:27.990 --> 00:22:30.980 is now parameterized by 00:22:30.980 --> 00:22:31.600 [inaudible]. 00:22:31.600 --> 00:22:35.770 So again, let's write down specific formulas for A, B and T, true specific 00:22:35.770 --> 00:22:38.510 choices of A, B and T. Then 00:22:38.510 --> 00:22:43.480 as I vary [inaudible] I get different distributions. 00:22:43.480 --> 00:22:44.760 And 00:22:44.760 --> 00:22:46.470 I'm going to show that 00:22:46.470 --> 00:22:48.360 the 00:22:48.360 --> 00:22:52.290 [inaudible] - I'm going to show that the [inaudible] and the Gaussians are special 00:22:52.290 --> 00:22:53.720 cases 00:22:53.720 --> 00:22:55.880 of exponential family distributions. 00:22:55.880 --> 00:23:00.610 And by that I mean that I can choose specific functions, A, B and T, 00:23:00.610 --> 00:23:02.870 so that this becomes the formula 00:23:02.870 --> 00:23:06.350 of the distributions of either a [inaudible] or a Gaussian. 00:23:06.350 --> 00:23:09.000 And then again, as I vary [inaudible], 00:23:09.000 --> 00:23:11.590 I'll get [inaudible], distributions with different means, 00:23:11.590 --> 00:23:13.220 or as I vary [inaudible], I'll get 00:23:13.220 --> 00:23:16.790 Gaussian distributions with different means for my 00:23:16.790 --> 00:23:22.920 fixed values of A, B and T. 00:23:22.920 --> 00:23:26.920 And for those of you that know what a sufficient statistic and statistics is, 00:23:26.920 --> 00:23:27.980 00:23:27.980 --> 00:23:30.720 T of Y actually is a sufficient statistic 00:23:30.720 --> 00:23:33.930 in the formal sense of 00:23:33.930 --> 00:23:37.430 sufficient statistic for a probability distribution. They may have seen it in a statistics 00:23:37.430 --> 00:23:38.120 class. 00:23:38.120 --> 00:23:42.240 If you don't know what a sufficient statistic is, don't worry about. We 00:23:42.240 --> 00:23:49.240 sort of don't need that property today. Okay. 00:23:52.440 --> 00:23:55.860 So - 00:23:55.860 --> 00:23:58.470 oh, one last comment. 00:23:58.470 --> 00:24:02.750 Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also 00:24:02.750 --> 00:24:04.290 just a raw number. 00:24:04.290 --> 00:24:08.150 So in many cases, the parameter of this distribution is just a raw number, 00:24:08.150 --> 00:24:10.320 and [inaudible] transposed T of Y 00:24:10.320 --> 00:24:14.340 is just a product of raw numbers. So again, that would be true for our first two examples, but 00:24:14.340 --> 00:24:14.910 00:24:14.910 --> 00:24:20.770 now for the last example I'll do today. 00:24:20.770 --> 00:24:24.900 So now we'll show that the [inaudible] and the Gaussian are examples of exponential family 00:24:24.900 --> 00:24:27.659 distributions. We'll start with the [inaudible]. 00:24:27.659 --> 00:24:30.769 So the [inaudible] distribution with [inaudible] - I guess I wrote this down 00:24:30.769 --> 00:24:32.510 already. 00:24:32.510 --> 00:24:36.000 PFY equals one [inaudible] by phi, 00:24:36.000 --> 00:24:36.590 [inaudible] equal to phi. 00:24:36.590 --> 00:24:40.530 So the parameter of phi specifies the probability 00:24:40.530 --> 00:24:42.600 that Y equals one. 00:24:42.600 --> 00:24:45.309 And so my goal now is to choose 00:24:45.309 --> 00:24:48.320 T, A and B, or is to choose A, B and T 00:24:48.320 --> 00:24:53.230 so that my formula for the exponential family becomes identical to my formula 00:24:53.230 --> 00:25:00.230 for the distribution of a [inaudible]. So 00:25:08.390 --> 00:25:15.390 probability of Y parameterized by phi is equal to that, all 00:25:17.060 --> 00:25:19.950 right. And you already saw sort 00:25:19.950 --> 00:25:24.480 of a similar exponential notation where we talked about logistic regression. The probability of 00:25:24.480 --> 00:25:26.630 Y being one is 00:25:26.630 --> 00:25:30.990 phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the 00:25:30.990 --> 00:25:37.990 Y times one minus phi to the one minus Y. So I'm gonna take 00:25:40.010 --> 00:25:43.920 the exponent of the log of this, an exponentiation in taking log [inaudible] 00:25:43.920 --> 00:25:45.080 cancel each other 00:25:45.080 --> 00:25:46.060 00:25:46.060 --> 00:25:53.060 out [inaudible]. 00:25:53.170 --> 00:25:54.980 And this is equal to E to the 00:25:54.980 --> 00:26:01.980 Y. And so [inaudible] 00:26:28.220 --> 00:26:31.150 is to be T of Y, 00:26:31.150 --> 00:26:33.360 and 00:26:33.360 --> 00:26:40.360 this will be 00:26:42.350 --> 00:26:44.990 minus A of [inaudible]. 00:26:44.990 --> 00:26:51.990 And then B of Y is just one, so B of Y doesn't matter. Just 00:26:56.830 --> 00:26:57.950 take a second 00:26:57.950 --> 00:27:00.590 to look through this and make sure it makes 00:27:00.590 --> 00:27:07.590 sense. I'll clean another 00:27:42.400 --> 00:27:47.330 board while you do that. So now let's write down a few more things. Just 00:27:47.330 --> 00:27:51.230 copying from the previous board, we had that [inaudible] zero four equal to log 00:27:51.230 --> 00:27:53.350 phi 00:27:53.350 --> 00:27:56.780 over one minus 00:27:56.780 --> 00:28:00.800 phi. [Inaudible] so if I want to do the [inaudible] take this formula, 00:28:00.800 --> 00:28:03.550 and if you invert it, if you solve for 00:28:03.550 --> 00:28:05.780 phi - excuse me, if you solve for 00:28:05.780 --> 00:28:09.520 theta as a function of phi, which is really [inaudible] is the function of 00:28:09.520 --> 00:28:11.480 phi. 00:28:11.480 --> 00:28:16.910 Just invert this formula. You find that phi is 00:28:16.910 --> 00:28:19.350 one over one plus [inaudible] minus [inaudible]. 00:28:19.350 --> 00:28:20.530 And so 00:28:20.530 --> 00:28:23.630 somehow the logistic function magically 00:28:23.630 --> 00:28:28.940 falls out of this. We'll take this even this even further later. 00:28:28.940 --> 00:28:31.370 Again, copying definitions from 00:28:31.370 --> 00:28:35.510 the board on - from the previous board, A of [inaudible] 00:28:35.510 --> 00:28:39.330 I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all 00:28:39.330 --> 00:28:42.659 right. So [inaudible] depends on phi, and phi 00:28:42.659 --> 00:28:45.700 depends on [inaudible]. So if I plug in 00:28:45.700 --> 00:28:49.490 this definition for [inaudible] 00:28:49.490 --> 00:28:50.679 into this - excuse 00:28:50.679 --> 00:28:53.510 me, plug in this definition for phi into that, 00:28:53.510 --> 00:28:57.340 I'll find that A of [inaudible] is therefore equal 00:28:57.340 --> 00:29:00.889 to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly 00:29:00.889 --> 00:29:01.950 interesting. 00:29:01.950 --> 00:29:07.890 And just to complete - excuse me. 00:29:07.890 --> 00:29:10.920 And just to complete the rest of this, T of 00:29:10.920 --> 00:29:12.860 Y is equal to Y, 00:29:12.860 --> 00:29:15.710 and 00:29:15.710 --> 00:29:18.340 B of Y is equal to one, okay. 00:29:18.340 --> 00:29:20.200 So just to recap what we've done, 00:29:20.200 --> 00:29:23.430 we've come up with a certain choice of 00:29:23.430 --> 00:29:24.720 functions A, T and 00:29:24.720 --> 00:29:26.250 B, 00:29:26.250 --> 00:29:29.610 so then my formula for the exponential family distribution 00:29:29.610 --> 00:29:33.950 now becomes exactly the formula for the distributions, or for the probability 00:29:33.950 --> 00:29:36.660 mass function of the [inaudible] distribution. 00:29:36.660 --> 00:29:40.320 And the natural parameter [inaudible] has a certain relationship of the 00:29:40.320 --> 00:29:45.260 original parameter of the [inaudible]. 00:29:45.260 --> 00:29:46.299 Question? 00:29:46.299 --> 00:29:48.580 [Inaudible] 00:29:48.580 --> 00:29:49.600 Let's 00:29:49.600 --> 00:29:54.430 see. [Inaudible]. The second to 00:29:54.430 --> 00:29:58.970 the last one. Oh, this answer is fine. Okay. 00:29:58.970 --> 00:30:00.620 Let's see. 00:30:00.620 --> 00:30:02.620 Yeah, so this is - 00:30:02.620 --> 00:30:07.260 well, if you expand this term out, one minus Y times log Y minus phi, 00:30:07.260 --> 00:30:08.590 and so 00:30:08.590 --> 00:30:09.910 one times log - 00:30:09.910 --> 00:30:12.600 one minus phi becomes this. And the other 00:30:12.600 --> 00:30:16.200 term is minus Y times log Y minus phi. 00:30:16.200 --> 00:30:17.659 And then - 00:30:17.659 --> 00:30:22.529 so the minus of a log is log one over X, or is just log one over 00:30:22.529 --> 00:30:25.380 whatever. So minus Y times log 00:30:25.380 --> 00:30:27.530 one minus phi becomes 00:30:27.530 --> 00:30:34.000 sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah, 00:30:34.000 --> 00:30:39.000 cool. Anything else? Yes? 00:30:39.000 --> 00:30:41.210 [Inaudible] is a scaler, isn't it? Up there 00:30:41.210 --> 00:30:43.100 - 00:30:43.100 --> 00:30:46.710 Yes. - it's a [inaudible] transposed, so it can be a vector or 00:30:46.710 --> 00:30:48.690 - Yes, [inaudible]. So 00:30:48.690 --> 00:30:53.179 let's see. In most - in this and the next example, [inaudible] will turn out to 00:30:53.179 --> 00:30:54.799 be a scaler. 00:30:54.799 --> 00:30:56.270 And so - 00:30:56.270 --> 00:30:59.440 well, on this board. 00:30:59.440 --> 00:31:03.570 And so if [inaudible] is a scaler and T of Y is a scaler, then this is just 00:31:03.570 --> 00:31:06.260 a real number times a real number. So this would be like a 00:31:06.260 --> 00:31:08.270 one-dimensional vector 00:31:08.270 --> 00:31:12.340 transposed times a one-dimensional vector. And so this is just real number times real number. Towards the 00:31:12.340 --> 00:31:15.850 end of today's lecture, we'll go with just one example where both of 00:31:15.850 --> 00:31:17.050 these are vectors. 00:31:17.050 --> 00:31:24.050 But for main distributions, these will turn out to be scalers. 00:31:24.350 --> 00:31:25.110 00:31:25.110 --> 00:31:26.730 00:31:26.730 --> 00:31:33.420 [Inaudible] distribution [inaudible]. I 00:31:33.420 --> 00:31:35.120 mean, it doesn't have the 00:31:35.120 --> 00:31:37.630 zero probability 00:31:37.630 --> 00:31:42.279 or [inaudible] zero and one. I 00:31:42.279 --> 00:31:43.279 see. So 00:31:43.279 --> 00:31:46.260 - yeah. Let's - for this, 00:31:46.260 --> 00:31:50.890 let's imagine that we're restricting the domain 00:31:50.890 --> 00:31:51.590 00:31:51.590 --> 00:31:55.140 of the input of the function to be Y equals zero or one. 00:31:55.140 --> 00:31:57.130 So think of that as maybe in 00:31:57.130 --> 00:32:01.620 implicit constraint on it. [Inaudible]. But so this is a 00:32:01.620 --> 00:32:02.540 probability mass function 00:32:02.540 --> 00:32:05.900 for Y equals zero or Y equals one. So 00:32:05.900 --> 00:32:12.900 write down Y equals zero one. Let's think of that as an [inaudible]. So - cool. 00:32:16.190 --> 00:32:19.580 So this 00:32:19.580 --> 00:32:21.479 takes the [inaudible] 00:32:21.479 --> 00:32:24.620 distribution and invites in the form and the exponential family 00:32:24.620 --> 00:32:27.830 distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the 00:32:27.830 --> 00:32:29.970 Gaussian. 00:32:29.970 --> 00:32:32.500 I'll basically just write out the answers. 00:32:32.500 --> 00:32:37.190 So 00:32:37.190 --> 00:32:38.250 00:32:38.250 --> 00:32:40.450 with a normal distribution with 00:32:40.450 --> 00:32:42.790 [inaudible] sequence squared, 00:32:42.790 --> 00:32:47.210 and so you remember, was it two lectures ago, 00:32:47.210 --> 00:32:50.910 when we were dividing the maximum likelihood - excuse me, oh, no, just the 00:32:50.910 --> 00:32:52.070 previous lecture 00:32:52.070 --> 00:32:55.750 when we were dividing the maximum likelihood estimate 00:32:55.750 --> 00:32:59.900 for the parameters of ordinary [inaudible] squares. We showed that 00:32:59.900 --> 00:33:04.370 the parameter for [inaudible] squared didn't matter. When we divide 00:33:04.370 --> 00:33:05.980 the [inaudible] model for [inaudible] square [inaudible], 00:33:05.980 --> 00:33:09.200 we said that no matter what [inaudible] square was, we end up with the same 00:33:09.200 --> 00:33:11.520 value of the parameters. 00:33:11.520 --> 00:33:15.559 So for the purposes of just writing lesson, today's lecture, and not taking 00:33:15.559 --> 00:33:16.460 account 00:33:16.460 --> 00:33:17.690 [inaudible] squared, 00:33:17.690 --> 00:33:20.929 I'm just going to set 00:33:20.929 --> 00:33:25.740 [inaudible] squared to be for the one, okay, so as to not worry about it. 00:33:25.740 --> 00:33:28.330 Lecture [inaudible] talks a little bit more about this, but I'm just gonna - 00:33:28.330 --> 00:33:30.490 just to make 00:33:30.490 --> 00:33:33.630 [inaudible] in class a bit easier and simpler today, let's just 00:33:33.630 --> 00:33:36.049 say that [inaudible] square equals one. 00:33:36.049 --> 00:33:38.610 [Inaudible] square is essentially just a scaling factor 00:33:38.610 --> 00:33:43.720 on the variable Y. 00:33:43.720 --> 00:33:45.660 So in that case, the Gaussian 00:33:45.660 --> 00:33:50.840 density is given by this, 00:33:50.840 --> 00:33:52.400 [inaudible] 00:33:52.400 --> 00:33:53.680 squared. 00:33:53.680 --> 00:33:55.020 00:33:55.020 --> 00:33:58.670 And 00:33:58.670 --> 00:34:02.490 - well, by a couple of steps of algebra, which I'm not going to do, 00:34:02.490 --> 00:34:06.840 but is written out in [inaudible] in the lecture now so you can download. 00:34:06.840 --> 00:34:09.869 This is one root two 00:34:09.869 --> 00:34:13.989 pie, E to the minus one-half Y squared 00:34:13.989 --> 00:34:16.899 times E to E. 00:34:16.899 --> 00:34:20.209 New Y minus 00:34:20.209 --> 00:34:22.689 one-half [inaudible] squared, okay. 00:34:22.689 --> 00:34:25.019 So I'm just not doing the algebra. 00:34:25.019 --> 00:34:27.709 And so that's B 00:34:27.709 --> 00:34:32.659 of Y, 00:34:32.659 --> 00:34:37.279 we have [inaudible] that's equal to [inaudible]. P of 00:34:37.279 --> 00:34:42.259 Y equals 00:34:42.259 --> 00:34:43.459 Y, 00:34:43.459 --> 00:34:45.119 and - well, A of [inaudible] 00:34:45.119 --> 00:34:49.859 is equal to 00:34:49.859 --> 00:34:53.989 minus one-half - actually, I think that 00:34:53.989 --> 00:34:59.239 should be plus one-half. Have I got that right? 00:34:59.239 --> 00:35:01.619 Yeah, sorry. Let's 00:35:01.619 --> 00:35:03.189 see - excuse me. Plus sign 00:35:03.189 --> 00:35:06.619 there, okay. If you minus one-half [inaudible] squared, and 00:35:06.619 --> 00:35:11.169 because [inaudible] is equal to [inaudible], this is just minus one-half 00:35:11.169 --> 00:35:13.959 [inaudible] squared, okay. 00:35:13.959 --> 00:35:15.419 And so 00:35:15.419 --> 00:35:19.159 this would be a specific choice again of A, B and T 00:35:19.159 --> 00:35:21.069 that 00:35:21.069 --> 00:35:22.820 expresses the Gaussian density 00:35:22.820 --> 00:35:26.630 in the form of an exponential family distribution. 00:35:26.630 --> 00:35:30.839 And in this case, the relationship between [inaudible] and [inaudible] is that 00:35:30.839 --> 00:35:34.759 [inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural 00:35:34.759 --> 00:35:37.929 parameter of the exponential family distribution. Minus 00:35:37.929 --> 00:35:43.579 half. Oh, this is minus half? [Inaudible] Oh, okay, thanks. 00:35:43.579 --> 00:35:47.069 And so - guessing 00:35:47.069 --> 00:35:51.180 that should be plus then. Is that right? Okay. Oh, 00:35:51.180 --> 00:35:58.180 yes, you're right. Thank you. All right. 00:36:03.189 --> 00:36:05.579 And so [inaudible] result that 00:36:05.579 --> 00:36:09.470 if you've taken a look in undergrad statistics class, turns out that most of the 00:36:09.470 --> 00:36:12.899 "textbook distributions," not all, but most of them, 00:36:12.899 --> 00:36:16.609 can be written in the form of an exponential family distribution. 00:36:16.609 --> 00:36:20.579 So you saw the Gaussian, the normal distribution. It turns out the [inaudible] 00:36:20.579 --> 00:36:25.309 in normal distribution, which is a generalization of Gaussian random variables, so 00:36:25.309 --> 00:36:27.479 it's a high dimension to vectors. 00:36:27.479 --> 00:36:31.999 The [inaudible] normal distribution is also in the exponential family. 00:36:31.999 --> 00:36:36.489 You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all 00:36:36.489 --> 00:36:37.889 right. So the [inaudible] 00:36:37.889 --> 00:36:42.139 models outcomes over zero and one. They'll be coin tosses with two outcomes. 00:36:42.139 --> 00:36:46.729 The [inaudible] models outcomes over K possible values. That's also an 00:36:46.729 --> 00:36:49.289 exponential families distribution. 00:36:49.289 --> 00:36:52.670 You may have heard of the Parson distribution. And so the Parson distribution is 00:36:52.670 --> 00:36:54.849 often used for modeling counts. Things like 00:36:54.849 --> 00:36:59.779 the number of radioactive decays in a sample, or the number of 00:36:59.779 --> 00:37:03.710 customers to your website, the numbers of visitors arriving in a store. The 00:37:03.710 --> 00:37:08.059 Parson distribution is also in the exponential family. So are 00:37:08.059 --> 00:37:10.369 the gamma and the exponential 00:37:10.369 --> 00:37:15.359 distributions, if you've heard of them. So the gamma and the exponential distributions are 00:37:15.359 --> 00:37:18.859 distributions of the positive numbers. So they're often used in model 00:37:18.859 --> 00:37:20.299 intervals, like if you're 00:37:20.299 --> 00:37:22.309 standing at the bus stop and you want to 00:37:22.309 --> 00:37:25.619 ask, "When is the next bus likely to arrive? How long do I have to wait for 00:37:25.619 --> 00:37:27.019 my bus to arrive?" 00:37:27.019 --> 00:37:32.229 Often you model that with sort of gamma distribution or 00:37:32.229 --> 00:37:34.190 exponential families, or the exponential 00:37:34.190 --> 00:37:36.599 distribution. Those are also in the exponential family. 00:37:36.599 --> 00:37:40.930 Even more [inaudible] distributions, like the [inaudible] and the [inaudible] 00:37:40.930 --> 00:37:43.890 distributions, these are probably distributions over 00:37:43.890 --> 00:37:49.089 fractions, are already probability distributions over probability distributions. 00:37:49.089 --> 00:37:52.410 And also things like the Wisha distribution, which is the 00:37:52.410 --> 00:37:56.609 distribution over covariance matrices. So all of these, it turns out, can be written in 00:37:56.609 --> 00:38:00.719 the form of exponential family distributions. Well, 00:38:00.719 --> 00:38:03.660 and 00:38:03.660 --> 00:38:08.509 in the problem set where he asks you to take one of these distributions and write 00:38:08.509 --> 00:38:12.739 it in the form of the exponential family distribution, and derive a generalized linear model 00:38:12.739 --> 00:38:14.680 for it, okay. 00:38:14.680 --> 00:38:21.680 Which brings me to the next topic of 00:38:28.789 --> 00:38:32.119 having chosen and exponential family distribution, 00:38:32.119 --> 00:38:37.829 how do you use it to derive a generalized linear model? 00:38:37.829 --> 00:38:41.219 So 00:38:41.219 --> 00:38:45.459 generalized linear models are often abbreviated GLM's. 00:38:45.459 --> 00:38:48.559 And 00:38:48.559 --> 00:38:51.619 I'm going to write down the three assumptions. You can think of them 00:38:51.619 --> 00:38:54.699 as assumptions, or you can think of them as design choices, 00:38:54.699 --> 00:39:00.309 that will then allow me to sort of turn a crank and come up with a generalized linear model. 00:39:00.309 --> 00:39:03.599 So the first one is - I'm going to assume 00:39:03.599 --> 00:39:04.309 that 00:39:04.309 --> 00:39:05.580 given 00:39:05.580 --> 00:39:09.429 my input X and my parameters theta, 00:39:09.429 --> 00:39:12.729 I'm going to assume that the variable Y, 00:39:12.729 --> 00:39:17.009 the output Y, or the response variable Y I'm trying to predict 00:39:17.009 --> 00:39:20.079 is distributed 00:39:20.079 --> 00:39:21.719 exponential family 00:39:21.719 --> 00:39:23.880 00:39:23.880 --> 00:39:27.329 with some natural parameter [inaudible]. And so this means that there 00:39:27.329 --> 00:39:29.300 is some specific choice 00:39:29.300 --> 00:39:32.009 of those functions, A, B and T 00:39:32.009 --> 00:39:33.049 so that 00:39:33.049 --> 00:39:38.049 the conditional distribution of Y given X and parameterized by theta, 00:39:38.049 --> 00:39:39.940 those exponential families 00:39:39.940 --> 00:39:41.729 with parameter 00:39:41.729 --> 00:39:46.390 [inaudible]. Where here, [inaudible] may depend on X in some way. 00:39:46.390 --> 00:39:50.239 So for example, if you're trying to predict - if you want to 00:39:50.239 --> 00:39:50.710 predict 00:39:50.710 --> 00:39:53.090 how many customers have arrived at your website, 00:39:53.090 --> 00:39:54.729 you may choose to 00:39:54.729 --> 00:39:57.190 model the number of people 00:39:57.190 --> 00:40:01.350 - the number of hits on your website by Parson Distribution since Parson 00:40:01.350 --> 00:40:03.869 Distribution is natural for modeling com data. 00:40:03.869 --> 00:40:04.920 And so you may 00:40:04.920 --> 00:40:11.920 choose the exponential family distribution here to be the Parson distribution. 00:40:15.109 --> 00:40:18.049 [Inaudible] that given X, our 00:40:18.049 --> 00:40:20.579 goal is 00:40:20.579 --> 00:40:22.329 to output 00:40:22.329 --> 00:40:25.009 the 00:40:25.009 --> 00:40:28.640 effective value of Y 00:40:28.640 --> 00:40:29.649 given X. So 00:40:29.649 --> 00:40:34.019 given the features in the website examples, I've given a set of features 00:40:34.019 --> 00:40:37.609 about whether there were any proportions, whether there were sales, how many people linked to 00:40:37.609 --> 00:40:39.949 your website, or whatever. 00:40:39.949 --> 00:40:41.489 I'm going to assume that 00:40:41.489 --> 00:40:45.359 our goal in our [inaudible] problem is to estimate the expected number of people that will 00:40:45.359 --> 00:40:49.249 arrive at your website on a given day. 00:40:49.249 --> 00:40:49.999 00:40:49.999 --> 00:40:51.659 So in other words, 00:40:51.659 --> 00:40:54.670 you're saying that I want H of 00:40:54.670 --> 00:40:58.369 X to be equal to - oh, 00:40:58.369 --> 00:40:59.739 excuse me. I actually 00:40:59.739 --> 00:41:05.709 meant to write T of Y here. 00:41:05.709 --> 00:41:10.349 My goal is to get my learning algorithms hypothesis to output the expected value 00:41:10.349 --> 00:41:12.719 of T of Y given X. But 00:41:12.719 --> 00:41:17.170 again, for most of the examples, T of Y is just equal to Y. 00:41:17.170 --> 00:41:21.109 And so for most of the examples, our goal is to get our learning algorithms output, T 00:41:21.109 --> 00:41:28.109 expected value of Y given X because T of Y is usually 00:41:33.239 --> 00:41:33.919 equal 00:41:33.919 --> 00:41:36.229 to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic. 00:41:36.229 --> 00:41:39.390 Same T of Y. 00:41:39.390 --> 00:41:43.880 And lastly, this last one I wrote down - these are assumptions. This last 00:41:43.880 --> 00:41:47.680 one you might - maybe wanna think of this as a design choice. 00:41:47.680 --> 00:41:48.509 00:41:48.509 --> 00:41:50.019 Which is [inaudible] 00:41:50.019 --> 00:41:53.119 assume that the distribution of Y given X 00:41:53.119 --> 00:41:55.430 is a distributed exponential family 00:41:55.430 --> 00:41:59.079 with some parameter [inaudible]. So the number of visitors on the website 00:41:59.079 --> 00:42:01.129 on any given day will be Parson 00:42:01.129 --> 00:42:03.439 or some parameter [inaudible]. 00:42:03.439 --> 00:42:05.970 And the last decision I need to make is 00:42:05.970 --> 00:42:09.269 was the relationship between my input teachers 00:42:09.269 --> 00:42:11.219 and this parameter 00:42:11.219 --> 00:42:14.629 [inaudible] parameterizing my Parson distribution or whatever. 00:42:14.629 --> 00:42:18.799 And this last step, I'm going to make the 00:42:18.799 --> 00:42:20.929 assumption, or really a design choice, 00:42:20.929 --> 00:42:24.169 that I'm going to assume the relationship between [inaudible] 00:42:24.169 --> 00:42:26.359 and my [inaudible] axis linear, 00:42:26.359 --> 00:42:29.709 and in particular that they're governed by this - that [inaudible] is equal to theta, 00:42:29.709 --> 00:42:32.179 transpose X. 00:42:32.179 --> 00:42:35.759 And the reason I make this design choice is it will allow me to turn the crank of 00:42:35.759 --> 00:42:36.809 the 00:42:36.809 --> 00:42:38.619 generalized linear model of 00:42:38.619 --> 00:42:42.160 machinery and come off with very nice algorithms for fitting 00:42:42.160 --> 00:42:45.409 say Parson Regression models 00:42:45.409 --> 00:42:47.209 or performed regression 00:42:47.209 --> 00:42:50.279 with a gamma distribution outputs or exponential distribution outputs and 00:42:50.279 --> 00:42:53.429 so on. 00:42:53.429 --> 00:42:58.929 So 00:42:58.929 --> 00:43:05.929 let's work through an example. 00:43:13.089 --> 00:43:14.769 00:43:14.769 --> 00:43:17.799 [Inaudible] 00:43:17.799 --> 00:43:22.589 equals theta transpose X works for the case where [inaudible] is a real number. 00:43:22.589 --> 00:43:24.989 For the more general case, you 00:43:24.989 --> 00:43:28.789 would have [inaudible] I equals theta 00:43:28.789 --> 00:43:35.049 I, transpose X if [inaudible] 00:43:35.049 --> 00:43:39.420 is a vector rather than a real number. But again, most of the examples [inaudible] 00:43:39.420 --> 00:43:44.859 will just be a real number. 00:43:44.859 --> 00:43:51.859 All right. 00:43:54.930 --> 00:44:00.389 So let's work through the [inaudible] example. You'll see 00:44:00.389 --> 00:44:07.389 where Y given X parameterized by theta - this is a distributed 00:44:08.839 --> 00:44:12.239 exponential family with natural parameter [inaudible]. 00:44:12.239 --> 00:44:13.669 And for 00:44:13.669 --> 00:44:17.299 the [inaudible] distribution, I'm going to choose A, B and T to 00:44:17.299 --> 00:44:19.649 be the specific forms 00:44:19.649 --> 00:44:21.379 that 00:44:21.379 --> 00:44:25.559 cause those exponential families to become the [inaudible] distribution. This is the example we 00:44:25.559 --> 00:44:29.749 worked through just now, the first example we worked through just now. 00:44:29.749 --> 00:44:33.529 So - oh, 00:44:33.529 --> 00:44:35.059 and we also have - 00:44:35.059 --> 00:44:41.069 so for any fixed 00:44:41.069 --> 00:44:44.440 value of X and theta, my hypothesis, my learning 00:44:44.440 --> 00:44:46.339 algorithm 00:44:46.339 --> 00:44:53.170 will 00:44:53.170 --> 00:44:56.799 make a prediction, or will make - will sort of output [inaudible] of 00:44:56.799 --> 00:45:00.859 X, 00:45:00.859 --> 00:45:02.559 which is 00:45:02.559 --> 00:45:06.179 by my, I guess, assumption [inaudible]. Watch our learning 00:45:06.179 --> 00:45:10.309 algorithm to output the expected value of Y given X 00:45:10.309 --> 00:45:13.829 00:45:13.829 --> 00:45:14.950 and parameterized by theta, where Y can take on 00:45:14.950 --> 00:45:17.349 only the value zero and one, 00:45:17.349 --> 00:45:21.639 then the expected value of Y is just equal to the 00:45:21.639 --> 00:45:26.099 probability that Y is equal to one. So 00:45:26.099 --> 00:45:29.669 the expected value of a [inaudible] variable is just equal to the 00:45:29.669 --> 00:45:36.669 probability that it's equal to one. 00:45:36.680 --> 00:45:38.219 And so 00:45:38.219 --> 00:45:41.550 the probability that Y equals one is just equal to phi 00:45:41.550 --> 00:45:43.399 because that's the 00:45:43.399 --> 00:45:44.589 parameter 00:45:44.589 --> 00:45:47.120 of my [inaudible] distribution. Phi 00:45:47.120 --> 00:45:54.120 is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one. 00:45:57.059 --> 00:45:59.439 Which we worked out previously, 00:45:59.439 --> 00:46:02.509 phi was one over one plus E to the negative [inaudible]. 00:46:02.509 --> 00:46:06.059 So we worked this out on our previous board. This is the relationship - 00:46:06.059 --> 00:46:10.039 so when we wrote down the [inaudible] distribution 00:46:10.039 --> 00:46:14.359 in the form of an exponential family, we worked out what the relationship was between phi and 00:46:14.359 --> 00:46:18.459 [inaudible], and it was this. So we worked out the relationship between the 00:46:18.459 --> 00:46:23.689 expected value of Y and [inaudible] was this relationship. 00:46:23.689 --> 00:46:25.719 And lastly, because 00:46:25.719 --> 00:46:28.710 we made the design choice, or the assumption that 00:46:28.710 --> 00:46:31.159 [inaudible] and theta are 00:46:31.159 --> 00:46:32.189 linearly 00:46:32.189 --> 00:46:36.170 related. This is therefore equal to 00:46:36.170 --> 00:46:42.089 one over one plus E to the minus theta, transpose X. 00:46:42.089 --> 00:46:45.200 And so that's 00:46:45.200 --> 00:46:46.170 how I 00:46:46.170 --> 00:46:49.389 come up with the logistic regression algorithm 00:46:49.389 --> 00:46:51.189 when 00:46:51.189 --> 00:46:53.420 you have a variable Y 00:46:53.420 --> 00:46:57.420 - when you have a [inaudible] variable Y, or also response variable 00:46:57.420 --> 00:46:58.069 Y 00:46:58.069 --> 00:47:01.509 that takes on two values, and then you choose to model 00:47:01.509 --> 00:47:07.619 variable [inaudible] distribution. Are you 00:47:07.619 --> 00:47:14.109 sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope 00:47:14.109 --> 00:47:15.560 you get 00:47:15.560 --> 00:47:19.070 the ease of use of this, or sort of the power of this. The only decision I 00:47:19.070 --> 00:47:22.159 made was really, I said Y - 00:47:22.159 --> 00:47:24.759 let's say I have a new machine-learning problem and 00:47:24.759 --> 00:47:28.349 I'm trying to predict the value of a variable Y that happens to take on two 00:47:28.349 --> 00:47:29.619 values. 00:47:29.619 --> 00:47:33.669 Then the only decision I need to make is I chose [inaudible] distribution. I 00:47:33.669 --> 00:47:37.729 say I want to model - I want to assume that given X and theta, 00:47:37.729 --> 00:47:40.999 I'm going to assume Y is distributed 00:47:40.999 --> 00:47:43.160 [inaudible]. That's the only decision I made. 00:47:43.160 --> 00:47:46.890 And then everything else follows automatically having made the 00:47:46.890 --> 00:47:47.890 decision to 00:47:47.890 --> 00:47:50.210 model Y given X and 00:47:50.210 --> 00:47:53.979 parameterized by theta as being [inaudible]. 00:47:53.979 --> 00:47:58.440 In the same way you can choose a different distribution, you can choose Y as Parson or Y as 00:47:58.440 --> 00:48:00.189 gamma or Y as whatever, 00:48:00.189 --> 00:48:03.109 and follow a similar process and come up with a different model and 00:48:03.109 --> 00:48:04.410 different learning algorithm. 00:48:04.410 --> 00:48:07.269 Come up with a different generalized linear model for whatever learning 00:48:07.269 --> 00:48:12.289 algorithm you're faced with. 00:48:12.289 --> 00:48:16.339 This tiny little notation, the 00:48:16.339 --> 00:48:20.109 function G that 00:48:20.109 --> 00:48:25.229 relates G of [inaudible] that relates the natural parameter 00:48:25.229 --> 00:48:28.119 to the expected value of Y, 00:48:28.119 --> 00:48:31.679 which in this case, one over one plus [inaudible] minus [inaudible], 00:48:31.679 --> 00:48:38.679 this is called the canonical response function. And G inverse is called the 00:48:41.989 --> 00:48:43.349 canonical 00:48:43.349 --> 00:48:48.650 link function. These aren't 00:48:48.650 --> 00:48:53.199 00:48:53.199 --> 00:48:57.179 a huge deal. I won't use this terminology a lot. I'm just 00:48:57.179 --> 00:48:58.749 mentioning those in case 00:48:58.749 --> 00:49:03.009 you hear about - people talk about generalized linear models, and if they talk about canonical 00:49:03.009 --> 00:49:04.540 response functions or 00:49:04.540 --> 00:49:09.380 canonical link functions, just so you know there's all of this. 00:49:09.380 --> 00:49:11.269 Actually, many techs actually use 00:49:11.269 --> 00:49:15.529 the reverse way. This is G inverse and this is G, but this 00:49:15.529 --> 00:49:18.179 notation turns out to be more consistent with 00:49:18.179 --> 00:49:20.119 other algorithms in machine learning. So 00:49:20.119 --> 00:49:23.150 I'm going to use this notation. 00:49:23.150 --> 00:49:27.119 But I probably won't use the terms canonical response functions and canonical 00:49:27.119 --> 00:49:29.759 link functions in lecture a lot, so just - I don't know. 00:49:29.759 --> 00:49:31.799 I'm not big on 00:49:31.799 --> 00:49:38.799 memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay. 00:49:49.839 --> 00:49:51.349 You know what, I think 00:49:51.349 --> 00:49:55.390 in the interest of time, I'm going to skip over the Gaussian example. But again, 00:49:55.390 --> 00:49:57.469 just like I said, 00:49:57.469 --> 00:49:58.920 [inaudible], Y 00:49:58.920 --> 00:50:02.559 is [inaudible], different variation I get of logistic regression. You can do the same 00:50:02.559 --> 00:50:04.579 thing with the Gaussian distribution 00:50:04.579 --> 00:50:08.819 and end up with ordinary [inaudible] squares model. 00:50:08.819 --> 00:50:12.059 The problem with Gaussian is that it's almost so simple that when you see it for 00:50:12.059 --> 00:50:13.959 the first time that it's 00:50:13.959 --> 00:50:17.149 sometimes more confusing than the [inaudible] model because it looks so simple, 00:50:17.149 --> 00:50:20.650 it looks like it has to be more complicated. So let me just skip that 00:50:20.650 --> 00:50:22.719 and leave you to read about 00:50:22.719 --> 00:50:25.369 the Gaussian example in the lecture notes. 00:50:25.369 --> 00:50:27.399 And what I want to do is 00:50:27.399 --> 00:50:29.770 actually go through a more complex example. 00:50:29.770 --> 00:50:33.529 Question? [Inaudible] 00:50:33.529 --> 00:50:35.479 Okay, right. So 00:50:35.479 --> 00:50:39.219 how do choose what theory will be? 00:50:39.219 --> 00:50:42.869 We'll get to that in the end. What you have there is the logistic 00:50:42.869 --> 00:50:45.439 regression model, which is a [inaudible] model 00:50:45.439 --> 00:50:51.309 that assumes the probability of Y given X is given by a certain form. 00:50:51.309 --> 00:50:53.499 And so 00:50:53.499 --> 00:50:58.279 what you do is you can write down the log likelihood of your training set, and 00:50:58.279 --> 00:51:01.859 find the value of theta that maximizes the log likelihood of the parameters. 00:51:01.859 --> 00:51:04.630 Does that make 00:51:04.630 --> 00:51:05.509 sense? 00:51:05.509 --> 00:51:09.159 So I'll say that again towards the end of today's lecture. But 00:51:09.159 --> 00:51:13.699 for logistic regression, the way you choose theta is exactly maximum likelihood, 00:51:13.699 --> 00:51:14.979 as we 00:51:14.979 --> 00:51:18.179 worked out in the previous lecture, using Newton's Method or gradient 00:51:18.179 --> 00:51:19.849 ascent or 00:51:19.849 --> 00:51:21.909 whatever. I'll sort of try to 00:51:21.909 --> 00:51:28.909 do that again for one more example towards the end of today's lecture. So 00:51:29.249 --> 00:51:33.249 what I want to do is actually use the remaining, I don't know, 00:51:33.249 --> 00:51:36.059 19 minutes or so of this class, 00:51:36.059 --> 00:51:38.209 to go through the - 00:51:38.209 --> 00:51:42.309 one of the more - it's probably the most complex example of a 00:51:42.309 --> 00:51:46.089 generalized linear model that I've used. This one I want to go through because it's a little bit 00:51:46.089 --> 00:51:47.659 trickier than 00:51:47.659 --> 00:51:51.159 many of the other textbook examples of 00:51:51.159 --> 00:51:53.509 generalized linear models. 00:51:53.509 --> 00:51:54.580 So 00:51:54.580 --> 00:51:57.959 again, what I'm going to do is 00:51:57.959 --> 00:51:59.960 go through the derivation 00:51:59.960 --> 00:52:03.910 reasonably quickly and give you the gist of it, and if there are steps I skip or details 00:52:03.910 --> 00:52:07.519 omitted, I'll leave you to read about them more carefully 00:52:07.519 --> 00:52:10.049 in the lecture notes. 00:52:10.049 --> 00:52:13.649 And what I want to do is talk about 00:52:13.649 --> 00:52:16.899 [inaudible]. 00:52:16.899 --> 00:52:19.660 And 00:52:19.660 --> 00:52:23.190 [inaudible] 00:52:23.190 --> 00:52:25.799 is the distribution over 00:52:25.799 --> 00:52:28.149 K possible outcomes. 00:52:28.149 --> 00:52:29.719 00:52:29.719 --> 00:52:32.939 Imagine you're now in a machine-learning problem where the value of Y that you're trying to 00:52:32.939 --> 00:52:34.640 predict can take on 00:52:34.640 --> 00:52:37.290 K possible outcomes, so rather than 00:52:37.290 --> 00:52:39.589 only two outcomes. So obviously, this example's already - 00:52:39.589 --> 00:52:44.649 if you want to have a learning algorithm, or to magically send emails for you into your 00:52:44.649 --> 00:52:48.130 right email folder, and you may have a dozen email folders you want your algorithm 00:52:48.130 --> 00:52:50.129 to classify emails into. 00:52:50.129 --> 00:52:51.010 Or 00:52:51.010 --> 00:52:54.210 predicting if the patient either has a disease or does not have 00:52:54.210 --> 00:52:57.269 a disease, which would be a [inaudible] classification problem. 00:52:57.269 --> 00:53:01.759 If you think that the patient may have one of K diseases, and 00:53:01.759 --> 00:53:06.039 you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So 00:53:06.039 --> 00:53:06.749 lots 00:53:06.749 --> 00:53:10.399 of multi-cause classification problems where you have more than two causes. You model that 00:53:10.399 --> 00:53:13.989 with [inaudible]. 00:53:13.989 --> 00:53:17.169 And eventually - 00:53:17.169 --> 00:53:20.609 so for logistic regression, I had [inaudible] like these where you have a 00:53:20.609 --> 00:53:25.539 training set and you find a decision boundary that separates them. 00:53:25.539 --> 00:53:29.069 [Inaudible], we're going to entertain 00:53:29.069 --> 00:53:32.039 the value of predicting, taking on multiple values, so you now have 00:53:32.039 --> 00:53:33.380 three causes, 00:53:33.380 --> 00:53:35.759 and the learning algorithm 00:53:35.759 --> 00:53:39.489 will learn some way to separate out three causes or more, rather than just two 00:53:39.489 --> 00:53:43.799 causes. 00:53:43.799 --> 00:53:46.940 So let's write [inaudible] in the form of 00:53:46.940 --> 00:53:49.359 an exponential family distribution. 00:53:49.359 --> 00:53:53.219 00:53:53.219 --> 00:53:58.009 So the parameters of a [inaudible] are phi one, phi two 00:53:58.009 --> 00:53:59.919 [inaudible] 00:53:59.919 --> 00:54:04.049 phi K. I'll actually change this in a second - 00:54:04.049 --> 00:54:09.009 where the probability of Y equals I is phi I, 00:54:09.009 --> 00:54:10.160 right, because there are 00:54:10.160 --> 00:54:12.979 K possible outcomes. 00:54:12.979 --> 00:54:17.539 But if I choose this as my parameterization of the [inaudible], then 00:54:17.539 --> 00:54:19.940 my parameter's actually redundant because 00:54:19.940 --> 00:54:23.399 if these are probabilities, then you have to sum up the one. 00:54:23.399 --> 00:54:26.180 And therefore for example, I 00:54:26.180 --> 00:54:29.589 can derive the last parameter, phi K, 00:54:29.589 --> 00:54:32.029 as one minus phi 00:54:32.029 --> 00:54:34.789 one, up 00:54:34.789 --> 00:54:36.390 to phi K minus 00:54:36.390 --> 00:54:40.599 one. So this would be a 00:54:40.599 --> 00:54:44.150 redundant parameterization from 00:54:44.150 --> 00:54:48.789 [inaudible]. The result is over-parameterized. And so for purposes of 00:54:48.789 --> 00:54:49.870 this [inaudible], I'm 00:54:49.870 --> 00:54:53.859 going to treat my parameters of my [inaudible] as phi one, 00:54:53.859 --> 00:54:58.859 phi two, up to phi K minus one. 00:54:58.859 --> 00:55:02.799 And I won't think of phi K as a parameter. I'll just - so my parameters are 00:55:02.799 --> 00:55:03.420 just - 00:55:03.420 --> 00:55:07.869 I just have K minus one parameters, parameterizing my 00:55:07.869 --> 00:55:11.190 [inaudible]. And sometimes I write phi K in my 00:55:11.190 --> 00:55:13.069 derivations as well, and you should think of 00:55:13.069 --> 00:55:17.609 phi K as just a shorthand for this, for one minus the rest of the parameters, okay. 00:55:17.609 --> 00:55:24.609 So 00:55:36.479 --> 00:55:40.799 it turns out the [inaudible] is one of the few examples where T of Y - it's one of the 00:55:40.799 --> 00:55:44.569 examples where T of Y is not equal to Y. 00:55:44.569 --> 00:55:48.999 So in this case, Y is 00:55:48.999 --> 00:55:51.680 on of K possible values. 00:55:51.680 --> 00:55:55.199 And so T of Y would be defined as follows; T 00:55:55.199 --> 00:55:59.239 of one is going to be a vector with a one 00:55:59.239 --> 00:56:02.099 and zeros everywhere else. T 00:56:02.099 --> 00:56:03.640 of two 00:56:03.640 --> 00:56:07.309 is going to be a zero, one, zero 00:56:07.309 --> 00:56:08.689 and so 00:56:08.689 --> 00:56:12.359 on. Except that these are going to be 00:56:12.359 --> 00:56:15.640 K minus one-dimensional vectors. And 00:56:15.640 --> 00:56:17.449 so 00:56:17.449 --> 00:56:20.669 T of K minus one is going to be zero, zero, 00:56:20.669 --> 00:56:23.719 zero, one. 00:56:23.719 --> 00:56:26.829 And 00:56:26.829 --> 00:56:29.069 T of K is going to be the vector of all zeros. 00:56:29.069 --> 00:56:30.439 So this is just 00:56:30.439 --> 00:56:32.419 how I'm choosing to define T 00:56:32.419 --> 00:56:34.559 of Y 00:56:34.559 --> 00:56:39.019 to write down the [inaudible] in the form of an exponential family 00:56:39.019 --> 00:56:40.709 distribution. 00:56:40.709 --> 00:56:44.799 Again, these are K minus one-dimensional vectors. 00:56:44.799 --> 00:56:47.319 So 00:56:47.319 --> 00:56:50.410 this is a good point to introduce one more useful piece of notation, 00:56:50.410 --> 00:56:53.909 which is called indicator function notation. 00:56:53.909 --> 00:56:58.069 So I'm going to write one, and then curly braces. 00:56:58.069 --> 00:57:03.549 And if I write a true statement inside, then the indicator of that statement is going to be 00:57:03.549 --> 00:57:04.949 one. Then I write one, 00:57:04.949 --> 00:57:07.949 and then I write a false statement inside, then 00:57:07.949 --> 00:57:11.849 the value of this indicator function is going to be a 00:57:11.849 --> 00:57:15.299 zero. For example, if I write indicator two 00:57:15.299 --> 00:57:16.989 equals three 00:57:16.989 --> 00:57:20.789 [inaudible] that's false, and so this is equal to zero. 00:57:20.789 --> 00:57:22.489 Whereas indicator [inaudible] 00:57:22.489 --> 00:57:24.819 plus one equals two, 00:57:24.819 --> 00:57:28.449 I wrote down a true statement inside. And so the indicator of the statement was equal to 00:57:28.449 --> 00:57:30.029 one. So the 00:57:30.029 --> 00:57:31.979 indicator function is just a 00:57:31.979 --> 00:57:35.490 very useful notation for indicating sort of truth or falsehood 00:57:35.490 --> 00:57:42.490 of the statement inside. And so - actually, let's do 00:57:45.529 --> 00:57:47.199 this here. 00:57:47.199 --> 00:57:53.939 To combine both of these, right, if I carve out a bit 00:57:53.939 --> 00:57:57.149 of space here - 00:57:57.149 --> 00:58:02.539 so if I use - so TY is a 00:58:02.539 --> 00:58:06.670 vector. Y is one of K values, and so 00:58:06.670 --> 00:58:09.629 TY is one of these K vectors. If 00:58:09.629 --> 00:58:12.969 I use TY as [inaudible] to denote 00:58:12.969 --> 00:58:15.949 the [inaudible] element of the vector 00:58:15.949 --> 00:58:18.429 TY, 00:58:18.429 --> 00:58:21.789 then TY - the [inaudible] element of the vector TY 00:58:21.789 --> 00:58:26.529 is just equal to indicator for 00:58:26.529 --> 00:58:30.459 whether Y is equal to I. Just take a 00:58:30.459 --> 00:58:33.189 - let me clean a couple more boards. Take a look at this for a second 00:58:33.189 --> 00:58:34.989 and make sure you understand why that - make 00:58:34.989 --> 00:58:41.989 sure you understand all that notation and why this is true. All 00:59:09.839 --> 00:59:15.599 right. Actually, raise your hand if this equation makes sense to you. 00:59:15.599 --> 00:59:18.769 Most of you, 00:59:18.769 --> 00:59:23.859 not all, okay. [Inaudible]. 00:59:23.859 --> 00:59:26.599 Just as one kind of [inaudible], 00:59:26.599 --> 00:59:29.679 suppose Y is equal to one 00:59:29.679 --> 00:59:33.579 - let's 00:59:33.579 --> 00:59:36.199 say - let me see. Suppose Y is equal to one, right, 00:59:36.199 --> 00:59:39.739 so TY is equal to this vector, 00:59:39.739 --> 00:59:42.389 and therefore the first element of this vector 00:59:42.389 --> 00:59:46.779 will be one, and the rest of the elements will be equal to zero. 00:59:46.779 --> 00:59:50.299 And so - let 00:59:50.299 --> 00:59:54.339 me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of 00:59:54.339 --> 00:59:58.959 the vector TY, and I want to know is this one or zero. All right. 00:59:58.959 --> 01:00:01.249 Well, this will be one. 01:00:01.249 --> 01:00:03.759 The [inaudible] element of the vector TY 01:00:03.759 --> 01:00:05.059 will be equal to one 01:00:05.059 --> 01:00:06.469 if, and only if Y is 01:00:06.469 --> 01:00:08.789 equal to I. 01:00:08.789 --> 01:00:12.339 Because for example, if Y is equal to one, then only the first element of this 01:00:12.339 --> 01:00:13.809 vector will be zero. 01:00:13.809 --> 01:00:17.829 If Y is equal to two, then only the second element of the vector will be zero 01:00:17.829 --> 01:00:19.849 and so on. So the question of 01:00:19.849 --> 01:00:23.899 whether or not - whether the [inaudible] element of this vector, TY, is equal to 01:00:23.899 --> 01:00:25.670 one is 01:00:25.670 --> 01:00:26.699 answered by 01:00:26.699 --> 01:00:28.749 just asking is Y 01:00:28.749 --> 01:00:35.749 equal to I. Okay. If you're 01:00:36.189 --> 01:00:38.980 still not quite sure why that's true, go home and 01:00:38.980 --> 01:00:41.409 think about it a bit more. And I think I - 01:00:41.409 --> 01:00:45.439 and take a look at the lecture notes as 01:00:45.439 --> 01:00:49.779 well, maybe that'll help. At least for now, only just take my word for it. So 01:00:49.779 --> 01:00:54.459 let's go ahead and write out the distribution 01:00:54.459 --> 01:00:58.369 for the [inaudible] in an exponential family form. 01:00:58.369 --> 01:01:05.369 So PFY is equal to phi one. 01:01:06.479 --> 01:01:07.809 Indicator Y equals one 01:01:07.809 --> 01:01:10.539 times phi two. Indicator Y equals 01:01:10.539 --> 01:01:12.909 to 01:01:12.909 --> 01:01:17.129 up to phi K 01:01:17.129 --> 01:01:21.219 times indicator Y equals K. And again, phi K is not a 01:01:21.219 --> 01:01:23.369 parameter of the distribution. Phi K 01:01:23.369 --> 01:01:29.439 is a shorthand for one minus phi one minus phi two minus the rest. 01:01:29.439 --> 01:01:30.689 And so 01:01:30.689 --> 01:01:34.639 using this equation on the left as well, I can also write this as 01:01:34.639 --> 01:01:38.449 phi one times TY one, phi 01:01:38.449 --> 01:01:40.379 two, TY 01:01:40.379 --> 01:01:42.879 two, dot, dot, dot. 01:01:42.879 --> 01:01:44.609 Phi K minus one, 01:01:44.609 --> 01:01:46.019 01:01:46.019 --> 01:01:48.269 TY, K minus one 01:01:48.269 --> 01:01:50.890 times phi K. And then 01:01:50.890 --> 01:01:57.890 one minus [inaudible]. That should 01:02:02.239 --> 01:02:08.239 be K. 01:02:08.239 --> 01:02:10.709 And it turns out - 01:02:10.709 --> 01:02:15.069 it takes some of the steps of algebra that I don't have time to show. 01:02:15.069 --> 01:02:19.269 It turns out, you can simplify this into - well, the 01:02:19.269 --> 01:02:26.269 exponential family form 01:02:32.389 --> 01:02:39.389 where [inaudible] is a vector, this is a 01:02:50.499 --> 01:02:57.499 K minus one-dimensional vector, and - well, 01:03:06.059 --> 01:03:08.469 okay. So deriving this is a few steps of algebra 01:03:08.469 --> 01:03:12.599 that you can work out yourself, but I won't do here. 01:03:12.599 --> 01:03:17.789 And so using my definition for TY, and 01:03:17.789 --> 01:03:18.809 by choosing 01:03:18.809 --> 01:03:20.550 [inaudible] A and B this 01:03:20.550 --> 01:03:24.490 way, I can take my distribution from [inaudible] and write it out in 01:03:24.490 --> 01:03:31.490 a form of an exponential family distribution. 01:03:32.889 --> 01:03:34.949 It turns out also that 01:03:34.949 --> 01:03:36.739 - let's 01:03:36.739 --> 01:03:40.429 see. [Inaudible], right. One of the things we did was we also had 01:03:40.429 --> 01:03:42.199 [inaudible] as a function of phi, 01:03:42.199 --> 01:03:45.199 and then we inverted that to write 01:03:45.199 --> 01:03:47.859 out phi as a function of [inaudible]. 01:03:47.859 --> 01:03:50.859 So it turns out you can do that as well. 01:03:50.859 --> 01:03:55.489 So this defines [inaudible] as a function of the [inaudible] distributions parameters phi. 01:03:55.489 --> 01:03:56.249 So 01:03:56.249 --> 01:03:59.799 you can take this relationship between [inaudible] and phi and invert it, 01:03:59.799 --> 01:04:02.630 and write out phi as a function of [inaudible]. 01:04:02.630 --> 01:04:06.279 And it turns out, you get that 01:04:06.279 --> 01:04:08.759 phi I is equal to [inaudible] - 01:04:08.759 --> 01:04:15.759 excuse me. 01:04:16.339 --> 01:04:20.959 And you get that phi I is equal to 01:04:20.959 --> 01:04:27.959 [inaudible] I of one plus 01:04:28.479 --> 01:04:35.479 that. 01:04:35.909 --> 01:04:39.170 And the way you do this is you just - this defines 01:04:39.170 --> 01:04:43.349 [inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And 01:04:43.349 --> 01:04:47.609 this is - again, there are a couple of steps of algebra that I'm just not showing. 01:04:47.609 --> 01:04:52.059 And then lastly, using our 01:04:52.059 --> 01:04:55.650 assumption that the [inaudible] are a linear function of the [inaudible] 01:04:55.650 --> 01:04:56.789 axis, phi 01:04:56.789 --> 01:05:01.890 I is therefore equal to E to the theta I, transpose X, 01:05:01.890 --> 01:05:06.319 divided by one plus sum over 01:05:06.319 --> 01:05:08.920 J equals one, to K 01:05:08.920 --> 01:05:11.109 minus one, 01:05:11.109 --> 01:05:14.489 E to the 01:05:14.489 --> 01:05:17.659 theta J, transpose 01:05:17.659 --> 01:05:24.069 X. And this is just using the fact 01:05:24.069 --> 01:05:28.549 that [inaudible] I equals theta I, transpose X, which was our earlier 01:05:28.549 --> 01:05:35.549 design choice from generalized linear models. So 01:05:43.659 --> 01:05:49.879 we're just about down. 01:05:49.879 --> 01:05:51.729 So my learning algorithm 01:05:51.729 --> 01:05:56.029 [inaudible]. I'm going to think of it as [inaudible] the 01:05:56.029 --> 01:05:59.150 expected value of TY 01:05:59.150 --> 01:06:02.279 given X and [inaudible] by theta. 01:06:02.279 --> 01:06:04.719 So 01:06:04.719 --> 01:06:08.849 TY was this vector indicator function. So 01:06:08.849 --> 01:06:10.199 T one 01:06:10.199 --> 01:06:13.170 was indicator Y equals one, 01:06:13.170 --> 01:06:16.079 down to indicator Y equals 01:06:16.079 --> 01:06:20.049 K minus one. All 01:06:20.049 --> 01:06:24.349 right. So I want my learning algorithm to output this; the expected value of this vector of 01:06:24.349 --> 01:06:31.349 indicator functions. 01:06:34.130 --> 01:06:39.209 The expected value of indicator Y equals one is just 01:06:39.209 --> 01:06:42.359 the probability that Y equals one, 01:06:42.359 --> 01:06:45.189 which is given by phi one. 01:06:45.189 --> 01:06:48.359 So I have a random variable that's one whenever Y is equal to one and zero 01:06:48.359 --> 01:06:49.859 otherwise, 01:06:49.859 --> 01:06:51.719 so the expected value of that, 01:06:51.719 --> 01:06:54.900 of this indicator Y equals one is just the 01:06:54.900 --> 01:06:59.249 probability that Y equals one, which is given by phi one. 01:06:59.249 --> 01:07:02.239 And therefore, 01:07:02.239 --> 01:07:05.380 by what we were taught earlier, 01:07:05.380 --> 01:07:08.089 this is therefore [inaudible] 01:07:08.089 --> 01:07:15.089 to the theta one, transpose X over - well - okay. 01:07:45.729 --> 01:07:46.849 And so my 01:07:46.849 --> 01:07:51.470 learning algorithm will output the probability that Y equals one, Y equals two, up to Y 01:07:51.470 --> 01:07:54.959 equals K minus one. 01:07:54.959 --> 01:07:58.429 And these probabilities are going to be parameterized by 01:07:58.429 --> 01:08:05.429 these functions like these. 01:08:21.149 --> 01:08:25.339 And so just to give this algorithm a name, 01:08:25.339 --> 01:08:32.339 this algorithm is called softmax regression, 01:08:34.059 --> 01:08:38.599 and is widely thought of as the generalization of 01:08:38.599 --> 01:08:41.900 logistic regression, which is regression of two classes. Is widely thought 01:08:41.900 --> 01:08:44.909 of as a generalization of logistic regression 01:08:44.909 --> 01:08:46.310 to the case of 01:08:46.310 --> 01:08:50.199 K classes rather than two classes. 01:08:50.199 --> 01:08:52.539 01:08:52.539 --> 01:08:56.190 And so just to be very concrete about what you do, right. So you have a machine-learning 01:08:56.190 --> 01:08:59.619 problem, and you want to apply softmax regression to it. So generally, 01:08:59.619 --> 01:09:02.749 work for the entire derivation [inaudible]. I think the 01:09:02.749 --> 01:09:05.719 question you had is about how to fit parameters. 01:09:05.719 --> 01:09:09.649 So let's say you have a machine-learning problem, and 01:09:09.649 --> 01:09:13.209 Y takes on one of K classes. 01:09:13.209 --> 01:09:17.209 What you do is you sit down and say, "Okay, I wanna model Y as being 01:09:17.209 --> 01:09:19.190 [inaudible] 01:09:19.190 --> 01:09:24.190 given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort 01:09:24.190 --> 01:09:27.349 of turn the crank. And everything else I wrote down follows 01:09:27.349 --> 01:09:29.980 automatically from you have made the choice 01:09:29.980 --> 01:09:34.769 of using [inaudible] distribution as your choice of exponential family. 01:09:34.769 --> 01:09:38.159 And then what you do is you then have this training set, X, I, 01:09:38.159 --> 01:09:39.010 Y, 01:09:39.010 --> 01:09:41.229 I up to X, M, 01:09:41.229 --> 01:09:43.560 Y, M. So 01:09:43.560 --> 01:09:45.429 you're 01:09:45.429 --> 01:09:49.309 doing the training set. We're now [inaudible] the value of Y takes on one 01:09:49.309 --> 01:09:51.479 of K possible values. 01:09:51.479 --> 01:09:55.389 And what you do is you then 01:09:55.389 --> 01:09:59.349 find the parameters of the model by maximum likelihood. So you write down the likelihood 01:09:59.349 --> 01:10:03.039 of the parameters, and you maximize the likelihood. 01:10:03.039 --> 01:10:06.809 So what's the likelihood? Well, the likelihood, as usual, is the 01:10:06.809 --> 01:10:08.929 product of your training set of 01:10:08.929 --> 01:10:11.030 P of YI 01:10:11.030 --> 01:10:14.709 given XI parameterized 01:10:14.709 --> 01:10:17.689 by theta. That's 01:10:17.689 --> 01:10:19.940 the likelihood, same as we had before. 01:10:19.940 --> 01:10:21.349 And that's 01:10:21.349 --> 01:10:23.550 01:10:23.550 --> 01:10:24.640 product of your 01:10:24.640 --> 01:10:26.439 training set of - 01:10:26.439 --> 01:10:29.649 let me write these down now. 01:10:29.649 --> 01:10:32.639 YI equals one 01:10:32.639 --> 01:10:36.049 times phi two of indicator YI 01:10:36.049 --> 01:10:38.300 equals two, dot, 01:10:38.300 --> 01:10:39.489 dot, dot, 01:10:39.489 --> 01:10:42.530 to phi K of indicator YI 01:10:42.530 --> 01:10:45.019 equals 01:10:45.019 --> 01:10:47.269 K. 01:10:47.269 --> 01:10:50.469 Where, for example, 01:10:50.469 --> 01:10:55.340 phi one depends on theta through this formula. It is E to the theta one, 01:10:55.340 --> 01:10:56.650 transpose X 01:10:56.650 --> 01:10:59.289 over 01:10:59.289 --> 01:11:02.469 one 01:11:02.469 --> 01:11:09.420 plus sum over J - well, that formula I had just now. 01:11:09.420 --> 01:11:11.649 And so phi one here is really a 01:11:11.649 --> 01:11:15.559 shorthand for this formula, and similarly for phi two and so on, 01:11:15.559 --> 01:11:20.000 up to phi K, where phi K is one minus all of these things. All right. 01:11:20.000 --> 01:11:21.809 So this is a 01:11:21.809 --> 01:11:22.799 01:11:22.799 --> 01:11:26.209 -this formula looks more complicated than it really is. What you 01:11:26.209 --> 01:11:27.639 really do is you write this down, 01:11:27.639 --> 01:11:31.459 then you take logs, compute a derivative of this formula [inaudible] theta, 01:11:31.459 --> 01:11:34.480 and 01:11:34.480 --> 01:11:36.750 apply say gradient ascent 01:11:36.750 --> 01:11:41.369 to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector, 01:11:41.369 --> 01:11:45.059 right? And now it looks 01:11:45.059 --> 01:11:48.610 like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one 01:11:48.610 --> 01:11:50.459 through 01:11:50.459 --> 01:11:52.599 theta 01:11:52.599 --> 01:11:57.880 K minus one. I've been thinking of each of these as - 01:11:57.880 --> 01:11:59.210 and N 01:11:59.210 --> 01:12:00.910 plus one-dimensional vector. 01:12:00.910 --> 01:12:03.639 If X is N plus one-dimensional, 01:12:03.639 --> 01:12:05.150 then I've been - see, I think if 01:12:05.150 --> 01:12:09.489 you have a set of parameters comprising K minus one vectors, 01:12:09.489 --> 01:12:13.359 and each of these is a - you could group all of these together and make these, but I 01:12:13.359 --> 01:12:15.690 just haven't been doing that. [Inaudible] the derivative 01:12:15.690 --> 01:12:22.690 of K minus one parameter vectors. [Inaudible], what do they correspond to? 01:12:23.340 --> 01:12:25.050 [Inaudible]. 01:12:25.050 --> 01:12:29.050 We're sort of out of time. Let me take that offline. It's hard to answer in the 01:12:29.050 --> 01:12:32.559 same way that the logistic regression - what does theta correspond to 01:12:32.559 --> 01:12:37.729 in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like 01:12:37.729 --> 01:12:42.530 the [inaudible] feature - Yeah. Sort of similar interpretation, 01:12:42.530 --> 01:12:44.060 yeah. That's good. I think I'm running a little bit 01:12:44.060 --> 01:12:48.670 late. Why don't I - why don't we officially close for the day, but you can come up 01:12:48.670 --> 01:12:50.430 if you more questions and take them offline. Thanks.