0:00:00.000,0:00:09.169 (music) 0:00:09.869,0:00:13.099 this presentation is delivered by [br]the Stanford Center for 0:00:13.099,0:00:14.979 Professional Development. 0:00:23.299,0:00:26.800 Okay. Good morning and [br]welcome back to the third 0:00:26.800,0:00:30.619 lecture of this class. So here's 0:00:30.619,0:00:33.360 what I want to do today, 0:00:33.360,0:00:34.320 and some of 0:00:34.320,0:00:37.920 the topics I do today may seem a little bit like I'm jumping, sort of, 0:00:37.920,0:00:44.440 from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we 0:00:44.440,0:00:45.370 talked about 0:00:45.370,0:00:48.240 linear regression and today I want to talk about 0:00:48.240,0:00:52.500 sort of an adaptation of that called locally weighted regression. It's very a powerful 0:00:52.500,0:00:57.120 algorithm that's actually one of my former mentor's probably favorite machine 0:00:57.120,0:00:59.360 learning algorithm. We'll then 0:00:59.360,0:01:02.800 talk about a probabilistic interpretation of linear regression 0:01:02.800,0:01:07.650 and use that to move onto our first classification algorithm, which 0:01:07.650,0:01:08.910 is logistic regression; take 0:01:08.910,0:01:12.630 a brief digression to tell you about something called the perceptron algorithm, 0:01:12.630,0:01:15.990 which is something we'll come back to, again, later this quarter; 0:01:15.990,0:01:17.170 and 0:01:17.170,0:01:21.070 time allowing I hope to get to Newton's method, which is an algorithm for 0:01:21.070,0:01:23.119 fitting logistic regression models. 0:01:24.490,0:01:30.510 So, let's just recap what we're talking about in the previous lecture, 0:01:30.510,0:01:33.790 remember the notation that I defined was that 0:01:33.790,0:01:35.090 I used this 0:01:35.090,0:01:37.620 X superscript i, 0:01:37.620,0:01:40.960 Y superscript i to denote the ith training example. 0:01:46.730,0:01:48.210 And 0:01:49.150,0:01:51.790 when we're talking about linear regression 0:01:52.900,0:01:54.080 or ordinary least squares, 0:01:54.080,0:01:56.140 we use this to denote 0:01:56.140,0:01:58.179 the predicted value 0:01:58.179,0:02:00.830 output by my hypothesis H on 0:02:00.830,0:02:02.420 the input X^i. 0:02:02.420,0:02:04.070 And my hypothesis 0:02:04.070,0:02:06.160 was parameterized by the vector 0:02:06.160,0:02:08.399 parameters theta 0:02:08.399,0:02:13.809 and so we said that this was equal to sum from theta j 0:02:13.809,0:02:15.549 X^i 0:02:15.549,0:02:19.870 si 0:02:19.870,0:02:22.409 written more simply as theta transpose X 0:02:22.409,0:02:25.079 And we had the convention that X 0:02:25.079,0:02:29.409 subscript zero is equal to one so [br]this accounts for the intercept term in our 0:02:29.409,0:02:30.919 linear regression model. 0:02:30.919,0:02:32.799 And lowercase n here 0:02:32.799,0:02:36.780 was the notation I was using for the 0:02:36.780,0:02:40.209 number of features in my training set. Okay? So in the 0:02:40.209,0:02:43.689 example trying to predict housing prices, we had two features, the size of 0:02:43.689,0:02:45.939 the house and the number of bedrooms. 0:02:45.939,0:02:50.559 We had two features and there was - little n was equal to two. 0:02:50.559,0:02:51.989 0:02:51.989,0:02:54.709 So just to 0:02:54.709,0:02:57.079 finish recapping the previous lecture, we 0:02:57.079,0:03:01.260 defined this quadratic cos function J of theta equals one-half, 0:03:01.260,0:03:05.510 something 0:03:05.510,0:03:07.209 I 0:03:07.209,0:03:10.379 equals one to m, theta of XI minus YI 0:03:10.379,0:03:12.439 squared 0:03:12.439,0:03:16.819 where this is the sum over our m training examples and my training set. So lowercase 0:03:16.819,0:03:17.769 m 0:03:17.769,0:03:21.129 was the notation I've been using to denote the number of training examples I have and the 0:03:21.129,0:03:23.059 size of my training set. 0:03:23.059,0:03:25.400 And at the end of the last lecture, 0:03:25.400,0:03:26.809 we derive 0:03:26.809,0:03:30.840 the value of theta that minimizes this enclosed form, which was X 0:03:30.840,0:03:32.349 transpose X 0:03:32.349,0:03:35.369 inverse X 0:03:35.369,0:03:38.889 transpose Y. Okay? 0:03:38.889,0:03:42.409 So 0:03:42.409,0:03:46.419 as we move on in today's lecture, I'll continue to use this notation and, again, 0:03:46.419,0:03:50.609 I realize this is a fair amount of notation to all remember, 0:03:50.609,0:03:55.099 so if partway through this lecture you forgot - if you're having trouble remembering 0:03:55.099,0:04:02.099 what lowercase m is or what lowercase n is or something please raise your hand and ask. When 0:04:04.709,0:04:07.690 we talked about linear regression last time 0:04:07.690,0:04:10.440 we used two features. One of the features was 0:04:10.440,0:04:14.279 the size of the houses in square feet, so the living area of the house, 0:04:14.279,0:04:18.449 and the other feature was the number of bedrooms in the house. 0:04:18.449,0:04:22.059 In general, we apply a machine-learning algorithm to some problem that you care 0:04:22.059,0:04:23.060 about. 0:04:23.060,0:04:28.210 The choice of the features will very much be up to you, right? And 0:04:28.210,0:04:32.300 the way you choose your features to give the learning algorithm will often have a 0:04:32.300,0:04:34.330 large impact on how it actually does. 0:04:34.330,0:04:40.120 So just for example, 0:04:40.120,0:04:44.120 the choice we made last time was X1 equal this size, and let's leave this idea 0:04:44.120,0:04:47.080 of the feature of the number of bedrooms for now, let's say we don't have data 0:04:47.080,0:04:50.630 that tells us how many bedrooms are in these houses. 0:04:50.630,0:04:54.030 One thing you could do is actually define - oh, let's 0:04:54.030,0:04:56.289 draw this out. 0:04:56.289,0:05:03.289 And so, right? So say that 0:05:04.689,0:05:07.669 was the size of the house and that's the price of the house. So 0:05:07.669,0:05:10.300 if you use 0:05:10.300,0:05:14.789 this as a feature maybe you get theta zero plus theta 1, 0:05:14.789,0:05:19.889 X1, this, sort of, linear model. 0:05:19.889,0:05:21.590 If you choose - let me just copy the 0:05:21.590,0:05:26.770 same data set over, right? 0:05:26.770,0:05:30.330 You can define the set of features where X1 is equal to the size of the house 0:05:30.330,0:05:34.989 and X2 is the 0:05:34.989,0:05:36.050 square 0:05:36.050,0:05:37.169 of the size 0:05:37.169,0:05:38.319 of the house. Okay? 0:05:38.319,0:05:43.080 So X1 is the size of the house in say square footage and X2 is 0:05:43.080,0:05:45.919 just take whatever the square footage of the house is and just 0:05:45.919,0:05:49.059 square that number, and this would be another way to come up with a feature, 0:05:49.059,0:05:51.109 and if you do that then 0:05:51.109,0:05:55.960 the same algorithm will end up fitting a 0:05:55.960,0:05:59.399 quadratic function for you. 0:05:59.399,0:06:01.379 Theta 2, XM squared. 0:06:01.379,0:06:06.139 Okay? Because this 0:06:06.139,0:06:09.610 is actually X2. And 0:06:09.610,0:06:12.520 depending on what the data looks like, maybe this is a slightly 0:06:12.520,0:06:16.509 better fit to the data. You can actually 0:06:16.509,0:06:23.509 take this even further, right? 0:06:25.280,0:06:26.960 Which is - let's see. 0:06:26.960,0:06:30.400 I have seven training examples here, so you can actually 0:06:30.400,0:06:34.460 maybe fit up to six for the polynomial. You can actually fill a model 0:06:34.460,0:06:35.619 theta zero plus 0:06:35.619,0:06:38.229 theta one, X1 plus theta two, 0:06:38.229,0:06:42.030 X squared plus up to 0:06:42.030,0:06:48.710 theta six. X to the 0:06:48.710,0:06:52.889 power of six and theta six are the polynomial 0:06:52.889,0:06:55.589 to these seven data points. 0:06:55.589,0:06:58.030 And if you do that you find that 0:06:58.030,0:07:01.669 you come up with a model that fits your data exactly. This is where, I guess, 0:07:01.669,0:07:06.190 in this example I drew, we have seven data points, so if you fit a 0:07:06.190,0:07:08.489 six model polynomial you can, sort of, fit a line 0:07:08.489,0:07:11.860 that passes through these seven points perfectly. 0:07:11.860,0:07:14.379 And you probably find that the curve 0:07:14.379,0:07:17.069 you get will look something 0:07:17.069,0:07:20.039 like that. 0:07:20.039,0:07:23.179 And on the one hand, this is a great model in a sense that it 0:07:23.179,0:07:25.329 fits your training data perfectly. 0:07:25.329,0:07:26.680 On the other hand, this is probably not a 0:07:26.680,0:07:28.819 very good model in the sense that 0:07:28.819,0:07:31.889 none of us seriously think that this is a very good predictor of housing 0:07:31.889,0:07:36.220 prices as a function of the size of the house, right? So 0:07:36.220,0:07:39.020 we'll actually come back to this later. It turns 0:07:39.020,0:07:41.940 out of the models we have here; 0:07:41.940,0:07:45.180 I feel like maybe the quadratic model fits the data best. 0:07:45.180,0:07:46.460 Whereas 0:07:46.460,0:07:47.810 0:07:47.810,0:07:52.020 the linear model looks like there's actually a bit of a quadratic component in this 0:07:52.020,0:07:52.830 data 0:07:52.830,0:07:56.539 that the linear function is not capturing. 0:07:56.539,0:07:59.959 So we'll actually come back to this a little bit later and talk about the problems 0:07:59.959,0:08:03.779 associated with fitting models that are either too simple, use two small a set of 0:08:03.779,0:08:04.620 features, or 0:08:04.620,0:08:08.449 on the models that are too complex and maybe 0:08:08.449,0:08:11.210 use too large a set of features. 0:08:11.210,0:08:12.289 Just to give these a 0:08:12.289,0:08:13.240 name, 0:08:13.240,0:08:14.490 we call this 0:08:14.490,0:08:19.750 the problem of underfitting 0:08:19.750,0:08:23.139 and, very informally, this refers to a setting where 0:08:23.139,0:08:26.819 there are obvious patterns that - where there are patterns in the data that the 0:08:26.819,0:08:28.729 algorithm is just failing to fit. 0:08:28.729,0:08:32.839 And this problem here we refer to as 0:08:32.839,0:08:34.580 overfitting 0:08:34.580,0:08:36.299 and, again, very informally, 0:08:36.299,0:08:41.310 this is when the algorithm is fitting the idiosyncrasies of this specific data set, 0:08:41.310,0:08:43.480 right? It just so happens that 0:08:43.480,0:08:48.190 of the seven houses we sampled in Portland, or wherever you collect data from, 0:08:48.190,0:08:51.640 that house happens to be a bit more expensive, that house happened on the less 0:08:51.640,0:08:54.070 expensive, and by 0:08:54.070,0:08:57.400 fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties 0:08:57.400,0:08:58.820 of this data set, 0:08:58.820,0:09:01.079 rather than the true underlying trends 0:09:01.079,0:09:04.940 of how housing prices vary as the function of the size of house. Okay? 0:09:04.940,0:09:08.460 So these are two very different problems. We'll define them more formally me later 0:09:08.460,0:09:12.280 and talk about how to address each of these problems, 0:09:12.280,0:09:13.890 but for now I 0:09:13.890,0:09:20.890 hope you appreciate that there is this issue of selecting features. So if 0:09:22.150,0:09:23.610 you want to just 0:09:23.610,0:09:26.270 teach us the learning problems there are a few ways to do 0:09:26.270,0:09:27.370 so. 0:09:27.370,0:09:29.010 We'll talk about 0:09:29.010,0:09:32.760 feature selection algorithms later this quarter as well. So automatic algorithms 0:09:32.760,0:09:33.779 for choosing 0:09:33.779,0:09:35.959 what features you use in a 0:09:35.959,0:09:37.530 regression problem like this. 0:09:37.530,0:09:41.610 What I want to do today is talk about a class of algorithms 0:09:41.610,0:09:44.570 called non-parametric learning algorithms that will help 0:09:44.570,0:09:49.730 to alleviate the need somewhat for you to choose features very carefully. Okay? 0:09:49.730,0:09:56.730 And this leads us into our discussion of locally weighted regression. 0:09:56.970,0:10:03.970 And just to define the term, 0:10:12.340,0:10:16.780 linear regression, as we've defined it so far, is an example of a parametric learning 0:10:16.780,0:10:17.740 0:10:17.740,0:10:19.150 algorithm. Parametric 0:10:19.150,0:10:21.920 learning algorithm is one that's defined as 0:10:21.920,0:10:24.670 an algorithm that has a fixed number of parameters 0:10:24.670,0:10:27.200 that fit to the data. Okay? So 0:10:27.200,0:10:28.670 in linear regression 0:10:28.670,0:10:32.930 we have a fix set of parameters theta, right? That must 0:10:32.930,0:10:39.440 fit to the data. 0:10:39.440,0:10:46.440 In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The 0:10:58.630,0:11:02.830 formal definition, which is not very intuitive, so I've replaced it with a 0:11:02.830,0:11:04.110 second, say, more 0:11:04.110,0:11:06.200 intuitive. 0:11:06.200,0:11:10.460 The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm 0:11:10.460,0:11:17.460 where the number of parameters 0:11:18.300,0:11:22.170 goes 0:11:22.170,0:11:25.419 with M, with the size of the training set. And usually it's 0:11:25.419,0:11:30.620 defined as a number of parameters grows linearly with the size of the training set. 0:11:30.620,0:11:32.640 This is the formal definition. 0:11:32.640,0:11:33.410 A 0:11:33.410,0:11:36.189 slightly less formal definition is that 0:11:36.189,0:11:37.529 the amount of stuff that your learning algorithm needs 0:11:37.529,0:11:40.830 to keep around 0:11:40.830,0:11:44.820 will grow linearly with the training sets or, in another way of saying it, is that this is an 0:11:44.820,0:11:45.830 algorithm that 0:11:45.830,0:11:51.590 we'll need to keep around an entire training set, even after learning. Okay? So 0:11:51.590,0:11:53.980 don't worry too much about this definition. But 0:11:53.980,0:11:55.979 what I want to do now is 0:11:55.979,0:11:58.989 describe a specific non-parametric learning algorithm 0:11:58.989,0:12:05.989 called locally weighted regression. 0:12:09.860,0:12:16.860 Which also goes by a couple of other names - 0:12:17.040,0:12:20.580 which also goes by the name of Loess for self-hysterical reasons. Loess 0:12:20.580,0:12:23.030 is usually spelled L-O-E-S-S, 0:12:23.030,0:12:24.130 sometimes spelled like that, 0:12:24.130,0:12:27.140 too. I just call it locally weighted regression. 0:12:27.140,0:12:34.140 So here's 0:12:34.540,0:12:37.760 the idea. This will be an algorithm that allows us 0:12:37.760,0:12:42.650 to worry a little bit less about having to choose features very carefully. 0:12:42.650,0:12:48.350 So 0:12:48.350,0:12:55.320 for my motivating example, let's say that I 0:12:55.320,0:12:59.000 have a 0:12:59.000,0:13:00.939 training site that looks like this, okay? 0:13:00.939,0:13:04.270 So this is X and that's Y. 0:13:04.270,0:13:07.010 If you run 0:13:07.010,0:13:10.920 linear regression on this and you fit maybe a linear function to this 0:13:10.920,0:13:12.380 and you end up with a 0:13:12.380,0:13:13.399 more or less 0:13:13.399,0:13:16.580 flat, straight line, which is not a very good fit to this data. You 0:13:16.580,0:13:19.820 can sit around and stare at this and try to decide whether the features are used right. 0:13:19.820,0:13:22.570 So maybe you want to toss in a quadratic function, 0:13:22.570,0:13:25.770 but this isn't really quadratic either. So maybe you want to 0:13:25.770,0:13:27.910 model this as a X 0:13:27.910,0:13:31.060 plus X squared plus maybe some function of sin of X or something. 0:13:31.060,0:13:33.850 You actually sit around and fiddle with features. 0:13:33.850,0:13:37.160 And after a while you can probably come up with a set of features that the model is 0:13:37.160,0:13:39.720 okay, but let's talk about an algorithm that 0:13:39.720,0:13:46.720 you can use without needing to do that. 0:13:50.340,0:13:52.930 So 0:13:52.930,0:13:54.480 if - well, 0:13:54.480,0:13:56.370 suppose you want to evaluate 0:13:56.370,0:13:59.170 your hypothesis H 0:13:59.170,0:14:03.950 at a certain point 0:14:03.950,0:14:06.310 with a certain query point low K is X. Okay? And 0:14:06.310,0:14:07.490 let's say 0:14:07.490,0:14:10.670 you want to know what's the predicted value of 0:14:10.670,0:14:11.930 Y 0:14:11.930,0:14:16.540 at this position of X, right? So 0:14:16.540,0:14:18.760 for linear regression, 0:14:18.760,0:14:22.440 what we were doing was we would fit 0:14:22.440,0:14:25.070 theta 0:14:25.070,0:14:28.130 to minimize 0:14:28.130,0:14:30.050 sum 0:14:30.050,0:14:34.610 over I, YI minus theta, transpose XI 0:14:34.610,0:14:38.760 squared, 0:14:38.760,0:14:41.200 and return theta 0:14:41.200,0:14:46.170 transpose X. Okay? So that was linear regression. 0:14:46.170,0:14:49.970 In contrast, in locally weighted linear regression you're going to do things slightly 0:14:49.970,0:14:51.130 different. You're 0:14:51.130,0:14:54.270 going to look at this point X 0:14:54.270,0:14:58.700 and then I'm going to look in my data set and take into account 0:14:58.700,0:15:03.370 only the data points that are, sort of, in the little vicinity of X. Okay? 0:15:03.370,0:15:07.130 So we'll look at where I want to value my hypothesis. I'm going to look 0:15:07.130,0:15:10.140 only in the vicinity of 0:15:10.140,0:15:13.730 this point where I want to value my hypothesis, 0:15:13.730,0:15:16.480 and then I'm going to take, 0:15:16.480,0:15:19.600 let's say, just these few points, 0:15:19.600,0:15:21.320 and I will 0:15:21.320,0:15:22.760 apply linear regression 0:15:22.760,0:15:26.160 to fit a straight line just to this sub-set of the data. Okay? I'm 0:15:26.160,0:15:29.740 using this sub-term sub-set - well let's come back to that later. 0:15:29.740,0:15:32.000 So we take this data set and I fit a 0:15:32.000,0:15:36.690 straight line to it and maybe I get a straight line like that. 0:15:36.690,0:15:40.720 And what I'll do is then 0:15:40.720,0:15:45.150 evaluate this particular value of straight line and 0:15:45.150,0:15:47.280 that will be the value I return for my algorithm. 0:15:47.280,0:15:50.360 I think this would be the predicted value 0:15:50.360,0:15:53.060 for - 0:15:53.060,0:15:57.370 this would be the value of then my hypothesis outputs 0:15:57.370,0:16:04.370 in locally weighted regression. Okay? So 0:16:05.270,0:16:10.110 we're gonna fall one up. Let me go ahead and formalize that. In 0:16:10.110,0:16:15.120 locally weighted regression, we're going to fit theta to 0:16:15.120,0:16:18.350 minimize 0:16:18.350,0:16:25.350 sum over I 0:16:27.480,0:16:32.590 to minimize that 0:16:32.590,0:16:36.600 where these terms W superscript I are called weights. 0:16:36.600,0:16:37.769 0:16:37.769,0:16:41.080 There are many possible choice for ways, I'm just gonna write one down. So this E's and 0:16:41.080,0:16:42.990 minus, XI minus 0:16:42.990,0:16:45.590 X squared 0:16:45.590,0:16:49.290 over 0:16:49.290,0:16:54.930 two. So let's look at what these weights really are, right? So notice that - 0:16:54.930,0:16:57.790 suppose you have a training example XI. 0:16:57.790,0:17:04.790 So that XI is very close to X. So this is small, 0:17:06.940,0:17:10.620 right? Then if XI minus X is small, so if XI minus X is close to zero, then 0:17:10.620,0:17:14.319 this is E's to the minus zero and E to the zero 0:17:14.319,0:17:15.860 is one. 0:17:15.860,0:17:19.790 So if XI is close to X, then 0:17:19.790,0:17:21.180 WI 0:17:21.180,0:17:25.130 will be close to one. In other words, the weight associated with the, I training 0:17:25.130,0:17:27.030 example be close to one 0:17:27.030,0:17:30.160 if XI and X are close to each 0:17:30.160,0:17:31.150 other. 0:17:31.150,0:17:34.270 Conversely if XI minus X 0:17:34.270,0:17:41.270 is large 0:17:42.330,0:17:43.110 then - I don't 0:17:43.110,0:17:48.010 know, what would WI be? Zero. 0:17:48.010,0:17:51.020 Zero, right. Close to zero. Right. 0:17:51.020,0:17:51.850 So 0:17:51.850,0:17:56.630 if XI is very far from X then this is E to the minus of some large number 0:17:56.630,0:18:02.210 and E to the minus some large number will be close to zero. 0:18:02.210,0:18:05.170 Okay? 0:18:05.170,0:18:12.170 So the picture is, if I'm 0:18:12.400,0:18:13.390 quarrying 0:18:13.390,0:18:16.970 at a certain point X, shown on the X axis, 0:18:16.970,0:18:21.820 and if my data 0:18:21.820,0:18:24.090 set, say, look like that, 0:18:24.090,0:18:28.880 then I'm going to give the points close to this a large weight and give the points 0:18:28.880,0:18:32.610 far away a small weight. 0:18:32.610,0:18:35.390 So 0:18:35.390,0:18:37.930 for the points that are far away, 0:18:37.930,0:18:39.690 WI will be close to zero. 0:18:39.690,0:18:44.280 And so as if for the points that are far away, 0:18:44.280,0:18:47.850 they will not contribute much at all to this summation, right? So I think this is 0:18:47.850,0:18:48.970 sum over I 0:18:48.970,0:18:54.240 of one times this quadratic term for points by points plus zero times this quadratic term for faraway 0:18:54.240,0:18:55.720 points. 0:18:55.720,0:18:58.610 And so the effect of using this weighting is that 0:18:58.610,0:19:00.050 locally weighted linear regression 0:19:00.050,0:19:02.630 fits a set of parameters theta, 0:19:02.630,0:19:05.030 paying much more attention to fitting the 0:19:05.030,0:19:06.950 points close by 0:19:06.950,0:19:10.870 accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is 0:19:10.870,0:19:17.580 exponentially [inaudible]? 0:19:17.580,0:19:22.570 Yeah. Let's see. So it turns out there are many other weighting functions you can use. It 0:19:22.570,0:19:26.890 turns out that there are definitely different communities of researchers that tend to 0:19:26.890,0:19:29.400 choose different choices by default. 0:19:29.400,0:19:33.700 There is somewhat of a literature on debating what point - 0:19:33.700,0:19:35.660 exactly what function to use. 0:19:35.660,0:19:37.970 This, sort of, exponential decay function is - 0:19:37.970,0:19:41.180 this happens to be a reasonably common one that seems to be a more reasonable choice on many problems, 0:19:41.180,0:19:42.310 0:19:42.310,0:19:45.380 but you can actually plug in other functions as well. Did 0:19:45.380,0:19:47.500 I mention what [inaudible] is it at? 0:19:47.500,0:19:49.710 For those of you that are familiar with 0:19:49.710,0:19:51.550 the 0:19:51.550,0:19:53.890 normal distribution, or the Gaussian distribution, 0:19:53.890,0:19:55.350 say this - 0:19:55.350,0:19:59.370 what this formula I've written out here, it cosmetically 0:19:59.370,0:20:02.770 looks a bit like a Gaussian distribution. Okay? But this actually has 0:20:02.770,0:20:06.190 absolutely nothing to do with Gaussian distribution. 0:20:06.190,0:20:07.940 So this 0:20:07.940,0:20:12.240 is not that a problem with XI is Gaussian or whatever. This is no such 0:20:12.240,0:20:13.010 interpretation. 0:20:13.010,0:20:16.840 This is just a convenient function that happens to be a bell-shaped function, but 0:20:16.840,0:20:21.460 don't endow this of any Gaussian semantics. Okay? 0:20:21.460,0:20:23.160 So, in fact - well, 0:20:23.160,0:20:28.510 if you remember the familiar bell-shaped Gaussian, again, it's just 0:20:28.510,0:20:32.160 the ways of associating with these points is that if you 0:20:32.160,0:20:33.850 imagine 0:20:33.850,0:20:36.050 putting this on a bell-shaped bump, 0:20:36.050,0:20:39.650 centered around the position of where you want to value your hypothesis H, 0:20:39.650,0:20:42.910 then there's a saying this point here I'll give 0:20:42.910,0:20:44.460 a weight that's proportional 0:20:44.460,0:20:48.780 to the height of the Gaussian - excuse me, to the height of the bell-shaped function 0:20:48.780,0:20:50.960 evaluated at this point. And the 0:20:50.960,0:20:53.020 way to get to this point will be, 0:20:53.020,0:20:54.730 to this training example, 0:20:54.730,0:20:56.800 will be proportionate to that height 0:20:56.800,0:20:58.549 and so on. Okay? 0:20:58.549,0:21:01.450 And so training examples that are really far away 0:21:01.450,0:21:04.820 get a very small weight. 0:21:04.820,0:21:07.400 0:21:07.400,0:21:10.630 One last small generalization to this is that 0:21:10.630,0:21:11.960 normally 0:21:11.960,0:21:15.080 there's one other parameter to this 0:21:15.080,0:21:17.760 algorithm, which I'll denote as tow. 0:21:17.760,0:21:21.110 Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian. 0:21:21.110,0:21:24.640 This is a convenient form or function. 0:21:24.640,0:21:27.140 This parameter tow 0:21:27.140,0:21:32.510 is called the bandwidth 0:21:32.510,0:21:33.760 parameter 0:21:33.760,0:21:40.760 and 0:21:40.980,0:21:47.530 informally it controls how fast the weights fall of with distance. Okay? So just 0:21:47.530,0:21:49.350 copy my diagram from 0:21:49.350,0:21:53.810 the other side, I guess. 0:21:53.810,0:21:55.770 So if 0:21:55.770,0:21:57.750 tow is very small, 0:21:57.750,0:22:00.000 0:22:00.000,0:22:04.520 if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape, 0:22:04.520,0:22:08.080 so that the weights of the points are far away fall off rapidly. 0:22:08.080,0:22:09.570 Whereas if 0:22:09.570,0:22:16.570 tow 0:22:17.490,0:22:21.309 is large then you'd end 0:22:21.309,0:22:25.309 up choosing a weighting function that falls of relatively slowly with distance from your 0:22:25.309,0:22:27.540 query. 0:22:27.540,0:22:30.549 Okay? 0:22:30.549,0:22:32.910 So 0:22:32.910,0:22:39.910 I hope you can, therefore, see that if you 0:22:41.580,0:22:44.410 apply locally weighted linear regression to a data set that looks like 0:22:44.410,0:22:45.640 this, 0:22:45.640,0:22:46.850 then to 0:22:46.850,0:22:50.100 ask what your hypothesis output is at a point like this you end up having a straight line 0:22:50.100,0:22:51.169 making 0:22:51.169,0:22:52.740 that prediction. To 0:22:52.740,0:22:55.159 ask what kind of class this [inaudible] at 0:22:55.159,0:22:56.430 that value 0:22:56.430,0:22:58.540 you put a straight line there 0:22:58.540,0:23:00.420 and you predict that value. 0:23:00.420,0:23:04.060 It turns out that every time you try to vary your hypothesis, every time you 0:23:04.060,0:23:06.509 ask your learning algorithm to make a prediction for 0:23:06.509,0:23:08.910 how much a new house costs or whatever, 0:23:08.910,0:23:09.910 0:23:09.910,0:23:13.640 you need to run a new fitting procedure and then 0:23:13.640,0:23:15.730 evaluate this line that you fit 0:23:15.730,0:23:17.810 just at the position of 0:23:17.810,0:23:22.430 the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you 0:23:22.430,0:23:25.510 do this for every point along the X-axis then 0:23:25.510,0:23:27.010 you find that 0:23:27.010,0:23:29.990 locally weighted regression is able to trace on this, sort of, very 0:23:29.990,0:23:31.690 non-linear curve 0:23:31.690,0:23:37.559 for a data set like this. Okay? So 0:23:37.559,0:23:42.110 in the problem set we're actually gonna let you play around more with this algorithm. So I won't say 0:23:42.110,0:23:43.850 too much more about it here. 0:23:43.850,0:23:48.730 But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you 0:23:48.730,0:23:49.950 still have 0:23:49.950,0:23:51.200 the same problem of overfitting 0:23:51.200,0:23:55.320 and underfitting, like when 0:23:55.320,0:23:57.960 you had a Q's tow. Like you make it too small 0:23:57.960,0:23:59.080 in your - 0:23:59.080,0:24:01.760 Yes, absolutely. Yes. So locally 0:24:01.760,0:24:04.260 weighted regression can run into 0:24:04.260,0:24:07.420 - locally weighted regression is not a penancier for the problem of 0:24:07.420,0:24:10.470 overfitting or underfitting. 0:24:10.470,0:24:15.130 You can still run into the same problems with locally weighted regression. What you just 0:24:15.130,0:24:16.920 said about 0:24:16.920,0:24:19.920 - and so some of these things I'll leave you to discover for yourself in the 0:24:19.920,0:24:20.859 homework problem. 0:24:20.859,0:24:24.720 You'll actually see what you just mentioned. 0:24:24.720,0:24:29.070 Yeah? It almost seems like 0:24:29.070,0:24:31.360 you're 0:24:31.360,0:24:37.430 not even thoroughly [inaudible] with this locally weighted, you had all the 0:24:37.430,0:24:42.500 data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a 0:24:42.500,0:24:44.700 model, because you need the entire data set. 0:24:44.700,0:24:45.460 And 0:24:45.460,0:24:49.720 the other way of saying that is that this is a non-parametric learning 0:24:49.720,0:24:51.150 algorithm. So this 0:24:51.150,0:24:52.280 0:24:52.280,0:24:54.440 -I don't know. I won't 0:24:54.440,0:24:58.130 debate whether, you know, are we really building a model or not. But 0:24:58.130,0:25:00.370 this is a perfectly fine - so 0:25:00.370,0:25:01.660 if I think 0:25:01.660,0:25:03.750 when you write a code implementing 0:25:03.750,0:25:07.080 locally weighted linear regression on the 0:25:07.080,0:25:08.980 data set I think of that code 0:25:08.980,0:25:13.340 as a whole - as building your model. 0:25:13.340,0:25:14.560 So it actually uses - 0:25:14.560,0:25:18.610 we've actually used this quite successfully to model, sort of, the dynamics of this 0:25:18.610,0:25:23.030 autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights 0:25:23.030,0:25:24.090 based 0:25:24.090,0:25:27.770 on 0:25:27.770,0:25:32.690 the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see, 0:25:32.690,0:25:34.259 yes. So 0:25:34.259,0:25:37.450 it turns out there are a few things you can do. One thing that is quite common is 0:25:37.450,0:25:40.190 how to choose this band with parameter tow, 0:25:40.190,0:25:45.790 right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible] 0:25:45.790,0:25:52.790 Gaussian 0:25:56.480,0:25:58.030 sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's 0:25:58.030,0:25:59.640 see. Boy. 0:25:59.640,0:26:01.679 The weights are not 0:26:01.679,0:26:06.130 random variables and it's not, for the purpose of this algorithm, it is not useful to 0:26:06.130,0:26:09.370 endow it with probable semantics. So you could 0:26:09.370,0:26:15.220 choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In 0:26:15.220,0:26:15.820 fact, 0:26:15.820,0:26:17.390 it turns out that 0:26:17.390,0:26:21.530 I happened to choose this, sort of, bell-shaped function 0:26:21.530,0:26:23.119 to define my weights. 0:26:23.119,0:26:27.149 It's actually fine to choose a function that doesn't even integrate to one, that integrates to 0:26:27.149,0:26:29.560 infinity, say, as you're weighting function. So 0:26:29.560,0:26:31.700 in that sense, 0:26:31.700,0:26:36.220 I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially 0:26:36.220,0:26:41.830 since you use other functions that integrate to infinity and don't integrate to one. Okay? 0:26:41.830,0:26:43.990 It's the last question and let's move on Assume 0:26:43.990,0:26:50.990 that we have a very huge [inaudible], for example. A very huge set of houses and want to 0:26:51.370,0:26:53.870 predict the linear for each house 0:26:53.870,0:26:57.530 and so should the end result for each input - I'm 0:26:57.530,0:26:59.549 seeing this very constantly for 0:26:59.549,0:27:02.120 - Yes, you're right. So 0:27:02.120,0:27:04.830 because locally weighted regression is a 0:27:04.830,0:27:07.200 non-parametric algorithm 0:27:07.200,0:27:11.380 every time you make a prediction you need to fit theta to your entire training set again. 0:27:11.380,0:27:13.120 So you're actually right. 0:27:13.120,0:27:17.640 If you have a very large training set then this is a somewhat expensive algorithm to 0:27:17.640,0:27:19.840 use. Because every time you want to make a prediction 0:27:19.840,0:27:20.929 you need to fit 0:27:20.929,0:27:22.660 a straight line 0:27:22.660,0:27:25.640 to a huge data set again. 0:27:25.640,0:27:28.860 Turns out there are algorithms that - turns 0:27:28.860,0:27:32.170 out there are ways to make this much more efficient for large data sets as well. 0:27:32.170,0:27:34.539 So don't want to talk about that. If you're interested, look 0:27:34.539,0:27:36.089 up the work of Andrew Moore 0:27:36.089,0:27:37.780 on KD-trees. He, 0:27:37.780,0:27:39.580 sort 0:27:39.580,0:27:42.190 of, figured out ways to fit these models much more efficiently. That's not something I want to go 0:27:42.190,0:27:44.760 into today. Okay? Let me 0:27:44.760,0:27:51.760 move one. Let's take more questions later. So, okay. 0:28:00.789,0:28:05.670 So that's locally weighted regression. 0:28:05.670,0:28:07.080 Remember the 0:28:07.080,0:28:10.830 outline I had, I guess, at the beginning of this lecture. What I want to do now is 0:28:10.830,0:28:11.440 0:28:11.440,0:28:15.650 talk about a probabilistic interpretation of linear regression, all right? 0:28:15.650,0:28:19.179 And in particular of the - it'll be this probabilistic interpretation 0:28:19.179,0:28:22.250 that let's us move on to talk 0:28:22.250,0:28:29.250 about logistic regression, which will be our first classification algorithm. So 0:28:39.680,0:28:43.320 let's put aside locally weighted regression for now. We'll just talk about 0:28:43.320,0:28:46.650 ordinary unweighted linear regression. Let's 0:28:46.650,0:28:50.580 ask the question of why least squares, right? Of all the things 0:28:50.580,0:28:52.400 we could optimize 0:28:52.400,0:28:56.250 how do we come up with this criteria for minimizing the square of the area 0:28:56.250,0:28:59.490 between the predictions of the hypotheses 0:28:59.490,0:29:02.320 and the values Y predicted. So why not minimize 0:29:02.320,0:29:08.140 the absolute value of the areas or the areas to the power of four or something? 0:29:08.140,0:29:10.260 What I'm going to do now is present 0:29:10.260,0:29:12.270 one set of assumptions that 0:29:12.270,0:29:14.520 will serve to "justify" 0:29:14.520,0:29:18.950 why we're minimizing the sum of square zero. Okay? 0:29:18.950,0:29:21.620 It turns out that there are many assumptions that are sufficient 0:29:21.620,0:29:25.820 to justify why we do least squares and this is just one of them. 0:29:25.820,0:29:26.929 So 0:29:26.929,0:29:30.260 just because I present one set of assumptions under which least 0:29:30.260,0:29:32.340 squares regression make sense, 0:29:32.340,0:29:35.270 but this is not the only set of assumptions. So even if the assumptions 0:29:35.270,0:29:39.160 I describe don't hold, least squares actually still makes sense in many 0:29:39.160,0:29:40.030 circumstances. But this 0:29:40.030,0:29:41.490 sort of new help, you know, 0:29:41.490,0:29:48.289 give one rationalization, like, one reason for doing least squares regression. 0:29:48.289,0:29:51.140 And, in particular, what I'm going to do is 0:29:51.140,0:29:57.520 endow the least squares model with probabilistic semantics. So 0:29:57.520,0:30:01.809 let's assume in our example of predicting housing prices, 0:30:01.809,0:30:02.950 that 0:30:02.950,0:30:07.510 the price of the house it's sold four, and 0:30:07.510,0:30:12.400 there's going to be some linear function of the features, 0:30:12.400,0:30:13.720 plus 0:30:13.720,0:30:17.049 some term epsilon I. Okay? 0:30:17.049,0:30:19.620 And epsilon I will be 0:30:19.620,0:30:22.010 our error term. 0:30:22.010,0:30:23.160 You can think of 0:30:23.160,0:30:27.860 the error term as capturing unmodeled effects, like, that maybe 0:30:27.860,0:30:31.490 there's some other features of a house, like, maybe how many fireplaces it has or whether 0:30:31.490,0:30:33.390 there's a garden or whatever, 0:30:33.390,0:30:34.130 that there 0:30:34.130,0:30:37.100 are additional features that we jut fail to capture or 0:30:37.100,0:30:39.520 you can think of epsilon as random noise. 0:30:39.520,0:30:42.360 Epsilon is our error term that captures both these 0:30:42.360,0:30:46.230 unmodeled effects. Just things we forgot to model. Maybe the function isn't quite 0:30:46.230,0:30:47.660 linear or something. 0:30:47.660,0:30:50.360 0:30:50.360,0:30:52.709 As well as random noise, like maybe 0:30:52.709,0:30:56.539 that day the seller was in a really bad mood and so he sold it, just 0:30:56.539,0:31:00.440 refused to go for a reasonable price or something. 0:31:00.440,0:31:02.670 And now 0:31:02.670,0:31:07.880 I will assume that the errors have a 0:31:07.880,0:31:09.010 probabilistic 0:31:09.010,0:31:13.100 - have a probability distribution. I'll assume that the errors epsilon I 0:31:13.100,0:31:14.520 are distributed 0:31:14.520,0:31:16.500 just till 0:31:16.500,0:31:18.660 they denote epsilon I 0:31:18.660,0:31:23.440 is distributive according to a probability distribution. That's 0:31:23.440,0:31:26.510 a Gaussian distribution with mean zero 0:31:26.510,0:31:28.170 and variance sigma squared. Okay? So 0:31:28.170,0:31:30.630 let me just scripts in here, 0:31:30.630,0:31:34.630 n stands for normal, right? To denote a normal distribution, also known as the 0:31:34.630,0:31:36.169 Gaussian distribution, 0:31:36.169,0:31:37.320 with mean 0:31:37.320,0:31:41.190 zero and covariance sigma squared. 0:31:41.190,0:31:46.730 Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you. 0:31:46.730,0:31:48.640 Great. Almost everyone. 0:31:48.640,0:31:51.280 So, 0:31:51.280,0:31:55.309 in other words, the density for Gaussian is what you've seen before. 0:31:55.309,0:31:57.730 The density for epsilon I would be 0:31:57.730,0:31:59.650 one over root 2 pi sigma, E to the 0:31:59.650,0:32:01.790 negative, 0:32:01.790,0:32:03.570 epsilon I 0:32:03.570,0:32:05.760 0:32:05.760,0:32:09.420 squared over 2 sigma squared, right? 0:32:09.420,0:32:15.170 And the 0:32:15.170,0:32:22.170 density of our epsilon I will be this bell-shaped curve 0:32:22.929,0:32:25.870 with one standard deviation 0:32:25.870,0:32:31.039 being a, sort of, sigma. Okay? This 0:32:31.039,0:32:31.780 is 0:32:31.780,0:32:34.200 form for that bell-shaped curve. 0:32:34.200,0:32:41.200 So, let's see. I can erase that. Can I 0:32:41.260,0:32:47.040 erase the board? 0:32:47.040,0:32:54.040 0:32:59.159,0:33:04.770 So this implies that the 0:33:04.770,0:33:10.120 probability distribution of a price of a house 0:33:10.120,0:33:11.530 given in si 0:33:11.530,0:33:13.620 and the parameters theta, 0:33:13.620,0:33:20.620 that this is going to be Gaussian 0:33:29.530,0:33:34.820 with that density. Okay? 0:33:34.820,0:33:37.930 In other words, saying goes as that the 0:33:37.930,0:33:41.440 price of 0:33:41.440,0:33:46.550 a house given the features of the house and my parameters theta, 0:33:46.550,0:33:50.530 this is going to be a random variable 0:33:50.530,0:33:53.820 that's distributed Gaussian with 0:33:53.820,0:33:56.700 mean theta transpose XI 0:33:56.700,0:33:58.039 and variance sigma squared. 0:33:58.039,0:34:01.120 Right? Because we imagine that 0:34:01.120,0:34:04.990 the way the housing prices are generated is that the price of a house 0:34:04.990,0:34:08.990 is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma 0:34:08.990,0:34:11.649 squared. So 0:34:11.649,0:34:13.589 the price of a house is going to 0:34:13.589,0:34:16.239 have mean theta transpose XI, again, and sigma squared, right? Does 0:34:16.239,0:34:20.069 this make 0:34:20.069,0:34:24.379 sense? Raise your hand if this makes sense. Yeah, 0:34:24.379,0:34:31.379 okay. Lots of you. In 0:34:38.369,0:34:43.939 point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume 0:34:43.939,0:34:47.489 here the error is a 0:34:47.489,0:34:50.299 Gaussian? 0:34:50.299,0:34:54.039 Right. So, boy. 0:34:54.039,0:34:56.429 Why do I see the error as Gaussian? 0:34:56.429,0:35:00.909 Two reasons, right? One is that it turns out to be mathematically convenient to do so 0:35:00.909,0:35:03.109 and the other is, I don't 0:35:03.109,0:35:06.529 know, I can also mumble about justifications, such as things to the 0:35:06.529,0:35:08.609 central limit theorem. It turns out that if you, 0:35:08.609,0:35:11.989 for the vast majority of problems, if you apply a linear regression model like 0:35:11.989,0:35:15.679 this and try to measure the distribution of the errors, 0:35:15.679,0:35:19.509 not all the time, but very often you find that the errors really are Gaussian. That 0:35:19.509,0:35:21.779 this Gaussian model is a good 0:35:21.779,0:35:23.519 assumption for the error 0:35:23.519,0:35:26.079 in regression problems like these. 0:35:26.079,0:35:29.000 Some of you may have heard of the central limit theorem, which says that 0:35:29.000,0:35:33.219 the sum of many independent random variables will tend towards a Gaussian. 0:35:33.219,0:35:37.450 So if the error is caused by many effects, like the mood of the 0:35:37.450,0:35:39.060 seller, the mood of the buyer, 0:35:39.060,0:35:42.680 some other features that we miss, whether the place has a garden or not, and 0:35:42.680,0:35:45.630 if all of these effects are independent, then 0:35:45.630,0:35:49.259 by the central limit theorem you might be inclined to believe that 0:35:49.259,0:35:52.259 the sum of all these effects will be approximately Gaussian. If 0:35:52.259,0:35:53.760 in practice, I guess, the 0:35:53.760,0:35:57.810 two real answers are that, 1.) In practice this is actually a reasonably accurate 0:35:57.810,0:36:04.810 assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're 0:36:06.509,0:36:07.539 saying 0:36:07.539,0:36:10.539 if we assume that area around model 0:36:10.539,0:36:12.849 has zero mean, then 0:36:12.849,0:36:16.269 the area is centered around our model. Which 0:36:16.269,0:36:18.059 it seems almost like we're trying to assume 0:36:18.059,0:36:19.960 what we're trying to prove. Instructor? 0:36:19.960,0:36:23.640 That's the [inaudible] but, yes. You are assuming that 0:36:23.640,0:36:28.359 the error has zero mean. Which is, yeah, right. 0:36:28.359,0:36:30.919 I think later this quarter we get to some of the other 0:36:30.919,0:36:35.059 things, but for now just think of this as a mathematically - it's actually not 0:36:35.059,0:36:37.879 an unreasonable assumption. 0:36:37.879,0:36:40.390 I guess, 0:36:40.390,0:36:44.979 in machine learning all the assumptions we make are almost never 0:36:44.979,0:36:48.519 true in the absence sense, right? Because, for instance, 0:36:48.519,0:36:54.359 housing prices are priced to dollars and cents, so the error will be - 0:36:54.359,0:36:55.199 errors 0:36:55.199,0:36:58.719 in prices are not continued as value random variables, because 0:36:58.719,0:37:02.239 houses can only be priced at a certain number of dollars and a certain number of 0:37:02.239,0:37:05.970 cents and you never have fractions of cents in housing prices. 0:37:05.970,0:37:10.339 Whereas a Gaussian random variable would. So in that sense, assumptions we make are never 0:37:10.339,0:37:14.689 "absolutely true," but for practical purposes this is a 0:37:14.689,0:37:19.459 accurate enough assumption that it'll be useful to 0:37:19.459,0:37:23.529 make. Okay? I think in a week or two, we'll actually come back to 0:37:23.529,0:37:27.279 selected more about the assumptions we make and when they help our learning 0:37:27.279,0:37:29.059 algorithms and when they hurt our learning 0:37:29.059,0:37:30.839 algorithms. We'll say a bit more about it 0:37:30.839,0:37:37.839 when we talk about generative and discriminative learning algorithms, like, in a week or 0:37:40.319,0:37:44.819 two. Okay? So let's point out one bit of notation, which is that when I 0:37:44.819,0:37:48.399 wrote this down I actually wrote P of YI given XI and then semicolon 0:37:48.399,0:37:49.719 theta 0:37:49.719,0:37:54.999 and I'm going to use this notation when we are not thinking of theta as a 0:37:54.999,0:37:56.799 random variable. So 0:37:56.799,0:38:00.709 in statistics, though, sometimes it's called the frequentist's point of view, 0:38:00.709,0:38:02.380 where you think of there as being some, 0:38:02.380,0:38:06.450 sort of, true value of theta that's out there that's generating the data say, 0:38:06.450,0:38:07.519 but 0:38:07.519,0:38:11.299 we don't know what theta is, but theta is not a random 0:38:11.299,0:38:13.440 vehicle, right? So it's not like there's some 0:38:13.440,0:38:16.270 random value of theta out there. It's that theta is - 0:38:16.270,0:38:19.140 there's some true value of theta out there. It's just that we don't 0:38:19.140,0:38:22.459 know what the true value of theta is. So 0:38:22.459,0:38:25.620 if theta is not a random variable, then I'm 0:38:25.620,0:38:27.299 going to avoid 0:38:27.299,0:38:31.359 writing P of YI given XI comma theta, because this would mean 0:38:31.359,0:38:34.899 that probably of YI conditioned on X and theta 0:38:34.899,0:38:40.200 and you can only condition on random variables. 0:38:40.200,0:38:43.010 So at this part of the class where we're taking 0:38:43.010,0:38:46.490 sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class 0:38:46.490,0:38:49.220 we're thinking of theta not as a random variable, but just as something 0:38:49.220,0:38:50.470 we're trying to estimate 0:38:50.470,0:38:52.829 and use the semicolon 0:38:52.829,0:38:57.559 notation. So the way to read this is this is the probability of YI given XI 0:38:57.559,0:39:00.410 and parameterized by theta. Okay? So 0:39:00.410,0:39:03.719 you read the semicolon as parameterized by. 0:39:03.719,0:39:08.269 And in the same way here, I'll say YI given XI parameterized by 0:39:08.269,0:39:09.499 theta is distributed 0:39:09.499,0:39:16.499 Gaussian with that. All right. 0:39:36.319,0:39:38.329 So we're gonna make one more assumption. 0:39:38.329,0:39:41.279 Let's assume that the 0:39:41.279,0:39:43.819 error terms are 0:39:43.819,0:39:48.269 0:39:48.269,0:39:50.609 IID, okay? 0:39:50.609,0:39:54.279 Which stands for Independently and Identically Distributed. So it's 0:39:54.279,0:39:57.259 going to assume that the error terms are 0:39:57.259,0:40:04.259 independent of each other, 0:40:04.469,0:40:11.279 right? 0:40:11.279,0:40:14.749 The identically distributed part just means that I'm assuming the outcome for the 0:40:14.749,0:40:18.009 same Gaussian distribution or the same variance, 0:40:18.009,0:40:22.099 but the more important part of is this is that I'm assuming that the epsilon I's are 0:40:22.099,0:40:25.519 independent of each other. 0:40:25.519,0:40:28.899 Now, let's talk about how to fit a model. 0:40:28.899,0:40:32.359 The probability of Y given 0:40:32.359,0:40:36.239 X parameterized by theta - I'm actually going to give 0:40:36.239,0:40:39.149 this another name. I'm going to write this down 0:40:39.149,0:40:42.749 and we'll call this the likelihood of theta 0:40:42.749,0:40:46.439 as the probability of Y given X parameterized by theta. 0:40:46.439,0:40:49.009 And so this is going to be 0:40:49.009,0:40:50.419 the product 0:40:50.419,0:40:57.419 over my training set like that. 0:40:59.939,0:41:04.299 Which is, in turn, going to be a product of 0:41:04.299,0:41:10.539 those Gaussian densities that I wrote down just now, 0:41:10.539,0:41:14.549 right? 0:41:14.549,0:41:20.359 Okay? 0:41:20.359,0:41:24.629 Then in parts of notation, I guess, I define this term here to be the 0:41:24.629,0:41:26.180 likelihood of theta. 0:41:26.180,0:41:30.109 And the likely of theta is just the probability of the data Y, right? Given X 0:41:30.109,0:41:32.789 and prioritized by theta. 0:41:32.789,0:41:36.619 To test the likelihood and probability are often confused. 0:41:36.619,0:41:40.519 So the likelihood of theta is the same thing as the 0:41:40.519,0:41:45.509 probability of the data you saw. So likely and probably are, sort of, the same thing. 0:41:45.509,0:41:47.779 Except that when I use the term likelihood 0:41:47.779,0:41:51.799 I'm trying to emphasize that I'm taking this thing 0:41:51.799,0:41:55.380 and viewing it as a function of theta. 0:41:55.380,0:41:57.130 Okay? 0:41:57.130,0:42:00.609 So likelihood and for probability, they're really the same thing except that 0:42:00.609,0:42:02.339 when I want to view this thing 0:42:02.339,0:42:05.660 as a function of theta holding X and Y fix are 0:42:05.660,0:42:10.089 then called likelihood. Okay? So 0:42:10.089,0:42:13.299 hopefully you hear me say the likelihood of the parameters and the probability 0:42:13.299,0:42:15.119 of the data, 0:42:15.119,0:42:18.469 right? Rather than the likelihood of the data or probability of parameters. So try 0:42:18.469,0:42:25.469 to be consistent in that terminology. 0:42:30.509,0:42:31.519 So given that 0:42:31.519,0:42:33.529 the probability of the data is this and this 0:42:33.529,0:42:36.859 is also the likelihood of the parameters, 0:42:36.859,0:42:37.809 how do you estimate 0:42:37.809,0:42:40.460 the parameters theta? So given a training set, 0:42:40.460,0:42:46.309 what parameters theta do you want to choose for your model? 0:42:46.309,0:42:53.309 0:42:58.749,0:43:01.549 Well, the principle of maximum likelihood 0:43:01.549,0:43:02.709 estimation 0:43:02.709,0:43:09.329 says that, 0:43:09.329,0:43:13.299 right? You can choose the value of theta that makes the data 0:43:13.299,0:43:20.039 as probable as possible, right? So choose theta 0:43:20.039,0:43:27.039 to maximize the likelihood. Or 0:43:27.199,0:43:29.769 in other words choose the parameters that make 0:43:29.769,0:43:33.119 the data as probable as possible, right? So this is 0:43:33.119,0:43:36.939 massive likely your estimation from six to six. So it's choose the parameters that makes 0:43:36.939,0:43:39.539 it as likely as probable as possible 0:43:39.539,0:43:43.309 for me to have seen the data I just 0:43:43.309,0:43:46.819 did. So 0:43:46.819,0:43:53.189 for mathematical convenience, let me define lower case l of theta. 0:43:53.189,0:43:57.959 This is called the log likelihood function and it's just log 0:43:57.959,0:44:01.429 of capital L of theta. 0:44:01.429,0:44:06.309 So this is log over product over I 0:44:06.309,0:44:09.539 to find 0:44:09.539,0:44:14.299 sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this 0:44:14.299,0:44:16.609 from the previous board. 0:44:16.609,0:44:23.609 Log and a product is the same as the sum of over logs, right? So it's a sum 0:44:24.509,0:44:31.509 of 0:44:35.140,0:44:38.400 the logs of - which simplifies to m times 0:44:38.400,0:44:39.349 one over root 0:44:39.349,0:44:43.509 two pi 0:44:43.509,0:44:44.380 sigma 0:44:44.380,0:44:47.469 plus 0:44:47.469,0:44:52.099 and then log of explanation cancel each other, right? So if log of E of 0:44:52.099,0:44:53.089 something is just 0:44:53.089,0:45:00.089 whatever's inside the exponent. So, you know what, 0:45:01.229,0:45:08.229 let me write this on the 0:45:12.079,0:45:16.129 next 0:45:16.129,0:45:21.359 board. 0:45:21.359,0:45:28.359 Okay. 0:45:32.759,0:45:39.759 0:45:46.239,0:45:50.509 So 0:45:50.509,0:45:53.349 maximizing the likelihood or maximizing the log 0:45:53.349,0:45:58.440 likelihood is the same 0:45:58.440,0:46:02.940 as minimizing 0:46:02.940,0:46:09.940 that term over there. Well, you get it, right? 0:46:22.019,0:46:26.039 Because there's a minus sign. So maximizing this because of the minus sign is the same as 0:46:26.039,0:46:26.829 minimizing 0:46:26.829,0:46:33.229 this as a function of theta. And 0:46:33.229,0:46:35.729 this is, of course, just 0:46:35.729,0:46:42.729 the same quadratic cos function that we had last time, J of theta, 0:46:43.089,0:46:44.469 right? So what 0:46:44.469,0:46:48.339 we've just shown is that the ordinary least squares algorithm, 0:46:48.339,0:46:50.539 that we worked on the previous lecture, 0:46:50.539,0:46:55.019 is just maximum likelihood 0:46:55.019,0:46:56.079 assuming 0:46:56.079,0:46:58.219 this probabilistic model, 0:46:58.219,0:47:05.219 assuming IID Gaussian errors on our data. 0:47:06.109,0:47:09.659 0:47:09.659,0:47:10.720 Okay? One thing that we'll 0:47:10.720,0:47:12.480 actually leave is that, 0:47:12.480,0:47:14.489 in the next lecture notice that 0:47:14.489,0:47:17.060 the value of sigma squared doesn't matter, 0:47:17.060,0:47:17.930 right? That somehow 0:47:17.930,0:47:21.390 no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a 0:47:21.390,0:47:21.999 variance 0:47:21.999,0:47:26.229 of a Gaussian. So that no matter what sigma 0:47:26.229,0:47:30.160 squared is since it's a positive number the value of theta we end up with 0:47:30.160,0:47:33.619 will be the same, right? So because 0:47:33.619,0:47:34.659 minimizing this 0:47:34.659,0:47:39.289 you get the same value of theta no matter what sigma squared is. So it's as if 0:47:39.289,0:47:42.690 in this model the value of sigma squared doesn't really matter. 0:47:42.690,0:47:45.790 Just remember that for the next lecture. We'll come back 0:47:45.790,0:47:47.509 to this again. 0:47:47.509,0:47:51.429 Any questions about this? 0:47:51.429,0:47:52.670 Actually, let me clean up 0:47:52.670,0:47:59.670 another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of 0:48:43.929,0:48:50.929 theta by 0:48:51.209,0:48:52.079 a fraction 0:48:52.079,0:48:53.809 of error, 0:48:53.809,0:48:55.369 but I think it's that you 0:48:55.369,0:48:57.319 measure 0:48:57.319,0:49:01.179 because it depends on the family of theta too, for example. If 0:49:01.179,0:49:05.269 0:49:05.269,0:49:09.069 you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think 0:49:09.069,0:49:12.609 let's - the thing's you're mentioning are 0:49:12.609,0:49:14.549 maybe deeper questions about 0:49:14.549,0:49:17.869 learning algorithms that we'll just come back to later, so don't really want to get into 0:49:17.869,0:49:19.459 that right 0:49:19.459,0:49:22.199 now. Any more 0:49:22.199,0:49:29.199 questions? Okay. So 0:49:33.219,0:49:38.759 this endows linear regression with a probabilistic interpretation. 0:49:38.759,0:49:43.209 I'm actually going to use this probabil - use this, sort of, probabilistic 0:49:43.209,0:49:43.940 interpretation 0:49:43.940,0:49:46.190 in order to derive our next learning algorithm, 0:49:46.190,0:49:50.309 which will be our first classification algorithm. Okay? 0:49:50.309,0:49:53.619 So 0:49:53.619,0:49:57.930 you'll recall that I said that regression problems are where the variable Y 0:49:57.930,0:50:01.339 that you're trying to predict is continuous values. 0:50:01.339,0:50:04.259 Now I'm actually gonna talk about our first classification problem, 0:50:04.259,0:50:07.859 where the value Y you're trying to predict 0:50:07.859,0:50:10.939 will be discreet value. You can take on only a small number of discrete values 0:50:10.939,0:50:14.379 and in this case I'll talk about binding classification 0:50:14.379,0:50:15.709 where 0:50:15.709,0:50:18.769 Y takes on only two values, right? So 0:50:18.769,0:50:21.890 you come up with classification problems if you're trying to do, 0:50:21.890,0:50:25.900 say, a medical diagnosis and try to decide based on some features 0:50:25.900,0:50:29.890 that the patient has a disease or does not have a disease. 0:50:29.890,0:50:34.239 Or if in the housing example, maybe you're trying to decide will this house sell in the 0:50:34.239,0:50:37.509 next six months or not and the answer is either yes or no. It'll either be sold in the 0:50:37.509,0:50:40.989 next six months or it won't be. 0:50:40.989,0:50:44.529 Other standing examples, if you want to build a spam filter. Is this e-mail spam 0:50:44.529,0:50:50.930 or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting 0:50:50.930,0:50:55.439 whether a computer system will crash. So you have a learning algorithm to predict will 0:50:55.439,0:50:59.159 this computing cluster crash over the next 24 hours? And, again, it's a yes 0:50:59.159,0:51:06.039 or no answer. So 0:51:06.039,0:51:08.149 there's X, there's Y. 0:51:08.149,0:51:14.119 And in a classification problem 0:51:14.119,0:51:15.269 Y takes on 0:51:15.269,0:51:18.569 two values, zero and one. That's it in binding the classification. 0:51:18.569,0:51:21.630 So what can you do? Well, one thing you could do is 0:51:21.630,0:51:25.299 take linear regression, as we've described it so far, and apply it to this problem, 0:51:25.299,0:51:26.399 right? So you, 0:51:26.399,0:51:30.289 you know, given this data set you can fit a straight line to it. Maybe 0:51:30.289,0:51:31.829 you get that straight line, right? 0:51:31.829,0:51:32.430 But 0:51:32.430,0:51:36.770 this data set I've drawn, right? This is an amazingly easy classification problem. It's 0:51:36.770,0:51:37.890 pretty obvious 0:51:37.890,0:51:41.420 to all of us that, right? The relationship between X and Y is - 0:51:41.420,0:51:47.729 well, you just look at a value around here and it's the right is one, it's 0:51:47.729,0:51:51.529 the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then 0:51:51.529,0:51:53.650 maybe take your linear regression 0:51:53.650,0:51:55.669 hypothesis to this straight line 0:51:55.669,0:51:58.489 and threshold it at 0.5. 0:51:58.489,0:52:02.059 If you do that you'll certainly get the right answer. You predict that 0:52:02.059,0:52:02.660 0:52:02.660,0:52:04.269 if X is to the right of, sort 0:52:04.269,0:52:06.229 of, the mid-point here 0:52:06.229,0:52:11.829 then Y is one and then next to the left of that mid-point then Y is zero. 0:52:11.829,0:52:16.099 So some people actually do this. Apply linear regression to classification problems 0:52:16.099,0:52:17.719 and sometimes it'll 0:52:17.719,0:52:19.319 work okay, 0:52:19.319,0:52:21.820 but in general it's actually a pretty bad idea to 0:52:21.820,0:52:26.130 apply linear regression to 0:52:26.130,0:52:31.849 classification problems like these and here's why. Let's say I 0:52:31.849,0:52:33.529 change my training set 0:52:33.529,0:52:39.509 by giving you just one more training example all the way up there, right? 0:52:39.509,0:52:43.049 Imagine if given this training set is actually still entirely obvious what the 0:52:43.049,0:52:45.670 relationship between X and Y is, right? It's just - 0:52:45.670,0:52:50.839 take this value as greater than Y is one and it's less then Y 0:52:50.839,0:52:52.039 is zero. 0:52:52.039,0:52:55.059 By giving you this additional training example it really shouldn't 0:52:55.059,0:52:56.429 change anything. I mean, 0:52:56.429,0:52:59.289 I didn't really convey much new information. There's no surprise that this 0:52:59.289,0:53:01.670 corresponds to Y equals one. 0:53:01.670,0:53:04.589 But if you now fit linear regression to this data 0:53:04.589,0:53:07.159 set you end up with a line that, I 0:53:07.159,0:53:10.410 don't know, maybe looks like that, right? 0:53:10.410,0:53:13.430 And now the predictions of your 0:53:13.430,0:53:15.739 hypothesis have changed completely if 0:53:15.739,0:53:22.739 your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that? 0:53:26.579,0:53:30.079 Right. 0:53:30.079,0:53:31.279 Yeah, yeah, fine. Yeah, sure. A theta 0:53:31.279,0:53:36.849 set like that so. So, I 0:53:36.849,0:53:38.609 guess, 0:53:38.609,0:53:42.179 these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will 0:53:42.179,0:53:47.609 change it even more if you gave it 0:53:47.609,0:53:49.579 all - 0:53:49.579,0:53:51.999 Yeah. Then I think this actually would make it even worse. You 0:53:51.999,0:53:54.349 would actually get a line that pulls out even further, right? So 0:53:54.349,0:53:57.819 this is my example. I get to make it whatever I want, right? But 0:53:57.819,0:54:00.689 the point of this is that there's not a deep meaning to this. The point of this is 0:54:00.689,0:54:01.879 just that 0:54:01.879,0:54:04.859 it could be a really bad idea to apply linear regression to classification 0:54:04.859,0:54:05.880 0:54:05.880,0:54:11.609 algorithm. Sometimes it work fine, but usually I wouldn't do it. 0:54:11.609,0:54:13.769 So a couple of problems with this. One is that, 0:54:13.769,0:54:17.349 well - so what do you want to do 0:54:17.349,0:54:20.630 for classification? If you know the value of Y lies between zero and 0:54:20.630,0:54:24.709 one then to kind of fix this problem 0:54:24.709,0:54:27.079 let's just start by 0:54:27.079,0:54:29.219 changing the form 0:54:29.219,0:54:33.869 of our hypothesis so that my hypothesis 0:54:33.869,0:54:39.279 always lies in the unit interval between zero and one. Okay? 0:54:39.279,0:54:41.829 So if I know Y is either 0:54:41.829,0:54:44.039 zero or one then 0:54:44.039,0:54:47.469 let's at least not have my hypothesis predict values much larger than one and much 0:54:47.469,0:54:50.939 smaller than zero. 0:54:50.939,0:54:52.369 And so 0:54:52.369,0:54:55.749 I'm going to - instead of choosing a linear function for my hypothesis I'm going 0:54:55.749,0:54:59.799 to choose something slightly different. And, 0:54:59.799,0:55:03.219 in particular, I'm going to choose 0:55:03.219,0:55:08.129 this function, H subscript theta of X is going to equal to G of 0:55:08.129,0:55:10.839 theta transpose X 0:55:10.839,0:55:12.930 where 0:55:12.930,0:55:13.890 G 0:55:13.890,0:55:17.279 is going to be this 0:55:17.279,0:55:18.329 function and so 0:55:18.329,0:55:20.819 this becomes more than one plus theta X 0:55:20.819,0:55:22.819 of theta 0:55:22.819,0:55:25.049 transpose X. 0:55:25.049,0:55:27.809 And G of Z is called the sigmoid 0:55:27.809,0:55:32.959 function and it 0:55:32.959,0:55:38.659 is often also called the logistic function. It 0:55:38.659,0:55:41.229 goes by either of these 0:55:41.229,0:55:48.229 names. And what G of Z looks like is the following. So when you have your 0:55:48.449,0:55:50.529 horizontal axis I'm going to plot Z 0:55:50.529,0:55:56.549 and so G of Z 0:55:56.549,0:55:58.879 will look like this. 0:55:58.879,0:56:04.759 Okay? I didn't draw that very well. Okay. 0:56:04.759,0:56:05.849 So G of Z 0:56:05.849,0:56:07.769 tends towards zero 0:56:07.769,0:56:09.909 as Z becomes very small 0:56:09.909,0:56:12.219 and G of Z will ascend 0:56:12.219,0:56:15.209 towards one as Z becomes large and it crosses the 0:56:15.209,0:56:17.089 vertical 0:56:17.089,0:56:19.519 axis at 0.5. 0:56:19.519,0:56:24.460 So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of 0:56:24.460,0:56:27.479 sigmoid in other 0:56:27.479,0:56:30.299 step five? Say that again. Why we cannot chose this at five for some reason, like, that's 0:56:30.299,0:56:35.429 better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from, 0:56:35.429,0:56:37.270 right? I just 0:56:37.270,0:56:38.839 wrote down this function. It actually 0:56:38.839,0:56:42.569 turns out that there are two reasons for using this function that we'll come to. 0:56:42.569,0:56:43.619 One is - 0:56:43.619,0:56:46.809 we talked about generalized linear models. We'll see that this falls out naturally 0:56:46.809,0:56:49.910 as part of the broader class of models. 0:56:49.910,0:56:51.089 And another reason 0:56:51.089,0:56:52.340 that we'll talk about 0:56:52.340,0:56:54.239 next week, it turns out 0:56:54.239,0:56:55.089 there are a couple of, 0:56:55.089,0:56:57.269 I think, very beautiful reasons for why 0:56:57.269,0:56:58.999 we choose logistic 0:56:58.999,0:57:00.469 functions. We'll see 0:57:00.469,0:57:02.089 that in a little bit. But for now let me just 0:57:02.089,0:57:07.249 define it and just take my word for it for now that this is a reasonable choice. 0:57:07.249,0:57:08.529 Okay? But notice now that 0:57:08.529,0:57:15.529 my - the values output by my hypothesis will always be between zero 0:57:15.879,0:57:17.289 and one. Furthermore, 0:57:17.289,0:57:20.719 just like we did for linear regression, I'm going to endow 0:57:20.719,0:57:25.549 the outputs and my hypothesis with a probabilistic interpretation, right? So 0:57:25.549,0:57:30.619 I'm going to assume that the probability that Y is equal to one 0:57:30.619,0:57:33.899 given X and parameterized by theta 0:57:33.899,0:57:35.559 that's equal to 0:57:35.559,0:57:38.439 H subscript theta of X, all right? 0:57:38.439,0:57:39.640 So in other words 0:57:39.640,0:57:42.789 I'm going to imagine that my hypothesis is outputting all these 0:57:42.789,0:57:45.409 numbers that lie between zero and one. 0:57:45.409,0:57:47.890 I'm going to think of my hypothesis 0:57:47.890,0:57:54.890 as trying to estimate the probability that Y is equal to one. Okay? 0:57:56.209,0:57:59.710 And because 0:57:59.710,0:58:02.119 Y has to be either zero or one 0:58:02.119,0:58:04.729 then the probability of Y equals zero is going 0:58:04.729,0:58:11.729 to be that. All right? 0:58:11.859,0:58:15.349 So more simply it turns out - actually, take these two equations 0:58:15.349,0:58:19.209 and write them more compactly. 0:58:19.209,0:58:22.139 Write P of Y given X 0:58:22.139,0:58:24.209 parameterized by theta. 0:58:24.209,0:58:25.839 This is going to be H 0:58:25.839,0:58:28.349 subscript theta of X to 0:58:28.349,0:58:31.949 the power of Y times 0:58:31.949,0:58:33.179 one minus 0:58:33.179,0:58:34.769 H of X to the power of 0:58:34.769,0:58:37.219 one minus Y. Okay? So I know this 0:58:37.219,0:58:42.889 looks somewhat bizarre, but this actually makes the variation much nicer. 0:58:42.889,0:58:44.439 So Y is equal to one 0:58:44.439,0:58:48.059 then this equation is H of X to the power of one 0:58:48.059,0:58:51.229 times something to the power of zero. 0:58:51.229,0:58:54.429 So anything to the power of zero is just one, 0:58:54.429,0:58:55.949 right? So Y equals one then 0:58:55.949,0:58:59.389 this is something to the power of zero and so this is just one. 0:58:59.389,0:59:03.349 So if Y equals one this is just saying P of Y equals one is equal to H subscript 0:59:03.349,0:59:05.919 theta of X. Okay? 0:59:05.919,0:59:08.249 And in the same way, if Y is equal 0:59:08.249,0:59:09.919 to zero then this is P 0:59:09.919,0:59:14.329 of Y equals zero equals this thing to the power of zero and so this disappears. This is 0:59:14.329,0:59:15.669 just one 0:59:15.669,0:59:17.579 times this thing power of one. Okay? So this is 0:59:17.579,0:59:18.599 a 0:59:18.599,0:59:20.269 compact way of writing 0:59:20.269,0:59:22.799 both of these equations to 0:59:22.799,0:59:29.799 gather them to one line. 0:59:31.149,0:59:35.650 So let's hope our parameter fitting, right? And, again, you can ask - 0:59:35.650,0:59:38.389 well, given this model by data, how do I fit 0:59:38.389,0:59:41.409 the parameters theta of my 0:59:41.409,0:59:46.169 model? So the likelihood of the parameters is, as before, it's just the probability 0:59:46.169,0:59:48.639 0:59:48.639,0:59:50.499 0:59:50.499,0:59:54.469 of theta, right? Which is product over I, PFYI 0:59:54.469,0:59:56.709 given XI 0:59:56.709,0:59:59.119 parameterized by theta. 0:59:59.119,1:00:01.609 Which is - just plugging those 1:00:01.609,1:00:08.609 in. Okay? I 1:00:09.399,1:00:16.399 dropped this theta subscript just so you can write a little bit less. Oh, 1:00:17.479,1:00:19.699 excuse me. These 1:00:19.699,1:00:26.699 should be 1:00:29.380,1:00:35.749 XI's 1:00:35.749,1:00:42.749 and YI's. Okay? 1:00:51.299,1:00:53.359 So, 1:00:53.359,1:00:57.029 as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So 1:00:57.029,1:00:58.759 we want 1:00:58.759,1:01:03.529 to find the - setting the parameters theta that maximizes the likelihood L 1:01:03.529,1:01:08.119 of theta. It 1:01:08.119,1:01:09.619 turns out 1:01:09.619,1:01:11.439 that very often 1:01:11.439,1:01:14.479 - just when you work with the derivations, it turns out that it is often 1:01:14.479,1:01:17.809 much easier to maximize the log of the likelihood rather than maximize the 1:01:17.809,1:01:19.559 likelihood. 1:01:19.559,1:01:20.439 So 1:01:20.439,1:01:22.640 the log 1:01:22.640,1:01:24.630 likelihood L of theta is just log of capital L. 1:01:24.630,1:01:27.879 This will, therefore, 1:01:27.879,1:01:34.879 be sum of this. Okay? 1:01:58.089,1:01:59.239 And so 1:01:59.239,1:02:03.689 to fit the parameters theta of our model we'll 1:02:03.689,1:02:05.929 find the value of 1:02:05.929,1:02:11.539 theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes. 1:02:11.539,1:02:18.539 Thanks. 1:02:21.519,1:02:27.239 So having maximized this function - well, it turns out we can actually apply 1:02:27.239,1:02:28.759 the same gradient 1:02:28.759,1:02:32.749 descent algorithm that we learned. That was the first algorithm we used 1:02:32.749,1:02:34.979 to minimize the quadratic function. 1:02:34.979,1:02:37.209 And you remember, when we talked about least squares, 1:02:37.209,1:02:40.120 the first algorithm we used to minimize the quadratic 1:02:40.120,1:02:41.060 error function 1:02:41.060,1:02:42.649 was great in descent. 1:02:42.649,1:02:45.409 So can actually use exactly the same algorithm 1:02:45.409,1:02:48.039 to maximize the log likelihood. 1:02:48.039,1:02:48.749 1:02:48.749,1:02:51.359 And you remember, that algorithm was just 1:02:51.359,1:02:54.609 repeatedly take the value of theta 1:02:54.609,1:02:56.489 and you replace it with 1:02:56.489,1:02:58.919 the previous value of theta plus 1:02:58.919,1:03:01.959 a learning rate alpha 1:03:01.959,1:03:03.939 times 1:03:03.939,1:03:08.339 the gradient of the cos function. The log likelihood will respect the 1:03:08.339,1:03:09.640 theta. Okay? 1:03:09.640,1:03:13.869 One small change is that because previously we were trying to minimize 1:03:13.869,1:03:14.650 1:03:14.650,1:03:16.999 the quadratic error term. 1:03:16.999,1:03:20.359 Today we're trying to maximize rather than minimize. So rather than having a minus 1:03:20.359,1:03:23.189 sign we have a plus sign. So this is 1:03:23.189,1:03:24.070 just great in ascents, 1:03:24.070,1:03:25.489 but for the 1:03:25.489,1:03:27.560 maximization rather than the minimization. 1:03:27.560,1:03:32.380 So we actually call this gradient ascent and it's really the same 1:03:32.380,1:03:35.389 algorithm. 1:03:35.389,1:03:36.840 So to figure out 1:03:36.840,1:03:41.619 what this gradient - so in order to derive gradient descent, 1:03:41.619,1:03:43.719 what you need to do is 1:03:43.719,1:03:48.069 compute the partial derivatives of your objective function with respect to 1:03:48.069,1:03:52.729 each of your parameters theta I, right? 1:03:52.729,1:03:58.219 It turns out that 1:03:58.219,1:04:00.269 if you actually 1:04:00.269,1:04:02.619 compute this partial derivative - 1:04:02.619,1:04:09.539 so you take this formula, this L of theta, which is - oh, got that wrong too. If 1:04:09.539,1:04:13.879 you take this lower case l theta, if you take the log likelihood of theta, 1:04:13.879,1:04:17.159 and if you take it's partial derivative with 1:04:17.159,1:04:19.479 respect to theta I 1:04:19.479,1:04:21.549 you find that 1:04:21.549,1:04:27.119 this is equal to - 1:04:27.119,1:04:29.829 let's see. Okay? And, 1:04:29.829,1:04:36.829 I 1:04:45.939,1:04:47.309 don't 1:04:47.309,1:04:50.329 know, the derivation isn't terribly complicated, but in 1:04:50.329,1:04:53.900 the interest of saving you watching me write down a couple of 1:04:53.900,1:04:55.789 blackboards full of math I'll just write 1:04:55.789,1:04:57.199 down the final answer. But 1:04:57.199,1:04:58.949 the way you get this is you 1:04:58.949,1:05:00.179 just take those, plug 1:05:00.179,1:05:04.429 in the definition for F subscript theta as function of XI, and take derivatives, 1:05:04.429,1:05:06.390 and work through the algebra 1:05:06.390,1:05:08.439 it turns out it'll simplify 1:05:08.439,1:05:13.219 down to this formula. Okay? 1:05:13.219,1:05:15.069 And so 1:05:15.069,1:05:19.130 what that gives you is that gradient ascent 1:05:19.130,1:05:21.669 is the following 1:05:21.669,1:05:24.150 rule. Theta J gets updated as theta 1:05:24.150,1:05:25.899 J 1:05:25.899,1:05:27.839 plus alpha 1:05:27.839,1:05:34.839 gives this. Okay? 1:05:46.999,1:05:50.229 Does this look familiar to anyone? Did you 1:05:50.229,1:05:56.369 remember seeing this formula at the last lecture? Right. 1:05:56.369,1:05:59.349 So when I worked up Bastrian descent 1:05:59.349,1:06:02.449 for least squares regression I, 1:06:02.449,1:06:06.269 actually, wrote down 1:06:06.269,1:06:07.640 exactly the same thing, or maybe 1:06:07.640,1:06:11.419 there's a minus sign and this is also fit. But I, actually, had 1:06:11.419,1:06:13.909 exactly the same learning rule last time 1:06:13.909,1:06:19.919 for least squares regression, 1:06:19.919,1:06:23.509 right? Is this the same learning algorithm then? So what's different? How come I was making 1:06:23.509,1:06:25.499 all that noise earlier about 1:06:25.499,1:06:30.009 least squares regression being a bad idea for classification problems and then I did 1:06:30.009,1:06:33.769 a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants? 1:06:33.769,1:06:40.769 Say that again. [Inaudible] Oh, 1:06:44.180,1:06:46.639 right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same, 1:06:46.639,1:06:48.350 this is not the same, right? And the 1:06:48.350,1:06:49.269 reason is, 1:06:49.269,1:06:51.700 in logistic regression 1:06:51.700,1:06:54.169 this is different from before, right? 1:06:54.169,1:06:55.699 The definition 1:06:55.699,1:06:58.380 of this H subscript theta of XI 1:06:58.380,1:06:59.319 is not 1:06:59.319,1:07:03.279 the same as the definition I was using in the previous lecture. 1:07:03.279,1:07:06.519 And in particular this is no longer theta transpose XI. This is not 1:07:06.519,1:07:09.909 a linear function anymore. 1:07:09.909,1:07:11.910 This is a logistic function of theta 1:07:11.910,1:07:14.369 transpose XI. Okay? 1:07:14.369,1:07:17.919 So even though this looks cosmetically similar, 1:07:17.919,1:07:20.119 even though this is similar on the surface, 1:07:20.119,1:07:23.739 to the Bastrian descent rule I derived last time for 1:07:23.739,1:07:25.209 least squares regression 1:07:25.209,1:07:29.359 this is actually a totally different learning algorithm. Okay? 1:07:29.359,1:07:32.499 And it turns out that there's actually no coincidence that you ended up with the 1:07:32.499,1:07:34.579 same learning rule. We'll actually 1:07:34.579,1:07:35.450 1:07:35.450,1:07:39.519 talk a bit more about this later when we talk about generalized linear models. 1:07:39.519,1:07:42.659 But this is one of the most elegant generalized learning models 1:07:42.659,1:07:44.519 that we'll see later. That 1:07:44.519,1:07:48.199 even though we're using a different model, you actually ended up with 1:07:48.199,1:07:51.379 what looks like the same learning algorithm and it's actually no 1:07:51.379,1:07:56.449 coincidence. Cool. 1:07:56.449,1:07:59.029 One last comment as 1:07:59.029,1:08:01.239 part of a sort of learning process, 1:08:01.239,1:08:02.339 over here 1:08:02.339,1:08:05.030 I said I take the derivatives and I 1:08:05.030,1:08:06.729 ended up with this line. 1:08:06.729,1:08:09.269 I didn't want to 1:08:09.269,1:08:13.119 make you sit through a long algebraic derivation, but 1:08:13.119,1:08:14.640 later today or later this week, 1:08:14.640,1:08:18.569 please, do go home and look at our lecture notes, where I wrote out 1:08:18.569,1:08:20.829 the entirety of this derivation in full, 1:08:20.829,1:08:23.419 and make sure you can follow every single step of 1:08:23.419,1:08:26.709 how we take partial derivatives of this log likelihood 1:08:26.709,1:08:32.069 to get this formula over here. Okay? By the way, for those who are 1:08:32.069,1:08:36.379 interested in seriously masking machine learning material, 1:08:36.379,1:08:40.349 when you go home and look at the lecture notes it will actually be very easy for most 1:08:40.349,1:08:41.549 of you to look through 1:08:41.549,1:08:44.599 the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense, 1:08:44.599,1:08:45.089 and, 1:08:45.089,1:08:49.539 sort of, say cool. I see how you get this line. 1:08:49.539,1:08:53.159 You want to make sure you really understand the material. My concrete 1:08:53.159,1:08:55.049 suggestion to you would be to you to go home, 1:08:55.049,1:08:57.779 read through the lecture notes, check every line, 1:08:57.779,1:09:02.159 and then to cover up the derivation and see if you can derive this example, right? So 1:09:02.159,1:09:06.529 in general, that's usually good advice for studying technical 1:09:06.529,1:09:09.029 material like machine learning. Which is if you work through a proof 1:09:09.029,1:09:11.239 and you think you understood every line, 1:09:11.239,1:09:14.129 the way to make sure you really understood it is to cover it up and see 1:09:14.129,1:09:17.459 if you can rederive the entire thing itself. This is actually a great way because I 1:09:17.459,1:09:19.679 did this a lot when I was trying to study 1:09:19.679,1:09:22.239 various pieces of machine learning 1:09:22.239,1:09:26.119 theory and various proofs. And this is actually a great way to study because cover up 1:09:26.119,1:09:28.420 the derivations and see if you can do it yourself 1:09:28.420,1:09:32.529 without looking at the original derivation. All right. 1:09:32.529,1:09:36.690 1:09:36.690,1:09:39.950 I probably won't get to Newton's Method today. I just 1:09:39.950,1:09:46.950 want to say 1:09:55.250,1:09:58.159 - take one quick digression to talk about 1:09:58.159,1:09:59.480 one more algorithm, 1:09:59.480,1:10:01.119 which was the discussion sort 1:10:01.119,1:10:08.119 of alluding to this earlier, 1:10:09.169,1:10:11.530 which is the perceptron 1:10:11.530,1:10:13.219 algorithm, right? So 1:10:13.219,1:10:13.760 I'm 1:10:13.760,1:10:16.840 not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come 1:10:16.840,1:10:20.340 back to later. Later this quarter 1:10:20.340,1:10:23.820 we'll talk about learning theory. 1:10:23.820,1:10:27.149 So in logistic regression we said that G of Z are, sort 1:10:27.149,1:10:27.920 of, 1:10:27.920,1:10:30.150 my hypothesis output values 1:10:30.150,1:10:32.800 that were low numbers between zero and one. 1:10:32.800,1:10:37.110 The question is what if you want to force G of Z to up 1:10:37.110,1:10:38.889 the value to 1:10:38.889,1:10:39.190 either 1:10:39.190,1:10:41.400 zero one? 1:10:41.400,1:10:43.860 So the 1:10:43.860,1:10:46.210 perceptron algorithm defines G of Z 1:10:46.210,1:10:48.090 to be this. 1:10:48.090,1:10:52.590 1:10:52.590,1:10:54.300 So the picture is - or 1:10:54.300,1:11:01.300 the cartoon is, rather than this sigmoid function. E of 1:11:02.349,1:11:07.699 Z now looks like this step function that you were asking about earlier. 1:11:07.699,1:11:13.579 In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So 1:11:13.579,1:11:14.310 this 1:11:14.310,1:11:17.570 is actually - everything is exactly the same as before, 1:11:17.570,1:11:20.389 except that G of Z is now the step function. 1:11:20.389,1:11:21.539 It 1:11:21.539,1:11:24.960 turns out there's this learning called the perceptron learning rule that's actually 1:11:24.960,1:11:28.429 even the same as the classic gradient ascent 1:11:28.429,1:11:30.610 for logistic regression. 1:11:30.610,1:11:32.539 And the learning rule is 1:11:32.539,1:11:39.539 given by this. Okay? 1:11:44.349,1:11:49.799 So it looks just like the 1:11:49.799,1:11:51.019 classic gradient ascent rule 1:11:51.019,1:11:53.659 1:11:53.659,1:11:56.210 for logistic regression. 1:11:56.210,1:12:00.780 So this is very different flavor of algorithm than least squares regression and logistic 1:12:00.780,1:12:01.669 regression, 1:12:01.669,1:12:05.889 and, in particular, because it outputs only values are either zero or one it 1:12:05.889,1:12:09.369 turns out it's very difficult to endow this algorithm with 1:12:09.369,1:12:11.840 probabilistic semantics. And this 1:12:11.840,1:12:18.840 is, again, even though - oh, excuse me. Right there. Okay. 1:12:19.639,1:12:23.659 And even though this learning rule looks, again, looks cosmetically very similar to 1:12:23.659,1:12:25.710 what we have in logistics regression this is actually 1:12:25.710,1:12:28.320 a very different type of learning rule 1:12:28.320,1:12:31.420 than the others that were seen 1:12:31.420,1:12:33.630 in this class. So 1:12:33.630,1:12:36.969 because this is such a simple learning algorithm, right? It just 1:12:36.969,1:12:41.030 computes theta transpose X and then you threshold and then your output is zero or one. 1:12:41.030,1:12:42.909 This is - 1:12:42.909,1:12:45.989 right. So these are a simpler algorithm than logistic regression, I think. 1:12:45.989,1:12:49.190 When we talk about learning theory later in this class, 1:12:49.190,1:12:54.590 the simplicity of this algorithm will let us come back and use it as a building block. Okay? 1:12:54.590,1:12:56.550 But that's all I want to say about this algorithm for now.