WEBVTT 00:00:00.000 --> 00:00:09.169 (music) 00:00:09.869 --> 00:00:13.099 this presentation is delivered by the Stanford Center for 00:00:13.099 --> 00:00:14.979 Professional Development. 00:00:23.299 --> 00:00:26.800 Okay. Good morning and welcome back to the third 00:00:26.800 --> 00:00:30.619 lecture of this class. So here's 00:00:30.619 --> 00:00:33.360 what I want to do today, 00:00:33.360 --> 00:00:34.320 and some of 00:00:34.320 --> 00:00:37.920 the topics I do today may seem a little bit like I'm jumping, sort of, 00:00:37.920 --> 00:00:44.440 from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we 00:00:44.440 --> 00:00:45.370 talked about 00:00:45.370 --> 00:00:48.240 linear regression and today I want to talk about 00:00:48.240 --> 00:00:52.500 sort of an adaptation of that called locally weighted regression. It's very a powerful 00:00:52.500 --> 00:00:57.120 algorithm that's actually one of my former mentor's probably favorite machine 00:00:57.120 --> 00:00:59.360 learning algorithm. We'll then 00:00:59.360 --> 00:01:02.800 talk about a probabilistic interpretation of linear regression 00:01:02.800 --> 00:01:07.650 and use that to move onto our first classification algorithm, which 00:01:07.650 --> 00:01:08.910 is logistic regression; take 00:01:08.910 --> 00:01:12.630 a brief digression to tell you about something called the perceptron algorithm, 00:01:12.630 --> 00:01:15.990 which is something we'll come back to, again, later this quarter; 00:01:15.990 --> 00:01:17.170 and 00:01:17.170 --> 00:01:21.070 time allowing I hope to get to Newton's method, which is an algorithm for 00:01:21.070 --> 00:01:23.119 fitting logistic regression models. 00:01:24.490 --> 00:01:30.510 So, let's just recap what we're talking about in the previous lecture, 00:01:30.510 --> 00:01:33.790 remember the notation that I defined was that 00:01:33.790 --> 00:01:35.090 I used this 00:01:35.090 --> 00:01:37.620 X superscript i, 00:01:37.620 --> 00:01:40.960 Y superscript i to denote the ith training example. 00:01:46.730 --> 00:01:48.210 And 00:01:49.150 --> 00:01:51.790 when we're talking about linear regression 00:01:52.900 --> 00:01:54.080 or ordinary least squares, 00:01:54.080 --> 00:01:56.140 we use this to denote 00:01:56.140 --> 00:01:58.179 the predicted value 00:01:58.179 --> 00:02:00.830 output by my hypothesis H on 00:02:00.830 --> 00:02:02.420 the input X^i. 00:02:02.420 --> 00:02:04.070 And my hypothesis 00:02:04.070 --> 00:02:06.160 was parameterized by the vector 00:02:06.160 --> 00:02:08.399 parameters theta 00:02:08.399 --> 00:02:13.809 and so we said that this was equal to sum from theta j 00:02:13.809 --> 00:02:15.549 X^i 00:02:15.549 --> 00:02:19.870 si 00:02:19.870 --> 00:02:22.409 written more simply as theta transpose X 00:02:22.409 --> 00:02:25.079 And we had the convention that X 00:02:25.079 --> 00:02:29.409 subscript zero is equal to one so this accounts for the intercept term in our 00:02:29.409 --> 00:02:30.919 linear regression model. 00:02:30.919 --> 00:02:32.799 And lowercase n here 00:02:32.799 --> 00:02:36.780 was the notation I was using for the 00:02:36.780 --> 00:02:40.209 number of features in my training set. Okay? So in the 00:02:40.209 --> 00:02:43.689 example trying to predict housing prices, we had two features, the size of 00:02:43.689 --> 00:02:45.939 the house and the number of bedrooms. 00:02:45.939 --> 00:02:50.559 We had two features and there was - little n was equal to two. 00:02:50.559 --> 00:02:51.989 00:02:51.989 --> 00:02:54.709 So just to 00:02:54.709 --> 00:02:57.079 finish recapping the previous lecture, we 00:02:57.079 --> 00:03:01.260 defined this quadratic cos function J of theta equals one-half, 00:03:01.260 --> 00:03:05.510 something 00:03:05.510 --> 00:03:07.209 I 00:03:07.209 --> 00:03:10.379 equals one to m, theta of XI minus YI 00:03:10.379 --> 00:03:12.439 squared 00:03:12.439 --> 00:03:16.819 where this is the sum over our m training examples and my training set. So lowercase 00:03:16.819 --> 00:03:17.769 m 00:03:17.769 --> 00:03:21.129 was the notation I've been using to denote the number of training examples I have and the 00:03:21.129 --> 00:03:23.059 size of my training set. 00:03:23.059 --> 00:03:25.400 And at the end of the last lecture, 00:03:25.400 --> 00:03:26.809 we derive 00:03:26.809 --> 00:03:30.840 the value of theta that minimizes this enclosed form, which was X 00:03:30.840 --> 00:03:32.349 transpose X 00:03:32.349 --> 00:03:35.369 inverse X 00:03:35.369 --> 00:03:38.889 transpose Y. Okay? 00:03:38.889 --> 00:03:42.409 So 00:03:42.409 --> 00:03:46.419 as we move on in today's lecture, I'll continue to use this notation and, again, 00:03:46.419 --> 00:03:50.609 I realize this is a fair amount of notation to all remember, 00:03:50.609 --> 00:03:55.099 so if partway through this lecture you forgot - if you're having trouble remembering 00:03:55.099 --> 00:04:02.099 what lowercase m is or what lowercase n is or something please raise your hand and ask. When 00:04:04.709 --> 00:04:07.690 we talked about linear regression last time 00:04:07.690 --> 00:04:10.440 we used two features. One of the features was 00:04:10.440 --> 00:04:14.279 the size of the houses in square feet, so the living area of the house, 00:04:14.279 --> 00:04:18.449 and the other feature was the number of bedrooms in the house. 00:04:18.449 --> 00:04:22.059 In general, we apply a machine-learning algorithm to some problem that you care 00:04:22.059 --> 00:04:23.060 about. 00:04:23.060 --> 00:04:28.210 The choice of the features will very much be up to you, right? And 00:04:28.210 --> 00:04:32.300 the way you choose your features to give the learning algorithm will often have a 00:04:32.300 --> 00:04:34.330 large impact on how it actually does. 00:04:34.330 --> 00:04:40.120 So just for example, 00:04:40.120 --> 00:04:44.120 the choice we made last time was X1 equal this size, and let's leave this idea 00:04:44.120 --> 00:04:47.080 of the feature of the number of bedrooms for now, let's say we don't have data 00:04:47.080 --> 00:04:50.630 that tells us how many bedrooms are in these houses. 00:04:50.630 --> 00:04:54.030 One thing you could do is actually define - oh, let's 00:04:54.030 --> 00:04:56.289 draw this out. 00:04:56.289 --> 00:05:03.289 And so, right? So say that 00:05:04.689 --> 00:05:07.669 was the size of the house and that's the price of the house. So 00:05:07.669 --> 00:05:10.300 if you use 00:05:10.300 --> 00:05:14.789 this as a feature maybe you get theta zero plus theta 1, 00:05:14.789 --> 00:05:19.889 X1, this, sort of, linear model. 00:05:19.889 --> 00:05:21.590 If you choose - let me just copy the 00:05:21.590 --> 00:05:26.770 same data set over, right? 00:05:26.770 --> 00:05:30.330 You can define the set of features where X1 is equal to the size of the house 00:05:30.330 --> 00:05:34.989 and X2 is the 00:05:34.989 --> 00:05:36.050 square 00:05:36.050 --> 00:05:37.169 of the size 00:05:37.169 --> 00:05:38.319 of the house. Okay? 00:05:38.319 --> 00:05:43.080 So X1 is the size of the house in say square footage and X2 is 00:05:43.080 --> 00:05:45.919 just take whatever the square footage of the house is and just 00:05:45.919 --> 00:05:49.059 square that number, and this would be another way to come up with a feature, 00:05:49.059 --> 00:05:51.109 and if you do that then 00:05:51.109 --> 00:05:55.960 the same algorithm will end up fitting a 00:05:55.960 --> 00:05:59.399 quadratic function for you. 00:05:59.399 --> 00:06:01.379 Theta 2, XM squared. 00:06:01.379 --> 00:06:06.139 Okay? Because this 00:06:06.139 --> 00:06:09.610 is actually X2. And 00:06:09.610 --> 00:06:12.520 depending on what the data looks like, maybe this is a slightly 00:06:12.520 --> 00:06:16.509 better fit to the data. You can actually 00:06:16.509 --> 00:06:23.509 take this even further, right? 00:06:25.280 --> 00:06:26.960 Which is - let's see. 00:06:26.960 --> 00:06:30.400 I have seven training examples here, so you can actually 00:06:30.400 --> 00:06:34.460 maybe fit up to six for the polynomial. You can actually fill a model 00:06:34.460 --> 00:06:35.619 theta zero plus 00:06:35.619 --> 00:06:38.229 theta one, X1 plus theta two, 00:06:38.229 --> 00:06:42.030 X squared plus up to 00:06:42.030 --> 00:06:48.710 theta six. X to the 00:06:48.710 --> 00:06:52.889 power of six and theta six are the polynomial 00:06:52.889 --> 00:06:55.589 to these seven data points. 00:06:55.589 --> 00:06:58.030 And if you do that you find that 00:06:58.030 --> 00:07:01.669 you come up with a model that fits your data exactly. This is where, I guess, 00:07:01.669 --> 00:07:06.190 in this example I drew, we have seven data points, so if you fit a 00:07:06.190 --> 00:07:08.489 six model polynomial you can, sort of, fit a line 00:07:08.489 --> 00:07:11.860 that passes through these seven points perfectly. 00:07:11.860 --> 00:07:14.379 And you probably find that the curve 00:07:14.379 --> 00:07:17.069 you get will look something 00:07:17.069 --> 00:07:20.039 like that. 00:07:20.039 --> 00:07:23.179 And on the one hand, this is a great model in a sense that it 00:07:23.179 --> 00:07:25.329 fits your training data perfectly. 00:07:25.329 --> 00:07:26.680 On the other hand, this is probably not a 00:07:26.680 --> 00:07:28.819 very good model in the sense that 00:07:28.819 --> 00:07:31.889 none of us seriously think that this is a very good predictor of housing 00:07:31.889 --> 00:07:36.220 prices as a function of the size of the house, right? So 00:07:36.220 --> 00:07:39.020 we'll actually come back to this later. It turns 00:07:39.020 --> 00:07:41.940 out of the models we have here; 00:07:41.940 --> 00:07:45.180 I feel like maybe the quadratic model fits the data best. 00:07:45.180 --> 00:07:46.460 Whereas 00:07:46.460 --> 00:07:47.810 00:07:47.810 --> 00:07:52.020 the linear model looks like there's actually a bit of a quadratic component in this 00:07:52.020 --> 00:07:52.830 data 00:07:52.830 --> 00:07:56.539 that the linear function is not capturing. 00:07:56.539 --> 00:07:59.959 So we'll actually come back to this a little bit later and talk about the problems 00:07:59.959 --> 00:08:03.779 associated with fitting models that are either too simple, use two small a set of 00:08:03.779 --> 00:08:04.620 features, or 00:08:04.620 --> 00:08:08.449 on the models that are too complex and maybe 00:08:08.449 --> 00:08:11.210 use too large a set of features. 00:08:11.210 --> 00:08:12.289 Just to give these a 00:08:12.289 --> 00:08:13.240 name, 00:08:13.240 --> 00:08:14.490 we call this 00:08:14.490 --> 00:08:19.750 the problem of underfitting 00:08:19.750 --> 00:08:23.139 and, very informally, this refers to a setting where 00:08:23.139 --> 00:08:26.819 there are obvious patterns that - where there are patterns in the data that the 00:08:26.819 --> 00:08:28.729 algorithm is just failing to fit. 00:08:28.729 --> 00:08:32.839 And this problem here we refer to as 00:08:32.839 --> 00:08:34.580 overfitting 00:08:34.580 --> 00:08:36.299 and, again, very informally, 00:08:36.299 --> 00:08:41.310 this is when the algorithm is fitting the idiosyncrasies of this specific data set, 00:08:41.310 --> 00:08:43.480 right? It just so happens that 00:08:43.480 --> 00:08:48.190 of the seven houses we sampled in Portland, or wherever you collect data from, 00:08:48.190 --> 00:08:51.640 that house happens to be a bit more expensive, that house happened on the less 00:08:51.640 --> 00:08:54.070 expensive, and by 00:08:54.070 --> 00:08:57.400 fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties 00:08:57.400 --> 00:08:58.820 of this data set, 00:08:58.820 --> 00:09:01.079 rather than the true underlying trends 00:09:01.079 --> 00:09:04.940 of how housing prices vary as the function of the size of house. Okay? 00:09:04.940 --> 00:09:08.460 So these are two very different problems. We'll define them more formally me later 00:09:08.460 --> 00:09:12.280 and talk about how to address each of these problems, 00:09:12.280 --> 00:09:13.890 but for now I 00:09:13.890 --> 00:09:20.890 hope you appreciate that there is this issue of selecting features. So if 00:09:22.150 --> 00:09:23.610 you want to just 00:09:23.610 --> 00:09:26.270 teach us the learning problems there are a few ways to do 00:09:26.270 --> 00:09:27.370 so. 00:09:27.370 --> 00:09:29.010 We'll talk about 00:09:29.010 --> 00:09:32.760 feature selection algorithms later this quarter as well. So automatic algorithms 00:09:32.760 --> 00:09:33.779 for choosing 00:09:33.779 --> 00:09:35.959 what features you use in a 00:09:35.959 --> 00:09:37.530 regression problem like this. 00:09:37.530 --> 00:09:41.610 What I want to do today is talk about a class of algorithms 00:09:41.610 --> 00:09:44.570 called non-parametric learning algorithms that will help 00:09:44.570 --> 00:09:49.730 to alleviate the need somewhat for you to choose features very carefully. Okay? 00:09:49.730 --> 00:09:56.730 And this leads us into our discussion of locally weighted regression. 00:09:56.970 --> 00:10:03.970 And just to define the term, 00:10:12.340 --> 00:10:16.780 linear regression, as we've defined it so far, is an example of a parametric learning 00:10:16.780 --> 00:10:17.740 00:10:17.740 --> 00:10:19.150 algorithm. Parametric 00:10:19.150 --> 00:10:21.920 learning algorithm is one that's defined as 00:10:21.920 --> 00:10:24.670 an algorithm that has a fixed number of parameters 00:10:24.670 --> 00:10:27.200 that fit to the data. Okay? So 00:10:27.200 --> 00:10:28.670 in linear regression 00:10:28.670 --> 00:10:32.930 we have a fix set of parameters theta, right? That must 00:10:32.930 --> 00:10:39.440 fit to the data. 00:10:39.440 --> 00:10:46.440 In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The 00:10:58.630 --> 00:11:02.830 formal definition, which is not very intuitive, so I've replaced it with a 00:11:02.830 --> 00:11:04.110 second, say, more 00:11:04.110 --> 00:11:06.200 intuitive. 00:11:06.200 --> 00:11:10.460 The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm 00:11:10.460 --> 00:11:17.460 where the number of parameters 00:11:18.300 --> 00:11:22.170 goes 00:11:22.170 --> 00:11:25.419 with M, with the size of the training set. And usually it's 00:11:25.419 --> 00:11:30.620 defined as a number of parameters grows linearly with the size of the training set. 00:11:30.620 --> 00:11:32.640 This is the formal definition. 00:11:32.640 --> 00:11:33.410 A 00:11:33.410 --> 00:11:36.189 slightly less formal definition is that 00:11:36.189 --> 00:11:37.529 the amount of stuff that your learning algorithm needs 00:11:37.529 --> 00:11:40.830 to keep around 00:11:40.830 --> 00:11:44.820 will grow linearly with the training sets or, in another way of saying it, is that this is an 00:11:44.820 --> 00:11:45.830 algorithm that 00:11:45.830 --> 00:11:51.590 we'll need to keep around an entire training set, even after learning. Okay? So 00:11:51.590 --> 00:11:53.980 don't worry too much about this definition. But 00:11:53.980 --> 00:11:55.979 what I want to do now is 00:11:55.979 --> 00:11:58.989 describe a specific non-parametric learning algorithm 00:11:58.989 --> 00:12:05.989 called locally weighted regression. 00:12:09.860 --> 00:12:16.860 Which also goes by a couple of other names - 00:12:17.040 --> 00:12:20.580 which also goes by the name of Loess for self-hysterical reasons. Loess 00:12:20.580 --> 00:12:23.030 is usually spelled L-O-E-S-S, 00:12:23.030 --> 00:12:24.130 sometimes spelled like that, 00:12:24.130 --> 00:12:27.140 too. I just call it locally weighted regression. 00:12:27.140 --> 00:12:34.140 So here's 00:12:34.540 --> 00:12:37.760 the idea. This will be an algorithm that allows us 00:12:37.760 --> 00:12:42.650 to worry a little bit less about having to choose features very carefully. 00:12:42.650 --> 00:12:48.350 So 00:12:48.350 --> 00:12:55.320 for my motivating example, let's say that I 00:12:55.320 --> 00:12:59.000 have a 00:12:59.000 --> 00:13:00.939 training site that looks like this, okay? 00:13:00.939 --> 00:13:04.270 So this is X and that's Y. 00:13:04.270 --> 00:13:07.010 If you run 00:13:07.010 --> 00:13:10.920 linear regression on this and you fit maybe a linear function to this 00:13:10.920 --> 00:13:12.380 and you end up with a 00:13:12.380 --> 00:13:13.399 more or less 00:13:13.399 --> 00:13:16.580 flat, straight line, which is not a very good fit to this data. You 00:13:16.580 --> 00:13:19.820 can sit around and stare at this and try to decide whether the features are used right. 00:13:19.820 --> 00:13:22.570 So maybe you want to toss in a quadratic function, 00:13:22.570 --> 00:13:25.770 but this isn't really quadratic either. So maybe you want to 00:13:25.770 --> 00:13:27.910 model this as a X 00:13:27.910 --> 00:13:31.060 plus X squared plus maybe some function of sin of X or something. 00:13:31.060 --> 00:13:33.850 You actually sit around and fiddle with features. 00:13:33.850 --> 00:13:37.160 And after a while you can probably come up with a set of features that the model is 00:13:37.160 --> 00:13:39.720 okay, but let's talk about an algorithm that 00:13:39.720 --> 00:13:46.720 you can use without needing to do that. 00:13:50.340 --> 00:13:52.930 So 00:13:52.930 --> 00:13:54.480 if - well, 00:13:54.480 --> 00:13:56.370 suppose you want to evaluate 00:13:56.370 --> 00:13:59.170 your hypothesis H 00:13:59.170 --> 00:14:03.950 at a certain point 00:14:03.950 --> 00:14:06.310 with a certain query point low K is X. Okay? And 00:14:06.310 --> 00:14:07.490 let's say 00:14:07.490 --> 00:14:10.670 you want to know what's the predicted value of 00:14:10.670 --> 00:14:11.930 Y 00:14:11.930 --> 00:14:16.540 at this position of X, right? So 00:14:16.540 --> 00:14:18.760 for linear regression, 00:14:18.760 --> 00:14:22.440 what we were doing was we would fit 00:14:22.440 --> 00:14:25.070 theta 00:14:25.070 --> 00:14:28.130 to minimize 00:14:28.130 --> 00:14:30.050 sum 00:14:30.050 --> 00:14:34.610 over I, YI minus theta, transpose XI 00:14:34.610 --> 00:14:38.760 squared, 00:14:38.760 --> 00:14:41.200 and return theta 00:14:41.200 --> 00:14:46.170 transpose X. Okay? So that was linear regression. 00:14:46.170 --> 00:14:49.970 In contrast, in locally weighted linear regression you're going to do things slightly 00:14:49.970 --> 00:14:51.130 different. You're 00:14:51.130 --> 00:14:54.270 going to look at this point X 00:14:54.270 --> 00:14:58.700 and then I'm going to look in my data set and take into account 00:14:58.700 --> 00:15:03.370 only the data points that are, sort of, in the little vicinity of X. Okay? 00:15:03.370 --> 00:15:07.130 So we'll look at where I want to value my hypothesis. I'm going to look 00:15:07.130 --> 00:15:10.140 only in the vicinity of 00:15:10.140 --> 00:15:13.730 this point where I want to value my hypothesis, 00:15:13.730 --> 00:15:16.480 and then I'm going to take, 00:15:16.480 --> 00:15:19.600 let's say, just these few points, 00:15:19.600 --> 00:15:21.320 and I will 00:15:21.320 --> 00:15:22.760 apply linear regression 00:15:22.760 --> 00:15:26.160 to fit a straight line just to this sub-set of the data. Okay? I'm 00:15:26.160 --> 00:15:29.740 using this sub-term sub-set - well let's come back to that later. 00:15:29.740 --> 00:15:32.000 So we take this data set and I fit a 00:15:32.000 --> 00:15:36.690 straight line to it and maybe I get a straight line like that. 00:15:36.690 --> 00:15:40.720 And what I'll do is then 00:15:40.720 --> 00:15:45.150 evaluate this particular value of straight line and 00:15:45.150 --> 00:15:47.280 that will be the value I return for my algorithm. 00:15:47.280 --> 00:15:50.360 I think this would be the predicted value 00:15:50.360 --> 00:15:53.060 for - 00:15:53.060 --> 00:15:57.370 this would be the value of then my hypothesis outputs 00:15:57.370 --> 00:16:04.370 in locally weighted regression. Okay? So 00:16:05.270 --> 00:16:10.110 we're gonna fall one up. Let me go ahead and formalize that. In 00:16:10.110 --> 00:16:15.120 locally weighted regression, we're going to fit theta to 00:16:15.120 --> 00:16:18.350 minimize 00:16:18.350 --> 00:16:25.350 sum over I 00:16:27.480 --> 00:16:32.590 to minimize that 00:16:32.590 --> 00:16:36.600 where these terms W superscript I are called weights. 00:16:36.600 --> 00:16:37.769 00:16:37.769 --> 00:16:41.080 There are many possible choice for ways, I'm just gonna write one down. So this E's and 00:16:41.080 --> 00:16:42.990 minus, XI minus 00:16:42.990 --> 00:16:45.590 X squared 00:16:45.590 --> 00:16:49.290 over 00:16:49.290 --> 00:16:54.930 two. So let's look at what these weights really are, right? So notice that - 00:16:54.930 --> 00:16:57.790 suppose you have a training example XI. 00:16:57.790 --> 00:17:04.790 So that XI is very close to X. So this is small, 00:17:06.940 --> 00:17:10.620 right? Then if XI minus X is small, so if XI minus X is close to zero, then 00:17:10.620 --> 00:17:14.319 this is E's to the minus zero and E to the zero 00:17:14.319 --> 00:17:15.860 is one. 00:17:15.860 --> 00:17:19.790 So if XI is close to X, then 00:17:19.790 --> 00:17:21.180 WI 00:17:21.180 --> 00:17:25.130 will be close to one. In other words, the weight associated with the, I training 00:17:25.130 --> 00:17:27.030 example be close to one 00:17:27.030 --> 00:17:30.160 if XI and X are close to each 00:17:30.160 --> 00:17:31.150 other. 00:17:31.150 --> 00:17:34.270 Conversely if XI minus X 00:17:34.270 --> 00:17:41.270 is large 00:17:42.330 --> 00:17:43.110 then - I don't 00:17:43.110 --> 00:17:48.010 know, what would WI be? Zero. 00:17:48.010 --> 00:17:51.020 Zero, right. Close to zero. Right. 00:17:51.020 --> 00:17:51.850 So 00:17:51.850 --> 00:17:56.630 if XI is very far from X then this is E to the minus of some large number 00:17:56.630 --> 00:18:02.210 and E to the minus some large number will be close to zero. 00:18:02.210 --> 00:18:05.170 Okay? 00:18:05.170 --> 00:18:12.170 So the picture is, if I'm 00:18:12.400 --> 00:18:13.390 quarrying 00:18:13.390 --> 00:18:16.970 at a certain point X, shown on the X axis, 00:18:16.970 --> 00:18:21.820 and if my data 00:18:21.820 --> 00:18:24.090 set, say, look like that, 00:18:24.090 --> 00:18:28.880 then I'm going to give the points close to this a large weight and give the points 00:18:28.880 --> 00:18:32.610 far away a small weight. 00:18:32.610 --> 00:18:35.390 So 00:18:35.390 --> 00:18:37.930 for the points that are far away, 00:18:37.930 --> 00:18:39.690 WI will be close to zero. 00:18:39.690 --> 00:18:44.280 And so as if for the points that are far away, 00:18:44.280 --> 00:18:47.850 they will not contribute much at all to this summation, right? So I think this is 00:18:47.850 --> 00:18:48.970 sum over I 00:18:48.970 --> 00:18:54.240 of one times this quadratic term for points by points plus zero times this quadratic term for faraway 00:18:54.240 --> 00:18:55.720 points. 00:18:55.720 --> 00:18:58.610 And so the effect of using this weighting is that 00:18:58.610 --> 00:19:00.050 locally weighted linear regression 00:19:00.050 --> 00:19:02.630 fits a set of parameters theta, 00:19:02.630 --> 00:19:05.030 paying much more attention to fitting the 00:19:05.030 --> 00:19:06.950 points close by 00:19:06.950 --> 00:19:10.870 accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is 00:19:10.870 --> 00:19:17.580 exponentially [inaudible]? 00:19:17.580 --> 00:19:22.570 Yeah. Let's see. So it turns out there are many other weighting functions you can use. It 00:19:22.570 --> 00:19:26.890 turns out that there are definitely different communities of researchers that tend to 00:19:26.890 --> 00:19:29.400 choose different choices by default. 00:19:29.400 --> 00:19:33.700 There is somewhat of a literature on debating what point - 00:19:33.700 --> 00:19:35.660 exactly what function to use. 00:19:35.660 --> 00:19:37.970 This, sort of, exponential decay function is - 00:19:37.970 --> 00:19:41.180 this happens to be a reasonably common one that seems to be a more reasonable choice on many problems, 00:19:41.180 --> 00:19:42.310 00:19:42.310 --> 00:19:45.380 but you can actually plug in other functions as well. Did 00:19:45.380 --> 00:19:47.500 I mention what [inaudible] is it at? 00:19:47.500 --> 00:19:49.710 For those of you that are familiar with 00:19:49.710 --> 00:19:51.550 the 00:19:51.550 --> 00:19:53.890 normal distribution, or the Gaussian distribution, 00:19:53.890 --> 00:19:55.350 say this - 00:19:55.350 --> 00:19:59.370 what this formula I've written out here, it cosmetically 00:19:59.370 --> 00:20:02.770 looks a bit like a Gaussian distribution. Okay? But this actually has 00:20:02.770 --> 00:20:06.190 absolutely nothing to do with Gaussian distribution. 00:20:06.190 --> 00:20:07.940 So this 00:20:07.940 --> 00:20:12.240 is not that a problem with XI is Gaussian or whatever. This is no such 00:20:12.240 --> 00:20:13.010 interpretation. 00:20:13.010 --> 00:20:16.840 This is just a convenient function that happens to be a bell-shaped function, but 00:20:16.840 --> 00:20:21.460 don't endow this of any Gaussian semantics. Okay? 00:20:21.460 --> 00:20:23.160 So, in fact - well, 00:20:23.160 --> 00:20:28.510 if you remember the familiar bell-shaped Gaussian, again, it's just 00:20:28.510 --> 00:20:32.160 the ways of associating with these points is that if you 00:20:32.160 --> 00:20:33.850 imagine 00:20:33.850 --> 00:20:36.050 putting this on a bell-shaped bump, 00:20:36.050 --> 00:20:39.650 centered around the position of where you want to value your hypothesis H, 00:20:39.650 --> 00:20:42.910 then there's a saying this point here I'll give 00:20:42.910 --> 00:20:44.460 a weight that's proportional 00:20:44.460 --> 00:20:48.780 to the height of the Gaussian - excuse me, to the height of the bell-shaped function 00:20:48.780 --> 00:20:50.960 evaluated at this point. And the 00:20:50.960 --> 00:20:53.020 way to get to this point will be, 00:20:53.020 --> 00:20:54.730 to this training example, 00:20:54.730 --> 00:20:56.800 will be proportionate to that height 00:20:56.800 --> 00:20:58.549 and so on. Okay? 00:20:58.549 --> 00:21:01.450 And so training examples that are really far away 00:21:01.450 --> 00:21:04.820 get a very small weight. 00:21:04.820 --> 00:21:07.400 00:21:07.400 --> 00:21:10.630 One last small generalization to this is that 00:21:10.630 --> 00:21:11.960 normally 00:21:11.960 --> 00:21:15.080 there's one other parameter to this 00:21:15.080 --> 00:21:17.760 algorithm, which I'll denote as tow. 00:21:17.760 --> 00:21:21.110 Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian. 00:21:21.110 --> 00:21:24.640 This is a convenient form or function. 00:21:24.640 --> 00:21:27.140 This parameter tow 00:21:27.140 --> 00:21:32.510 is called the bandwidth 00:21:32.510 --> 00:21:33.760 parameter 00:21:33.760 --> 00:21:40.760 and 00:21:40.980 --> 00:21:47.530 informally it controls how fast the weights fall of with distance. Okay? So just 00:21:47.530 --> 00:21:49.350 copy my diagram from 00:21:49.350 --> 00:21:53.810 the other side, I guess. 00:21:53.810 --> 00:21:55.770 So if 00:21:55.770 --> 00:21:57.750 tow is very small, 00:21:57.750 --> 00:22:00.000 00:22:00.000 --> 00:22:04.520 if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape, 00:22:04.520 --> 00:22:08.080 so that the weights of the points are far away fall off rapidly. 00:22:08.080 --> 00:22:09.570 Whereas if 00:22:09.570 --> 00:22:16.570 tow 00:22:17.490 --> 00:22:21.309 is large then you'd end 00:22:21.309 --> 00:22:25.309 up choosing a weighting function that falls of relatively slowly with distance from your 00:22:25.309 --> 00:22:27.540 query. 00:22:27.540 --> 00:22:30.549 Okay? 00:22:30.549 --> 00:22:32.910 So 00:22:32.910 --> 00:22:39.910 I hope you can, therefore, see that if you 00:22:41.580 --> 00:22:44.410 apply locally weighted linear regression to a data set that looks like 00:22:44.410 --> 00:22:45.640 this, 00:22:45.640 --> 00:22:46.850 then to 00:22:46.850 --> 00:22:50.100 ask what your hypothesis output is at a point like this you end up having a straight line 00:22:50.100 --> 00:22:51.169 making 00:22:51.169 --> 00:22:52.740 that prediction. To 00:22:52.740 --> 00:22:55.159 ask what kind of class this [inaudible] at 00:22:55.159 --> 00:22:56.430 that value 00:22:56.430 --> 00:22:58.540 you put a straight line there 00:22:58.540 --> 00:23:00.420 and you predict that value. 00:23:00.420 --> 00:23:04.060 It turns out that every time you try to vary your hypothesis, every time you 00:23:04.060 --> 00:23:06.509 ask your learning algorithm to make a prediction for 00:23:06.509 --> 00:23:08.910 how much a new house costs or whatever, 00:23:08.910 --> 00:23:09.910 00:23:09.910 --> 00:23:13.640 you need to run a new fitting procedure and then 00:23:13.640 --> 00:23:15.730 evaluate this line that you fit 00:23:15.730 --> 00:23:17.810 just at the position of 00:23:17.810 --> 00:23:22.430 the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you 00:23:22.430 --> 00:23:25.510 do this for every point along the X-axis then 00:23:25.510 --> 00:23:27.010 you find that 00:23:27.010 --> 00:23:29.990 locally weighted regression is able to trace on this, sort of, very 00:23:29.990 --> 00:23:31.690 non-linear curve 00:23:31.690 --> 00:23:37.559 for a data set like this. Okay? So 00:23:37.559 --> 00:23:42.110 in the problem set we're actually gonna let you play around more with this algorithm. So I won't say 00:23:42.110 --> 00:23:43.850 too much more about it here. 00:23:43.850 --> 00:23:48.730 But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you 00:23:48.730 --> 00:23:49.950 still have 00:23:49.950 --> 00:23:51.200 the same problem of overfitting 00:23:51.200 --> 00:23:55.320 and underfitting, like when 00:23:55.320 --> 00:23:57.960 you had a Q's tow. Like you make it too small 00:23:57.960 --> 00:23:59.080 in your - 00:23:59.080 --> 00:24:01.760 Yes, absolutely. Yes. So locally 00:24:01.760 --> 00:24:04.260 weighted regression can run into 00:24:04.260 --> 00:24:07.420 - locally weighted regression is not a penancier for the problem of 00:24:07.420 --> 00:24:10.470 overfitting or underfitting. 00:24:10.470 --> 00:24:15.130 You can still run into the same problems with locally weighted regression. What you just 00:24:15.130 --> 00:24:16.920 said about 00:24:16.920 --> 00:24:19.920 - and so some of these things I'll leave you to discover for yourself in the 00:24:19.920 --> 00:24:20.859 homework problem. 00:24:20.859 --> 00:24:24.720 You'll actually see what you just mentioned. 00:24:24.720 --> 00:24:29.070 Yeah? It almost seems like 00:24:29.070 --> 00:24:31.360 you're 00:24:31.360 --> 00:24:37.430 not even thoroughly [inaudible] with this locally weighted, you had all the 00:24:37.430 --> 00:24:42.500 data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a 00:24:42.500 --> 00:24:44.700 model, because you need the entire data set. 00:24:44.700 --> 00:24:45.460 And 00:24:45.460 --> 00:24:49.720 the other way of saying that is that this is a non-parametric learning 00:24:49.720 --> 00:24:51.150 algorithm. So this 00:24:51.150 --> 00:24:52.280 00:24:52.280 --> 00:24:54.440 -I don't know. I won't 00:24:54.440 --> 00:24:58.130 debate whether, you know, are we really building a model or not. But 00:24:58.130 --> 00:25:00.370 this is a perfectly fine - so 00:25:00.370 --> 00:25:01.660 if I think 00:25:01.660 --> 00:25:03.750 when you write a code implementing 00:25:03.750 --> 00:25:07.080 locally weighted linear regression on the 00:25:07.080 --> 00:25:08.980 data set I think of that code 00:25:08.980 --> 00:25:13.340 as a whole - as building your model. 00:25:13.340 --> 00:25:14.560 So it actually uses - 00:25:14.560 --> 00:25:18.610 we've actually used this quite successfully to model, sort of, the dynamics of this 00:25:18.610 --> 00:25:23.030 autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights 00:25:23.030 --> 00:25:24.090 based 00:25:24.090 --> 00:25:27.770 on 00:25:27.770 --> 00:25:32.690 the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see, 00:25:32.690 --> 00:25:34.259 yes. So 00:25:34.259 --> 00:25:37.450 it turns out there are a few things you can do. One thing that is quite common is 00:25:37.450 --> 00:25:40.190 how to choose this band with parameter tow, 00:25:40.190 --> 00:25:45.790 right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible] 00:25:45.790 --> 00:25:52.790 Gaussian 00:25:56.480 --> 00:25:58.030 sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's 00:25:58.030 --> 00:25:59.640 see. Boy. 00:25:59.640 --> 00:26:01.679 The weights are not 00:26:01.679 --> 00:26:06.130 random variables and it's not, for the purpose of this algorithm, it is not useful to 00:26:06.130 --> 00:26:09.370 endow it with probable semantics. So you could 00:26:09.370 --> 00:26:15.220 choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In 00:26:15.220 --> 00:26:15.820 fact, 00:26:15.820 --> 00:26:17.390 it turns out that 00:26:17.390 --> 00:26:21.530 I happened to choose this, sort of, bell-shaped function 00:26:21.530 --> 00:26:23.119 to define my weights. 00:26:23.119 --> 00:26:27.149 It's actually fine to choose a function that doesn't even integrate to one, that integrates to 00:26:27.149 --> 00:26:29.560 infinity, say, as you're weighting function. So 00:26:29.560 --> 00:26:31.700 in that sense, 00:26:31.700 --> 00:26:36.220 I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially 00:26:36.220 --> 00:26:41.830 since you use other functions that integrate to infinity and don't integrate to one. Okay? 00:26:41.830 --> 00:26:43.990 It's the last question and let's move on Assume 00:26:43.990 --> 00:26:50.990 that we have a very huge [inaudible], for example. A very huge set of houses and want to 00:26:51.370 --> 00:26:53.870 predict the linear for each house 00:26:53.870 --> 00:26:57.530 and so should the end result for each input - I'm 00:26:57.530 --> 00:26:59.549 seeing this very constantly for 00:26:59.549 --> 00:27:02.120 - Yes, you're right. So 00:27:02.120 --> 00:27:04.830 because locally weighted regression is a 00:27:04.830 --> 00:27:07.200 non-parametric algorithm 00:27:07.200 --> 00:27:11.380 every time you make a prediction you need to fit theta to your entire training set again. 00:27:11.380 --> 00:27:13.120 So you're actually right. 00:27:13.120 --> 00:27:17.640 If you have a very large training set then this is a somewhat expensive algorithm to 00:27:17.640 --> 00:27:19.840 use. Because every time you want to make a prediction 00:27:19.840 --> 00:27:20.929 you need to fit 00:27:20.929 --> 00:27:22.660 a straight line 00:27:22.660 --> 00:27:25.640 to a huge data set again. 00:27:25.640 --> 00:27:28.860 Turns out there are algorithms that - turns 00:27:28.860 --> 00:27:32.170 out there are ways to make this much more efficient for large data sets as well. 00:27:32.170 --> 00:27:34.539 So don't want to talk about that. If you're interested, look 00:27:34.539 --> 00:27:36.089 up the work of Andrew Moore 00:27:36.089 --> 00:27:37.780 on KD-trees. He, 00:27:37.780 --> 00:27:39.580 sort 00:27:39.580 --> 00:27:42.190 of, figured out ways to fit these models much more efficiently. That's not something I want to go 00:27:42.190 --> 00:27:44.760 into today. Okay? Let me 00:27:44.760 --> 00:27:51.760 move one. Let's take more questions later. So, okay. 00:28:00.789 --> 00:28:05.670 So that's locally weighted regression. 00:28:05.670 --> 00:28:07.080 Remember the 00:28:07.080 --> 00:28:10.830 outline I had, I guess, at the beginning of this lecture. What I want to do now is 00:28:10.830 --> 00:28:11.440 00:28:11.440 --> 00:28:15.650 talk about a probabilistic interpretation of linear regression, all right? 00:28:15.650 --> 00:28:19.179 And in particular of the - it'll be this probabilistic interpretation 00:28:19.179 --> 00:28:22.250 that let's us move on to talk 00:28:22.250 --> 00:28:29.250 about logistic regression, which will be our first classification algorithm. So 00:28:39.680 --> 00:28:43.320 let's put aside locally weighted regression for now. We'll just talk about 00:28:43.320 --> 00:28:46.650 ordinary unweighted linear regression. Let's 00:28:46.650 --> 00:28:50.580 ask the question of why least squares, right? Of all the things 00:28:50.580 --> 00:28:52.400 we could optimize 00:28:52.400 --> 00:28:56.250 how do we come up with this criteria for minimizing the square of the area 00:28:56.250 --> 00:28:59.490 between the predictions of the hypotheses 00:28:59.490 --> 00:29:02.320 and the values Y predicted. So why not minimize 00:29:02.320 --> 00:29:08.140 the absolute value of the areas or the areas to the power of four or something? 00:29:08.140 --> 00:29:10.260 What I'm going to do now is present 00:29:10.260 --> 00:29:12.270 one set of assumptions that 00:29:12.270 --> 00:29:14.520 will serve to "justify" 00:29:14.520 --> 00:29:18.950 why we're minimizing the sum of square zero. Okay? 00:29:18.950 --> 00:29:21.620 It turns out that there are many assumptions that are sufficient 00:29:21.620 --> 00:29:25.820 to justify why we do least squares and this is just one of them. 00:29:25.820 --> 00:29:26.929 So 00:29:26.929 --> 00:29:30.260 just because I present one set of assumptions under which least 00:29:30.260 --> 00:29:32.340 squares regression make sense, 00:29:32.340 --> 00:29:35.270 but this is not the only set of assumptions. So even if the assumptions 00:29:35.270 --> 00:29:39.160 I describe don't hold, least squares actually still makes sense in many 00:29:39.160 --> 00:29:40.030 circumstances. But this 00:29:40.030 --> 00:29:41.490 sort of new help, you know, 00:29:41.490 --> 00:29:48.289 give one rationalization, like, one reason for doing least squares regression. 00:29:48.289 --> 00:29:51.140 And, in particular, what I'm going to do is 00:29:51.140 --> 00:29:57.520 endow the least squares model with probabilistic semantics. So 00:29:57.520 --> 00:30:01.809 let's assume in our example of predicting housing prices, 00:30:01.809 --> 00:30:02.950 that 00:30:02.950 --> 00:30:07.510 the price of the house it's sold four, and 00:30:07.510 --> 00:30:12.400 there's going to be some linear function of the features, 00:30:12.400 --> 00:30:13.720 plus 00:30:13.720 --> 00:30:17.049 some term epsilon I. Okay? 00:30:17.049 --> 00:30:19.620 And epsilon I will be 00:30:19.620 --> 00:30:22.010 our error term. 00:30:22.010 --> 00:30:23.160 You can think of 00:30:23.160 --> 00:30:27.860 the error term as capturing unmodeled effects, like, that maybe 00:30:27.860 --> 00:30:31.490 there's some other features of a house, like, maybe how many fireplaces it has or whether 00:30:31.490 --> 00:30:33.390 there's a garden or whatever, 00:30:33.390 --> 00:30:34.130 that there 00:30:34.130 --> 00:30:37.100 are additional features that we jut fail to capture or 00:30:37.100 --> 00:30:39.520 you can think of epsilon as random noise. 00:30:39.520 --> 00:30:42.360 Epsilon is our error term that captures both these 00:30:42.360 --> 00:30:46.230 unmodeled effects. Just things we forgot to model. Maybe the function isn't quite 00:30:46.230 --> 00:30:47.660 linear or something. 00:30:47.660 --> 00:30:50.360 00:30:50.360 --> 00:30:52.709 As well as random noise, like maybe 00:30:52.709 --> 00:30:56.539 that day the seller was in a really bad mood and so he sold it, just 00:30:56.539 --> 00:31:00.440 refused to go for a reasonable price or something. 00:31:00.440 --> 00:31:02.670 And now 00:31:02.670 --> 00:31:07.880 I will assume that the errors have a 00:31:07.880 --> 00:31:09.010 probabilistic 00:31:09.010 --> 00:31:13.100 - have a probability distribution. I'll assume that the errors epsilon I 00:31:13.100 --> 00:31:14.520 are distributed 00:31:14.520 --> 00:31:16.500 just till 00:31:16.500 --> 00:31:18.660 they denote epsilon I 00:31:18.660 --> 00:31:23.440 is distributive according to a probability distribution. That's 00:31:23.440 --> 00:31:26.510 a Gaussian distribution with mean zero 00:31:26.510 --> 00:31:28.170 and variance sigma squared. Okay? So 00:31:28.170 --> 00:31:30.630 let me just scripts in here, 00:31:30.630 --> 00:31:34.630 n stands for normal, right? To denote a normal distribution, also known as the 00:31:34.630 --> 00:31:36.169 Gaussian distribution, 00:31:36.169 --> 00:31:37.320 with mean 00:31:37.320 --> 00:31:41.190 zero and covariance sigma squared. 00:31:41.190 --> 00:31:46.730 Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you. 00:31:46.730 --> 00:31:48.640 Great. Almost everyone. 00:31:48.640 --> 00:31:51.280 So, 00:31:51.280 --> 00:31:55.309 in other words, the density for Gaussian is what you've seen before. 00:31:55.309 --> 00:31:57.730 The density for epsilon I would be 00:31:57.730 --> 00:31:59.650 one over root 2 pi sigma, E to the 00:31:59.650 --> 00:32:01.790 negative, 00:32:01.790 --> 00:32:03.570 epsilon I 00:32:03.570 --> 00:32:05.760 00:32:05.760 --> 00:32:09.420 squared over 2 sigma squared, right? 00:32:09.420 --> 00:32:15.170 And the 00:32:15.170 --> 00:32:22.170 density of our epsilon I will be this bell-shaped curve 00:32:22.929 --> 00:32:25.870 with one standard deviation 00:32:25.870 --> 00:32:31.039 being a, sort of, sigma. Okay? This 00:32:31.039 --> 00:32:31.780 is 00:32:31.780 --> 00:32:34.200 form for that bell-shaped curve. 00:32:34.200 --> 00:32:41.200 So, let's see. I can erase that. Can I 00:32:41.260 --> 00:32:47.040 erase the board? 00:32:47.040 --> 00:32:54.040 00:32:59.159 --> 00:33:04.770 So this implies that the 00:33:04.770 --> 00:33:10.120 probability distribution of a price of a house 00:33:10.120 --> 00:33:11.530 given in si 00:33:11.530 --> 00:33:13.620 and the parameters theta, 00:33:13.620 --> 00:33:20.620 that this is going to be Gaussian 00:33:29.530 --> 00:33:34.820 with that density. Okay? 00:33:34.820 --> 00:33:37.930 In other words, saying goes as that the 00:33:37.930 --> 00:33:41.440 price of 00:33:41.440 --> 00:33:46.550 a house given the features of the house and my parameters theta, 00:33:46.550 --> 00:33:50.530 this is going to be a random variable 00:33:50.530 --> 00:33:53.820 that's distributed Gaussian with 00:33:53.820 --> 00:33:56.700 mean theta transpose XI 00:33:56.700 --> 00:33:58.039 and variance sigma squared. 00:33:58.039 --> 00:34:01.120 Right? Because we imagine that 00:34:01.120 --> 00:34:04.990 the way the housing prices are generated is that the price of a house 00:34:04.990 --> 00:34:08.990 is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma 00:34:08.990 --> 00:34:11.649 squared. So 00:34:11.649 --> 00:34:13.589 the price of a house is going to 00:34:13.589 --> 00:34:16.239 have mean theta transpose XI, again, and sigma squared, right? Does 00:34:16.239 --> 00:34:20.069 this make 00:34:20.069 --> 00:34:24.379 sense? Raise your hand if this makes sense. Yeah, 00:34:24.379 --> 00:34:31.379 okay. Lots of you. In 00:34:38.369 --> 00:34:43.939 point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume 00:34:43.939 --> 00:34:47.489 here the error is a 00:34:47.489 --> 00:34:50.299 Gaussian? 00:34:50.299 --> 00:34:54.039 Right. So, boy. 00:34:54.039 --> 00:34:56.429 Why do I see the error as Gaussian? 00:34:56.429 --> 00:35:00.909 Two reasons, right? One is that it turns out to be mathematically convenient to do so 00:35:00.909 --> 00:35:03.109 and the other is, I don't 00:35:03.109 --> 00:35:06.529 know, I can also mumble about justifications, such as things to the 00:35:06.529 --> 00:35:08.609 central limit theorem. It turns out that if you, 00:35:08.609 --> 00:35:11.989 for the vast majority of problems, if you apply a linear regression model like 00:35:11.989 --> 00:35:15.679 this and try to measure the distribution of the errors, 00:35:15.679 --> 00:35:19.509 not all the time, but very often you find that the errors really are Gaussian. That 00:35:19.509 --> 00:35:21.779 this Gaussian model is a good 00:35:21.779 --> 00:35:23.519 assumption for the error 00:35:23.519 --> 00:35:26.079 in regression problems like these. 00:35:26.079 --> 00:35:29.000 Some of you may have heard of the central limit theorem, which says that 00:35:29.000 --> 00:35:33.219 the sum of many independent random variables will tend towards a Gaussian. 00:35:33.219 --> 00:35:37.450 So if the error is caused by many effects, like the mood of the 00:35:37.450 --> 00:35:39.060 seller, the mood of the buyer, 00:35:39.060 --> 00:35:42.680 some other features that we miss, whether the place has a garden or not, and 00:35:42.680 --> 00:35:45.630 if all of these effects are independent, then 00:35:45.630 --> 00:35:49.259 by the central limit theorem you might be inclined to believe that 00:35:49.259 --> 00:35:52.259 the sum of all these effects will be approximately Gaussian. If 00:35:52.259 --> 00:35:53.760 in practice, I guess, the 00:35:53.760 --> 00:35:57.810 two real answers are that, 1.) In practice this is actually a reasonably accurate 00:35:57.810 --> 00:36:04.810 assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're 00:36:06.509 --> 00:36:07.539 saying 00:36:07.539 --> 00:36:10.539 if we assume that area around model 00:36:10.539 --> 00:36:12.849 has zero mean, then 00:36:12.849 --> 00:36:16.269 the area is centered around our model. Which 00:36:16.269 --> 00:36:18.059 it seems almost like we're trying to assume 00:36:18.059 --> 00:36:19.960 what we're trying to prove. Instructor? 00:36:19.960 --> 00:36:23.640 That's the [inaudible] but, yes. You are assuming that 00:36:23.640 --> 00:36:28.359 the error has zero mean. Which is, yeah, right. 00:36:28.359 --> 00:36:30.919 I think later this quarter we get to some of the other 00:36:30.919 --> 00:36:35.059 things, but for now just think of this as a mathematically - it's actually not 00:36:35.059 --> 00:36:37.879 an unreasonable assumption. 00:36:37.879 --> 00:36:40.390 I guess, 00:36:40.390 --> 00:36:44.979 in machine learning all the assumptions we make are almost never 00:36:44.979 --> 00:36:48.519 true in the absence sense, right? Because, for instance, 00:36:48.519 --> 00:36:54.359 housing prices are priced to dollars and cents, so the error will be - 00:36:54.359 --> 00:36:55.199 errors 00:36:55.199 --> 00:36:58.719 in prices are not continued as value random variables, because 00:36:58.719 --> 00:37:02.239 houses can only be priced at a certain number of dollars and a certain number of 00:37:02.239 --> 00:37:05.970 cents and you never have fractions of cents in housing prices. 00:37:05.970 --> 00:37:10.339 Whereas a Gaussian random variable would. So in that sense, assumptions we make are never 00:37:10.339 --> 00:37:14.689 "absolutely true," but for practical purposes this is a 00:37:14.689 --> 00:37:19.459 accurate enough assumption that it'll be useful to 00:37:19.459 --> 00:37:23.529 make. Okay? I think in a week or two, we'll actually come back to 00:37:23.529 --> 00:37:27.279 selected more about the assumptions we make and when they help our learning 00:37:27.279 --> 00:37:29.059 algorithms and when they hurt our learning 00:37:29.059 --> 00:37:30.839 algorithms. We'll say a bit more about it 00:37:30.839 --> 00:37:37.839 when we talk about generative and discriminative learning algorithms, like, in a week or 00:37:40.319 --> 00:37:44.819 two. Okay? So let's point out one bit of notation, which is that when I 00:37:44.819 --> 00:37:48.399 wrote this down I actually wrote P of YI given XI and then semicolon 00:37:48.399 --> 00:37:49.719 theta 00:37:49.719 --> 00:37:54.999 and I'm going to use this notation when we are not thinking of theta as a 00:37:54.999 --> 00:37:56.799 random variable. So 00:37:56.799 --> 00:38:00.709 in statistics, though, sometimes it's called the frequentist's point of view, 00:38:00.709 --> 00:38:02.380 where you think of there as being some, 00:38:02.380 --> 00:38:06.450 sort of, true value of theta that's out there that's generating the data say, 00:38:06.450 --> 00:38:07.519 but 00:38:07.519 --> 00:38:11.299 we don't know what theta is, but theta is not a random 00:38:11.299 --> 00:38:13.440 vehicle, right? So it's not like there's some 00:38:13.440 --> 00:38:16.270 random value of theta out there. It's that theta is - 00:38:16.270 --> 00:38:19.140 there's some true value of theta out there. It's just that we don't 00:38:19.140 --> 00:38:22.459 know what the true value of theta is. So 00:38:22.459 --> 00:38:25.620 if theta is not a random variable, then I'm 00:38:25.620 --> 00:38:27.299 going to avoid 00:38:27.299 --> 00:38:31.359 writing P of YI given XI comma theta, because this would mean 00:38:31.359 --> 00:38:34.899 that probably of YI conditioned on X and theta 00:38:34.899 --> 00:38:40.200 and you can only condition on random variables. 00:38:40.200 --> 00:38:43.010 So at this part of the class where we're taking 00:38:43.010 --> 00:38:46.490 sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class 00:38:46.490 --> 00:38:49.220 we're thinking of theta not as a random variable, but just as something 00:38:49.220 --> 00:38:50.470 we're trying to estimate 00:38:50.470 --> 00:38:52.829 and use the semicolon 00:38:52.829 --> 00:38:57.559 notation. So the way to read this is this is the probability of YI given XI 00:38:57.559 --> 00:39:00.410 and parameterized by theta. Okay? So 00:39:00.410 --> 00:39:03.719 you read the semicolon as parameterized by. 00:39:03.719 --> 00:39:08.269 And in the same way here, I'll say YI given XI parameterized by 00:39:08.269 --> 00:39:09.499 theta is distributed 00:39:09.499 --> 00:39:16.499 Gaussian with that. All right. 00:39:36.319 --> 00:39:38.329 So we're gonna make one more assumption. 00:39:38.329 --> 00:39:41.279 Let's assume that the 00:39:41.279 --> 00:39:43.819 error terms are 00:39:43.819 --> 00:39:48.269 00:39:48.269 --> 00:39:50.609 IID, okay? 00:39:50.609 --> 00:39:54.279 Which stands for Independently and Identically Distributed. So it's 00:39:54.279 --> 00:39:57.259 going to assume that the error terms are 00:39:57.259 --> 00:40:04.259 independent of each other, 00:40:04.469 --> 00:40:11.279 right? 00:40:11.279 --> 00:40:14.749 The identically distributed part just means that I'm assuming the outcome for the 00:40:14.749 --> 00:40:18.009 same Gaussian distribution or the same variance, 00:40:18.009 --> 00:40:22.099 but the more important part of is this is that I'm assuming that the epsilon I's are 00:40:22.099 --> 00:40:25.519 independent of each other. 00:40:25.519 --> 00:40:28.899 Now, let's talk about how to fit a model. 00:40:28.899 --> 00:40:32.359 The probability of Y given 00:40:32.359 --> 00:40:36.239 X parameterized by theta - I'm actually going to give 00:40:36.239 --> 00:40:39.149 this another name. I'm going to write this down 00:40:39.149 --> 00:40:42.749 and we'll call this the likelihood of theta 00:40:42.749 --> 00:40:46.439 as the probability of Y given X parameterized by theta. 00:40:46.439 --> 00:40:49.009 And so this is going to be 00:40:49.009 --> 00:40:50.419 the product 00:40:50.419 --> 00:40:57.419 over my training set like that. 00:40:59.939 --> 00:41:04.299 Which is, in turn, going to be a product of 00:41:04.299 --> 00:41:10.539 those Gaussian densities that I wrote down just now, 00:41:10.539 --> 00:41:14.549 right? 00:41:14.549 --> 00:41:20.359 Okay? 00:41:20.359 --> 00:41:24.629 Then in parts of notation, I guess, I define this term here to be the 00:41:24.629 --> 00:41:26.180 likelihood of theta. 00:41:26.180 --> 00:41:30.109 And the likely of theta is just the probability of the data Y, right? Given X 00:41:30.109 --> 00:41:32.789 and prioritized by theta. 00:41:32.789 --> 00:41:36.619 To test the likelihood and probability are often confused. 00:41:36.619 --> 00:41:40.519 So the likelihood of theta is the same thing as the 00:41:40.519 --> 00:41:45.509 probability of the data you saw. So likely and probably are, sort of, the same thing. 00:41:45.509 --> 00:41:47.779 Except that when I use the term likelihood 00:41:47.779 --> 00:41:51.799 I'm trying to emphasize that I'm taking this thing 00:41:51.799 --> 00:41:55.380 and viewing it as a function of theta. 00:41:55.380 --> 00:41:57.130 Okay? 00:41:57.130 --> 00:42:00.609 So likelihood and for probability, they're really the same thing except that 00:42:00.609 --> 00:42:02.339 when I want to view this thing 00:42:02.339 --> 00:42:05.660 as a function of theta holding X and Y fix are 00:42:05.660 --> 00:42:10.089 then called likelihood. Okay? So 00:42:10.089 --> 00:42:13.299 hopefully you hear me say the likelihood of the parameters and the probability 00:42:13.299 --> 00:42:15.119 of the data, 00:42:15.119 --> 00:42:18.469 right? Rather than the likelihood of the data or probability of parameters. So try 00:42:18.469 --> 00:42:25.469 to be consistent in that terminology. 00:42:30.509 --> 00:42:31.519 So given that 00:42:31.519 --> 00:42:33.529 the probability of the data is this and this 00:42:33.529 --> 00:42:36.859 is also the likelihood of the parameters, 00:42:36.859 --> 00:42:37.809 how do you estimate 00:42:37.809 --> 00:42:40.460 the parameters theta? So given a training set, 00:42:40.460 --> 00:42:46.309 what parameters theta do you want to choose for your model? 00:42:46.309 --> 00:42:53.309 00:42:58.749 --> 00:43:01.549 Well, the principle of maximum likelihood 00:43:01.549 --> 00:43:02.709 estimation 00:43:02.709 --> 00:43:09.329 says that, 00:43:09.329 --> 00:43:13.299 right? You can choose the value of theta that makes the data 00:43:13.299 --> 00:43:20.039 as probable as possible, right? So choose theta 00:43:20.039 --> 00:43:27.039 to maximize the likelihood. Or 00:43:27.199 --> 00:43:29.769 in other words choose the parameters that make 00:43:29.769 --> 00:43:33.119 the data as probable as possible, right? So this is 00:43:33.119 --> 00:43:36.939 massive likely your estimation from six to six. So it's choose the parameters that makes 00:43:36.939 --> 00:43:39.539 it as likely as probable as possible 00:43:39.539 --> 00:43:43.309 for me to have seen the data I just 00:43:43.309 --> 00:43:46.819 did. So 00:43:46.819 --> 00:43:53.189 for mathematical convenience, let me define lower case l of theta. 00:43:53.189 --> 00:43:57.959 This is called the log likelihood function and it's just log 00:43:57.959 --> 00:44:01.429 of capital L of theta. 00:44:01.429 --> 00:44:06.309 So this is log over product over I 00:44:06.309 --> 00:44:09.539 to find 00:44:09.539 --> 00:44:14.299 sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this 00:44:14.299 --> 00:44:16.609 from the previous board. 00:44:16.609 --> 00:44:23.609 Log and a product is the same as the sum of over logs, right? So it's a sum 00:44:24.509 --> 00:44:31.509 of 00:44:35.140 --> 00:44:38.400 the logs of - which simplifies to m times 00:44:38.400 --> 00:44:39.349 one over root 00:44:39.349 --> 00:44:43.509 two pi 00:44:43.509 --> 00:44:44.380 sigma 00:44:44.380 --> 00:44:47.469 plus 00:44:47.469 --> 00:44:52.099 and then log of explanation cancel each other, right? So if log of E of 00:44:52.099 --> 00:44:53.089 something is just 00:44:53.089 --> 00:45:00.089 whatever's inside the exponent. So, you know what, 00:45:01.229 --> 00:45:08.229 let me write this on the 00:45:12.079 --> 00:45:16.129 next 00:45:16.129 --> 00:45:21.359 board. 00:45:21.359 --> 00:45:28.359 Okay. 00:45:32.759 --> 00:45:39.759 00:45:46.239 --> 00:45:50.509 So 00:45:50.509 --> 00:45:53.349 maximizing the likelihood or maximizing the log 00:45:53.349 --> 00:45:58.440 likelihood is the same 00:45:58.440 --> 00:46:02.940 as minimizing 00:46:02.940 --> 00:46:09.940 that term over there. Well, you get it, right? 00:46:22.019 --> 00:46:26.039 Because there's a minus sign. So maximizing this because of the minus sign is the same as 00:46:26.039 --> 00:46:26.829 minimizing 00:46:26.829 --> 00:46:33.229 this as a function of theta. And 00:46:33.229 --> 00:46:35.729 this is, of course, just 00:46:35.729 --> 00:46:42.729 the same quadratic cos function that we had last time, J of theta, 00:46:43.089 --> 00:46:44.469 right? So what 00:46:44.469 --> 00:46:48.339 we've just shown is that the ordinary least squares algorithm, 00:46:48.339 --> 00:46:50.539 that we worked on the previous lecture, 00:46:50.539 --> 00:46:55.019 is just maximum likelihood 00:46:55.019 --> 00:46:56.079 assuming 00:46:56.079 --> 00:46:58.219 this probabilistic model, 00:46:58.219 --> 00:47:05.219 assuming IID Gaussian errors on our data. 00:47:06.109 --> 00:47:09.659 00:47:09.659 --> 00:47:10.720 Okay? One thing that we'll 00:47:10.720 --> 00:47:12.480 actually leave is that, 00:47:12.480 --> 00:47:14.489 in the next lecture notice that 00:47:14.489 --> 00:47:17.060 the value of sigma squared doesn't matter, 00:47:17.060 --> 00:47:17.930 right? That somehow 00:47:17.930 --> 00:47:21.390 no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a 00:47:21.390 --> 00:47:21.999 variance 00:47:21.999 --> 00:47:26.229 of a Gaussian. So that no matter what sigma 00:47:26.229 --> 00:47:30.160 squared is since it's a positive number the value of theta we end up with 00:47:30.160 --> 00:47:33.619 will be the same, right? So because 00:47:33.619 --> 00:47:34.659 minimizing this 00:47:34.659 --> 00:47:39.289 you get the same value of theta no matter what sigma squared is. So it's as if 00:47:39.289 --> 00:47:42.690 in this model the value of sigma squared doesn't really matter. 00:47:42.690 --> 00:47:45.790 Just remember that for the next lecture. We'll come back 00:47:45.790 --> 00:47:47.509 to this again. 00:47:47.509 --> 00:47:51.429 Any questions about this? 00:47:51.429 --> 00:47:52.670 Actually, let me clean up 00:47:52.670 --> 00:47:59.670 another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of 00:48:43.929 --> 00:48:50.929 theta by 00:48:51.209 --> 00:48:52.079 a fraction 00:48:52.079 --> 00:48:53.809 of error, 00:48:53.809 --> 00:48:55.369 but I think it's that you 00:48:55.369 --> 00:48:57.319 measure 00:48:57.319 --> 00:49:01.179 because it depends on the family of theta too, for example. If 00:49:01.179 --> 00:49:05.269 00:49:05.269 --> 00:49:09.069 you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think 00:49:09.069 --> 00:49:12.609 let's - the thing's you're mentioning are 00:49:12.609 --> 00:49:14.549 maybe deeper questions about 00:49:14.549 --> 00:49:17.869 learning algorithms that we'll just come back to later, so don't really want to get into 00:49:17.869 --> 00:49:19.459 that right 00:49:19.459 --> 00:49:22.199 now. Any more 00:49:22.199 --> 00:49:29.199 questions? Okay. So 00:49:33.219 --> 00:49:38.759 this endows linear regression with a probabilistic interpretation. 00:49:38.759 --> 00:49:43.209 I'm actually going to use this probabil - use this, sort of, probabilistic 00:49:43.209 --> 00:49:43.940 interpretation 00:49:43.940 --> 00:49:46.190 in order to derive our next learning algorithm, 00:49:46.190 --> 00:49:50.309 which will be our first classification algorithm. Okay? 00:49:50.309 --> 00:49:53.619 So 00:49:53.619 --> 00:49:57.930 you'll recall that I said that regression problems are where the variable Y 00:49:57.930 --> 00:50:01.339 that you're trying to predict is continuous values. 00:50:01.339 --> 00:50:04.259 Now I'm actually gonna talk about our first classification problem, 00:50:04.259 --> 00:50:07.859 where the value Y you're trying to predict 00:50:07.859 --> 00:50:10.939 will be discreet value. You can take on only a small number of discrete values 00:50:10.939 --> 00:50:14.379 and in this case I'll talk about binding classification 00:50:14.379 --> 00:50:15.709 where 00:50:15.709 --> 00:50:18.769 Y takes on only two values, right? So 00:50:18.769 --> 00:50:21.890 you come up with classification problems if you're trying to do, 00:50:21.890 --> 00:50:25.900 say, a medical diagnosis and try to decide based on some features 00:50:25.900 --> 00:50:29.890 that the patient has a disease or does not have a disease. 00:50:29.890 --> 00:50:34.239 Or if in the housing example, maybe you're trying to decide will this house sell in the 00:50:34.239 --> 00:50:37.509 next six months or not and the answer is either yes or no. It'll either be sold in the 00:50:37.509 --> 00:50:40.989 next six months or it won't be. 00:50:40.989 --> 00:50:44.529 Other standing examples, if you want to build a spam filter. Is this e-mail spam 00:50:44.529 --> 00:50:50.930 or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting 00:50:50.930 --> 00:50:55.439 whether a computer system will crash. So you have a learning algorithm to predict will 00:50:55.439 --> 00:50:59.159 this computing cluster crash over the next 24 hours? And, again, it's a yes 00:50:59.159 --> 00:51:06.039 or no answer. So 00:51:06.039 --> 00:51:08.149 there's X, there's Y. 00:51:08.149 --> 00:51:14.119 And in a classification problem 00:51:14.119 --> 00:51:15.269 Y takes on 00:51:15.269 --> 00:51:18.569 two values, zero and one. That's it in binding the classification. 00:51:18.569 --> 00:51:21.630 So what can you do? Well, one thing you could do is 00:51:21.630 --> 00:51:25.299 take linear regression, as we've described it so far, and apply it to this problem, 00:51:25.299 --> 00:51:26.399 right? So you, 00:51:26.399 --> 00:51:30.289 you know, given this data set you can fit a straight line to it. Maybe 00:51:30.289 --> 00:51:31.829 you get that straight line, right? 00:51:31.829 --> 00:51:32.430 But 00:51:32.430 --> 00:51:36.770 this data set I've drawn, right? This is an amazingly easy classification problem. It's 00:51:36.770 --> 00:51:37.890 pretty obvious 00:51:37.890 --> 00:51:41.420 to all of us that, right? The relationship between X and Y is - 00:51:41.420 --> 00:51:47.729 well, you just look at a value around here and it's the right is one, it's 00:51:47.729 --> 00:51:51.529 the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then 00:51:51.529 --> 00:51:53.650 maybe take your linear regression 00:51:53.650 --> 00:51:55.669 hypothesis to this straight line 00:51:55.669 --> 00:51:58.489 and threshold it at 0.5. 00:51:58.489 --> 00:52:02.059 If you do that you'll certainly get the right answer. You predict that 00:52:02.059 --> 00:52:02.660 00:52:02.660 --> 00:52:04.269 if X is to the right of, sort 00:52:04.269 --> 00:52:06.229 of, the mid-point here 00:52:06.229 --> 00:52:11.829 then Y is one and then next to the left of that mid-point then Y is zero. 00:52:11.829 --> 00:52:16.099 So some people actually do this. Apply linear regression to classification problems 00:52:16.099 --> 00:52:17.719 and sometimes it'll 00:52:17.719 --> 00:52:19.319 work okay, 00:52:19.319 --> 00:52:21.820 but in general it's actually a pretty bad idea to 00:52:21.820 --> 00:52:26.130 apply linear regression to 00:52:26.130 --> 00:52:31.849 classification problems like these and here's why. Let's say I 00:52:31.849 --> 00:52:33.529 change my training set 00:52:33.529 --> 00:52:39.509 by giving you just one more training example all the way up there, right? 00:52:39.509 --> 00:52:43.049 Imagine if given this training set is actually still entirely obvious what the 00:52:43.049 --> 00:52:45.670 relationship between X and Y is, right? It's just - 00:52:45.670 --> 00:52:50.839 take this value as greater than Y is one and it's less then Y 00:52:50.839 --> 00:52:52.039 is zero. 00:52:52.039 --> 00:52:55.059 By giving you this additional training example it really shouldn't 00:52:55.059 --> 00:52:56.429 change anything. I mean, 00:52:56.429 --> 00:52:59.289 I didn't really convey much new information. There's no surprise that this 00:52:59.289 --> 00:53:01.670 corresponds to Y equals one. 00:53:01.670 --> 00:53:04.589 But if you now fit linear regression to this data 00:53:04.589 --> 00:53:07.159 set you end up with a line that, I 00:53:07.159 --> 00:53:10.410 don't know, maybe looks like that, right? 00:53:10.410 --> 00:53:13.430 And now the predictions of your 00:53:13.430 --> 00:53:15.739 hypothesis have changed completely if 00:53:15.739 --> 00:53:22.739 your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that? 00:53:26.579 --> 00:53:30.079 Right. 00:53:30.079 --> 00:53:31.279 Yeah, yeah, fine. Yeah, sure. A theta 00:53:31.279 --> 00:53:36.849 set like that so. So, I 00:53:36.849 --> 00:53:38.609 guess, 00:53:38.609 --> 00:53:42.179 these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will 00:53:42.179 --> 00:53:47.609 change it even more if you gave it 00:53:47.609 --> 00:53:49.579 all - 00:53:49.579 --> 00:53:51.999 Yeah. Then I think this actually would make it even worse. You 00:53:51.999 --> 00:53:54.349 would actually get a line that pulls out even further, right? So 00:53:54.349 --> 00:53:57.819 this is my example. I get to make it whatever I want, right? But 00:53:57.819 --> 00:54:00.689 the point of this is that there's not a deep meaning to this. The point of this is 00:54:00.689 --> 00:54:01.879 just that 00:54:01.879 --> 00:54:04.859 it could be a really bad idea to apply linear regression to classification 00:54:04.859 --> 00:54:05.880 00:54:05.880 --> 00:54:11.609 algorithm. Sometimes it work fine, but usually I wouldn't do it. 00:54:11.609 --> 00:54:13.769 So a couple of problems with this. One is that, 00:54:13.769 --> 00:54:17.349 well - so what do you want to do 00:54:17.349 --> 00:54:20.630 for classification? If you know the value of Y lies between zero and 00:54:20.630 --> 00:54:24.709 one then to kind of fix this problem 00:54:24.709 --> 00:54:27.079 let's just start by 00:54:27.079 --> 00:54:29.219 changing the form 00:54:29.219 --> 00:54:33.869 of our hypothesis so that my hypothesis 00:54:33.869 --> 00:54:39.279 always lies in the unit interval between zero and one. Okay? 00:54:39.279 --> 00:54:41.829 So if I know Y is either 00:54:41.829 --> 00:54:44.039 zero or one then 00:54:44.039 --> 00:54:47.469 let's at least not have my hypothesis predict values much larger than one and much 00:54:47.469 --> 00:54:50.939 smaller than zero. 00:54:50.939 --> 00:54:52.369 And so 00:54:52.369 --> 00:54:55.749 I'm going to - instead of choosing a linear function for my hypothesis I'm going 00:54:55.749 --> 00:54:59.799 to choose something slightly different. And, 00:54:59.799 --> 00:55:03.219 in particular, I'm going to choose 00:55:03.219 --> 00:55:08.129 this function, H subscript theta of X is going to equal to G of 00:55:08.129 --> 00:55:10.839 theta transpose X 00:55:10.839 --> 00:55:12.930 where 00:55:12.930 --> 00:55:13.890 G 00:55:13.890 --> 00:55:17.279 is going to be this 00:55:17.279 --> 00:55:18.329 function and so 00:55:18.329 --> 00:55:20.819 this becomes more than one plus theta X 00:55:20.819 --> 00:55:22.819 of theta 00:55:22.819 --> 00:55:25.049 transpose X. 00:55:25.049 --> 00:55:27.809 And G of Z is called the sigmoid 00:55:27.809 --> 00:55:32.959 function and it 00:55:32.959 --> 00:55:38.659 is often also called the logistic function. It 00:55:38.659 --> 00:55:41.229 goes by either of these 00:55:41.229 --> 00:55:48.229 names. And what G of Z looks like is the following. So when you have your 00:55:48.449 --> 00:55:50.529 horizontal axis I'm going to plot Z 00:55:50.529 --> 00:55:56.549 and so G of Z 00:55:56.549 --> 00:55:58.879 will look like this. 00:55:58.879 --> 00:56:04.759 Okay? I didn't draw that very well. Okay. 00:56:04.759 --> 00:56:05.849 So G of Z 00:56:05.849 --> 00:56:07.769 tends towards zero 00:56:07.769 --> 00:56:09.909 as Z becomes very small 00:56:09.909 --> 00:56:12.219 and G of Z will ascend 00:56:12.219 --> 00:56:15.209 towards one as Z becomes large and it crosses the 00:56:15.209 --> 00:56:17.089 vertical 00:56:17.089 --> 00:56:19.519 axis at 0.5. 00:56:19.519 --> 00:56:24.460 So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of 00:56:24.460 --> 00:56:27.479 sigmoid in other 00:56:27.479 --> 00:56:30.299 step five? Say that again. Why we cannot chose this at five for some reason, like, that's 00:56:30.299 --> 00:56:35.429 better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from, 00:56:35.429 --> 00:56:37.270 right? I just 00:56:37.270 --> 00:56:38.839 wrote down this function. It actually 00:56:38.839 --> 00:56:42.569 turns out that there are two reasons for using this function that we'll come to. 00:56:42.569 --> 00:56:43.619 One is - 00:56:43.619 --> 00:56:46.809 we talked about generalized linear models. We'll see that this falls out naturally 00:56:46.809 --> 00:56:49.910 as part of the broader class of models. 00:56:49.910 --> 00:56:51.089 And another reason 00:56:51.089 --> 00:56:52.340 that we'll talk about 00:56:52.340 --> 00:56:54.239 next week, it turns out 00:56:54.239 --> 00:56:55.089 there are a couple of, 00:56:55.089 --> 00:56:57.269 I think, very beautiful reasons for why 00:56:57.269 --> 00:56:58.999 we choose logistic 00:56:58.999 --> 00:57:00.469 functions. We'll see 00:57:00.469 --> 00:57:02.089 that in a little bit. But for now let me just 00:57:02.089 --> 00:57:07.249 define it and just take my word for it for now that this is a reasonable choice. 00:57:07.249 --> 00:57:08.529 Okay? But notice now that 00:57:08.529 --> 00:57:15.529 my - the values output by my hypothesis will always be between zero 00:57:15.879 --> 00:57:17.289 and one. Furthermore, 00:57:17.289 --> 00:57:20.719 just like we did for linear regression, I'm going to endow 00:57:20.719 --> 00:57:25.549 the outputs and my hypothesis with a probabilistic interpretation, right? So 00:57:25.549 --> 00:57:30.619 I'm going to assume that the probability that Y is equal to one 00:57:30.619 --> 00:57:33.899 given X and parameterized by theta 00:57:33.899 --> 00:57:35.559 that's equal to 00:57:35.559 --> 00:57:38.439 H subscript theta of X, all right? 00:57:38.439 --> 00:57:39.640 So in other words 00:57:39.640 --> 00:57:42.789 I'm going to imagine that my hypothesis is outputting all these 00:57:42.789 --> 00:57:45.409 numbers that lie between zero and one. 00:57:45.409 --> 00:57:47.890 I'm going to think of my hypothesis 00:57:47.890 --> 00:57:54.890 as trying to estimate the probability that Y is equal to one. Okay? 00:57:56.209 --> 00:57:59.710 And because 00:57:59.710 --> 00:58:02.119 Y has to be either zero or one 00:58:02.119 --> 00:58:04.729 then the probability of Y equals zero is going 00:58:04.729 --> 00:58:11.729 to be that. All right? 00:58:11.859 --> 00:58:15.349 So more simply it turns out - actually, take these two equations 00:58:15.349 --> 00:58:19.209 and write them more compactly. 00:58:19.209 --> 00:58:22.139 Write P of Y given X 00:58:22.139 --> 00:58:24.209 parameterized by theta. 00:58:24.209 --> 00:58:25.839 This is going to be H 00:58:25.839 --> 00:58:28.349 subscript theta of X to 00:58:28.349 --> 00:58:31.949 the power of Y times 00:58:31.949 --> 00:58:33.179 one minus 00:58:33.179 --> 00:58:34.769 H of X to the power of 00:58:34.769 --> 00:58:37.219 one minus Y. Okay? So I know this 00:58:37.219 --> 00:58:42.889 looks somewhat bizarre, but this actually makes the variation much nicer. 00:58:42.889 --> 00:58:44.439 So Y is equal to one 00:58:44.439 --> 00:58:48.059 then this equation is H of X to the power of one 00:58:48.059 --> 00:58:51.229 times something to the power of zero. 00:58:51.229 --> 00:58:54.429 So anything to the power of zero is just one, 00:58:54.429 --> 00:58:55.949 right? So Y equals one then 00:58:55.949 --> 00:58:59.389 this is something to the power of zero and so this is just one. 00:58:59.389 --> 00:59:03.349 So if Y equals one this is just saying P of Y equals one is equal to H subscript 00:59:03.349 --> 00:59:05.919 theta of X. Okay? 00:59:05.919 --> 00:59:08.249 And in the same way, if Y is equal 00:59:08.249 --> 00:59:09.919 to zero then this is P 00:59:09.919 --> 00:59:14.329 of Y equals zero equals this thing to the power of zero and so this disappears. This is 00:59:14.329 --> 00:59:15.669 just one 00:59:15.669 --> 00:59:17.579 times this thing power of one. Okay? So this is 00:59:17.579 --> 00:59:18.599 a 00:59:18.599 --> 00:59:20.269 compact way of writing 00:59:20.269 --> 00:59:22.799 both of these equations to 00:59:22.799 --> 00:59:29.799 gather them to one line. 00:59:31.149 --> 00:59:35.650 So let's hope our parameter fitting, right? And, again, you can ask - 00:59:35.650 --> 00:59:38.389 well, given this model by data, how do I fit 00:59:38.389 --> 00:59:41.409 the parameters theta of my 00:59:41.409 --> 00:59:46.169 model? So the likelihood of the parameters is, as before, it's just the probability 00:59:46.169 --> 00:59:48.639 00:59:48.639 --> 00:59:50.499 00:59:50.499 --> 00:59:54.469 of theta, right? Which is product over I, PFYI 00:59:54.469 --> 00:59:56.709 given XI 00:59:56.709 --> 00:59:59.119 parameterized by theta. 00:59:59.119 --> 01:00:01.609 Which is - just plugging those 01:00:01.609 --> 01:00:08.609 in. Okay? I 01:00:09.399 --> 01:00:16.399 dropped this theta subscript just so you can write a little bit less. Oh, 01:00:17.479 --> 01:00:19.699 excuse me. These 01:00:19.699 --> 01:00:26.699 should be 01:00:29.380 --> 01:00:35.749 XI's 01:00:35.749 --> 01:00:42.749 and YI's. Okay? 01:00:51.299 --> 01:00:53.359 So, 01:00:53.359 --> 01:00:57.029 as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So 01:00:57.029 --> 01:00:58.759 we want 01:00:58.759 --> 01:01:03.529 to find the - setting the parameters theta that maximizes the likelihood L 01:01:03.529 --> 01:01:08.119 of theta. It 01:01:08.119 --> 01:01:09.619 turns out 01:01:09.619 --> 01:01:11.439 that very often 01:01:11.439 --> 01:01:14.479 - just when you work with the derivations, it turns out that it is often 01:01:14.479 --> 01:01:17.809 much easier to maximize the log of the likelihood rather than maximize the 01:01:17.809 --> 01:01:19.559 likelihood. 01:01:19.559 --> 01:01:20.439 So 01:01:20.439 --> 01:01:22.640 the log 01:01:22.640 --> 01:01:24.630 likelihood L of theta is just log of capital L. 01:01:24.630 --> 01:01:27.879 This will, therefore, 01:01:27.879 --> 01:01:34.879 be sum of this. Okay? 01:01:58.089 --> 01:01:59.239 And so 01:01:59.239 --> 01:02:03.689 to fit the parameters theta of our model we'll 01:02:03.689 --> 01:02:05.929 find the value of 01:02:05.929 --> 01:02:11.539 theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes. 01:02:11.539 --> 01:02:18.539 Thanks. 01:02:21.519 --> 01:02:27.239 So having maximized this function - well, it turns out we can actually apply 01:02:27.239 --> 01:02:28.759 the same gradient 01:02:28.759 --> 01:02:32.749 descent algorithm that we learned. That was the first algorithm we used 01:02:32.749 --> 01:02:34.979 to minimize the quadratic function. 01:02:34.979 --> 01:02:37.209 And you remember, when we talked about least squares, 01:02:37.209 --> 01:02:40.120 the first algorithm we used to minimize the quadratic 01:02:40.120 --> 01:02:41.060 error function 01:02:41.060 --> 01:02:42.649 was great in descent. 01:02:42.649 --> 01:02:45.409 So can actually use exactly the same algorithm 01:02:45.409 --> 01:02:48.039 to maximize the log likelihood. 01:02:48.039 --> 01:02:48.749 01:02:48.749 --> 01:02:51.359 And you remember, that algorithm was just 01:02:51.359 --> 01:02:54.609 repeatedly take the value of theta 01:02:54.609 --> 01:02:56.489 and you replace it with 01:02:56.489 --> 01:02:58.919 the previous value of theta plus 01:02:58.919 --> 01:03:01.959 a learning rate alpha 01:03:01.959 --> 01:03:03.939 times 01:03:03.939 --> 01:03:08.339 the gradient of the cos function. The log likelihood will respect the 01:03:08.339 --> 01:03:09.640 theta. Okay? 01:03:09.640 --> 01:03:13.869 One small change is that because previously we were trying to minimize 01:03:13.869 --> 01:03:14.650 01:03:14.650 --> 01:03:16.999 the quadratic error term. 01:03:16.999 --> 01:03:20.359 Today we're trying to maximize rather than minimize. So rather than having a minus 01:03:20.359 --> 01:03:23.189 sign we have a plus sign. So this is 01:03:23.189 --> 01:03:24.070 just great in ascents, 01:03:24.070 --> 01:03:25.489 but for the 01:03:25.489 --> 01:03:27.560 maximization rather than the minimization. 01:03:27.560 --> 01:03:32.380 So we actually call this gradient ascent and it's really the same 01:03:32.380 --> 01:03:35.389 algorithm. 01:03:35.389 --> 01:03:36.840 So to figure out 01:03:36.840 --> 01:03:41.619 what this gradient - so in order to derive gradient descent, 01:03:41.619 --> 01:03:43.719 what you need to do is 01:03:43.719 --> 01:03:48.069 compute the partial derivatives of your objective function with respect to 01:03:48.069 --> 01:03:52.729 each of your parameters theta I, right? 01:03:52.729 --> 01:03:58.219 It turns out that 01:03:58.219 --> 01:04:00.269 if you actually 01:04:00.269 --> 01:04:02.619 compute this partial derivative - 01:04:02.619 --> 01:04:09.539 so you take this formula, this L of theta, which is - oh, got that wrong too. If 01:04:09.539 --> 01:04:13.879 you take this lower case l theta, if you take the log likelihood of theta, 01:04:13.879 --> 01:04:17.159 and if you take it's partial derivative with 01:04:17.159 --> 01:04:19.479 respect to theta I 01:04:19.479 --> 01:04:21.549 you find that 01:04:21.549 --> 01:04:27.119 this is equal to - 01:04:27.119 --> 01:04:29.829 let's see. Okay? And, 01:04:29.829 --> 01:04:36.829 I 01:04:45.939 --> 01:04:47.309 don't 01:04:47.309 --> 01:04:50.329 know, the derivation isn't terribly complicated, but in 01:04:50.329 --> 01:04:53.900 the interest of saving you watching me write down a couple of 01:04:53.900 --> 01:04:55.789 blackboards full of math I'll just write 01:04:55.789 --> 01:04:57.199 down the final answer. But 01:04:57.199 --> 01:04:58.949 the way you get this is you 01:04:58.949 --> 01:05:00.179 just take those, plug 01:05:00.179 --> 01:05:04.429 in the definition for F subscript theta as function of XI, and take derivatives, 01:05:04.429 --> 01:05:06.390 and work through the algebra 01:05:06.390 --> 01:05:08.439 it turns out it'll simplify 01:05:08.439 --> 01:05:13.219 down to this formula. Okay? 01:05:13.219 --> 01:05:15.069 And so 01:05:15.069 --> 01:05:19.130 what that gives you is that gradient ascent 01:05:19.130 --> 01:05:21.669 is the following 01:05:21.669 --> 01:05:24.150 rule. Theta J gets updated as theta 01:05:24.150 --> 01:05:25.899 J 01:05:25.899 --> 01:05:27.839 plus alpha 01:05:27.839 --> 01:05:34.839 gives this. Okay? 01:05:46.999 --> 01:05:50.229 Does this look familiar to anyone? Did you 01:05:50.229 --> 01:05:56.369 remember seeing this formula at the last lecture? Right. 01:05:56.369 --> 01:05:59.349 So when I worked up Bastrian descent 01:05:59.349 --> 01:06:02.449 for least squares regression I, 01:06:02.449 --> 01:06:06.269 actually, wrote down 01:06:06.269 --> 01:06:07.640 exactly the same thing, or maybe 01:06:07.640 --> 01:06:11.419 there's a minus sign and this is also fit. But I, actually, had 01:06:11.419 --> 01:06:13.909 exactly the same learning rule last time 01:06:13.909 --> 01:06:19.919 for least squares regression, 01:06:19.919 --> 01:06:23.509 right? Is this the same learning algorithm then? So what's different? How come I was making 01:06:23.509 --> 01:06:25.499 all that noise earlier about 01:06:25.499 --> 01:06:30.009 least squares regression being a bad idea for classification problems and then I did 01:06:30.009 --> 01:06:33.769 a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants? 01:06:33.769 --> 01:06:40.769 Say that again. [Inaudible] Oh, 01:06:44.180 --> 01:06:46.639 right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same, 01:06:46.639 --> 01:06:48.350 this is not the same, right? And the 01:06:48.350 --> 01:06:49.269 reason is, 01:06:49.269 --> 01:06:51.700 in logistic regression 01:06:51.700 --> 01:06:54.169 this is different from before, right? 01:06:54.169 --> 01:06:55.699 The definition 01:06:55.699 --> 01:06:58.380 of this H subscript theta of XI 01:06:58.380 --> 01:06:59.319 is not 01:06:59.319 --> 01:07:03.279 the same as the definition I was using in the previous lecture. 01:07:03.279 --> 01:07:06.519 And in particular this is no longer theta transpose XI. This is not 01:07:06.519 --> 01:07:09.909 a linear function anymore. 01:07:09.909 --> 01:07:11.910 This is a logistic function of theta 01:07:11.910 --> 01:07:14.369 transpose XI. Okay? 01:07:14.369 --> 01:07:17.919 So even though this looks cosmetically similar, 01:07:17.919 --> 01:07:20.119 even though this is similar on the surface, 01:07:20.119 --> 01:07:23.739 to the Bastrian descent rule I derived last time for 01:07:23.739 --> 01:07:25.209 least squares regression 01:07:25.209 --> 01:07:29.359 this is actually a totally different learning algorithm. Okay? 01:07:29.359 --> 01:07:32.499 And it turns out that there's actually no coincidence that you ended up with the 01:07:32.499 --> 01:07:34.579 same learning rule. We'll actually 01:07:34.579 --> 01:07:35.450 01:07:35.450 --> 01:07:39.519 talk a bit more about this later when we talk about generalized linear models. 01:07:39.519 --> 01:07:42.659 But this is one of the most elegant generalized learning models 01:07:42.659 --> 01:07:44.519 that we'll see later. That 01:07:44.519 --> 01:07:48.199 even though we're using a different model, you actually ended up with 01:07:48.199 --> 01:07:51.379 what looks like the same learning algorithm and it's actually no 01:07:51.379 --> 01:07:56.449 coincidence. Cool. 01:07:56.449 --> 01:07:59.029 One last comment as 01:07:59.029 --> 01:08:01.239 part of a sort of learning process, 01:08:01.239 --> 01:08:02.339 over here 01:08:02.339 --> 01:08:05.030 I said I take the derivatives and I 01:08:05.030 --> 01:08:06.729 ended up with this line. 01:08:06.729 --> 01:08:09.269 I didn't want to 01:08:09.269 --> 01:08:13.119 make you sit through a long algebraic derivation, but 01:08:13.119 --> 01:08:14.640 later today or later this week, 01:08:14.640 --> 01:08:18.569 please, do go home and look at our lecture notes, where I wrote out 01:08:18.569 --> 01:08:20.829 the entirety of this derivation in full, 01:08:20.829 --> 01:08:23.419 and make sure you can follow every single step of 01:08:23.419 --> 01:08:26.709 how we take partial derivatives of this log likelihood 01:08:26.709 --> 01:08:32.069 to get this formula over here. Okay? By the way, for those who are 01:08:32.069 --> 01:08:36.379 interested in seriously masking machine learning material, 01:08:36.379 --> 01:08:40.349 when you go home and look at the lecture notes it will actually be very easy for most 01:08:40.349 --> 01:08:41.549 of you to look through 01:08:41.549 --> 01:08:44.599 the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense, 01:08:44.599 --> 01:08:45.089 and, 01:08:45.089 --> 01:08:49.539 sort of, say cool. I see how you get this line. 01:08:49.539 --> 01:08:53.159 You want to make sure you really understand the material. My concrete 01:08:53.159 --> 01:08:55.049 suggestion to you would be to you to go home, 01:08:55.049 --> 01:08:57.779 read through the lecture notes, check every line, 01:08:57.779 --> 01:09:02.159 and then to cover up the derivation and see if you can derive this example, right? So 01:09:02.159 --> 01:09:06.529 in general, that's usually good advice for studying technical 01:09:06.529 --> 01:09:09.029 material like machine learning. Which is if you work through a proof 01:09:09.029 --> 01:09:11.239 and you think you understood every line, 01:09:11.239 --> 01:09:14.129 the way to make sure you really understood it is to cover it up and see 01:09:14.129 --> 01:09:17.459 if you can rederive the entire thing itself. This is actually a great way because I 01:09:17.459 --> 01:09:19.679 did this a lot when I was trying to study 01:09:19.679 --> 01:09:22.239 various pieces of machine learning 01:09:22.239 --> 01:09:26.119 theory and various proofs. And this is actually a great way to study because cover up 01:09:26.119 --> 01:09:28.420 the derivations and see if you can do it yourself 01:09:28.420 --> 01:09:32.529 without looking at the original derivation. All right. 01:09:32.529 --> 01:09:36.690 01:09:36.690 --> 01:09:39.950 I probably won't get to Newton's Method today. I just 01:09:39.950 --> 01:09:46.950 want to say 01:09:55.250 --> 01:09:58.159 - take one quick digression to talk about 01:09:58.159 --> 01:09:59.480 one more algorithm, 01:09:59.480 --> 01:10:01.119 which was the discussion sort 01:10:01.119 --> 01:10:08.119 of alluding to this earlier, 01:10:09.169 --> 01:10:11.530 which is the perceptron 01:10:11.530 --> 01:10:13.219 algorithm, right? So 01:10:13.219 --> 01:10:13.760 I'm 01:10:13.760 --> 01:10:16.840 not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come 01:10:16.840 --> 01:10:20.340 back to later. Later this quarter 01:10:20.340 --> 01:10:23.820 we'll talk about learning theory. 01:10:23.820 --> 01:10:27.149 So in logistic regression we said that G of Z are, sort 01:10:27.149 --> 01:10:27.920 of, 01:10:27.920 --> 01:10:30.150 my hypothesis output values 01:10:30.150 --> 01:10:32.800 that were low numbers between zero and one. 01:10:32.800 --> 01:10:37.110 The question is what if you want to force G of Z to up 01:10:37.110 --> 01:10:38.889 the value to 01:10:38.889 --> 01:10:39.190 either 01:10:39.190 --> 01:10:41.400 zero one? 01:10:41.400 --> 01:10:43.860 So the 01:10:43.860 --> 01:10:46.210 perceptron algorithm defines G of Z 01:10:46.210 --> 01:10:48.090 to be this. 01:10:48.090 --> 01:10:52.590 01:10:52.590 --> 01:10:54.300 So the picture is - or 01:10:54.300 --> 01:11:01.300 the cartoon is, rather than this sigmoid function. E of 01:11:02.349 --> 01:11:07.699 Z now looks like this step function that you were asking about earlier. 01:11:07.699 --> 01:11:13.579 In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So 01:11:13.579 --> 01:11:14.310 this 01:11:14.310 --> 01:11:17.570 is actually - everything is exactly the same as before, 01:11:17.570 --> 01:11:20.389 except that G of Z is now the step function. 01:11:20.389 --> 01:11:21.539 It 01:11:21.539 --> 01:11:24.960 turns out there's this learning called the perceptron learning rule that's actually 01:11:24.960 --> 01:11:28.429 even the same as the classic gradient ascent 01:11:28.429 --> 01:11:30.610 for logistic regression. 01:11:30.610 --> 01:11:32.539 And the learning rule is 01:11:32.539 --> 01:11:39.539 given by this. Okay? 01:11:44.349 --> 01:11:49.799 So it looks just like the 01:11:49.799 --> 01:11:51.019 classic gradient ascent rule 01:11:51.019 --> 01:11:53.659 01:11:53.659 --> 01:11:56.210 for logistic regression. 01:11:56.210 --> 01:12:00.780 So this is very different flavor of algorithm than least squares regression and logistic 01:12:00.780 --> 01:12:01.669 regression, 01:12:01.669 --> 01:12:05.889 and, in particular, because it outputs only values are either zero or one it 01:12:05.889 --> 01:12:09.369 turns out it's very difficult to endow this algorithm with 01:12:09.369 --> 01:12:11.840 probabilistic semantics. And this 01:12:11.840 --> 01:12:18.840 is, again, even though - oh, excuse me. Right there. Okay. 01:12:19.639 --> 01:12:23.659 And even though this learning rule looks, again, looks cosmetically very similar to 01:12:23.659 --> 01:12:25.710 what we have in logistics regression this is actually 01:12:25.710 --> 01:12:28.320 a very different type of learning rule 01:12:28.320 --> 01:12:31.420 than the others that were seen 01:12:31.420 --> 01:12:33.630 in this class. So 01:12:33.630 --> 01:12:36.969 because this is such a simple learning algorithm, right? It just 01:12:36.969 --> 01:12:41.030 computes theta transpose X and then you threshold and then your output is zero or one. 01:12:41.030 --> 01:12:42.909 This is - 01:12:42.909 --> 01:12:45.989 right. So these are a simpler algorithm than logistic regression, I think. 01:12:45.989 --> 01:12:49.190 When we talk about learning theory later in this class, 01:12:49.190 --> 01:12:54.590 the simplicity of this algorithm will let us come back and use it as a building block. Okay? 01:12:54.590 --> 01:12:56.550 But that's all I want to say about this algorithm for now.