[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.00,0:00:09.17,Default,,0000,0000,0000,,(music) Dialogue: 0,0:00:09.87,0:00:13.10,Default,,0000,0000,0000,,this presentation is delivered by \Nthe Stanford Center for Dialogue: 0,0:00:13.10,0:00:14.98,Default,,0000,0000,0000,,Professional Development. Dialogue: 0,0:00:23.30,0:00:26.80,Default,,0000,0000,0000,,Okay. Good morning and \Nwelcome back to the third Dialogue: 0,0:00:26.80,0:00:30.62,Default,,0000,0000,0000,,lecture of this class. So here's Dialogue: 0,0:00:30.62,0:00:33.36,Default,,0000,0000,0000,,what I want to do today, Dialogue: 0,0:00:33.36,0:00:34.32,Default,,0000,0000,0000,,and some of Dialogue: 0,0:00:34.32,0:00:37.92,Default,,0000,0000,0000,,the topics I do today may seem a little bit like I'm jumping, sort of, Dialogue: 0,0:00:37.92,0:00:44.44,Default,,0000,0000,0000,,from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we Dialogue: 0,0:00:44.44,0:00:45.37,Default,,0000,0000,0000,,talked about Dialogue: 0,0:00:45.37,0:00:48.24,Default,,0000,0000,0000,,linear regression and today I want to talk about Dialogue: 0,0:00:48.24,0:00:52.50,Default,,0000,0000,0000,,sort of an adaptation of that called locally weighted regression. It's very a powerful Dialogue: 0,0:00:52.50,0:00:57.12,Default,,0000,0000,0000,,algorithm that's actually one of my former mentor's probably favorite machine Dialogue: 0,0:00:57.12,0:00:59.36,Default,,0000,0000,0000,,learning algorithm. We'll then Dialogue: 0,0:00:59.36,0:01:02.80,Default,,0000,0000,0000,,talk about a probabilistic interpretation of linear regression Dialogue: 0,0:01:02.80,0:01:07.65,Default,,0000,0000,0000,,and use that to move onto our first classification algorithm, which Dialogue: 0,0:01:07.65,0:01:08.91,Default,,0000,0000,0000,,is logistic regression; take Dialogue: 0,0:01:08.91,0:01:12.63,Default,,0000,0000,0000,,a brief digression to tell you about something called the perceptron algorithm, Dialogue: 0,0:01:12.63,0:01:15.99,Default,,0000,0000,0000,,which is something we'll come back to, again, later this quarter; Dialogue: 0,0:01:15.99,0:01:17.17,Default,,0000,0000,0000,,and Dialogue: 0,0:01:17.17,0:01:21.07,Default,,0000,0000,0000,,time allowing I hope to get to Newton's method, which is an algorithm for Dialogue: 0,0:01:21.07,0:01:23.12,Default,,0000,0000,0000,,fitting logistic regression models. Dialogue: 0,0:01:24.49,0:01:30.51,Default,,0000,0000,0000,,So, let's just recap what we're talking about in the previous lecture, Dialogue: 0,0:01:30.51,0:01:33.79,Default,,0000,0000,0000,,remember the notation that I defined was that Dialogue: 0,0:01:33.79,0:01:35.09,Default,,0000,0000,0000,,I used this Dialogue: 0,0:01:35.09,0:01:37.62,Default,,0000,0000,0000,,X superscript i, Dialogue: 0,0:01:37.62,0:01:40.96,Default,,0000,0000,0000,,Y superscript i to denote the ith training example. Dialogue: 0,0:01:46.73,0:01:48.21,Default,,0000,0000,0000,,And Dialogue: 0,0:01:49.15,0:01:51.79,Default,,0000,0000,0000,,when we're talking about linear regression Dialogue: 0,0:01:52.90,0:01:54.08,Default,,0000,0000,0000,,or ordinary least squares, Dialogue: 0,0:01:54.08,0:01:56.14,Default,,0000,0000,0000,,we use this to denote Dialogue: 0,0:01:56.14,0:01:58.18,Default,,0000,0000,0000,,the predicted value Dialogue: 0,0:01:58.18,0:02:00.83,Default,,0000,0000,0000,,output by my hypothesis H on Dialogue: 0,0:02:00.83,0:02:02.42,Default,,0000,0000,0000,,the input X^i. Dialogue: 0,0:02:02.42,0:02:04.07,Default,,0000,0000,0000,,And my hypothesis Dialogue: 0,0:02:04.07,0:02:06.16,Default,,0000,0000,0000,,was parameterized by the vector Dialogue: 0,0:02:06.16,0:02:08.40,Default,,0000,0000,0000,,parameters theta Dialogue: 0,0:02:08.40,0:02:13.81,Default,,0000,0000,0000,,and so we said that this was equal to sum from theta j Dialogue: 0,0:02:13.81,0:02:15.55,Default,,0000,0000,0000,,X^i Dialogue: 0,0:02:15.55,0:02:19.87,Default,,0000,0000,0000,,si Dialogue: 0,0:02:19.87,0:02:22.41,Default,,0000,0000,0000,,written more simply as theta transpose X Dialogue: 0,0:02:22.41,0:02:25.08,Default,,0000,0000,0000,,And we had the convention that X Dialogue: 0,0:02:25.08,0:02:29.41,Default,,0000,0000,0000,,subscript zero is equal to one so \Nthis accounts for the intercept term in our Dialogue: 0,0:02:29.41,0:02:30.92,Default,,0000,0000,0000,,linear regression model. Dialogue: 0,0:02:30.92,0:02:32.80,Default,,0000,0000,0000,,And lowercase n here Dialogue: 0,0:02:32.80,0:02:36.78,Default,,0000,0000,0000,,was the notation I was using for the Dialogue: 0,0:02:36.78,0:02:40.21,Default,,0000,0000,0000,,number of features in my training set. Okay? So in the Dialogue: 0,0:02:40.21,0:02:43.69,Default,,0000,0000,0000,,example trying to predict housing prices, we had two features, the size of Dialogue: 0,0:02:43.69,0:02:45.94,Default,,0000,0000,0000,,the house and the number of bedrooms. Dialogue: 0,0:02:45.94,0:02:50.56,Default,,0000,0000,0000,,We had two features and there was - little n was equal to two. Dialogue: 0,0:02:50.56,0:02:51.99,Default,,0000,0000,0000,, Dialogue: 0,0:02:51.99,0:02:54.71,Default,,0000,0000,0000,,So just to Dialogue: 0,0:02:54.71,0:02:57.08,Default,,0000,0000,0000,,finish recapping the previous lecture, we Dialogue: 0,0:02:57.08,0:03:01.26,Default,,0000,0000,0000,,defined this quadratic cos function J of theta equals one-half, Dialogue: 0,0:03:01.26,0:03:05.51,Default,,0000,0000,0000,,something Dialogue: 0,0:03:05.51,0:03:07.21,Default,,0000,0000,0000,,I Dialogue: 0,0:03:07.21,0:03:10.38,Default,,0000,0000,0000,,equals one to m, theta of XI minus YI Dialogue: 0,0:03:10.38,0:03:12.44,Default,,0000,0000,0000,,squared Dialogue: 0,0:03:12.44,0:03:16.82,Default,,0000,0000,0000,,where this is the sum over our m training examples and my training set. So lowercase Dialogue: 0,0:03:16.82,0:03:17.77,Default,,0000,0000,0000,,m Dialogue: 0,0:03:17.77,0:03:21.13,Default,,0000,0000,0000,,was the notation I've been using to denote the number of training examples I have and the Dialogue: 0,0:03:21.13,0:03:23.06,Default,,0000,0000,0000,,size of my training set. Dialogue: 0,0:03:23.06,0:03:25.40,Default,,0000,0000,0000,,And at the end of the last lecture, Dialogue: 0,0:03:25.40,0:03:26.81,Default,,0000,0000,0000,,we derive Dialogue: 0,0:03:26.81,0:03:30.84,Default,,0000,0000,0000,,the value of theta that minimizes this enclosed form, which was X Dialogue: 0,0:03:30.84,0:03:32.35,Default,,0000,0000,0000,,transpose X Dialogue: 0,0:03:32.35,0:03:35.37,Default,,0000,0000,0000,,inverse X Dialogue: 0,0:03:35.37,0:03:38.89,Default,,0000,0000,0000,,transpose Y. Okay? Dialogue: 0,0:03:38.89,0:03:42.41,Default,,0000,0000,0000,,So Dialogue: 0,0:03:42.41,0:03:46.42,Default,,0000,0000,0000,,as we move on in today's lecture, I'll continue to use this notation and, again, Dialogue: 0,0:03:46.42,0:03:50.61,Default,,0000,0000,0000,,I realize this is a fair amount of notation to all remember, Dialogue: 0,0:03:50.61,0:03:55.10,Default,,0000,0000,0000,,so if partway through this lecture you forgot - if you're having trouble remembering Dialogue: 0,0:03:55.10,0:04:02.10,Default,,0000,0000,0000,,what lowercase m is or what lowercase n is or something please raise your hand and ask. When Dialogue: 0,0:04:04.71,0:04:07.69,Default,,0000,0000,0000,,we talked about linear regression last time Dialogue: 0,0:04:07.69,0:04:10.44,Default,,0000,0000,0000,,we used two features. One of the features was Dialogue: 0,0:04:10.44,0:04:14.28,Default,,0000,0000,0000,,the size of the houses in square feet, so the living area of the house, Dialogue: 0,0:04:14.28,0:04:18.45,Default,,0000,0000,0000,,and the other feature was the number of bedrooms in the house. Dialogue: 0,0:04:18.45,0:04:22.06,Default,,0000,0000,0000,,In general, we apply a machine-learning algorithm to some problem that you care Dialogue: 0,0:04:22.06,0:04:23.06,Default,,0000,0000,0000,,about. Dialogue: 0,0:04:23.06,0:04:28.21,Default,,0000,0000,0000,,The choice of the features will very much be up to you, right? And Dialogue: 0,0:04:28.21,0:04:32.30,Default,,0000,0000,0000,,the way you choose your features to give the learning algorithm will often have a Dialogue: 0,0:04:32.30,0:04:34.33,Default,,0000,0000,0000,,large impact on how it actually does. Dialogue: 0,0:04:34.33,0:04:40.12,Default,,0000,0000,0000,,So just for example, Dialogue: 0,0:04:40.12,0:04:44.12,Default,,0000,0000,0000,,the choice we made last time was X1 equal this size, and let's leave this idea Dialogue: 0,0:04:44.12,0:04:47.08,Default,,0000,0000,0000,,of the feature of the number of bedrooms for now, let's say we don't have data Dialogue: 0,0:04:47.08,0:04:50.63,Default,,0000,0000,0000,,that tells us how many bedrooms are in these houses. Dialogue: 0,0:04:50.63,0:04:54.03,Default,,0000,0000,0000,,One thing you could do is actually define - oh, let's Dialogue: 0,0:04:54.03,0:04:56.29,Default,,0000,0000,0000,,draw this out. Dialogue: 0,0:04:56.29,0:05:03.29,Default,,0000,0000,0000,,And so, right? So say that Dialogue: 0,0:05:04.69,0:05:07.67,Default,,0000,0000,0000,,was the size of the house and that's the price of the house. So Dialogue: 0,0:05:07.67,0:05:10.30,Default,,0000,0000,0000,,if you use Dialogue: 0,0:05:10.30,0:05:14.79,Default,,0000,0000,0000,,this as a feature maybe you get theta zero plus theta 1, Dialogue: 0,0:05:14.79,0:05:19.89,Default,,0000,0000,0000,,X1, this, sort of, linear model. Dialogue: 0,0:05:19.89,0:05:21.59,Default,,0000,0000,0000,,If you choose - let me just copy the Dialogue: 0,0:05:21.59,0:05:26.77,Default,,0000,0000,0000,,same data set over, right? Dialogue: 0,0:05:26.77,0:05:30.33,Default,,0000,0000,0000,,You can define the set of features where X1 is equal to the size of the house Dialogue: 0,0:05:30.33,0:05:34.99,Default,,0000,0000,0000,,and X2 is the Dialogue: 0,0:05:34.99,0:05:36.05,Default,,0000,0000,0000,,square Dialogue: 0,0:05:36.05,0:05:37.17,Default,,0000,0000,0000,,of the size Dialogue: 0,0:05:37.17,0:05:38.32,Default,,0000,0000,0000,,of the house. Okay? Dialogue: 0,0:05:38.32,0:05:43.08,Default,,0000,0000,0000,,So X1 is the size of the house in say square footage and X2 is Dialogue: 0,0:05:43.08,0:05:45.92,Default,,0000,0000,0000,,just take whatever the square footage of the house is and just Dialogue: 0,0:05:45.92,0:05:49.06,Default,,0000,0000,0000,,square that number, and this would be another way to come up with a feature, Dialogue: 0,0:05:49.06,0:05:51.11,Default,,0000,0000,0000,,and if you do that then Dialogue: 0,0:05:51.11,0:05:55.96,Default,,0000,0000,0000,,the same algorithm will end up fitting a Dialogue: 0,0:05:55.96,0:05:59.40,Default,,0000,0000,0000,,quadratic function for you. Dialogue: 0,0:05:59.40,0:06:01.38,Default,,0000,0000,0000,,Theta 2, XM squared. Dialogue: 0,0:06:01.38,0:06:06.14,Default,,0000,0000,0000,,Okay? Because this Dialogue: 0,0:06:06.14,0:06:09.61,Default,,0000,0000,0000,,is actually X2. And Dialogue: 0,0:06:09.61,0:06:12.52,Default,,0000,0000,0000,,depending on what the data looks like, maybe this is a slightly Dialogue: 0,0:06:12.52,0:06:16.51,Default,,0000,0000,0000,,better fit to the data. You can actually Dialogue: 0,0:06:16.51,0:06:23.51,Default,,0000,0000,0000,,take this even further, right? Dialogue: 0,0:06:25.28,0:06:26.96,Default,,0000,0000,0000,,Which is - let's see. Dialogue: 0,0:06:26.96,0:06:30.40,Default,,0000,0000,0000,,I have seven training examples here, so you can actually Dialogue: 0,0:06:30.40,0:06:34.46,Default,,0000,0000,0000,,maybe fit up to six for the polynomial. You can actually fill a model Dialogue: 0,0:06:34.46,0:06:35.62,Default,,0000,0000,0000,,theta zero plus Dialogue: 0,0:06:35.62,0:06:38.23,Default,,0000,0000,0000,,theta one, X1 plus theta two, Dialogue: 0,0:06:38.23,0:06:42.03,Default,,0000,0000,0000,,X squared plus up to Dialogue: 0,0:06:42.03,0:06:48.71,Default,,0000,0000,0000,,theta six. X to the Dialogue: 0,0:06:48.71,0:06:52.89,Default,,0000,0000,0000,,power of six and theta six are the polynomial Dialogue: 0,0:06:52.89,0:06:55.59,Default,,0000,0000,0000,,to these seven data points. Dialogue: 0,0:06:55.59,0:06:58.03,Default,,0000,0000,0000,,And if you do that you find that Dialogue: 0,0:06:58.03,0:07:01.67,Default,,0000,0000,0000,,you come up with a model that fits your data exactly. This is where, I guess, Dialogue: 0,0:07:01.67,0:07:06.19,Default,,0000,0000,0000,,in this example I drew, we have seven data points, so if you fit a Dialogue: 0,0:07:06.19,0:07:08.49,Default,,0000,0000,0000,,six model polynomial you can, sort of, fit a line Dialogue: 0,0:07:08.49,0:07:11.86,Default,,0000,0000,0000,,that passes through these seven points perfectly. Dialogue: 0,0:07:11.86,0:07:14.38,Default,,0000,0000,0000,,And you probably find that the curve Dialogue: 0,0:07:14.38,0:07:17.07,Default,,0000,0000,0000,,you get will look something Dialogue: 0,0:07:17.07,0:07:20.04,Default,,0000,0000,0000,,like that. Dialogue: 0,0:07:20.04,0:07:23.18,Default,,0000,0000,0000,,And on the one hand, this is a great model in a sense that it Dialogue: 0,0:07:23.18,0:07:25.33,Default,,0000,0000,0000,,fits your training data perfectly. Dialogue: 0,0:07:25.33,0:07:26.68,Default,,0000,0000,0000,,On the other hand, this is probably not a Dialogue: 0,0:07:26.68,0:07:28.82,Default,,0000,0000,0000,,very good model in the sense that Dialogue: 0,0:07:28.82,0:07:31.89,Default,,0000,0000,0000,,none of us seriously think that this is a very good predictor of housing Dialogue: 0,0:07:31.89,0:07:36.22,Default,,0000,0000,0000,,prices as a function of the size of the house, right? So Dialogue: 0,0:07:36.22,0:07:39.02,Default,,0000,0000,0000,,we'll actually come back to this later. It turns Dialogue: 0,0:07:39.02,0:07:41.94,Default,,0000,0000,0000,,out of the models we have here; Dialogue: 0,0:07:41.94,0:07:45.18,Default,,0000,0000,0000,,I feel like maybe the quadratic model fits the data best. Dialogue: 0,0:07:45.18,0:07:46.46,Default,,0000,0000,0000,,Whereas Dialogue: 0,0:07:46.46,0:07:47.81,Default,,0000,0000,0000,, Dialogue: 0,0:07:47.81,0:07:52.02,Default,,0000,0000,0000,,the linear model looks like there's actually a bit of a quadratic component in this Dialogue: 0,0:07:52.02,0:07:52.83,Default,,0000,0000,0000,,data Dialogue: 0,0:07:52.83,0:07:56.54,Default,,0000,0000,0000,,that the linear function is not capturing. Dialogue: 0,0:07:56.54,0:07:59.96,Default,,0000,0000,0000,,So we'll actually come back to this a little bit later and talk about the problems Dialogue: 0,0:07:59.96,0:08:03.78,Default,,0000,0000,0000,,associated with fitting models that are either too simple, use two small a set of Dialogue: 0,0:08:03.78,0:08:04.62,Default,,0000,0000,0000,,features, or Dialogue: 0,0:08:04.62,0:08:08.45,Default,,0000,0000,0000,,on the models that are too complex and maybe Dialogue: 0,0:08:08.45,0:08:11.21,Default,,0000,0000,0000,,use too large a set of features. Dialogue: 0,0:08:11.21,0:08:12.29,Default,,0000,0000,0000,,Just to give these a Dialogue: 0,0:08:12.29,0:08:13.24,Default,,0000,0000,0000,,name, Dialogue: 0,0:08:13.24,0:08:14.49,Default,,0000,0000,0000,,we call this Dialogue: 0,0:08:14.49,0:08:19.75,Default,,0000,0000,0000,,the problem of underfitting Dialogue: 0,0:08:19.75,0:08:23.14,Default,,0000,0000,0000,,and, very informally, this refers to a setting where Dialogue: 0,0:08:23.14,0:08:26.82,Default,,0000,0000,0000,,there are obvious patterns that - where there are patterns in the data that the Dialogue: 0,0:08:26.82,0:08:28.73,Default,,0000,0000,0000,,algorithm is just failing to fit. Dialogue: 0,0:08:28.73,0:08:32.84,Default,,0000,0000,0000,,And this problem here we refer to as Dialogue: 0,0:08:32.84,0:08:34.58,Default,,0000,0000,0000,,overfitting Dialogue: 0,0:08:34.58,0:08:36.30,Default,,0000,0000,0000,,and, again, very informally, Dialogue: 0,0:08:36.30,0:08:41.31,Default,,0000,0000,0000,,this is when the algorithm is fitting the idiosyncrasies of this specific data set, Dialogue: 0,0:08:41.31,0:08:43.48,Default,,0000,0000,0000,,right? It just so happens that Dialogue: 0,0:08:43.48,0:08:48.19,Default,,0000,0000,0000,,of the seven houses we sampled in Portland, or wherever you collect data from, Dialogue: 0,0:08:48.19,0:08:51.64,Default,,0000,0000,0000,,that house happens to be a bit more expensive, that house happened on the less Dialogue: 0,0:08:51.64,0:08:54.07,Default,,0000,0000,0000,,expensive, and by Dialogue: 0,0:08:54.07,0:08:57.40,Default,,0000,0000,0000,,fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties Dialogue: 0,0:08:57.40,0:08:58.82,Default,,0000,0000,0000,,of this data set, Dialogue: 0,0:08:58.82,0:09:01.08,Default,,0000,0000,0000,,rather than the true underlying trends Dialogue: 0,0:09:01.08,0:09:04.94,Default,,0000,0000,0000,,of how housing prices vary as the function of the size of house. Okay? Dialogue: 0,0:09:04.94,0:09:08.46,Default,,0000,0000,0000,,So these are two very different problems. We'll define them more formally me later Dialogue: 0,0:09:08.46,0:09:12.28,Default,,0000,0000,0000,,and talk about how to address each of these problems, Dialogue: 0,0:09:12.28,0:09:13.89,Default,,0000,0000,0000,,but for now I Dialogue: 0,0:09:13.89,0:09:20.89,Default,,0000,0000,0000,,hope you appreciate that there is this issue of selecting features. So if Dialogue: 0,0:09:22.15,0:09:23.61,Default,,0000,0000,0000,,you want to just Dialogue: 0,0:09:23.61,0:09:26.27,Default,,0000,0000,0000,,teach us the learning problems there are a few ways to do Dialogue: 0,0:09:26.27,0:09:27.37,Default,,0000,0000,0000,,so. Dialogue: 0,0:09:27.37,0:09:29.01,Default,,0000,0000,0000,,We'll talk about Dialogue: 0,0:09:29.01,0:09:32.76,Default,,0000,0000,0000,,feature selection algorithms later this quarter as well. So automatic algorithms Dialogue: 0,0:09:32.76,0:09:33.78,Default,,0000,0000,0000,,for choosing Dialogue: 0,0:09:33.78,0:09:35.96,Default,,0000,0000,0000,,what features you use in a Dialogue: 0,0:09:35.96,0:09:37.53,Default,,0000,0000,0000,,regression problem like this. Dialogue: 0,0:09:37.53,0:09:41.61,Default,,0000,0000,0000,,What I want to do today is talk about a class of algorithms Dialogue: 0,0:09:41.61,0:09:44.57,Default,,0000,0000,0000,,called non-parametric learning algorithms that will help Dialogue: 0,0:09:44.57,0:09:49.73,Default,,0000,0000,0000,,to alleviate the need somewhat for you to choose features very carefully. Okay? Dialogue: 0,0:09:49.73,0:09:56.73,Default,,0000,0000,0000,,And this leads us into our discussion of locally weighted regression. Dialogue: 0,0:09:56.97,0:10:03.97,Default,,0000,0000,0000,,And just to define the term, Dialogue: 0,0:10:12.34,0:10:16.78,Default,,0000,0000,0000,,linear regression, as we've defined it so far, is an example of a parametric learning Dialogue: 0,0:10:16.78,0:10:17.74,Default,,0000,0000,0000,, Dialogue: 0,0:10:17.74,0:10:19.15,Default,,0000,0000,0000,,algorithm. Parametric Dialogue: 0,0:10:19.15,0:10:21.92,Default,,0000,0000,0000,,learning algorithm is one that's defined as Dialogue: 0,0:10:21.92,0:10:24.67,Default,,0000,0000,0000,,an algorithm that has a fixed number of parameters Dialogue: 0,0:10:24.67,0:10:27.20,Default,,0000,0000,0000,,that fit to the data. Okay? So Dialogue: 0,0:10:27.20,0:10:28.67,Default,,0000,0000,0000,,in linear regression Dialogue: 0,0:10:28.67,0:10:32.93,Default,,0000,0000,0000,,we have a fix set of parameters theta, right? That must Dialogue: 0,0:10:32.93,0:10:39.44,Default,,0000,0000,0000,,fit to the data. Dialogue: 0,0:10:39.44,0:10:46.44,Default,,0000,0000,0000,,In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The Dialogue: 0,0:10:58.63,0:11:02.83,Default,,0000,0000,0000,,formal definition, which is not very intuitive, so I've replaced it with a Dialogue: 0,0:11:02.83,0:11:04.11,Default,,0000,0000,0000,,second, say, more Dialogue: 0,0:11:04.11,0:11:06.20,Default,,0000,0000,0000,,intuitive. Dialogue: 0,0:11:06.20,0:11:10.46,Default,,0000,0000,0000,,The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm Dialogue: 0,0:11:10.46,0:11:17.46,Default,,0000,0000,0000,,where the number of parameters Dialogue: 0,0:11:18.30,0:11:22.17,Default,,0000,0000,0000,,goes Dialogue: 0,0:11:22.17,0:11:25.42,Default,,0000,0000,0000,,with M, with the size of the training set. And usually it's Dialogue: 0,0:11:25.42,0:11:30.62,Default,,0000,0000,0000,,defined as a number of parameters grows linearly with the size of the training set. Dialogue: 0,0:11:30.62,0:11:32.64,Default,,0000,0000,0000,,This is the formal definition. Dialogue: 0,0:11:32.64,0:11:33.41,Default,,0000,0000,0000,,A Dialogue: 0,0:11:33.41,0:11:36.19,Default,,0000,0000,0000,,slightly less formal definition is that Dialogue: 0,0:11:36.19,0:11:37.53,Default,,0000,0000,0000,,the amount of stuff that your learning algorithm needs Dialogue: 0,0:11:37.53,0:11:40.83,Default,,0000,0000,0000,,to keep around Dialogue: 0,0:11:40.83,0:11:44.82,Default,,0000,0000,0000,,will grow linearly with the training sets or, in another way of saying it, is that this is an Dialogue: 0,0:11:44.82,0:11:45.83,Default,,0000,0000,0000,,algorithm that Dialogue: 0,0:11:45.83,0:11:51.59,Default,,0000,0000,0000,,we'll need to keep around an entire training set, even after learning. Okay? So Dialogue: 0,0:11:51.59,0:11:53.98,Default,,0000,0000,0000,,don't worry too much about this definition. But Dialogue: 0,0:11:53.98,0:11:55.98,Default,,0000,0000,0000,,what I want to do now is Dialogue: 0,0:11:55.98,0:11:58.99,Default,,0000,0000,0000,,describe a specific non-parametric learning algorithm Dialogue: 0,0:11:58.99,0:12:05.99,Default,,0000,0000,0000,,called locally weighted regression. Dialogue: 0,0:12:09.86,0:12:16.86,Default,,0000,0000,0000,,Which also goes by a couple of other names - Dialogue: 0,0:12:17.04,0:12:20.58,Default,,0000,0000,0000,,which also goes by the name of Loess for self-hysterical reasons. Loess Dialogue: 0,0:12:20.58,0:12:23.03,Default,,0000,0000,0000,,is usually spelled L-O-E-S-S, Dialogue: 0,0:12:23.03,0:12:24.13,Default,,0000,0000,0000,,sometimes spelled like that, Dialogue: 0,0:12:24.13,0:12:27.14,Default,,0000,0000,0000,,too. I just call it locally weighted regression. Dialogue: 0,0:12:27.14,0:12:34.14,Default,,0000,0000,0000,,So here's Dialogue: 0,0:12:34.54,0:12:37.76,Default,,0000,0000,0000,,the idea. This will be an algorithm that allows us Dialogue: 0,0:12:37.76,0:12:42.65,Default,,0000,0000,0000,,to worry a little bit less about having to choose features very carefully. Dialogue: 0,0:12:42.65,0:12:48.35,Default,,0000,0000,0000,,So Dialogue: 0,0:12:48.35,0:12:55.32,Default,,0000,0000,0000,,for my motivating example, let's say that I Dialogue: 0,0:12:55.32,0:12:59.00,Default,,0000,0000,0000,,have a Dialogue: 0,0:12:59.00,0:13:00.94,Default,,0000,0000,0000,,training site that looks like this, okay? Dialogue: 0,0:13:00.94,0:13:04.27,Default,,0000,0000,0000,,So this is X and that's Y. Dialogue: 0,0:13:04.27,0:13:07.01,Default,,0000,0000,0000,,If you run Dialogue: 0,0:13:07.01,0:13:10.92,Default,,0000,0000,0000,,linear regression on this and you fit maybe a linear function to this Dialogue: 0,0:13:10.92,0:13:12.38,Default,,0000,0000,0000,,and you end up with a Dialogue: 0,0:13:12.38,0:13:13.40,Default,,0000,0000,0000,,more or less Dialogue: 0,0:13:13.40,0:13:16.58,Default,,0000,0000,0000,,flat, straight line, which is not a very good fit to this data. You Dialogue: 0,0:13:16.58,0:13:19.82,Default,,0000,0000,0000,,can sit around and stare at this and try to decide whether the features are used right. Dialogue: 0,0:13:19.82,0:13:22.57,Default,,0000,0000,0000,,So maybe you want to toss in a quadratic function, Dialogue: 0,0:13:22.57,0:13:25.77,Default,,0000,0000,0000,,but this isn't really quadratic either. So maybe you want to Dialogue: 0,0:13:25.77,0:13:27.91,Default,,0000,0000,0000,,model this as a X Dialogue: 0,0:13:27.91,0:13:31.06,Default,,0000,0000,0000,,plus X squared plus maybe some function of sin of X or something. Dialogue: 0,0:13:31.06,0:13:33.85,Default,,0000,0000,0000,,You actually sit around and fiddle with features. Dialogue: 0,0:13:33.85,0:13:37.16,Default,,0000,0000,0000,,And after a while you can probably come up with a set of features that the model is Dialogue: 0,0:13:37.16,0:13:39.72,Default,,0000,0000,0000,,okay, but let's talk about an algorithm that Dialogue: 0,0:13:39.72,0:13:46.72,Default,,0000,0000,0000,,you can use without needing to do that. Dialogue: 0,0:13:50.34,0:13:52.93,Default,,0000,0000,0000,,So Dialogue: 0,0:13:52.93,0:13:54.48,Default,,0000,0000,0000,,if - well, Dialogue: 0,0:13:54.48,0:13:56.37,Default,,0000,0000,0000,,suppose you want to evaluate Dialogue: 0,0:13:56.37,0:13:59.17,Default,,0000,0000,0000,,your hypothesis H Dialogue: 0,0:13:59.17,0:14:03.95,Default,,0000,0000,0000,,at a certain point Dialogue: 0,0:14:03.95,0:14:06.31,Default,,0000,0000,0000,,with a certain query point low K is X. Okay? And Dialogue: 0,0:14:06.31,0:14:07.49,Default,,0000,0000,0000,,let's say Dialogue: 0,0:14:07.49,0:14:10.67,Default,,0000,0000,0000,,you want to know what's the predicted value of Dialogue: 0,0:14:10.67,0:14:11.93,Default,,0000,0000,0000,,Y Dialogue: 0,0:14:11.93,0:14:16.54,Default,,0000,0000,0000,,at this position of X, right? So Dialogue: 0,0:14:16.54,0:14:18.76,Default,,0000,0000,0000,,for linear regression, Dialogue: 0,0:14:18.76,0:14:22.44,Default,,0000,0000,0000,,what we were doing was we would fit Dialogue: 0,0:14:22.44,0:14:25.07,Default,,0000,0000,0000,,theta Dialogue: 0,0:14:25.07,0:14:28.13,Default,,0000,0000,0000,,to minimize Dialogue: 0,0:14:28.13,0:14:30.05,Default,,0000,0000,0000,,sum Dialogue: 0,0:14:30.05,0:14:34.61,Default,,0000,0000,0000,,over I, YI minus theta, transpose XI Dialogue: 0,0:14:34.61,0:14:38.76,Default,,0000,0000,0000,,squared, Dialogue: 0,0:14:38.76,0:14:41.20,Default,,0000,0000,0000,,and return theta Dialogue: 0,0:14:41.20,0:14:46.17,Default,,0000,0000,0000,,transpose X. Okay? So that was linear regression. Dialogue: 0,0:14:46.17,0:14:49.97,Default,,0000,0000,0000,,In contrast, in locally weighted linear regression you're going to do things slightly Dialogue: 0,0:14:49.97,0:14:51.13,Default,,0000,0000,0000,,different. You're Dialogue: 0,0:14:51.13,0:14:54.27,Default,,0000,0000,0000,,going to look at this point X Dialogue: 0,0:14:54.27,0:14:58.70,Default,,0000,0000,0000,,and then I'm going to look in my data set and take into account Dialogue: 0,0:14:58.70,0:15:03.37,Default,,0000,0000,0000,,only the data points that are, sort of, in the little vicinity of X. Okay? Dialogue: 0,0:15:03.37,0:15:07.13,Default,,0000,0000,0000,,So we'll look at where I want to value my hypothesis. I'm going to look Dialogue: 0,0:15:07.13,0:15:10.14,Default,,0000,0000,0000,,only in the vicinity of Dialogue: 0,0:15:10.14,0:15:13.73,Default,,0000,0000,0000,,this point where I want to value my hypothesis, Dialogue: 0,0:15:13.73,0:15:16.48,Default,,0000,0000,0000,,and then I'm going to take, Dialogue: 0,0:15:16.48,0:15:19.60,Default,,0000,0000,0000,,let's say, just these few points, Dialogue: 0,0:15:19.60,0:15:21.32,Default,,0000,0000,0000,,and I will Dialogue: 0,0:15:21.32,0:15:22.76,Default,,0000,0000,0000,,apply linear regression Dialogue: 0,0:15:22.76,0:15:26.16,Default,,0000,0000,0000,,to fit a straight line just to this sub-set of the data. Okay? I'm Dialogue: 0,0:15:26.16,0:15:29.74,Default,,0000,0000,0000,,using this sub-term sub-set - well let's come back to that later. Dialogue: 0,0:15:29.74,0:15:32.00,Default,,0000,0000,0000,,So we take this data set and I fit a Dialogue: 0,0:15:32.00,0:15:36.69,Default,,0000,0000,0000,,straight line to it and maybe I get a straight line like that. Dialogue: 0,0:15:36.69,0:15:40.72,Default,,0000,0000,0000,,And what I'll do is then Dialogue: 0,0:15:40.72,0:15:45.15,Default,,0000,0000,0000,,evaluate this particular value of straight line and Dialogue: 0,0:15:45.15,0:15:47.28,Default,,0000,0000,0000,,that will be the value I return for my algorithm. Dialogue: 0,0:15:47.28,0:15:50.36,Default,,0000,0000,0000,,I think this would be the predicted value Dialogue: 0,0:15:50.36,0:15:53.06,Default,,0000,0000,0000,,for - Dialogue: 0,0:15:53.06,0:15:57.37,Default,,0000,0000,0000,,this would be the value of then my hypothesis outputs Dialogue: 0,0:15:57.37,0:16:04.37,Default,,0000,0000,0000,,in locally weighted regression. Okay? So Dialogue: 0,0:16:05.27,0:16:10.11,Default,,0000,0000,0000,,we're gonna fall one up. Let me go ahead and formalize that. In Dialogue: 0,0:16:10.11,0:16:15.12,Default,,0000,0000,0000,,locally weighted regression, we're going to fit theta to Dialogue: 0,0:16:15.12,0:16:18.35,Default,,0000,0000,0000,,minimize Dialogue: 0,0:16:18.35,0:16:25.35,Default,,0000,0000,0000,,sum over I Dialogue: 0,0:16:27.48,0:16:32.59,Default,,0000,0000,0000,,to minimize that Dialogue: 0,0:16:32.59,0:16:36.60,Default,,0000,0000,0000,,where these terms W superscript I are called weights. Dialogue: 0,0:16:36.60,0:16:37.77,Default,,0000,0000,0000,, Dialogue: 0,0:16:37.77,0:16:41.08,Default,,0000,0000,0000,,There are many possible choice for ways, I'm just gonna write one down. So this E's and Dialogue: 0,0:16:41.08,0:16:42.99,Default,,0000,0000,0000,,minus, XI minus Dialogue: 0,0:16:42.99,0:16:45.59,Default,,0000,0000,0000,,X squared Dialogue: 0,0:16:45.59,0:16:49.29,Default,,0000,0000,0000,,over Dialogue: 0,0:16:49.29,0:16:54.93,Default,,0000,0000,0000,,two. So let's look at what these weights really are, right? So notice that - Dialogue: 0,0:16:54.93,0:16:57.79,Default,,0000,0000,0000,,suppose you have a training example XI. Dialogue: 0,0:16:57.79,0:17:04.79,Default,,0000,0000,0000,,So that XI is very close to X. So this is small, Dialogue: 0,0:17:06.94,0:17:10.62,Default,,0000,0000,0000,,right? Then if XI minus X is small, so if XI minus X is close to zero, then Dialogue: 0,0:17:10.62,0:17:14.32,Default,,0000,0000,0000,,this is E's to the minus zero and E to the zero Dialogue: 0,0:17:14.32,0:17:15.86,Default,,0000,0000,0000,,is one. Dialogue: 0,0:17:15.86,0:17:19.79,Default,,0000,0000,0000,,So if XI is close to X, then Dialogue: 0,0:17:19.79,0:17:21.18,Default,,0000,0000,0000,,WI Dialogue: 0,0:17:21.18,0:17:25.13,Default,,0000,0000,0000,,will be close to one. In other words, the weight associated with the, I training Dialogue: 0,0:17:25.13,0:17:27.03,Default,,0000,0000,0000,,example be close to one Dialogue: 0,0:17:27.03,0:17:30.16,Default,,0000,0000,0000,,if XI and X are close to each Dialogue: 0,0:17:30.16,0:17:31.15,Default,,0000,0000,0000,,other. Dialogue: 0,0:17:31.15,0:17:34.27,Default,,0000,0000,0000,,Conversely if XI minus X Dialogue: 0,0:17:34.27,0:17:41.27,Default,,0000,0000,0000,,is large Dialogue: 0,0:17:42.33,0:17:43.11,Default,,0000,0000,0000,,then - I don't Dialogue: 0,0:17:43.11,0:17:48.01,Default,,0000,0000,0000,,know, what would WI be? Zero. Dialogue: 0,0:17:48.01,0:17:51.02,Default,,0000,0000,0000,,Zero, right. Close to zero. Right. Dialogue: 0,0:17:51.02,0:17:51.85,Default,,0000,0000,0000,,So Dialogue: 0,0:17:51.85,0:17:56.63,Default,,0000,0000,0000,,if XI is very far from X then this is E to the minus of some large number Dialogue: 0,0:17:56.63,0:18:02.21,Default,,0000,0000,0000,,and E to the minus some large number will be close to zero. Dialogue: 0,0:18:02.21,0:18:05.17,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:18:05.17,0:18:12.17,Default,,0000,0000,0000,,So the picture is, if I'm Dialogue: 0,0:18:12.40,0:18:13.39,Default,,0000,0000,0000,,quarrying Dialogue: 0,0:18:13.39,0:18:16.97,Default,,0000,0000,0000,,at a certain point X, shown on the X axis, Dialogue: 0,0:18:16.97,0:18:21.82,Default,,0000,0000,0000,,and if my data Dialogue: 0,0:18:21.82,0:18:24.09,Default,,0000,0000,0000,,set, say, look like that, Dialogue: 0,0:18:24.09,0:18:28.88,Default,,0000,0000,0000,,then I'm going to give the points close to this a large weight and give the points Dialogue: 0,0:18:28.88,0:18:32.61,Default,,0000,0000,0000,,far away a small weight. Dialogue: 0,0:18:32.61,0:18:35.39,Default,,0000,0000,0000,,So Dialogue: 0,0:18:35.39,0:18:37.93,Default,,0000,0000,0000,,for the points that are far away, Dialogue: 0,0:18:37.93,0:18:39.69,Default,,0000,0000,0000,,WI will be close to zero. Dialogue: 0,0:18:39.69,0:18:44.28,Default,,0000,0000,0000,,And so as if for the points that are far away, Dialogue: 0,0:18:44.28,0:18:47.85,Default,,0000,0000,0000,,they will not contribute much at all to this summation, right? So I think this is Dialogue: 0,0:18:47.85,0:18:48.97,Default,,0000,0000,0000,,sum over I Dialogue: 0,0:18:48.97,0:18:54.24,Default,,0000,0000,0000,,of one times this quadratic term for points by points plus zero times this quadratic term for faraway Dialogue: 0,0:18:54.24,0:18:55.72,Default,,0000,0000,0000,,points. Dialogue: 0,0:18:55.72,0:18:58.61,Default,,0000,0000,0000,,And so the effect of using this weighting is that Dialogue: 0,0:18:58.61,0:19:00.05,Default,,0000,0000,0000,,locally weighted linear regression Dialogue: 0,0:19:00.05,0:19:02.63,Default,,0000,0000,0000,,fits a set of parameters theta, Dialogue: 0,0:19:02.63,0:19:05.03,Default,,0000,0000,0000,,paying much more attention to fitting the Dialogue: 0,0:19:05.03,0:19:06.95,Default,,0000,0000,0000,,points close by Dialogue: 0,0:19:06.95,0:19:10.87,Default,,0000,0000,0000,,accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is Dialogue: 0,0:19:10.87,0:19:17.58,Default,,0000,0000,0000,,exponentially [inaudible]? Dialogue: 0,0:19:17.58,0:19:22.57,Default,,0000,0000,0000,,Yeah. Let's see. So it turns out there are many other weighting functions you can use. It Dialogue: 0,0:19:22.57,0:19:26.89,Default,,0000,0000,0000,,turns out that there are definitely different communities of researchers that tend to Dialogue: 0,0:19:26.89,0:19:29.40,Default,,0000,0000,0000,,choose different choices by default. Dialogue: 0,0:19:29.40,0:19:33.70,Default,,0000,0000,0000,,There is somewhat of a literature on debating what point - Dialogue: 0,0:19:33.70,0:19:35.66,Default,,0000,0000,0000,,exactly what function to use. Dialogue: 0,0:19:35.66,0:19:37.97,Default,,0000,0000,0000,,This, sort of, exponential decay function is - Dialogue: 0,0:19:37.97,0:19:41.18,Default,,0000,0000,0000,,this happens to be a reasonably common one that seems to be a more reasonable choice on many problems, Dialogue: 0,0:19:41.18,0:19:42.31,Default,,0000,0000,0000,, Dialogue: 0,0:19:42.31,0:19:45.38,Default,,0000,0000,0000,,but you can actually plug in other functions as well. Did Dialogue: 0,0:19:45.38,0:19:47.50,Default,,0000,0000,0000,,I mention what [inaudible] is it at? Dialogue: 0,0:19:47.50,0:19:49.71,Default,,0000,0000,0000,,For those of you that are familiar with Dialogue: 0,0:19:49.71,0:19:51.55,Default,,0000,0000,0000,,the Dialogue: 0,0:19:51.55,0:19:53.89,Default,,0000,0000,0000,,normal distribution, or the Gaussian distribution, Dialogue: 0,0:19:53.89,0:19:55.35,Default,,0000,0000,0000,,say this - Dialogue: 0,0:19:55.35,0:19:59.37,Default,,0000,0000,0000,,what this formula I've written out here, it cosmetically Dialogue: 0,0:19:59.37,0:20:02.77,Default,,0000,0000,0000,,looks a bit like a Gaussian distribution. Okay? But this actually has Dialogue: 0,0:20:02.77,0:20:06.19,Default,,0000,0000,0000,,absolutely nothing to do with Gaussian distribution. Dialogue: 0,0:20:06.19,0:20:07.94,Default,,0000,0000,0000,,So this Dialogue: 0,0:20:07.94,0:20:12.24,Default,,0000,0000,0000,,is not that a problem with XI is Gaussian or whatever. This is no such Dialogue: 0,0:20:12.24,0:20:13.01,Default,,0000,0000,0000,,interpretation. Dialogue: 0,0:20:13.01,0:20:16.84,Default,,0000,0000,0000,,This is just a convenient function that happens to be a bell-shaped function, but Dialogue: 0,0:20:16.84,0:20:21.46,Default,,0000,0000,0000,,don't endow this of any Gaussian semantics. Okay? Dialogue: 0,0:20:21.46,0:20:23.16,Default,,0000,0000,0000,,So, in fact - well, Dialogue: 0,0:20:23.16,0:20:28.51,Default,,0000,0000,0000,,if you remember the familiar bell-shaped Gaussian, again, it's just Dialogue: 0,0:20:28.51,0:20:32.16,Default,,0000,0000,0000,,the ways of associating with these points is that if you Dialogue: 0,0:20:32.16,0:20:33.85,Default,,0000,0000,0000,,imagine Dialogue: 0,0:20:33.85,0:20:36.05,Default,,0000,0000,0000,,putting this on a bell-shaped bump, Dialogue: 0,0:20:36.05,0:20:39.65,Default,,0000,0000,0000,,centered around the position of where you want to value your hypothesis H, Dialogue: 0,0:20:39.65,0:20:42.91,Default,,0000,0000,0000,,then there's a saying this point here I'll give Dialogue: 0,0:20:42.91,0:20:44.46,Default,,0000,0000,0000,,a weight that's proportional Dialogue: 0,0:20:44.46,0:20:48.78,Default,,0000,0000,0000,,to the height of the Gaussian - excuse me, to the height of the bell-shaped function Dialogue: 0,0:20:48.78,0:20:50.96,Default,,0000,0000,0000,,evaluated at this point. And the Dialogue: 0,0:20:50.96,0:20:53.02,Default,,0000,0000,0000,,way to get to this point will be, Dialogue: 0,0:20:53.02,0:20:54.73,Default,,0000,0000,0000,,to this training example, Dialogue: 0,0:20:54.73,0:20:56.80,Default,,0000,0000,0000,,will be proportionate to that height Dialogue: 0,0:20:56.80,0:20:58.55,Default,,0000,0000,0000,,and so on. Okay? Dialogue: 0,0:20:58.55,0:21:01.45,Default,,0000,0000,0000,,And so training examples that are really far away Dialogue: 0,0:21:01.45,0:21:04.82,Default,,0000,0000,0000,,get a very small weight. Dialogue: 0,0:21:04.82,0:21:07.40,Default,,0000,0000,0000,, Dialogue: 0,0:21:07.40,0:21:10.63,Default,,0000,0000,0000,,One last small generalization to this is that Dialogue: 0,0:21:10.63,0:21:11.96,Default,,0000,0000,0000,,normally Dialogue: 0,0:21:11.96,0:21:15.08,Default,,0000,0000,0000,,there's one other parameter to this Dialogue: 0,0:21:15.08,0:21:17.76,Default,,0000,0000,0000,,algorithm, which I'll denote as tow. Dialogue: 0,0:21:17.76,0:21:21.11,Default,,0000,0000,0000,,Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian. Dialogue: 0,0:21:21.11,0:21:24.64,Default,,0000,0000,0000,,This is a convenient form or function. Dialogue: 0,0:21:24.64,0:21:27.14,Default,,0000,0000,0000,,This parameter tow Dialogue: 0,0:21:27.14,0:21:32.51,Default,,0000,0000,0000,,is called the bandwidth Dialogue: 0,0:21:32.51,0:21:33.76,Default,,0000,0000,0000,,parameter Dialogue: 0,0:21:33.76,0:21:40.76,Default,,0000,0000,0000,,and Dialogue: 0,0:21:40.98,0:21:47.53,Default,,0000,0000,0000,,informally it controls how fast the weights fall of with distance. Okay? So just Dialogue: 0,0:21:47.53,0:21:49.35,Default,,0000,0000,0000,,copy my diagram from Dialogue: 0,0:21:49.35,0:21:53.81,Default,,0000,0000,0000,,the other side, I guess. Dialogue: 0,0:21:53.81,0:21:55.77,Default,,0000,0000,0000,,So if Dialogue: 0,0:21:55.77,0:21:57.75,Default,,0000,0000,0000,,tow is very small, Dialogue: 0,0:21:57.75,0:22:00.00,Default,,0000,0000,0000,, Dialogue: 0,0:22:00.00,0:22:04.52,Default,,0000,0000,0000,,if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape, Dialogue: 0,0:22:04.52,0:22:08.08,Default,,0000,0000,0000,,so that the weights of the points are far away fall off rapidly. Dialogue: 0,0:22:08.08,0:22:09.57,Default,,0000,0000,0000,,Whereas if Dialogue: 0,0:22:09.57,0:22:16.57,Default,,0000,0000,0000,,tow Dialogue: 0,0:22:17.49,0:22:21.31,Default,,0000,0000,0000,,is large then you'd end Dialogue: 0,0:22:21.31,0:22:25.31,Default,,0000,0000,0000,,up choosing a weighting function that falls of relatively slowly with distance from your Dialogue: 0,0:22:25.31,0:22:27.54,Default,,0000,0000,0000,,query. Dialogue: 0,0:22:27.54,0:22:30.55,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:22:30.55,0:22:32.91,Default,,0000,0000,0000,,So Dialogue: 0,0:22:32.91,0:22:39.91,Default,,0000,0000,0000,,I hope you can, therefore, see that if you Dialogue: 0,0:22:41.58,0:22:44.41,Default,,0000,0000,0000,,apply locally weighted linear regression to a data set that looks like Dialogue: 0,0:22:44.41,0:22:45.64,Default,,0000,0000,0000,,this, Dialogue: 0,0:22:45.64,0:22:46.85,Default,,0000,0000,0000,,then to Dialogue: 0,0:22:46.85,0:22:50.10,Default,,0000,0000,0000,,ask what your hypothesis output is at a point like this you end up having a straight line Dialogue: 0,0:22:50.10,0:22:51.17,Default,,0000,0000,0000,,making Dialogue: 0,0:22:51.17,0:22:52.74,Default,,0000,0000,0000,,that prediction. To Dialogue: 0,0:22:52.74,0:22:55.16,Default,,0000,0000,0000,,ask what kind of class this [inaudible] at Dialogue: 0,0:22:55.16,0:22:56.43,Default,,0000,0000,0000,,that value Dialogue: 0,0:22:56.43,0:22:58.54,Default,,0000,0000,0000,,you put a straight line there Dialogue: 0,0:22:58.54,0:23:00.42,Default,,0000,0000,0000,,and you predict that value. Dialogue: 0,0:23:00.42,0:23:04.06,Default,,0000,0000,0000,,It turns out that every time you try to vary your hypothesis, every time you Dialogue: 0,0:23:04.06,0:23:06.51,Default,,0000,0000,0000,,ask your learning algorithm to make a prediction for Dialogue: 0,0:23:06.51,0:23:08.91,Default,,0000,0000,0000,,how much a new house costs or whatever, Dialogue: 0,0:23:08.91,0:23:09.91,Default,,0000,0000,0000,, Dialogue: 0,0:23:09.91,0:23:13.64,Default,,0000,0000,0000,,you need to run a new fitting procedure and then Dialogue: 0,0:23:13.64,0:23:15.73,Default,,0000,0000,0000,,evaluate this line that you fit Dialogue: 0,0:23:15.73,0:23:17.81,Default,,0000,0000,0000,,just at the position of Dialogue: 0,0:23:17.81,0:23:22.43,Default,,0000,0000,0000,,the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you Dialogue: 0,0:23:22.43,0:23:25.51,Default,,0000,0000,0000,,do this for every point along the X-axis then Dialogue: 0,0:23:25.51,0:23:27.01,Default,,0000,0000,0000,,you find that Dialogue: 0,0:23:27.01,0:23:29.99,Default,,0000,0000,0000,,locally weighted regression is able to trace on this, sort of, very Dialogue: 0,0:23:29.99,0:23:31.69,Default,,0000,0000,0000,,non-linear curve Dialogue: 0,0:23:31.69,0:23:37.56,Default,,0000,0000,0000,,for a data set like this. Okay? So Dialogue: 0,0:23:37.56,0:23:42.11,Default,,0000,0000,0000,,in the problem set we're actually gonna let you play around more with this algorithm. So I won't say Dialogue: 0,0:23:42.11,0:23:43.85,Default,,0000,0000,0000,,too much more about it here. Dialogue: 0,0:23:43.85,0:23:48.73,Default,,0000,0000,0000,,But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you Dialogue: 0,0:23:48.73,0:23:49.95,Default,,0000,0000,0000,,still have Dialogue: 0,0:23:49.95,0:23:51.20,Default,,0000,0000,0000,,the same problem of overfitting Dialogue: 0,0:23:51.20,0:23:55.32,Default,,0000,0000,0000,,and underfitting, like when Dialogue: 0,0:23:55.32,0:23:57.96,Default,,0000,0000,0000,,you had a Q's tow. Like you make it too small Dialogue: 0,0:23:57.96,0:23:59.08,Default,,0000,0000,0000,,in your - Dialogue: 0,0:23:59.08,0:24:01.76,Default,,0000,0000,0000,,Yes, absolutely. Yes. So locally Dialogue: 0,0:24:01.76,0:24:04.26,Default,,0000,0000,0000,,weighted regression can run into Dialogue: 0,0:24:04.26,0:24:07.42,Default,,0000,0000,0000,,- locally weighted regression is not a penancier for the problem of Dialogue: 0,0:24:07.42,0:24:10.47,Default,,0000,0000,0000,,overfitting or underfitting. Dialogue: 0,0:24:10.47,0:24:15.13,Default,,0000,0000,0000,,You can still run into the same problems with locally weighted regression. What you just Dialogue: 0,0:24:15.13,0:24:16.92,Default,,0000,0000,0000,,said about Dialogue: 0,0:24:16.92,0:24:19.92,Default,,0000,0000,0000,,- and so some of these things I'll leave you to discover for yourself in the Dialogue: 0,0:24:19.92,0:24:20.86,Default,,0000,0000,0000,,homework problem. Dialogue: 0,0:24:20.86,0:24:24.72,Default,,0000,0000,0000,,You'll actually see what you just mentioned. Dialogue: 0,0:24:24.72,0:24:29.07,Default,,0000,0000,0000,,Yeah? It almost seems like Dialogue: 0,0:24:29.07,0:24:31.36,Default,,0000,0000,0000,,you're Dialogue: 0,0:24:31.36,0:24:37.43,Default,,0000,0000,0000,,not even thoroughly [inaudible] with this locally weighted, you had all the Dialogue: 0,0:24:37.43,0:24:42.50,Default,,0000,0000,0000,,data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a Dialogue: 0,0:24:42.50,0:24:44.70,Default,,0000,0000,0000,,model, because you need the entire data set. Dialogue: 0,0:24:44.70,0:24:45.46,Default,,0000,0000,0000,,And Dialogue: 0,0:24:45.46,0:24:49.72,Default,,0000,0000,0000,,the other way of saying that is that this is a non-parametric learning Dialogue: 0,0:24:49.72,0:24:51.15,Default,,0000,0000,0000,,algorithm. So this Dialogue: 0,0:24:51.15,0:24:52.28,Default,,0000,0000,0000,, Dialogue: 0,0:24:52.28,0:24:54.44,Default,,0000,0000,0000,,-I don't know. I won't Dialogue: 0,0:24:54.44,0:24:58.13,Default,,0000,0000,0000,,debate whether, you know, are we really building a model or not. But Dialogue: 0,0:24:58.13,0:25:00.37,Default,,0000,0000,0000,,this is a perfectly fine - so Dialogue: 0,0:25:00.37,0:25:01.66,Default,,0000,0000,0000,,if I think Dialogue: 0,0:25:01.66,0:25:03.75,Default,,0000,0000,0000,,when you write a code implementing Dialogue: 0,0:25:03.75,0:25:07.08,Default,,0000,0000,0000,,locally weighted linear regression on the Dialogue: 0,0:25:07.08,0:25:08.98,Default,,0000,0000,0000,,data set I think of that code Dialogue: 0,0:25:08.98,0:25:13.34,Default,,0000,0000,0000,,as a whole - as building your model. Dialogue: 0,0:25:13.34,0:25:14.56,Default,,0000,0000,0000,,So it actually uses - Dialogue: 0,0:25:14.56,0:25:18.61,Default,,0000,0000,0000,,we've actually used this quite successfully to model, sort of, the dynamics of this Dialogue: 0,0:25:18.61,0:25:23.03,Default,,0000,0000,0000,,autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights Dialogue: 0,0:25:23.03,0:25:24.09,Default,,0000,0000,0000,,based Dialogue: 0,0:25:24.09,0:25:27.77,Default,,0000,0000,0000,,on Dialogue: 0,0:25:27.77,0:25:32.69,Default,,0000,0000,0000,,the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see, Dialogue: 0,0:25:32.69,0:25:34.26,Default,,0000,0000,0000,,yes. So Dialogue: 0,0:25:34.26,0:25:37.45,Default,,0000,0000,0000,,it turns out there are a few things you can do. One thing that is quite common is Dialogue: 0,0:25:37.45,0:25:40.19,Default,,0000,0000,0000,,how to choose this band with parameter tow, Dialogue: 0,0:25:40.19,0:25:45.79,Default,,0000,0000,0000,,right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible] Dialogue: 0,0:25:45.79,0:25:52.79,Default,,0000,0000,0000,,Gaussian Dialogue: 0,0:25:56.48,0:25:58.03,Default,,0000,0000,0000,,sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's Dialogue: 0,0:25:58.03,0:25:59.64,Default,,0000,0000,0000,,see. Boy. Dialogue: 0,0:25:59.64,0:26:01.68,Default,,0000,0000,0000,,The weights are not Dialogue: 0,0:26:01.68,0:26:06.13,Default,,0000,0000,0000,,random variables and it's not, for the purpose of this algorithm, it is not useful to Dialogue: 0,0:26:06.13,0:26:09.37,Default,,0000,0000,0000,,endow it with probable semantics. So you could Dialogue: 0,0:26:09.37,0:26:15.22,Default,,0000,0000,0000,,choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In Dialogue: 0,0:26:15.22,0:26:15.82,Default,,0000,0000,0000,,fact, Dialogue: 0,0:26:15.82,0:26:17.39,Default,,0000,0000,0000,,it turns out that Dialogue: 0,0:26:17.39,0:26:21.53,Default,,0000,0000,0000,,I happened to choose this, sort of, bell-shaped function Dialogue: 0,0:26:21.53,0:26:23.12,Default,,0000,0000,0000,,to define my weights. Dialogue: 0,0:26:23.12,0:26:27.15,Default,,0000,0000,0000,,It's actually fine to choose a function that doesn't even integrate to one, that integrates to Dialogue: 0,0:26:27.15,0:26:29.56,Default,,0000,0000,0000,,infinity, say, as you're weighting function. So Dialogue: 0,0:26:29.56,0:26:31.70,Default,,0000,0000,0000,,in that sense, Dialogue: 0,0:26:31.70,0:26:36.22,Default,,0000,0000,0000,,I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially Dialogue: 0,0:26:36.22,0:26:41.83,Default,,0000,0000,0000,,since you use other functions that integrate to infinity and don't integrate to one. Okay? Dialogue: 0,0:26:41.83,0:26:43.99,Default,,0000,0000,0000,,It's the last question and let's move on Assume Dialogue: 0,0:26:43.99,0:26:50.99,Default,,0000,0000,0000,,that we have a very huge [inaudible], for example. A very huge set of houses and want to Dialogue: 0,0:26:51.37,0:26:53.87,Default,,0000,0000,0000,,predict the linear for each house Dialogue: 0,0:26:53.87,0:26:57.53,Default,,0000,0000,0000,,and so should the end result for each input - I'm Dialogue: 0,0:26:57.53,0:26:59.55,Default,,0000,0000,0000,,seeing this very constantly for Dialogue: 0,0:26:59.55,0:27:02.12,Default,,0000,0000,0000,,- Yes, you're right. So Dialogue: 0,0:27:02.12,0:27:04.83,Default,,0000,0000,0000,,because locally weighted regression is a Dialogue: 0,0:27:04.83,0:27:07.20,Default,,0000,0000,0000,,non-parametric algorithm Dialogue: 0,0:27:07.20,0:27:11.38,Default,,0000,0000,0000,,every time you make a prediction you need to fit theta to your entire training set again. Dialogue: 0,0:27:11.38,0:27:13.12,Default,,0000,0000,0000,,So you're actually right. Dialogue: 0,0:27:13.12,0:27:17.64,Default,,0000,0000,0000,,If you have a very large training set then this is a somewhat expensive algorithm to Dialogue: 0,0:27:17.64,0:27:19.84,Default,,0000,0000,0000,,use. Because every time you want to make a prediction Dialogue: 0,0:27:19.84,0:27:20.93,Default,,0000,0000,0000,,you need to fit Dialogue: 0,0:27:20.93,0:27:22.66,Default,,0000,0000,0000,,a straight line Dialogue: 0,0:27:22.66,0:27:25.64,Default,,0000,0000,0000,,to a huge data set again. Dialogue: 0,0:27:25.64,0:27:28.86,Default,,0000,0000,0000,,Turns out there are algorithms that - turns Dialogue: 0,0:27:28.86,0:27:32.17,Default,,0000,0000,0000,,out there are ways to make this much more efficient for large data sets as well. Dialogue: 0,0:27:32.17,0:27:34.54,Default,,0000,0000,0000,,So don't want to talk about that. If you're interested, look Dialogue: 0,0:27:34.54,0:27:36.09,Default,,0000,0000,0000,,up the work of Andrew Moore Dialogue: 0,0:27:36.09,0:27:37.78,Default,,0000,0000,0000,,on KD-trees. He, Dialogue: 0,0:27:37.78,0:27:39.58,Default,,0000,0000,0000,,sort Dialogue: 0,0:27:39.58,0:27:42.19,Default,,0000,0000,0000,,of, figured out ways to fit these models much more efficiently. That's not something I want to go Dialogue: 0,0:27:42.19,0:27:44.76,Default,,0000,0000,0000,,into today. Okay? Let me Dialogue: 0,0:27:44.76,0:27:51.76,Default,,0000,0000,0000,,move one. Let's take more questions later. So, okay. Dialogue: 0,0:28:00.79,0:28:05.67,Default,,0000,0000,0000,,So that's locally weighted regression. Dialogue: 0,0:28:05.67,0:28:07.08,Default,,0000,0000,0000,,Remember the Dialogue: 0,0:28:07.08,0:28:10.83,Default,,0000,0000,0000,,outline I had, I guess, at the beginning of this lecture. What I want to do now is Dialogue: 0,0:28:10.83,0:28:11.44,Default,,0000,0000,0000,, Dialogue: 0,0:28:11.44,0:28:15.65,Default,,0000,0000,0000,,talk about a probabilistic interpretation of linear regression, all right? Dialogue: 0,0:28:15.65,0:28:19.18,Default,,0000,0000,0000,,And in particular of the - it'll be this probabilistic interpretation Dialogue: 0,0:28:19.18,0:28:22.25,Default,,0000,0000,0000,,that let's us move on to talk Dialogue: 0,0:28:22.25,0:28:29.25,Default,,0000,0000,0000,,about logistic regression, which will be our first classification algorithm. So Dialogue: 0,0:28:39.68,0:28:43.32,Default,,0000,0000,0000,,let's put aside locally weighted regression for now. We'll just talk about Dialogue: 0,0:28:43.32,0:28:46.65,Default,,0000,0000,0000,,ordinary unweighted linear regression. Let's Dialogue: 0,0:28:46.65,0:28:50.58,Default,,0000,0000,0000,,ask the question of why least squares, right? Of all the things Dialogue: 0,0:28:50.58,0:28:52.40,Default,,0000,0000,0000,,we could optimize Dialogue: 0,0:28:52.40,0:28:56.25,Default,,0000,0000,0000,,how do we come up with this criteria for minimizing the square of the area Dialogue: 0,0:28:56.25,0:28:59.49,Default,,0000,0000,0000,,between the predictions of the hypotheses Dialogue: 0,0:28:59.49,0:29:02.32,Default,,0000,0000,0000,,and the values Y predicted. So why not minimize Dialogue: 0,0:29:02.32,0:29:08.14,Default,,0000,0000,0000,,the absolute value of the areas or the areas to the power of four or something? Dialogue: 0,0:29:08.14,0:29:10.26,Default,,0000,0000,0000,,What I'm going to do now is present Dialogue: 0,0:29:10.26,0:29:12.27,Default,,0000,0000,0000,,one set of assumptions that Dialogue: 0,0:29:12.27,0:29:14.52,Default,,0000,0000,0000,,will serve to "justify" Dialogue: 0,0:29:14.52,0:29:18.95,Default,,0000,0000,0000,,why we're minimizing the sum of square zero. Okay? Dialogue: 0,0:29:18.95,0:29:21.62,Default,,0000,0000,0000,,It turns out that there are many assumptions that are sufficient Dialogue: 0,0:29:21.62,0:29:25.82,Default,,0000,0000,0000,,to justify why we do least squares and this is just one of them. Dialogue: 0,0:29:25.82,0:29:26.93,Default,,0000,0000,0000,,So Dialogue: 0,0:29:26.93,0:29:30.26,Default,,0000,0000,0000,,just because I present one set of assumptions under which least Dialogue: 0,0:29:30.26,0:29:32.34,Default,,0000,0000,0000,,squares regression make sense, Dialogue: 0,0:29:32.34,0:29:35.27,Default,,0000,0000,0000,,but this is not the only set of assumptions. So even if the assumptions Dialogue: 0,0:29:35.27,0:29:39.16,Default,,0000,0000,0000,,I describe don't hold, least squares actually still makes sense in many Dialogue: 0,0:29:39.16,0:29:40.03,Default,,0000,0000,0000,,circumstances. But this Dialogue: 0,0:29:40.03,0:29:41.49,Default,,0000,0000,0000,,sort of new help, you know, Dialogue: 0,0:29:41.49,0:29:48.29,Default,,0000,0000,0000,,give one rationalization, like, one reason for doing least squares regression. Dialogue: 0,0:29:48.29,0:29:51.14,Default,,0000,0000,0000,,And, in particular, what I'm going to do is Dialogue: 0,0:29:51.14,0:29:57.52,Default,,0000,0000,0000,,endow the least squares model with probabilistic semantics. So Dialogue: 0,0:29:57.52,0:30:01.81,Default,,0000,0000,0000,,let's assume in our example of predicting housing prices, Dialogue: 0,0:30:01.81,0:30:02.95,Default,,0000,0000,0000,,that Dialogue: 0,0:30:02.95,0:30:07.51,Default,,0000,0000,0000,,the price of the house it's sold four, and Dialogue: 0,0:30:07.51,0:30:12.40,Default,,0000,0000,0000,,there's going to be some linear function of the features, Dialogue: 0,0:30:12.40,0:30:13.72,Default,,0000,0000,0000,,plus Dialogue: 0,0:30:13.72,0:30:17.05,Default,,0000,0000,0000,,some term epsilon I. Okay? Dialogue: 0,0:30:17.05,0:30:19.62,Default,,0000,0000,0000,,And epsilon I will be Dialogue: 0,0:30:19.62,0:30:22.01,Default,,0000,0000,0000,,our error term. Dialogue: 0,0:30:22.01,0:30:23.16,Default,,0000,0000,0000,,You can think of Dialogue: 0,0:30:23.16,0:30:27.86,Default,,0000,0000,0000,,the error term as capturing unmodeled effects, like, that maybe Dialogue: 0,0:30:27.86,0:30:31.49,Default,,0000,0000,0000,,there's some other features of a house, like, maybe how many fireplaces it has or whether Dialogue: 0,0:30:31.49,0:30:33.39,Default,,0000,0000,0000,,there's a garden or whatever, Dialogue: 0,0:30:33.39,0:30:34.13,Default,,0000,0000,0000,,that there Dialogue: 0,0:30:34.13,0:30:37.10,Default,,0000,0000,0000,,are additional features that we jut fail to capture or Dialogue: 0,0:30:37.10,0:30:39.52,Default,,0000,0000,0000,,you can think of epsilon as random noise. Dialogue: 0,0:30:39.52,0:30:42.36,Default,,0000,0000,0000,,Epsilon is our error term that captures both these Dialogue: 0,0:30:42.36,0:30:46.23,Default,,0000,0000,0000,,unmodeled effects. Just things we forgot to model. Maybe the function isn't quite Dialogue: 0,0:30:46.23,0:30:47.66,Default,,0000,0000,0000,,linear or something. Dialogue: 0,0:30:47.66,0:30:50.36,Default,,0000,0000,0000,, Dialogue: 0,0:30:50.36,0:30:52.71,Default,,0000,0000,0000,,As well as random noise, like maybe Dialogue: 0,0:30:52.71,0:30:56.54,Default,,0000,0000,0000,,that day the seller was in a really bad mood and so he sold it, just Dialogue: 0,0:30:56.54,0:31:00.44,Default,,0000,0000,0000,,refused to go for a reasonable price or something. Dialogue: 0,0:31:00.44,0:31:02.67,Default,,0000,0000,0000,,And now Dialogue: 0,0:31:02.67,0:31:07.88,Default,,0000,0000,0000,,I will assume that the errors have a Dialogue: 0,0:31:07.88,0:31:09.01,Default,,0000,0000,0000,,probabilistic Dialogue: 0,0:31:09.01,0:31:13.10,Default,,0000,0000,0000,,- have a probability distribution. I'll assume that the errors epsilon I Dialogue: 0,0:31:13.10,0:31:14.52,Default,,0000,0000,0000,,are distributed Dialogue: 0,0:31:14.52,0:31:16.50,Default,,0000,0000,0000,,just till Dialogue: 0,0:31:16.50,0:31:18.66,Default,,0000,0000,0000,,they denote epsilon I Dialogue: 0,0:31:18.66,0:31:23.44,Default,,0000,0000,0000,,is distributive according to a probability distribution. That's Dialogue: 0,0:31:23.44,0:31:26.51,Default,,0000,0000,0000,,a Gaussian distribution with mean zero Dialogue: 0,0:31:26.51,0:31:28.17,Default,,0000,0000,0000,,and variance sigma squared. Okay? So Dialogue: 0,0:31:28.17,0:31:30.63,Default,,0000,0000,0000,,let me just scripts in here, Dialogue: 0,0:31:30.63,0:31:34.63,Default,,0000,0000,0000,,n stands for normal, right? To denote a normal distribution, also known as the Dialogue: 0,0:31:34.63,0:31:36.17,Default,,0000,0000,0000,,Gaussian distribution, Dialogue: 0,0:31:36.17,0:31:37.32,Default,,0000,0000,0000,,with mean Dialogue: 0,0:31:37.32,0:31:41.19,Default,,0000,0000,0000,,zero and covariance sigma squared. Dialogue: 0,0:31:41.19,0:31:46.73,Default,,0000,0000,0000,,Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you. Dialogue: 0,0:31:46.73,0:31:48.64,Default,,0000,0000,0000,,Great. Almost everyone. Dialogue: 0,0:31:48.64,0:31:51.28,Default,,0000,0000,0000,,So, Dialogue: 0,0:31:51.28,0:31:55.31,Default,,0000,0000,0000,,in other words, the density for Gaussian is what you've seen before. Dialogue: 0,0:31:55.31,0:31:57.73,Default,,0000,0000,0000,,The density for epsilon I would be Dialogue: 0,0:31:57.73,0:31:59.65,Default,,0000,0000,0000,,one over root 2 pi sigma, E to the Dialogue: 0,0:31:59.65,0:32:01.79,Default,,0000,0000,0000,,negative, Dialogue: 0,0:32:01.79,0:32:03.57,Default,,0000,0000,0000,,epsilon I Dialogue: 0,0:32:03.57,0:32:05.76,Default,,0000,0000,0000,, Dialogue: 0,0:32:05.76,0:32:09.42,Default,,0000,0000,0000,,squared over 2 sigma squared, right? Dialogue: 0,0:32:09.42,0:32:15.17,Default,,0000,0000,0000,,And the Dialogue: 0,0:32:15.17,0:32:22.17,Default,,0000,0000,0000,,density of our epsilon I will be this bell-shaped curve Dialogue: 0,0:32:22.93,0:32:25.87,Default,,0000,0000,0000,,with one standard deviation Dialogue: 0,0:32:25.87,0:32:31.04,Default,,0000,0000,0000,,being a, sort of, sigma. Okay? This Dialogue: 0,0:32:31.04,0:32:31.78,Default,,0000,0000,0000,,is Dialogue: 0,0:32:31.78,0:32:34.20,Default,,0000,0000,0000,,form for that bell-shaped curve. Dialogue: 0,0:32:34.20,0:32:41.20,Default,,0000,0000,0000,,So, let's see. I can erase that. Can I Dialogue: 0,0:32:41.26,0:32:47.04,Default,,0000,0000,0000,,erase the board? Dialogue: 0,0:32:47.04,0:32:54.04,Default,,0000,0000,0000,, Dialogue: 0,0:32:59.16,0:33:04.77,Default,,0000,0000,0000,,So this implies that the Dialogue: 0,0:33:04.77,0:33:10.12,Default,,0000,0000,0000,,probability distribution of a price of a house Dialogue: 0,0:33:10.12,0:33:11.53,Default,,0000,0000,0000,,given in si Dialogue: 0,0:33:11.53,0:33:13.62,Default,,0000,0000,0000,,and the parameters theta, Dialogue: 0,0:33:13.62,0:33:20.62,Default,,0000,0000,0000,,that this is going to be Gaussian Dialogue: 0,0:33:29.53,0:33:34.82,Default,,0000,0000,0000,,with that density. Okay? Dialogue: 0,0:33:34.82,0:33:37.93,Default,,0000,0000,0000,,In other words, saying goes as that the Dialogue: 0,0:33:37.93,0:33:41.44,Default,,0000,0000,0000,,price of Dialogue: 0,0:33:41.44,0:33:46.55,Default,,0000,0000,0000,,a house given the features of the house and my parameters theta, Dialogue: 0,0:33:46.55,0:33:50.53,Default,,0000,0000,0000,,this is going to be a random variable Dialogue: 0,0:33:50.53,0:33:53.82,Default,,0000,0000,0000,,that's distributed Gaussian with Dialogue: 0,0:33:53.82,0:33:56.70,Default,,0000,0000,0000,,mean theta transpose XI Dialogue: 0,0:33:56.70,0:33:58.04,Default,,0000,0000,0000,,and variance sigma squared. Dialogue: 0,0:33:58.04,0:34:01.12,Default,,0000,0000,0000,,Right? Because we imagine that Dialogue: 0,0:34:01.12,0:34:04.99,Default,,0000,0000,0000,,the way the housing prices are generated is that the price of a house Dialogue: 0,0:34:04.99,0:34:08.99,Default,,0000,0000,0000,,is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma Dialogue: 0,0:34:08.99,0:34:11.65,Default,,0000,0000,0000,,squared. So Dialogue: 0,0:34:11.65,0:34:13.59,Default,,0000,0000,0000,,the price of a house is going to Dialogue: 0,0:34:13.59,0:34:16.24,Default,,0000,0000,0000,,have mean theta transpose XI, again, and sigma squared, right? Does Dialogue: 0,0:34:16.24,0:34:20.07,Default,,0000,0000,0000,,this make Dialogue: 0,0:34:20.07,0:34:24.38,Default,,0000,0000,0000,,sense? Raise your hand if this makes sense. Yeah, Dialogue: 0,0:34:24.38,0:34:31.38,Default,,0000,0000,0000,,okay. Lots of you. In Dialogue: 0,0:34:38.37,0:34:43.94,Default,,0000,0000,0000,,point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume Dialogue: 0,0:34:43.94,0:34:47.49,Default,,0000,0000,0000,,here the error is a Dialogue: 0,0:34:47.49,0:34:50.30,Default,,0000,0000,0000,,Gaussian? Dialogue: 0,0:34:50.30,0:34:54.04,Default,,0000,0000,0000,,Right. So, boy. Dialogue: 0,0:34:54.04,0:34:56.43,Default,,0000,0000,0000,,Why do I see the error as Gaussian? Dialogue: 0,0:34:56.43,0:35:00.91,Default,,0000,0000,0000,,Two reasons, right? One is that it turns out to be mathematically convenient to do so Dialogue: 0,0:35:00.91,0:35:03.11,Default,,0000,0000,0000,,and the other is, I don't Dialogue: 0,0:35:03.11,0:35:06.53,Default,,0000,0000,0000,,know, I can also mumble about justifications, such as things to the Dialogue: 0,0:35:06.53,0:35:08.61,Default,,0000,0000,0000,,central limit theorem. It turns out that if you, Dialogue: 0,0:35:08.61,0:35:11.99,Default,,0000,0000,0000,,for the vast majority of problems, if you apply a linear regression model like Dialogue: 0,0:35:11.99,0:35:15.68,Default,,0000,0000,0000,,this and try to measure the distribution of the errors, Dialogue: 0,0:35:15.68,0:35:19.51,Default,,0000,0000,0000,,not all the time, but very often you find that the errors really are Gaussian. That Dialogue: 0,0:35:19.51,0:35:21.78,Default,,0000,0000,0000,,this Gaussian model is a good Dialogue: 0,0:35:21.78,0:35:23.52,Default,,0000,0000,0000,,assumption for the error Dialogue: 0,0:35:23.52,0:35:26.08,Default,,0000,0000,0000,,in regression problems like these. Dialogue: 0,0:35:26.08,0:35:29.00,Default,,0000,0000,0000,,Some of you may have heard of the central limit theorem, which says that Dialogue: 0,0:35:29.00,0:35:33.22,Default,,0000,0000,0000,,the sum of many independent random variables will tend towards a Gaussian. Dialogue: 0,0:35:33.22,0:35:37.45,Default,,0000,0000,0000,,So if the error is caused by many effects, like the mood of the Dialogue: 0,0:35:37.45,0:35:39.06,Default,,0000,0000,0000,,seller, the mood of the buyer, Dialogue: 0,0:35:39.06,0:35:42.68,Default,,0000,0000,0000,,some other features that we miss, whether the place has a garden or not, and Dialogue: 0,0:35:42.68,0:35:45.63,Default,,0000,0000,0000,,if all of these effects are independent, then Dialogue: 0,0:35:45.63,0:35:49.26,Default,,0000,0000,0000,,by the central limit theorem you might be inclined to believe that Dialogue: 0,0:35:49.26,0:35:52.26,Default,,0000,0000,0000,,the sum of all these effects will be approximately Gaussian. If Dialogue: 0,0:35:52.26,0:35:53.76,Default,,0000,0000,0000,,in practice, I guess, the Dialogue: 0,0:35:53.76,0:35:57.81,Default,,0000,0000,0000,,two real answers are that, 1.) In practice this is actually a reasonably accurate Dialogue: 0,0:35:57.81,0:36:04.81,Default,,0000,0000,0000,,assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're Dialogue: 0,0:36:06.51,0:36:07.54,Default,,0000,0000,0000,,saying Dialogue: 0,0:36:07.54,0:36:10.54,Default,,0000,0000,0000,,if we assume that area around model Dialogue: 0,0:36:10.54,0:36:12.85,Default,,0000,0000,0000,,has zero mean, then Dialogue: 0,0:36:12.85,0:36:16.27,Default,,0000,0000,0000,,the area is centered around our model. Which Dialogue: 0,0:36:16.27,0:36:18.06,Default,,0000,0000,0000,,it seems almost like we're trying to assume Dialogue: 0,0:36:18.06,0:36:19.96,Default,,0000,0000,0000,,what we're trying to prove. Instructor? Dialogue: 0,0:36:19.96,0:36:23.64,Default,,0000,0000,0000,,That's the [inaudible] but, yes. You are assuming that Dialogue: 0,0:36:23.64,0:36:28.36,Default,,0000,0000,0000,,the error has zero mean. Which is, yeah, right. Dialogue: 0,0:36:28.36,0:36:30.92,Default,,0000,0000,0000,,I think later this quarter we get to some of the other Dialogue: 0,0:36:30.92,0:36:35.06,Default,,0000,0000,0000,,things, but for now just think of this as a mathematically - it's actually not Dialogue: 0,0:36:35.06,0:36:37.88,Default,,0000,0000,0000,,an unreasonable assumption. Dialogue: 0,0:36:37.88,0:36:40.39,Default,,0000,0000,0000,,I guess, Dialogue: 0,0:36:40.39,0:36:44.98,Default,,0000,0000,0000,,in machine learning all the assumptions we make are almost never Dialogue: 0,0:36:44.98,0:36:48.52,Default,,0000,0000,0000,,true in the absence sense, right? Because, for instance, Dialogue: 0,0:36:48.52,0:36:54.36,Default,,0000,0000,0000,,housing prices are priced to dollars and cents, so the error will be - Dialogue: 0,0:36:54.36,0:36:55.20,Default,,0000,0000,0000,,errors Dialogue: 0,0:36:55.20,0:36:58.72,Default,,0000,0000,0000,,in prices are not continued as value random variables, because Dialogue: 0,0:36:58.72,0:37:02.24,Default,,0000,0000,0000,,houses can only be priced at a certain number of dollars and a certain number of Dialogue: 0,0:37:02.24,0:37:05.97,Default,,0000,0000,0000,,cents and you never have fractions of cents in housing prices. Dialogue: 0,0:37:05.97,0:37:10.34,Default,,0000,0000,0000,,Whereas a Gaussian random variable would. So in that sense, assumptions we make are never Dialogue: 0,0:37:10.34,0:37:14.69,Default,,0000,0000,0000,,"absolutely true," but for practical purposes this is a Dialogue: 0,0:37:14.69,0:37:19.46,Default,,0000,0000,0000,,accurate enough assumption that it'll be useful to Dialogue: 0,0:37:19.46,0:37:23.53,Default,,0000,0000,0000,,make. Okay? I think in a week or two, we'll actually come back to Dialogue: 0,0:37:23.53,0:37:27.28,Default,,0000,0000,0000,,selected more about the assumptions we make and when they help our learning Dialogue: 0,0:37:27.28,0:37:29.06,Default,,0000,0000,0000,,algorithms and when they hurt our learning Dialogue: 0,0:37:29.06,0:37:30.84,Default,,0000,0000,0000,,algorithms. We'll say a bit more about it Dialogue: 0,0:37:30.84,0:37:37.84,Default,,0000,0000,0000,,when we talk about generative and discriminative learning algorithms, like, in a week or Dialogue: 0,0:37:40.32,0:37:44.82,Default,,0000,0000,0000,,two. Okay? So let's point out one bit of notation, which is that when I Dialogue: 0,0:37:44.82,0:37:48.40,Default,,0000,0000,0000,,wrote this down I actually wrote P of YI given XI and then semicolon Dialogue: 0,0:37:48.40,0:37:49.72,Default,,0000,0000,0000,,theta Dialogue: 0,0:37:49.72,0:37:54.100,Default,,0000,0000,0000,,and I'm going to use this notation when we are not thinking of theta as a Dialogue: 0,0:37:54.100,0:37:56.80,Default,,0000,0000,0000,,random variable. So Dialogue: 0,0:37:56.80,0:38:00.71,Default,,0000,0000,0000,,in statistics, though, sometimes it's called the frequentist's point of view, Dialogue: 0,0:38:00.71,0:38:02.38,Default,,0000,0000,0000,,where you think of there as being some, Dialogue: 0,0:38:02.38,0:38:06.45,Default,,0000,0000,0000,,sort of, true value of theta that's out there that's generating the data say, Dialogue: 0,0:38:06.45,0:38:07.52,Default,,0000,0000,0000,,but Dialogue: 0,0:38:07.52,0:38:11.30,Default,,0000,0000,0000,,we don't know what theta is, but theta is not a random Dialogue: 0,0:38:11.30,0:38:13.44,Default,,0000,0000,0000,,vehicle, right? So it's not like there's some Dialogue: 0,0:38:13.44,0:38:16.27,Default,,0000,0000,0000,,random value of theta out there. It's that theta is - Dialogue: 0,0:38:16.27,0:38:19.14,Default,,0000,0000,0000,,there's some true value of theta out there. It's just that we don't Dialogue: 0,0:38:19.14,0:38:22.46,Default,,0000,0000,0000,,know what the true value of theta is. So Dialogue: 0,0:38:22.46,0:38:25.62,Default,,0000,0000,0000,,if theta is not a random variable, then I'm Dialogue: 0,0:38:25.62,0:38:27.30,Default,,0000,0000,0000,,going to avoid Dialogue: 0,0:38:27.30,0:38:31.36,Default,,0000,0000,0000,,writing P of YI given XI comma theta, because this would mean Dialogue: 0,0:38:31.36,0:38:34.90,Default,,0000,0000,0000,,that probably of YI conditioned on X and theta Dialogue: 0,0:38:34.90,0:38:40.20,Default,,0000,0000,0000,,and you can only condition on random variables. Dialogue: 0,0:38:40.20,0:38:43.01,Default,,0000,0000,0000,,So at this part of the class where we're taking Dialogue: 0,0:38:43.01,0:38:46.49,Default,,0000,0000,0000,,sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class Dialogue: 0,0:38:46.49,0:38:49.22,Default,,0000,0000,0000,,we're thinking of theta not as a random variable, but just as something Dialogue: 0,0:38:49.22,0:38:50.47,Default,,0000,0000,0000,,we're trying to estimate Dialogue: 0,0:38:50.47,0:38:52.83,Default,,0000,0000,0000,,and use the semicolon Dialogue: 0,0:38:52.83,0:38:57.56,Default,,0000,0000,0000,,notation. So the way to read this is this is the probability of YI given XI Dialogue: 0,0:38:57.56,0:39:00.41,Default,,0000,0000,0000,,and parameterized by theta. Okay? So Dialogue: 0,0:39:00.41,0:39:03.72,Default,,0000,0000,0000,,you read the semicolon as parameterized by. Dialogue: 0,0:39:03.72,0:39:08.27,Default,,0000,0000,0000,,And in the same way here, I'll say YI given XI parameterized by Dialogue: 0,0:39:08.27,0:39:09.50,Default,,0000,0000,0000,,theta is distributed Dialogue: 0,0:39:09.50,0:39:16.50,Default,,0000,0000,0000,,Gaussian with that. All right. Dialogue: 0,0:39:36.32,0:39:38.33,Default,,0000,0000,0000,,So we're gonna make one more assumption. Dialogue: 0,0:39:38.33,0:39:41.28,Default,,0000,0000,0000,,Let's assume that the Dialogue: 0,0:39:41.28,0:39:43.82,Default,,0000,0000,0000,,error terms are Dialogue: 0,0:39:43.82,0:39:48.27,Default,,0000,0000,0000,, Dialogue: 0,0:39:48.27,0:39:50.61,Default,,0000,0000,0000,,IID, okay? Dialogue: 0,0:39:50.61,0:39:54.28,Default,,0000,0000,0000,,Which stands for Independently and Identically Distributed. So it's Dialogue: 0,0:39:54.28,0:39:57.26,Default,,0000,0000,0000,,going to assume that the error terms are Dialogue: 0,0:39:57.26,0:40:04.26,Default,,0000,0000,0000,,independent of each other, Dialogue: 0,0:40:04.47,0:40:11.28,Default,,0000,0000,0000,,right? Dialogue: 0,0:40:11.28,0:40:14.75,Default,,0000,0000,0000,,The identically distributed part just means that I'm assuming the outcome for the Dialogue: 0,0:40:14.75,0:40:18.01,Default,,0000,0000,0000,,same Gaussian distribution or the same variance, Dialogue: 0,0:40:18.01,0:40:22.10,Default,,0000,0000,0000,,but the more important part of is this is that I'm assuming that the epsilon I's are Dialogue: 0,0:40:22.10,0:40:25.52,Default,,0000,0000,0000,,independent of each other. Dialogue: 0,0:40:25.52,0:40:28.90,Default,,0000,0000,0000,,Now, let's talk about how to fit a model. Dialogue: 0,0:40:28.90,0:40:32.36,Default,,0000,0000,0000,,The probability of Y given Dialogue: 0,0:40:32.36,0:40:36.24,Default,,0000,0000,0000,,X parameterized by theta - I'm actually going to give Dialogue: 0,0:40:36.24,0:40:39.15,Default,,0000,0000,0000,,this another name. I'm going to write this down Dialogue: 0,0:40:39.15,0:40:42.75,Default,,0000,0000,0000,,and we'll call this the likelihood of theta Dialogue: 0,0:40:42.75,0:40:46.44,Default,,0000,0000,0000,,as the probability of Y given X parameterized by theta. Dialogue: 0,0:40:46.44,0:40:49.01,Default,,0000,0000,0000,,And so this is going to be Dialogue: 0,0:40:49.01,0:40:50.42,Default,,0000,0000,0000,,the product Dialogue: 0,0:40:50.42,0:40:57.42,Default,,0000,0000,0000,,over my training set like that. Dialogue: 0,0:40:59.94,0:41:04.30,Default,,0000,0000,0000,,Which is, in turn, going to be a product of Dialogue: 0,0:41:04.30,0:41:10.54,Default,,0000,0000,0000,,those Gaussian densities that I wrote down just now, Dialogue: 0,0:41:10.54,0:41:14.55,Default,,0000,0000,0000,,right? Dialogue: 0,0:41:14.55,0:41:20.36,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:41:20.36,0:41:24.63,Default,,0000,0000,0000,,Then in parts of notation, I guess, I define this term here to be the Dialogue: 0,0:41:24.63,0:41:26.18,Default,,0000,0000,0000,,likelihood of theta. Dialogue: 0,0:41:26.18,0:41:30.11,Default,,0000,0000,0000,,And the likely of theta is just the probability of the data Y, right? Given X Dialogue: 0,0:41:30.11,0:41:32.79,Default,,0000,0000,0000,,and prioritized by theta. Dialogue: 0,0:41:32.79,0:41:36.62,Default,,0000,0000,0000,,To test the likelihood and probability are often confused. Dialogue: 0,0:41:36.62,0:41:40.52,Default,,0000,0000,0000,,So the likelihood of theta is the same thing as the Dialogue: 0,0:41:40.52,0:41:45.51,Default,,0000,0000,0000,,probability of the data you saw. So likely and probably are, sort of, the same thing. Dialogue: 0,0:41:45.51,0:41:47.78,Default,,0000,0000,0000,,Except that when I use the term likelihood Dialogue: 0,0:41:47.78,0:41:51.80,Default,,0000,0000,0000,,I'm trying to emphasize that I'm taking this thing Dialogue: 0,0:41:51.80,0:41:55.38,Default,,0000,0000,0000,,and viewing it as a function of theta. Dialogue: 0,0:41:55.38,0:41:57.13,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:41:57.13,0:42:00.61,Default,,0000,0000,0000,,So likelihood and for probability, they're really the same thing except that Dialogue: 0,0:42:00.61,0:42:02.34,Default,,0000,0000,0000,,when I want to view this thing Dialogue: 0,0:42:02.34,0:42:05.66,Default,,0000,0000,0000,,as a function of theta holding X and Y fix are Dialogue: 0,0:42:05.66,0:42:10.09,Default,,0000,0000,0000,,then called likelihood. Okay? So Dialogue: 0,0:42:10.09,0:42:13.30,Default,,0000,0000,0000,,hopefully you hear me say the likelihood of the parameters and the probability Dialogue: 0,0:42:13.30,0:42:15.12,Default,,0000,0000,0000,,of the data, Dialogue: 0,0:42:15.12,0:42:18.47,Default,,0000,0000,0000,,right? Rather than the likelihood of the data or probability of parameters. So try Dialogue: 0,0:42:18.47,0:42:25.47,Default,,0000,0000,0000,,to be consistent in that terminology. Dialogue: 0,0:42:30.51,0:42:31.52,Default,,0000,0000,0000,,So given that Dialogue: 0,0:42:31.52,0:42:33.53,Default,,0000,0000,0000,,the probability of the data is this and this Dialogue: 0,0:42:33.53,0:42:36.86,Default,,0000,0000,0000,,is also the likelihood of the parameters, Dialogue: 0,0:42:36.86,0:42:37.81,Default,,0000,0000,0000,,how do you estimate Dialogue: 0,0:42:37.81,0:42:40.46,Default,,0000,0000,0000,,the parameters theta? So given a training set, Dialogue: 0,0:42:40.46,0:42:46.31,Default,,0000,0000,0000,,what parameters theta do you want to choose for your model? Dialogue: 0,0:42:46.31,0:42:53.31,Default,,0000,0000,0000,, Dialogue: 0,0:42:58.75,0:43:01.55,Default,,0000,0000,0000,,Well, the principle of maximum likelihood Dialogue: 0,0:43:01.55,0:43:02.71,Default,,0000,0000,0000,,estimation Dialogue: 0,0:43:02.71,0:43:09.33,Default,,0000,0000,0000,,says that, Dialogue: 0,0:43:09.33,0:43:13.30,Default,,0000,0000,0000,,right? You can choose the value of theta that makes the data Dialogue: 0,0:43:13.30,0:43:20.04,Default,,0000,0000,0000,,as probable as possible, right? So choose theta Dialogue: 0,0:43:20.04,0:43:27.04,Default,,0000,0000,0000,,to maximize the likelihood. Or Dialogue: 0,0:43:27.20,0:43:29.77,Default,,0000,0000,0000,,in other words choose the parameters that make Dialogue: 0,0:43:29.77,0:43:33.12,Default,,0000,0000,0000,,the data as probable as possible, right? So this is Dialogue: 0,0:43:33.12,0:43:36.94,Default,,0000,0000,0000,,massive likely your estimation from six to six. So it's choose the parameters that makes Dialogue: 0,0:43:36.94,0:43:39.54,Default,,0000,0000,0000,,it as likely as probable as possible Dialogue: 0,0:43:39.54,0:43:43.31,Default,,0000,0000,0000,,for me to have seen the data I just Dialogue: 0,0:43:43.31,0:43:46.82,Default,,0000,0000,0000,,did. So Dialogue: 0,0:43:46.82,0:43:53.19,Default,,0000,0000,0000,,for mathematical convenience, let me define lower case l of theta. Dialogue: 0,0:43:53.19,0:43:57.96,Default,,0000,0000,0000,,This is called the log likelihood function and it's just log Dialogue: 0,0:43:57.96,0:44:01.43,Default,,0000,0000,0000,,of capital L of theta. Dialogue: 0,0:44:01.43,0:44:06.31,Default,,0000,0000,0000,,So this is log over product over I Dialogue: 0,0:44:06.31,0:44:09.54,Default,,0000,0000,0000,,to find Dialogue: 0,0:44:09.54,0:44:14.30,Default,,0000,0000,0000,,sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this Dialogue: 0,0:44:14.30,0:44:16.61,Default,,0000,0000,0000,,from the previous board. Dialogue: 0,0:44:16.61,0:44:23.61,Default,,0000,0000,0000,,Log and a product is the same as the sum of over logs, right? So it's a sum Dialogue: 0,0:44:24.51,0:44:31.51,Default,,0000,0000,0000,,of Dialogue: 0,0:44:35.14,0:44:38.40,Default,,0000,0000,0000,,the logs of - which simplifies to m times Dialogue: 0,0:44:38.40,0:44:39.35,Default,,0000,0000,0000,,one over root Dialogue: 0,0:44:39.35,0:44:43.51,Default,,0000,0000,0000,,two pi Dialogue: 0,0:44:43.51,0:44:44.38,Default,,0000,0000,0000,,sigma Dialogue: 0,0:44:44.38,0:44:47.47,Default,,0000,0000,0000,,plus Dialogue: 0,0:44:47.47,0:44:52.10,Default,,0000,0000,0000,,and then log of explanation cancel each other, right? So if log of E of Dialogue: 0,0:44:52.10,0:44:53.09,Default,,0000,0000,0000,,something is just Dialogue: 0,0:44:53.09,0:45:00.09,Default,,0000,0000,0000,,whatever's inside the exponent. So, you know what, Dialogue: 0,0:45:01.23,0:45:08.23,Default,,0000,0000,0000,,let me write this on the Dialogue: 0,0:45:12.08,0:45:16.13,Default,,0000,0000,0000,,next Dialogue: 0,0:45:16.13,0:45:21.36,Default,,0000,0000,0000,,board. Dialogue: 0,0:45:21.36,0:45:28.36,Default,,0000,0000,0000,,Okay. Dialogue: 0,0:45:32.76,0:45:39.76,Default,,0000,0000,0000,, Dialogue: 0,0:45:46.24,0:45:50.51,Default,,0000,0000,0000,,So Dialogue: 0,0:45:50.51,0:45:53.35,Default,,0000,0000,0000,,maximizing the likelihood or maximizing the log Dialogue: 0,0:45:53.35,0:45:58.44,Default,,0000,0000,0000,,likelihood is the same Dialogue: 0,0:45:58.44,0:46:02.94,Default,,0000,0000,0000,,as minimizing Dialogue: 0,0:46:02.94,0:46:09.94,Default,,0000,0000,0000,,that term over there. Well, you get it, right? Dialogue: 0,0:46:22.02,0:46:26.04,Default,,0000,0000,0000,,Because there's a minus sign. So maximizing this because of the minus sign is the same as Dialogue: 0,0:46:26.04,0:46:26.83,Default,,0000,0000,0000,,minimizing Dialogue: 0,0:46:26.83,0:46:33.23,Default,,0000,0000,0000,,this as a function of theta. And Dialogue: 0,0:46:33.23,0:46:35.73,Default,,0000,0000,0000,,this is, of course, just Dialogue: 0,0:46:35.73,0:46:42.73,Default,,0000,0000,0000,,the same quadratic cos function that we had last time, J of theta, Dialogue: 0,0:46:43.09,0:46:44.47,Default,,0000,0000,0000,,right? So what Dialogue: 0,0:46:44.47,0:46:48.34,Default,,0000,0000,0000,,we've just shown is that the ordinary least squares algorithm, Dialogue: 0,0:46:48.34,0:46:50.54,Default,,0000,0000,0000,,that we worked on the previous lecture, Dialogue: 0,0:46:50.54,0:46:55.02,Default,,0000,0000,0000,,is just maximum likelihood Dialogue: 0,0:46:55.02,0:46:56.08,Default,,0000,0000,0000,,assuming Dialogue: 0,0:46:56.08,0:46:58.22,Default,,0000,0000,0000,,this probabilistic model, Dialogue: 0,0:46:58.22,0:47:05.22,Default,,0000,0000,0000,,assuming IID Gaussian errors on our data. Dialogue: 0,0:47:06.11,0:47:09.66,Default,,0000,0000,0000,, Dialogue: 0,0:47:09.66,0:47:10.72,Default,,0000,0000,0000,,Okay? One thing that we'll Dialogue: 0,0:47:10.72,0:47:12.48,Default,,0000,0000,0000,,actually leave is that, Dialogue: 0,0:47:12.48,0:47:14.49,Default,,0000,0000,0000,,in the next lecture notice that Dialogue: 0,0:47:14.49,0:47:17.06,Default,,0000,0000,0000,,the value of sigma squared doesn't matter, Dialogue: 0,0:47:17.06,0:47:17.93,Default,,0000,0000,0000,,right? That somehow Dialogue: 0,0:47:17.93,0:47:21.39,Default,,0000,0000,0000,,no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a Dialogue: 0,0:47:21.39,0:47:21.100,Default,,0000,0000,0000,,variance Dialogue: 0,0:47:21.100,0:47:26.23,Default,,0000,0000,0000,,of a Gaussian. So that no matter what sigma Dialogue: 0,0:47:26.23,0:47:30.16,Default,,0000,0000,0000,,squared is since it's a positive number the value of theta we end up with Dialogue: 0,0:47:30.16,0:47:33.62,Default,,0000,0000,0000,,will be the same, right? So because Dialogue: 0,0:47:33.62,0:47:34.66,Default,,0000,0000,0000,,minimizing this Dialogue: 0,0:47:34.66,0:47:39.29,Default,,0000,0000,0000,,you get the same value of theta no matter what sigma squared is. So it's as if Dialogue: 0,0:47:39.29,0:47:42.69,Default,,0000,0000,0000,,in this model the value of sigma squared doesn't really matter. Dialogue: 0,0:47:42.69,0:47:45.79,Default,,0000,0000,0000,,Just remember that for the next lecture. We'll come back Dialogue: 0,0:47:45.79,0:47:47.51,Default,,0000,0000,0000,,to this again. Dialogue: 0,0:47:47.51,0:47:51.43,Default,,0000,0000,0000,,Any questions about this? Dialogue: 0,0:47:51.43,0:47:52.67,Default,,0000,0000,0000,,Actually, let me clean up Dialogue: 0,0:47:52.67,0:47:59.67,Default,,0000,0000,0000,,another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of Dialogue: 0,0:48:43.93,0:48:50.93,Default,,0000,0000,0000,,theta by Dialogue: 0,0:48:51.21,0:48:52.08,Default,,0000,0000,0000,,a fraction Dialogue: 0,0:48:52.08,0:48:53.81,Default,,0000,0000,0000,,of error, Dialogue: 0,0:48:53.81,0:48:55.37,Default,,0000,0000,0000,,but I think it's that you Dialogue: 0,0:48:55.37,0:48:57.32,Default,,0000,0000,0000,,measure Dialogue: 0,0:48:57.32,0:49:01.18,Default,,0000,0000,0000,,because it depends on the family of theta too, for example. If Dialogue: 0,0:49:01.18,0:49:05.27,Default,,0000,0000,0000,, Dialogue: 0,0:49:05.27,0:49:09.07,Default,,0000,0000,0000,,you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think Dialogue: 0,0:49:09.07,0:49:12.61,Default,,0000,0000,0000,,let's - the thing's you're mentioning are Dialogue: 0,0:49:12.61,0:49:14.55,Default,,0000,0000,0000,,maybe deeper questions about Dialogue: 0,0:49:14.55,0:49:17.87,Default,,0000,0000,0000,,learning algorithms that we'll just come back to later, so don't really want to get into Dialogue: 0,0:49:17.87,0:49:19.46,Default,,0000,0000,0000,,that right Dialogue: 0,0:49:19.46,0:49:22.20,Default,,0000,0000,0000,,now. Any more Dialogue: 0,0:49:22.20,0:49:29.20,Default,,0000,0000,0000,,questions? Okay. So Dialogue: 0,0:49:33.22,0:49:38.76,Default,,0000,0000,0000,,this endows linear regression with a probabilistic interpretation. Dialogue: 0,0:49:38.76,0:49:43.21,Default,,0000,0000,0000,,I'm actually going to use this probabil - use this, sort of, probabilistic Dialogue: 0,0:49:43.21,0:49:43.94,Default,,0000,0000,0000,,interpretation Dialogue: 0,0:49:43.94,0:49:46.19,Default,,0000,0000,0000,,in order to derive our next learning algorithm, Dialogue: 0,0:49:46.19,0:49:50.31,Default,,0000,0000,0000,,which will be our first classification algorithm. Okay? Dialogue: 0,0:49:50.31,0:49:53.62,Default,,0000,0000,0000,,So Dialogue: 0,0:49:53.62,0:49:57.93,Default,,0000,0000,0000,,you'll recall that I said that regression problems are where the variable Y Dialogue: 0,0:49:57.93,0:50:01.34,Default,,0000,0000,0000,,that you're trying to predict is continuous values. Dialogue: 0,0:50:01.34,0:50:04.26,Default,,0000,0000,0000,,Now I'm actually gonna talk about our first classification problem, Dialogue: 0,0:50:04.26,0:50:07.86,Default,,0000,0000,0000,,where the value Y you're trying to predict Dialogue: 0,0:50:07.86,0:50:10.94,Default,,0000,0000,0000,,will be discreet value. You can take on only a small number of discrete values Dialogue: 0,0:50:10.94,0:50:14.38,Default,,0000,0000,0000,,and in this case I'll talk about binding classification Dialogue: 0,0:50:14.38,0:50:15.71,Default,,0000,0000,0000,,where Dialogue: 0,0:50:15.71,0:50:18.77,Default,,0000,0000,0000,,Y takes on only two values, right? So Dialogue: 0,0:50:18.77,0:50:21.89,Default,,0000,0000,0000,,you come up with classification problems if you're trying to do, Dialogue: 0,0:50:21.89,0:50:25.90,Default,,0000,0000,0000,,say, a medical diagnosis and try to decide based on some features Dialogue: 0,0:50:25.90,0:50:29.89,Default,,0000,0000,0000,,that the patient has a disease or does not have a disease. Dialogue: 0,0:50:29.89,0:50:34.24,Default,,0000,0000,0000,,Or if in the housing example, maybe you're trying to decide will this house sell in the Dialogue: 0,0:50:34.24,0:50:37.51,Default,,0000,0000,0000,,next six months or not and the answer is either yes or no. It'll either be sold in the Dialogue: 0,0:50:37.51,0:50:40.99,Default,,0000,0000,0000,,next six months or it won't be. Dialogue: 0,0:50:40.99,0:50:44.53,Default,,0000,0000,0000,,Other standing examples, if you want to build a spam filter. Is this e-mail spam Dialogue: 0,0:50:44.53,0:50:50.93,Default,,0000,0000,0000,,or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting Dialogue: 0,0:50:50.93,0:50:55.44,Default,,0000,0000,0000,,whether a computer system will crash. So you have a learning algorithm to predict will Dialogue: 0,0:50:55.44,0:50:59.16,Default,,0000,0000,0000,,this computing cluster crash over the next 24 hours? And, again, it's a yes Dialogue: 0,0:50:59.16,0:51:06.04,Default,,0000,0000,0000,,or no answer. So Dialogue: 0,0:51:06.04,0:51:08.15,Default,,0000,0000,0000,,there's X, there's Y. Dialogue: 0,0:51:08.15,0:51:14.12,Default,,0000,0000,0000,,And in a classification problem Dialogue: 0,0:51:14.12,0:51:15.27,Default,,0000,0000,0000,,Y takes on Dialogue: 0,0:51:15.27,0:51:18.57,Default,,0000,0000,0000,,two values, zero and one. That's it in binding the classification. Dialogue: 0,0:51:18.57,0:51:21.63,Default,,0000,0000,0000,,So what can you do? Well, one thing you could do is Dialogue: 0,0:51:21.63,0:51:25.30,Default,,0000,0000,0000,,take linear regression, as we've described it so far, and apply it to this problem, Dialogue: 0,0:51:25.30,0:51:26.40,Default,,0000,0000,0000,,right? So you, Dialogue: 0,0:51:26.40,0:51:30.29,Default,,0000,0000,0000,,you know, given this data set you can fit a straight line to it. Maybe Dialogue: 0,0:51:30.29,0:51:31.83,Default,,0000,0000,0000,,you get that straight line, right? Dialogue: 0,0:51:31.83,0:51:32.43,Default,,0000,0000,0000,,But Dialogue: 0,0:51:32.43,0:51:36.77,Default,,0000,0000,0000,,this data set I've drawn, right? This is an amazingly easy classification problem. It's Dialogue: 0,0:51:36.77,0:51:37.89,Default,,0000,0000,0000,,pretty obvious Dialogue: 0,0:51:37.89,0:51:41.42,Default,,0000,0000,0000,,to all of us that, right? The relationship between X and Y is - Dialogue: 0,0:51:41.42,0:51:47.73,Default,,0000,0000,0000,,well, you just look at a value around here and it's the right is one, it's Dialogue: 0,0:51:47.73,0:51:51.53,Default,,0000,0000,0000,,the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then Dialogue: 0,0:51:51.53,0:51:53.65,Default,,0000,0000,0000,,maybe take your linear regression Dialogue: 0,0:51:53.65,0:51:55.67,Default,,0000,0000,0000,,hypothesis to this straight line Dialogue: 0,0:51:55.67,0:51:58.49,Default,,0000,0000,0000,,and threshold it at 0.5. Dialogue: 0,0:51:58.49,0:52:02.06,Default,,0000,0000,0000,,If you do that you'll certainly get the right answer. You predict that Dialogue: 0,0:52:02.06,0:52:02.66,Default,,0000,0000,0000,, Dialogue: 0,0:52:02.66,0:52:04.27,Default,,0000,0000,0000,,if X is to the right of, sort Dialogue: 0,0:52:04.27,0:52:06.23,Default,,0000,0000,0000,,of, the mid-point here Dialogue: 0,0:52:06.23,0:52:11.83,Default,,0000,0000,0000,,then Y is one and then next to the left of that mid-point then Y is zero. Dialogue: 0,0:52:11.83,0:52:16.10,Default,,0000,0000,0000,,So some people actually do this. Apply linear regression to classification problems Dialogue: 0,0:52:16.10,0:52:17.72,Default,,0000,0000,0000,,and sometimes it'll Dialogue: 0,0:52:17.72,0:52:19.32,Default,,0000,0000,0000,,work okay, Dialogue: 0,0:52:19.32,0:52:21.82,Default,,0000,0000,0000,,but in general it's actually a pretty bad idea to Dialogue: 0,0:52:21.82,0:52:26.13,Default,,0000,0000,0000,,apply linear regression to Dialogue: 0,0:52:26.13,0:52:31.85,Default,,0000,0000,0000,,classification problems like these and here's why. Let's say I Dialogue: 0,0:52:31.85,0:52:33.53,Default,,0000,0000,0000,,change my training set Dialogue: 0,0:52:33.53,0:52:39.51,Default,,0000,0000,0000,,by giving you just one more training example all the way up there, right? Dialogue: 0,0:52:39.51,0:52:43.05,Default,,0000,0000,0000,,Imagine if given this training set is actually still entirely obvious what the Dialogue: 0,0:52:43.05,0:52:45.67,Default,,0000,0000,0000,,relationship between X and Y is, right? It's just - Dialogue: 0,0:52:45.67,0:52:50.84,Default,,0000,0000,0000,,take this value as greater than Y is one and it's less then Y Dialogue: 0,0:52:50.84,0:52:52.04,Default,,0000,0000,0000,,is zero. Dialogue: 0,0:52:52.04,0:52:55.06,Default,,0000,0000,0000,,By giving you this additional training example it really shouldn't Dialogue: 0,0:52:55.06,0:52:56.43,Default,,0000,0000,0000,,change anything. I mean, Dialogue: 0,0:52:56.43,0:52:59.29,Default,,0000,0000,0000,,I didn't really convey much new information. There's no surprise that this Dialogue: 0,0:52:59.29,0:53:01.67,Default,,0000,0000,0000,,corresponds to Y equals one. Dialogue: 0,0:53:01.67,0:53:04.59,Default,,0000,0000,0000,,But if you now fit linear regression to this data Dialogue: 0,0:53:04.59,0:53:07.16,Default,,0000,0000,0000,,set you end up with a line that, I Dialogue: 0,0:53:07.16,0:53:10.41,Default,,0000,0000,0000,,don't know, maybe looks like that, right? Dialogue: 0,0:53:10.41,0:53:13.43,Default,,0000,0000,0000,,And now the predictions of your Dialogue: 0,0:53:13.43,0:53:15.74,Default,,0000,0000,0000,,hypothesis have changed completely if Dialogue: 0,0:53:15.74,0:53:22.74,Default,,0000,0000,0000,,your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that? Dialogue: 0,0:53:26.58,0:53:30.08,Default,,0000,0000,0000,,Right. Dialogue: 0,0:53:30.08,0:53:31.28,Default,,0000,0000,0000,,Yeah, yeah, fine. Yeah, sure. A theta Dialogue: 0,0:53:31.28,0:53:36.85,Default,,0000,0000,0000,,set like that so. So, I Dialogue: 0,0:53:36.85,0:53:38.61,Default,,0000,0000,0000,,guess, Dialogue: 0,0:53:38.61,0:53:42.18,Default,,0000,0000,0000,,these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will Dialogue: 0,0:53:42.18,0:53:47.61,Default,,0000,0000,0000,,change it even more if you gave it Dialogue: 0,0:53:47.61,0:53:49.58,Default,,0000,0000,0000,,all - Dialogue: 0,0:53:49.58,0:53:51.100,Default,,0000,0000,0000,,Yeah. Then I think this actually would make it even worse. You Dialogue: 0,0:53:51.100,0:53:54.35,Default,,0000,0000,0000,,would actually get a line that pulls out even further, right? So Dialogue: 0,0:53:54.35,0:53:57.82,Default,,0000,0000,0000,,this is my example. I get to make it whatever I want, right? But Dialogue: 0,0:53:57.82,0:54:00.69,Default,,0000,0000,0000,,the point of this is that there's not a deep meaning to this. The point of this is Dialogue: 0,0:54:00.69,0:54:01.88,Default,,0000,0000,0000,,just that Dialogue: 0,0:54:01.88,0:54:04.86,Default,,0000,0000,0000,,it could be a really bad idea to apply linear regression to classification Dialogue: 0,0:54:04.86,0:54:05.88,Default,,0000,0000,0000,, Dialogue: 0,0:54:05.88,0:54:11.61,Default,,0000,0000,0000,,algorithm. Sometimes it work fine, but usually I wouldn't do it. Dialogue: 0,0:54:11.61,0:54:13.77,Default,,0000,0000,0000,,So a couple of problems with this. One is that, Dialogue: 0,0:54:13.77,0:54:17.35,Default,,0000,0000,0000,,well - so what do you want to do Dialogue: 0,0:54:17.35,0:54:20.63,Default,,0000,0000,0000,,for classification? If you know the value of Y lies between zero and Dialogue: 0,0:54:20.63,0:54:24.71,Default,,0000,0000,0000,,one then to kind of fix this problem Dialogue: 0,0:54:24.71,0:54:27.08,Default,,0000,0000,0000,,let's just start by Dialogue: 0,0:54:27.08,0:54:29.22,Default,,0000,0000,0000,,changing the form Dialogue: 0,0:54:29.22,0:54:33.87,Default,,0000,0000,0000,,of our hypothesis so that my hypothesis Dialogue: 0,0:54:33.87,0:54:39.28,Default,,0000,0000,0000,,always lies in the unit interval between zero and one. Okay? Dialogue: 0,0:54:39.28,0:54:41.83,Default,,0000,0000,0000,,So if I know Y is either Dialogue: 0,0:54:41.83,0:54:44.04,Default,,0000,0000,0000,,zero or one then Dialogue: 0,0:54:44.04,0:54:47.47,Default,,0000,0000,0000,,let's at least not have my hypothesis predict values much larger than one and much Dialogue: 0,0:54:47.47,0:54:50.94,Default,,0000,0000,0000,,smaller than zero. Dialogue: 0,0:54:50.94,0:54:52.37,Default,,0000,0000,0000,,And so Dialogue: 0,0:54:52.37,0:54:55.75,Default,,0000,0000,0000,,I'm going to - instead of choosing a linear function for my hypothesis I'm going Dialogue: 0,0:54:55.75,0:54:59.80,Default,,0000,0000,0000,,to choose something slightly different. And, Dialogue: 0,0:54:59.80,0:55:03.22,Default,,0000,0000,0000,,in particular, I'm going to choose Dialogue: 0,0:55:03.22,0:55:08.13,Default,,0000,0000,0000,,this function, H subscript theta of X is going to equal to G of Dialogue: 0,0:55:08.13,0:55:10.84,Default,,0000,0000,0000,,theta transpose X Dialogue: 0,0:55:10.84,0:55:12.93,Default,,0000,0000,0000,,where Dialogue: 0,0:55:12.93,0:55:13.89,Default,,0000,0000,0000,,G Dialogue: 0,0:55:13.89,0:55:17.28,Default,,0000,0000,0000,,is going to be this Dialogue: 0,0:55:17.28,0:55:18.33,Default,,0000,0000,0000,,function and so Dialogue: 0,0:55:18.33,0:55:20.82,Default,,0000,0000,0000,,this becomes more than one plus theta X Dialogue: 0,0:55:20.82,0:55:22.82,Default,,0000,0000,0000,,of theta Dialogue: 0,0:55:22.82,0:55:25.05,Default,,0000,0000,0000,,transpose X. Dialogue: 0,0:55:25.05,0:55:27.81,Default,,0000,0000,0000,,And G of Z is called the sigmoid Dialogue: 0,0:55:27.81,0:55:32.96,Default,,0000,0000,0000,,function and it Dialogue: 0,0:55:32.96,0:55:38.66,Default,,0000,0000,0000,,is often also called the logistic function. It Dialogue: 0,0:55:38.66,0:55:41.23,Default,,0000,0000,0000,,goes by either of these Dialogue: 0,0:55:41.23,0:55:48.23,Default,,0000,0000,0000,,names. And what G of Z looks like is the following. So when you have your Dialogue: 0,0:55:48.45,0:55:50.53,Default,,0000,0000,0000,,horizontal axis I'm going to plot Z Dialogue: 0,0:55:50.53,0:55:56.55,Default,,0000,0000,0000,,and so G of Z Dialogue: 0,0:55:56.55,0:55:58.88,Default,,0000,0000,0000,,will look like this. Dialogue: 0,0:55:58.88,0:56:04.76,Default,,0000,0000,0000,,Okay? I didn't draw that very well. Okay. Dialogue: 0,0:56:04.76,0:56:05.85,Default,,0000,0000,0000,,So G of Z Dialogue: 0,0:56:05.85,0:56:07.77,Default,,0000,0000,0000,,tends towards zero Dialogue: 0,0:56:07.77,0:56:09.91,Default,,0000,0000,0000,,as Z becomes very small Dialogue: 0,0:56:09.91,0:56:12.22,Default,,0000,0000,0000,,and G of Z will ascend Dialogue: 0,0:56:12.22,0:56:15.21,Default,,0000,0000,0000,,towards one as Z becomes large and it crosses the Dialogue: 0,0:56:15.21,0:56:17.09,Default,,0000,0000,0000,,vertical Dialogue: 0,0:56:17.09,0:56:19.52,Default,,0000,0000,0000,,axis at 0.5. Dialogue: 0,0:56:19.52,0:56:24.46,Default,,0000,0000,0000,,So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of Dialogue: 0,0:56:24.46,0:56:27.48,Default,,0000,0000,0000,,sigmoid in other Dialogue: 0,0:56:27.48,0:56:30.30,Default,,0000,0000,0000,,step five? Say that again. Why we cannot chose this at five for some reason, like, that's Dialogue: 0,0:56:30.30,0:56:35.43,Default,,0000,0000,0000,,better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from, Dialogue: 0,0:56:35.43,0:56:37.27,Default,,0000,0000,0000,,right? I just Dialogue: 0,0:56:37.27,0:56:38.84,Default,,0000,0000,0000,,wrote down this function. It actually Dialogue: 0,0:56:38.84,0:56:42.57,Default,,0000,0000,0000,,turns out that there are two reasons for using this function that we'll come to. Dialogue: 0,0:56:42.57,0:56:43.62,Default,,0000,0000,0000,,One is - Dialogue: 0,0:56:43.62,0:56:46.81,Default,,0000,0000,0000,,we talked about generalized linear models. We'll see that this falls out naturally Dialogue: 0,0:56:46.81,0:56:49.91,Default,,0000,0000,0000,,as part of the broader class of models. Dialogue: 0,0:56:49.91,0:56:51.09,Default,,0000,0000,0000,,And another reason Dialogue: 0,0:56:51.09,0:56:52.34,Default,,0000,0000,0000,,that we'll talk about Dialogue: 0,0:56:52.34,0:56:54.24,Default,,0000,0000,0000,,next week, it turns out Dialogue: 0,0:56:54.24,0:56:55.09,Default,,0000,0000,0000,,there are a couple of, Dialogue: 0,0:56:55.09,0:56:57.27,Default,,0000,0000,0000,,I think, very beautiful reasons for why Dialogue: 0,0:56:57.27,0:56:58.100,Default,,0000,0000,0000,,we choose logistic Dialogue: 0,0:56:58.100,0:57:00.47,Default,,0000,0000,0000,,functions. We'll see Dialogue: 0,0:57:00.47,0:57:02.09,Default,,0000,0000,0000,,that in a little bit. But for now let me just Dialogue: 0,0:57:02.09,0:57:07.25,Default,,0000,0000,0000,,define it and just take my word for it for now that this is a reasonable choice. Dialogue: 0,0:57:07.25,0:57:08.53,Default,,0000,0000,0000,,Okay? But notice now that Dialogue: 0,0:57:08.53,0:57:15.53,Default,,0000,0000,0000,,my - the values output by my hypothesis will always be between zero Dialogue: 0,0:57:15.88,0:57:17.29,Default,,0000,0000,0000,,and one. Furthermore, Dialogue: 0,0:57:17.29,0:57:20.72,Default,,0000,0000,0000,,just like we did for linear regression, I'm going to endow Dialogue: 0,0:57:20.72,0:57:25.55,Default,,0000,0000,0000,,the outputs and my hypothesis with a probabilistic interpretation, right? So Dialogue: 0,0:57:25.55,0:57:30.62,Default,,0000,0000,0000,,I'm going to assume that the probability that Y is equal to one Dialogue: 0,0:57:30.62,0:57:33.90,Default,,0000,0000,0000,,given X and parameterized by theta Dialogue: 0,0:57:33.90,0:57:35.56,Default,,0000,0000,0000,,that's equal to Dialogue: 0,0:57:35.56,0:57:38.44,Default,,0000,0000,0000,,H subscript theta of X, all right? Dialogue: 0,0:57:38.44,0:57:39.64,Default,,0000,0000,0000,,So in other words Dialogue: 0,0:57:39.64,0:57:42.79,Default,,0000,0000,0000,,I'm going to imagine that my hypothesis is outputting all these Dialogue: 0,0:57:42.79,0:57:45.41,Default,,0000,0000,0000,,numbers that lie between zero and one. Dialogue: 0,0:57:45.41,0:57:47.89,Default,,0000,0000,0000,,I'm going to think of my hypothesis Dialogue: 0,0:57:47.89,0:57:54.89,Default,,0000,0000,0000,,as trying to estimate the probability that Y is equal to one. Okay? Dialogue: 0,0:57:56.21,0:57:59.71,Default,,0000,0000,0000,,And because Dialogue: 0,0:57:59.71,0:58:02.12,Default,,0000,0000,0000,,Y has to be either zero or one Dialogue: 0,0:58:02.12,0:58:04.73,Default,,0000,0000,0000,,then the probability of Y equals zero is going Dialogue: 0,0:58:04.73,0:58:11.73,Default,,0000,0000,0000,,to be that. All right? Dialogue: 0,0:58:11.86,0:58:15.35,Default,,0000,0000,0000,,So more simply it turns out - actually, take these two equations Dialogue: 0,0:58:15.35,0:58:19.21,Default,,0000,0000,0000,,and write them more compactly. Dialogue: 0,0:58:19.21,0:58:22.14,Default,,0000,0000,0000,,Write P of Y given X Dialogue: 0,0:58:22.14,0:58:24.21,Default,,0000,0000,0000,,parameterized by theta. Dialogue: 0,0:58:24.21,0:58:25.84,Default,,0000,0000,0000,,This is going to be H Dialogue: 0,0:58:25.84,0:58:28.35,Default,,0000,0000,0000,,subscript theta of X to Dialogue: 0,0:58:28.35,0:58:31.95,Default,,0000,0000,0000,,the power of Y times Dialogue: 0,0:58:31.95,0:58:33.18,Default,,0000,0000,0000,,one minus Dialogue: 0,0:58:33.18,0:58:34.77,Default,,0000,0000,0000,,H of X to the power of Dialogue: 0,0:58:34.77,0:58:37.22,Default,,0000,0000,0000,,one minus Y. Okay? So I know this Dialogue: 0,0:58:37.22,0:58:42.89,Default,,0000,0000,0000,,looks somewhat bizarre, but this actually makes the variation much nicer. Dialogue: 0,0:58:42.89,0:58:44.44,Default,,0000,0000,0000,,So Y is equal to one Dialogue: 0,0:58:44.44,0:58:48.06,Default,,0000,0000,0000,,then this equation is H of X to the power of one Dialogue: 0,0:58:48.06,0:58:51.23,Default,,0000,0000,0000,,times something to the power of zero. Dialogue: 0,0:58:51.23,0:58:54.43,Default,,0000,0000,0000,,So anything to the power of zero is just one, Dialogue: 0,0:58:54.43,0:58:55.95,Default,,0000,0000,0000,,right? So Y equals one then Dialogue: 0,0:58:55.95,0:58:59.39,Default,,0000,0000,0000,,this is something to the power of zero and so this is just one. Dialogue: 0,0:58:59.39,0:59:03.35,Default,,0000,0000,0000,,So if Y equals one this is just saying P of Y equals one is equal to H subscript Dialogue: 0,0:59:03.35,0:59:05.92,Default,,0000,0000,0000,,theta of X. Okay? Dialogue: 0,0:59:05.92,0:59:08.25,Default,,0000,0000,0000,,And in the same way, if Y is equal Dialogue: 0,0:59:08.25,0:59:09.92,Default,,0000,0000,0000,,to zero then this is P Dialogue: 0,0:59:09.92,0:59:14.33,Default,,0000,0000,0000,,of Y equals zero equals this thing to the power of zero and so this disappears. This is Dialogue: 0,0:59:14.33,0:59:15.67,Default,,0000,0000,0000,,just one Dialogue: 0,0:59:15.67,0:59:17.58,Default,,0000,0000,0000,,times this thing power of one. Okay? So this is Dialogue: 0,0:59:17.58,0:59:18.60,Default,,0000,0000,0000,,a Dialogue: 0,0:59:18.60,0:59:20.27,Default,,0000,0000,0000,,compact way of writing Dialogue: 0,0:59:20.27,0:59:22.80,Default,,0000,0000,0000,,both of these equations to Dialogue: 0,0:59:22.80,0:59:29.80,Default,,0000,0000,0000,,gather them to one line. Dialogue: 0,0:59:31.15,0:59:35.65,Default,,0000,0000,0000,,So let's hope our parameter fitting, right? And, again, you can ask - Dialogue: 0,0:59:35.65,0:59:38.39,Default,,0000,0000,0000,,well, given this model by data, how do I fit Dialogue: 0,0:59:38.39,0:59:41.41,Default,,0000,0000,0000,,the parameters theta of my Dialogue: 0,0:59:41.41,0:59:46.17,Default,,0000,0000,0000,,model? So the likelihood of the parameters is, as before, it's just the probability Dialogue: 0,0:59:46.17,0:59:48.64,Default,,0000,0000,0000,, Dialogue: 0,0:59:48.64,0:59:50.50,Default,,0000,0000,0000,, Dialogue: 0,0:59:50.50,0:59:54.47,Default,,0000,0000,0000,,of theta, right? Which is product over I, PFYI Dialogue: 0,0:59:54.47,0:59:56.71,Default,,0000,0000,0000,,given XI Dialogue: 0,0:59:56.71,0:59:59.12,Default,,0000,0000,0000,,parameterized by theta. Dialogue: 0,0:59:59.12,1:00:01.61,Default,,0000,0000,0000,,Which is - just plugging those Dialogue: 0,1:00:01.61,1:00:08.61,Default,,0000,0000,0000,,in. Okay? I Dialogue: 0,1:00:09.40,1:00:16.40,Default,,0000,0000,0000,,dropped this theta subscript just so you can write a little bit less. Oh, Dialogue: 0,1:00:17.48,1:00:19.70,Default,,0000,0000,0000,,excuse me. These Dialogue: 0,1:00:19.70,1:00:26.70,Default,,0000,0000,0000,,should be Dialogue: 0,1:00:29.38,1:00:35.75,Default,,0000,0000,0000,,XI's Dialogue: 0,1:00:35.75,1:00:42.75,Default,,0000,0000,0000,,and YI's. Okay? Dialogue: 0,1:00:51.30,1:00:53.36,Default,,0000,0000,0000,,So, Dialogue: 0,1:00:53.36,1:00:57.03,Default,,0000,0000,0000,,as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So Dialogue: 0,1:00:57.03,1:00:58.76,Default,,0000,0000,0000,,we want Dialogue: 0,1:00:58.76,1:01:03.53,Default,,0000,0000,0000,,to find the - setting the parameters theta that maximizes the likelihood L Dialogue: 0,1:01:03.53,1:01:08.12,Default,,0000,0000,0000,,of theta. It Dialogue: 0,1:01:08.12,1:01:09.62,Default,,0000,0000,0000,,turns out Dialogue: 0,1:01:09.62,1:01:11.44,Default,,0000,0000,0000,,that very often Dialogue: 0,1:01:11.44,1:01:14.48,Default,,0000,0000,0000,,- just when you work with the derivations, it turns out that it is often Dialogue: 0,1:01:14.48,1:01:17.81,Default,,0000,0000,0000,,much easier to maximize the log of the likelihood rather than maximize the Dialogue: 0,1:01:17.81,1:01:19.56,Default,,0000,0000,0000,,likelihood. Dialogue: 0,1:01:19.56,1:01:20.44,Default,,0000,0000,0000,,So Dialogue: 0,1:01:20.44,1:01:22.64,Default,,0000,0000,0000,,the log Dialogue: 0,1:01:22.64,1:01:24.63,Default,,0000,0000,0000,,likelihood L of theta is just log of capital L. Dialogue: 0,1:01:24.63,1:01:27.88,Default,,0000,0000,0000,,This will, therefore, Dialogue: 0,1:01:27.88,1:01:34.88,Default,,0000,0000,0000,,be sum of this. Okay? Dialogue: 0,1:01:58.09,1:01:59.24,Default,,0000,0000,0000,,And so Dialogue: 0,1:01:59.24,1:02:03.69,Default,,0000,0000,0000,,to fit the parameters theta of our model we'll Dialogue: 0,1:02:03.69,1:02:05.93,Default,,0000,0000,0000,,find the value of Dialogue: 0,1:02:05.93,1:02:11.54,Default,,0000,0000,0000,,theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes. Dialogue: 0,1:02:11.54,1:02:18.54,Default,,0000,0000,0000,,Thanks. Dialogue: 0,1:02:21.52,1:02:27.24,Default,,0000,0000,0000,,So having maximized this function - well, it turns out we can actually apply Dialogue: 0,1:02:27.24,1:02:28.76,Default,,0000,0000,0000,,the same gradient Dialogue: 0,1:02:28.76,1:02:32.75,Default,,0000,0000,0000,,descent algorithm that we learned. That was the first algorithm we used Dialogue: 0,1:02:32.75,1:02:34.98,Default,,0000,0000,0000,,to minimize the quadratic function. Dialogue: 0,1:02:34.98,1:02:37.21,Default,,0000,0000,0000,,And you remember, when we talked about least squares, Dialogue: 0,1:02:37.21,1:02:40.12,Default,,0000,0000,0000,,the first algorithm we used to minimize the quadratic Dialogue: 0,1:02:40.12,1:02:41.06,Default,,0000,0000,0000,,error function Dialogue: 0,1:02:41.06,1:02:42.65,Default,,0000,0000,0000,,was great in descent. Dialogue: 0,1:02:42.65,1:02:45.41,Default,,0000,0000,0000,,So can actually use exactly the same algorithm Dialogue: 0,1:02:45.41,1:02:48.04,Default,,0000,0000,0000,,to maximize the log likelihood. Dialogue: 0,1:02:48.04,1:02:48.75,Default,,0000,0000,0000,, Dialogue: 0,1:02:48.75,1:02:51.36,Default,,0000,0000,0000,,And you remember, that algorithm was just Dialogue: 0,1:02:51.36,1:02:54.61,Default,,0000,0000,0000,,repeatedly take the value of theta Dialogue: 0,1:02:54.61,1:02:56.49,Default,,0000,0000,0000,,and you replace it with Dialogue: 0,1:02:56.49,1:02:58.92,Default,,0000,0000,0000,,the previous value of theta plus Dialogue: 0,1:02:58.92,1:03:01.96,Default,,0000,0000,0000,,a learning rate alpha Dialogue: 0,1:03:01.96,1:03:03.94,Default,,0000,0000,0000,,times Dialogue: 0,1:03:03.94,1:03:08.34,Default,,0000,0000,0000,,the gradient of the cos function. The log likelihood will respect the Dialogue: 0,1:03:08.34,1:03:09.64,Default,,0000,0000,0000,,theta. Okay? Dialogue: 0,1:03:09.64,1:03:13.87,Default,,0000,0000,0000,,One small change is that because previously we were trying to minimize Dialogue: 0,1:03:13.87,1:03:14.65,Default,,0000,0000,0000,, Dialogue: 0,1:03:14.65,1:03:16.100,Default,,0000,0000,0000,,the quadratic error term. Dialogue: 0,1:03:16.100,1:03:20.36,Default,,0000,0000,0000,,Today we're trying to maximize rather than minimize. So rather than having a minus Dialogue: 0,1:03:20.36,1:03:23.19,Default,,0000,0000,0000,,sign we have a plus sign. So this is Dialogue: 0,1:03:23.19,1:03:24.07,Default,,0000,0000,0000,,just great in ascents, Dialogue: 0,1:03:24.07,1:03:25.49,Default,,0000,0000,0000,,but for the Dialogue: 0,1:03:25.49,1:03:27.56,Default,,0000,0000,0000,,maximization rather than the minimization. Dialogue: 0,1:03:27.56,1:03:32.38,Default,,0000,0000,0000,,So we actually call this gradient ascent and it's really the same Dialogue: 0,1:03:32.38,1:03:35.39,Default,,0000,0000,0000,,algorithm. Dialogue: 0,1:03:35.39,1:03:36.84,Default,,0000,0000,0000,,So to figure out Dialogue: 0,1:03:36.84,1:03:41.62,Default,,0000,0000,0000,,what this gradient - so in order to derive gradient descent, Dialogue: 0,1:03:41.62,1:03:43.72,Default,,0000,0000,0000,,what you need to do is Dialogue: 0,1:03:43.72,1:03:48.07,Default,,0000,0000,0000,,compute the partial derivatives of your objective function with respect to Dialogue: 0,1:03:48.07,1:03:52.73,Default,,0000,0000,0000,,each of your parameters theta I, right? Dialogue: 0,1:03:52.73,1:03:58.22,Default,,0000,0000,0000,,It turns out that Dialogue: 0,1:03:58.22,1:04:00.27,Default,,0000,0000,0000,,if you actually Dialogue: 0,1:04:00.27,1:04:02.62,Default,,0000,0000,0000,,compute this partial derivative - Dialogue: 0,1:04:02.62,1:04:09.54,Default,,0000,0000,0000,,so you take this formula, this L of theta, which is - oh, got that wrong too. If Dialogue: 0,1:04:09.54,1:04:13.88,Default,,0000,0000,0000,,you take this lower case l theta, if you take the log likelihood of theta, Dialogue: 0,1:04:13.88,1:04:17.16,Default,,0000,0000,0000,,and if you take it's partial derivative with Dialogue: 0,1:04:17.16,1:04:19.48,Default,,0000,0000,0000,,respect to theta I Dialogue: 0,1:04:19.48,1:04:21.55,Default,,0000,0000,0000,,you find that Dialogue: 0,1:04:21.55,1:04:27.12,Default,,0000,0000,0000,,this is equal to - Dialogue: 0,1:04:27.12,1:04:29.83,Default,,0000,0000,0000,,let's see. Okay? And, Dialogue: 0,1:04:29.83,1:04:36.83,Default,,0000,0000,0000,,I Dialogue: 0,1:04:45.94,1:04:47.31,Default,,0000,0000,0000,,don't Dialogue: 0,1:04:47.31,1:04:50.33,Default,,0000,0000,0000,,know, the derivation isn't terribly complicated, but in Dialogue: 0,1:04:50.33,1:04:53.90,Default,,0000,0000,0000,,the interest of saving you watching me write down a couple of Dialogue: 0,1:04:53.90,1:04:55.79,Default,,0000,0000,0000,,blackboards full of math I'll just write Dialogue: 0,1:04:55.79,1:04:57.20,Default,,0000,0000,0000,,down the final answer. But Dialogue: 0,1:04:57.20,1:04:58.95,Default,,0000,0000,0000,,the way you get this is you Dialogue: 0,1:04:58.95,1:05:00.18,Default,,0000,0000,0000,,just take those, plug Dialogue: 0,1:05:00.18,1:05:04.43,Default,,0000,0000,0000,,in the definition for F subscript theta as function of XI, and take derivatives, Dialogue: 0,1:05:04.43,1:05:06.39,Default,,0000,0000,0000,,and work through the algebra Dialogue: 0,1:05:06.39,1:05:08.44,Default,,0000,0000,0000,,it turns out it'll simplify Dialogue: 0,1:05:08.44,1:05:13.22,Default,,0000,0000,0000,,down to this formula. Okay? Dialogue: 0,1:05:13.22,1:05:15.07,Default,,0000,0000,0000,,And so Dialogue: 0,1:05:15.07,1:05:19.13,Default,,0000,0000,0000,,what that gives you is that gradient ascent Dialogue: 0,1:05:19.13,1:05:21.67,Default,,0000,0000,0000,,is the following Dialogue: 0,1:05:21.67,1:05:24.15,Default,,0000,0000,0000,,rule. Theta J gets updated as theta Dialogue: 0,1:05:24.15,1:05:25.90,Default,,0000,0000,0000,,J Dialogue: 0,1:05:25.90,1:05:27.84,Default,,0000,0000,0000,,plus alpha Dialogue: 0,1:05:27.84,1:05:34.84,Default,,0000,0000,0000,,gives this. Okay? Dialogue: 0,1:05:46.100,1:05:50.23,Default,,0000,0000,0000,,Does this look familiar to anyone? Did you Dialogue: 0,1:05:50.23,1:05:56.37,Default,,0000,0000,0000,,remember seeing this formula at the last lecture? Right. Dialogue: 0,1:05:56.37,1:05:59.35,Default,,0000,0000,0000,,So when I worked up Bastrian descent Dialogue: 0,1:05:59.35,1:06:02.45,Default,,0000,0000,0000,,for least squares regression I, Dialogue: 0,1:06:02.45,1:06:06.27,Default,,0000,0000,0000,,actually, wrote down Dialogue: 0,1:06:06.27,1:06:07.64,Default,,0000,0000,0000,,exactly the same thing, or maybe Dialogue: 0,1:06:07.64,1:06:11.42,Default,,0000,0000,0000,,there's a minus sign and this is also fit. But I, actually, had Dialogue: 0,1:06:11.42,1:06:13.91,Default,,0000,0000,0000,,exactly the same learning rule last time Dialogue: 0,1:06:13.91,1:06:19.92,Default,,0000,0000,0000,,for least squares regression, Dialogue: 0,1:06:19.92,1:06:23.51,Default,,0000,0000,0000,,right? Is this the same learning algorithm then? So what's different? How come I was making Dialogue: 0,1:06:23.51,1:06:25.50,Default,,0000,0000,0000,,all that noise earlier about Dialogue: 0,1:06:25.50,1:06:30.01,Default,,0000,0000,0000,,least squares regression being a bad idea for classification problems and then I did Dialogue: 0,1:06:30.01,1:06:33.77,Default,,0000,0000,0000,,a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants? Dialogue: 0,1:06:33.77,1:06:40.77,Default,,0000,0000,0000,,Say that again. [Inaudible] Oh, Dialogue: 0,1:06:44.18,1:06:46.64,Default,,0000,0000,0000,,right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same, Dialogue: 0,1:06:46.64,1:06:48.35,Default,,0000,0000,0000,,this is not the same, right? And the Dialogue: 0,1:06:48.35,1:06:49.27,Default,,0000,0000,0000,,reason is, Dialogue: 0,1:06:49.27,1:06:51.70,Default,,0000,0000,0000,,in logistic regression Dialogue: 0,1:06:51.70,1:06:54.17,Default,,0000,0000,0000,,this is different from before, right? Dialogue: 0,1:06:54.17,1:06:55.70,Default,,0000,0000,0000,,The definition Dialogue: 0,1:06:55.70,1:06:58.38,Default,,0000,0000,0000,,of this H subscript theta of XI Dialogue: 0,1:06:58.38,1:06:59.32,Default,,0000,0000,0000,,is not Dialogue: 0,1:06:59.32,1:07:03.28,Default,,0000,0000,0000,,the same as the definition I was using in the previous lecture. Dialogue: 0,1:07:03.28,1:07:06.52,Default,,0000,0000,0000,,And in particular this is no longer theta transpose XI. This is not Dialogue: 0,1:07:06.52,1:07:09.91,Default,,0000,0000,0000,,a linear function anymore. Dialogue: 0,1:07:09.91,1:07:11.91,Default,,0000,0000,0000,,This is a logistic function of theta Dialogue: 0,1:07:11.91,1:07:14.37,Default,,0000,0000,0000,,transpose XI. Okay? Dialogue: 0,1:07:14.37,1:07:17.92,Default,,0000,0000,0000,,So even though this looks cosmetically similar, Dialogue: 0,1:07:17.92,1:07:20.12,Default,,0000,0000,0000,,even though this is similar on the surface, Dialogue: 0,1:07:20.12,1:07:23.74,Default,,0000,0000,0000,,to the Bastrian descent rule I derived last time for Dialogue: 0,1:07:23.74,1:07:25.21,Default,,0000,0000,0000,,least squares regression Dialogue: 0,1:07:25.21,1:07:29.36,Default,,0000,0000,0000,,this is actually a totally different learning algorithm. Okay? Dialogue: 0,1:07:29.36,1:07:32.50,Default,,0000,0000,0000,,And it turns out that there's actually no coincidence that you ended up with the Dialogue: 0,1:07:32.50,1:07:34.58,Default,,0000,0000,0000,,same learning rule. We'll actually Dialogue: 0,1:07:34.58,1:07:35.45,Default,,0000,0000,0000,, Dialogue: 0,1:07:35.45,1:07:39.52,Default,,0000,0000,0000,,talk a bit more about this later when we talk about generalized linear models. Dialogue: 0,1:07:39.52,1:07:42.66,Default,,0000,0000,0000,,But this is one of the most elegant generalized learning models Dialogue: 0,1:07:42.66,1:07:44.52,Default,,0000,0000,0000,,that we'll see later. That Dialogue: 0,1:07:44.52,1:07:48.20,Default,,0000,0000,0000,,even though we're using a different model, you actually ended up with Dialogue: 0,1:07:48.20,1:07:51.38,Default,,0000,0000,0000,,what looks like the same learning algorithm and it's actually no Dialogue: 0,1:07:51.38,1:07:56.45,Default,,0000,0000,0000,,coincidence. Cool. Dialogue: 0,1:07:56.45,1:07:59.03,Default,,0000,0000,0000,,One last comment as Dialogue: 0,1:07:59.03,1:08:01.24,Default,,0000,0000,0000,,part of a sort of learning process, Dialogue: 0,1:08:01.24,1:08:02.34,Default,,0000,0000,0000,,over here Dialogue: 0,1:08:02.34,1:08:05.03,Default,,0000,0000,0000,,I said I take the derivatives and I Dialogue: 0,1:08:05.03,1:08:06.73,Default,,0000,0000,0000,,ended up with this line. Dialogue: 0,1:08:06.73,1:08:09.27,Default,,0000,0000,0000,,I didn't want to Dialogue: 0,1:08:09.27,1:08:13.12,Default,,0000,0000,0000,,make you sit through a long algebraic derivation, but Dialogue: 0,1:08:13.12,1:08:14.64,Default,,0000,0000,0000,,later today or later this week, Dialogue: 0,1:08:14.64,1:08:18.57,Default,,0000,0000,0000,,please, do go home and look at our lecture notes, where I wrote out Dialogue: 0,1:08:18.57,1:08:20.83,Default,,0000,0000,0000,,the entirety of this derivation in full, Dialogue: 0,1:08:20.83,1:08:23.42,Default,,0000,0000,0000,,and make sure you can follow every single step of Dialogue: 0,1:08:23.42,1:08:26.71,Default,,0000,0000,0000,,how we take partial derivatives of this log likelihood Dialogue: 0,1:08:26.71,1:08:32.07,Default,,0000,0000,0000,,to get this formula over here. Okay? By the way, for those who are Dialogue: 0,1:08:32.07,1:08:36.38,Default,,0000,0000,0000,,interested in seriously masking machine learning material, Dialogue: 0,1:08:36.38,1:08:40.35,Default,,0000,0000,0000,,when you go home and look at the lecture notes it will actually be very easy for most Dialogue: 0,1:08:40.35,1:08:41.55,Default,,0000,0000,0000,,of you to look through Dialogue: 0,1:08:41.55,1:08:44.60,Default,,0000,0000,0000,,the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense, Dialogue: 0,1:08:44.60,1:08:45.09,Default,,0000,0000,0000,,and, Dialogue: 0,1:08:45.09,1:08:49.54,Default,,0000,0000,0000,,sort of, say cool. I see how you get this line. Dialogue: 0,1:08:49.54,1:08:53.16,Default,,0000,0000,0000,,You want to make sure you really understand the material. My concrete Dialogue: 0,1:08:53.16,1:08:55.05,Default,,0000,0000,0000,,suggestion to you would be to you to go home, Dialogue: 0,1:08:55.05,1:08:57.78,Default,,0000,0000,0000,,read through the lecture notes, check every line, Dialogue: 0,1:08:57.78,1:09:02.16,Default,,0000,0000,0000,,and then to cover up the derivation and see if you can derive this example, right? So Dialogue: 0,1:09:02.16,1:09:06.53,Default,,0000,0000,0000,,in general, that's usually good advice for studying technical Dialogue: 0,1:09:06.53,1:09:09.03,Default,,0000,0000,0000,,material like machine learning. Which is if you work through a proof Dialogue: 0,1:09:09.03,1:09:11.24,Default,,0000,0000,0000,,and you think you understood every line, Dialogue: 0,1:09:11.24,1:09:14.13,Default,,0000,0000,0000,,the way to make sure you really understood it is to cover it up and see Dialogue: 0,1:09:14.13,1:09:17.46,Default,,0000,0000,0000,,if you can rederive the entire thing itself. This is actually a great way because I Dialogue: 0,1:09:17.46,1:09:19.68,Default,,0000,0000,0000,,did this a lot when I was trying to study Dialogue: 0,1:09:19.68,1:09:22.24,Default,,0000,0000,0000,,various pieces of machine learning Dialogue: 0,1:09:22.24,1:09:26.12,Default,,0000,0000,0000,,theory and various proofs. And this is actually a great way to study because cover up Dialogue: 0,1:09:26.12,1:09:28.42,Default,,0000,0000,0000,,the derivations and see if you can do it yourself Dialogue: 0,1:09:28.42,1:09:32.53,Default,,0000,0000,0000,,without looking at the original derivation. All right. Dialogue: 0,1:09:32.53,1:09:36.69,Default,,0000,0000,0000,, Dialogue: 0,1:09:36.69,1:09:39.95,Default,,0000,0000,0000,,I probably won't get to Newton's Method today. I just Dialogue: 0,1:09:39.95,1:09:46.95,Default,,0000,0000,0000,,want to say Dialogue: 0,1:09:55.25,1:09:58.16,Default,,0000,0000,0000,,- take one quick digression to talk about Dialogue: 0,1:09:58.16,1:09:59.48,Default,,0000,0000,0000,,one more algorithm, Dialogue: 0,1:09:59.48,1:10:01.12,Default,,0000,0000,0000,,which was the discussion sort Dialogue: 0,1:10:01.12,1:10:08.12,Default,,0000,0000,0000,,of alluding to this earlier, Dialogue: 0,1:10:09.17,1:10:11.53,Default,,0000,0000,0000,,which is the perceptron Dialogue: 0,1:10:11.53,1:10:13.22,Default,,0000,0000,0000,,algorithm, right? So Dialogue: 0,1:10:13.22,1:10:13.76,Default,,0000,0000,0000,,I'm Dialogue: 0,1:10:13.76,1:10:16.84,Default,,0000,0000,0000,,not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come Dialogue: 0,1:10:16.84,1:10:20.34,Default,,0000,0000,0000,,back to later. Later this quarter Dialogue: 0,1:10:20.34,1:10:23.82,Default,,0000,0000,0000,,we'll talk about learning theory. Dialogue: 0,1:10:23.82,1:10:27.15,Default,,0000,0000,0000,,So in logistic regression we said that G of Z are, sort Dialogue: 0,1:10:27.15,1:10:27.92,Default,,0000,0000,0000,,of, Dialogue: 0,1:10:27.92,1:10:30.15,Default,,0000,0000,0000,,my hypothesis output values Dialogue: 0,1:10:30.15,1:10:32.80,Default,,0000,0000,0000,,that were low numbers between zero and one. Dialogue: 0,1:10:32.80,1:10:37.11,Default,,0000,0000,0000,,The question is what if you want to force G of Z to up Dialogue: 0,1:10:37.11,1:10:38.89,Default,,0000,0000,0000,,the value to Dialogue: 0,1:10:38.89,1:10:39.19,Default,,0000,0000,0000,,either Dialogue: 0,1:10:39.19,1:10:41.40,Default,,0000,0000,0000,,zero one? Dialogue: 0,1:10:41.40,1:10:43.86,Default,,0000,0000,0000,,So the Dialogue: 0,1:10:43.86,1:10:46.21,Default,,0000,0000,0000,,perceptron algorithm defines G of Z Dialogue: 0,1:10:46.21,1:10:48.09,Default,,0000,0000,0000,,to be this. Dialogue: 0,1:10:48.09,1:10:52.59,Default,,0000,0000,0000,, Dialogue: 0,1:10:52.59,1:10:54.30,Default,,0000,0000,0000,,So the picture is - or Dialogue: 0,1:10:54.30,1:11:01.30,Default,,0000,0000,0000,,the cartoon is, rather than this sigmoid function. E of Dialogue: 0,1:11:02.35,1:11:07.70,Default,,0000,0000,0000,,Z now looks like this step function that you were asking about earlier. Dialogue: 0,1:11:07.70,1:11:13.58,Default,,0000,0000,0000,,In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So Dialogue: 0,1:11:13.58,1:11:14.31,Default,,0000,0000,0000,,this Dialogue: 0,1:11:14.31,1:11:17.57,Default,,0000,0000,0000,,is actually - everything is exactly the same as before, Dialogue: 0,1:11:17.57,1:11:20.39,Default,,0000,0000,0000,,except that G of Z is now the step function. Dialogue: 0,1:11:20.39,1:11:21.54,Default,,0000,0000,0000,,It Dialogue: 0,1:11:21.54,1:11:24.96,Default,,0000,0000,0000,,turns out there's this learning called the perceptron learning rule that's actually Dialogue: 0,1:11:24.96,1:11:28.43,Default,,0000,0000,0000,,even the same as the classic gradient ascent Dialogue: 0,1:11:28.43,1:11:30.61,Default,,0000,0000,0000,,for logistic regression. Dialogue: 0,1:11:30.61,1:11:32.54,Default,,0000,0000,0000,,And the learning rule is Dialogue: 0,1:11:32.54,1:11:39.54,Default,,0000,0000,0000,,given by this. Okay? Dialogue: 0,1:11:44.35,1:11:49.80,Default,,0000,0000,0000,,So it looks just like the Dialogue: 0,1:11:49.80,1:11:51.02,Default,,0000,0000,0000,,classic gradient ascent rule Dialogue: 0,1:11:51.02,1:11:53.66,Default,,0000,0000,0000,, Dialogue: 0,1:11:53.66,1:11:56.21,Default,,0000,0000,0000,,for logistic regression. Dialogue: 0,1:11:56.21,1:12:00.78,Default,,0000,0000,0000,,So this is very different flavor of algorithm than least squares regression and logistic Dialogue: 0,1:12:00.78,1:12:01.67,Default,,0000,0000,0000,,regression, Dialogue: 0,1:12:01.67,1:12:05.89,Default,,0000,0000,0000,,and, in particular, because it outputs only values are either zero or one it Dialogue: 0,1:12:05.89,1:12:09.37,Default,,0000,0000,0000,,turns out it's very difficult to endow this algorithm with Dialogue: 0,1:12:09.37,1:12:11.84,Default,,0000,0000,0000,,probabilistic semantics. And this Dialogue: 0,1:12:11.84,1:12:18.84,Default,,0000,0000,0000,,is, again, even though - oh, excuse me. Right there. Okay. Dialogue: 0,1:12:19.64,1:12:23.66,Default,,0000,0000,0000,,And even though this learning rule looks, again, looks cosmetically very similar to Dialogue: 0,1:12:23.66,1:12:25.71,Default,,0000,0000,0000,,what we have in logistics regression this is actually Dialogue: 0,1:12:25.71,1:12:28.32,Default,,0000,0000,0000,,a very different type of learning rule Dialogue: 0,1:12:28.32,1:12:31.42,Default,,0000,0000,0000,,than the others that were seen Dialogue: 0,1:12:31.42,1:12:33.63,Default,,0000,0000,0000,,in this class. So Dialogue: 0,1:12:33.63,1:12:36.97,Default,,0000,0000,0000,,because this is such a simple learning algorithm, right? It just Dialogue: 0,1:12:36.97,1:12:41.03,Default,,0000,0000,0000,,computes theta transpose X and then you threshold and then your output is zero or one. Dialogue: 0,1:12:41.03,1:12:42.91,Default,,0000,0000,0000,,This is - Dialogue: 0,1:12:42.91,1:12:45.99,Default,,0000,0000,0000,,right. So these are a simpler algorithm than logistic regression, I think. Dialogue: 0,1:12:45.99,1:12:49.19,Default,,0000,0000,0000,,When we talk about learning theory later in this class, Dialogue: 0,1:12:49.19,1:12:54.59,Default,,0000,0000,0000,,the simplicity of this algorithm will let us come back and use it as a building block. Okay? Dialogue: 0,1:12:54.59,1:12:56.55,Default,,0000,0000,0000,,But that's all I want to say about this algorithm for now.