[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:11.86,0:00:15.13,Default,,0000,0000,0000,,this presentation is delivered by the stanford center for professional Dialogue: 0,0:00:15.13,0:00:22.13,Default,,0000,0000,0000,,development. Dialogue: 0,0:00:23.56,0:00:26.10,Default,,0000,0000,0000,,Okay, so welcome back. And Dialogue: 0,0:00:26.10,0:00:30.80,Default,,0000,0000,0000,,what I want to do today is talk about Dialogue: 0,0:00:30.80,0:00:32.43,Default,,0000,0000,0000,,Newton's method, an algorithm Dialogue: 0,0:00:32.43,0:00:34.86,Default,,0000,0000,0000,,for fitting models like logistic regression, Dialogue: 0,0:00:34.86,0:00:39.15,Default,,0000,0000,0000,,and then we'll talk about exponential family distributions and generalized linear Dialogue: 0,0:00:39.15,0:00:42.81,Default,,0000,0000,0000,,models. It's a very nice class of ideas that will tie together, Dialogue: 0,0:00:42.81,0:00:47.18,Default,,0000,0000,0000,,the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that Dialogue: 0,0:00:47.18,0:00:50.73,Default,,0000,0000,0000,,today. So Dialogue: 0,0:00:50.73,0:00:54.63,Default,,0000,0000,0000,,throughout the previous lecture and this lecture, we're starting to use increasingly Dialogue: 0,0:00:54.63,0:00:58.21,Default,,0000,0000,0000,,large amounts of material on probability. Dialogue: 0,0:00:58.21,0:01:01.71,Default,,0000,0000,0000,,So if you'd like to see a refresher on sort of the Dialogue: 0,0:01:01.71,0:01:02.92,Default,,0000,0000,0000,,foundations of probability Dialogue: 0,0:01:02.92,0:01:06.31,Default,,0000,0000,0000,,- if you're not sure if you quite had your prerequisites for this class Dialogue: 0,0:01:06.31,0:01:09.13,Default,,0000,0000,0000,,in terms of a background in probability and statistics, Dialogue: 0,0:01:09.13,0:01:11.33,Default,,0000,0000,0000,,then the discussion section Dialogue: 0,0:01:11.33,0:01:16.13,Default,,0000,0000,0000,,taught this week by the TA's will go over so they can review a probability. Dialogue: 0,0:01:16.13,0:01:21.08,Default,,0000,0000,0000,,At the same discussion sections also for the TA's, we'll also briefly go over Dialogue: 0,0:01:21.08,0:01:26.18,Default,,0000,0000,0000,,sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you Dialogue: 0,0:01:26.18,0:01:27.73,Default,,0000,0000,0000,,any of you want to see Dialogue: 0,0:01:27.73,0:01:32.10,Default,,0000,0000,0000,,a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of Dialogue: 0,0:01:32.10,0:01:34.19,Default,,0000,0000,0000,,Matlab and Octave, Dialogue: 0,0:01:34.19,0:01:41.19,Default,,0000,0000,0000,,please come to this - the next discussion section. All right. So Dialogue: 0,0:01:42.09,0:01:45.37,Default,,0000,0000,0000,,just to recap briefly, Dialogue: 0,0:01:45.37,0:01:49.07,Default,,0000,0000,0000,,towards the end of the last lecture I talked about the logistic regression model Dialogue: 0,0:01:49.07,0:01:50.89,Default,,0000,0000,0000,,where we had - Dialogue: 0,0:01:50.89,0:01:57.21,Default,,0000,0000,0000,,which was an algorithm for classification. We had that P of y given one Dialogue: 0,0:01:57.21,0:02:01.81,Default,,0000,0000,0000,,[inaudible] - if an X - if Y equals one, give an X parameterized by theta Dialogue: 0,0:02:01.81,0:02:03.75,Default,,0000,0000,0000,,under this model, all right. If Dialogue: 0,0:02:03.75,0:02:05.28,Default,,0000,0000,0000,,this was one over one Dialogue: 0,0:02:05.28,0:02:10.61,Default,,0000,0000,0000,,plus e to the theta, transpose X. Dialogue: 0,0:02:10.61,0:02:15.24,Default,,0000,0000,0000,,And then you can write down the log-likelihood - Dialogue: 0,0:02:15.24,0:02:22.24,Default,,0000,0000,0000,,like given the training sets, Dialogue: 0,0:02:22.60,0:02:29.60,Default,,0000,0000,0000,, Dialogue: 0,0:02:29.82,0:02:31.33,Default,,0000,0000,0000,,which was that. Dialogue: 0,0:02:31.33,0:02:32.29,Default,,0000,0000,0000,,And Dialogue: 0,0:02:32.29,0:02:35.86,Default,,0000,0000,0000,,by taking the derivatives of this, you can derive Dialogue: 0,0:02:35.86,0:02:38.36,Default,,0000,0000,0000,,sort of a gradient ascent rule Dialogue: 0,0:02:38.36,0:02:42.69,Default,,0000,0000,0000,,for finding the maximum likelihood estimate of the parameter theta for Dialogue: 0,0:02:42.69,0:02:46.48,Default,,0000,0000,0000,,this logistic regression model. And so Dialogue: 0,0:02:46.48,0:02:48.17,Default,,0000,0000,0000,,last time I wrote down Dialogue: 0,0:02:48.17,0:02:51.77,Default,,0000,0000,0000,,the learning rule for batch gradient ascent, but the version of stochastic gradient Dialogue: 0,0:02:51.77,0:02:55.31,Default,,0000,0000,0000,,ascent where Dialogue: 0,0:02:55.31,0:03:02.31,Default,,0000,0000,0000,,you look at just one training example at a time, Dialogue: 0,0:03:07.05,0:03:09.37,Default,,0000,0000,0000,,would be like this, okay. Dialogue: 0,0:03:09.37,0:03:12.37,Default,,0000,0000,0000,,So last time I wrote down a batch gradient ascent. \NThis is stochastic gradient ascent. Dialogue: 0,0:03:12.37,0:03:13.50,Default,,0000,0000,0000,,So Dialogue: 0,0:03:13.50,0:03:17.97,Default,,0000,0000,0000,,if you want to fit a logistic regression model, meaning find the Dialogue: 0,0:03:17.97,0:03:20.81,Default,,0000,0000,0000,,value of theta that maximizes this log likelihood, Dialogue: 0,0:03:20.81,0:03:24.65,Default,,0000,0000,0000,,gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine Dialogue: 0,0:03:24.65,0:03:27.16,Default,,0000,0000,0000,,algorithm to use. Dialogue: 0,0:03:27.16,0:03:29.15,Default,,0000,0000,0000,,But what I want to do is talk about Dialogue: 0,0:03:29.15,0:03:30.43,Default,,0000,0000,0000,,a different Dialogue: 0,0:03:30.43,0:03:31.66,Default,,0000,0000,0000,,algorithm Dialogue: 0,0:03:31.66,0:03:32.67,Default,,0000,0000,0000,,for fitting Dialogue: 0,0:03:32.67,0:03:34.66,Default,,0000,0000,0000,,models like logistic regression. Dialogue: 0,0:03:34.66,0:03:39.42,Default,,0000,0000,0000,,And this would be an algorithm that will, I guess, often run much faster than Dialogue: 0,0:03:39.42,0:03:41.81,Default,,0000,0000,0000,,gradient descent. Dialogue: 0,0:03:41.81,0:03:43.97,Default,,0000,0000,0000,,Um. Dialogue: 0,0:03:43.97,0:03:48.80,Default,,0000,0000,0000,,And this algorithm is called Newton's Method. Dialogue: 0,0:03:48.80,0:03:54.15,Default,,0000,0000,0000,,And when we describe Newton's Method - let me ask you - I should Dialogue: 0,0:03:54.15,0:04:01.15,Default,,0000,0000,0000,,ask you to consider a different problem first, Dialogue: 0,0:04:01.24,0:04:03.61,Default,,0000,0000,0000,,which is - Dialogue: 0,0:04:03.61,0:04:05.44,Default,,0000,0000,0000,,let's say Dialogue: 0,0:04:05.44,0:04:10.65,Default,,0000,0000,0000,,you have a function F of theta, Dialogue: 0,0:04:10.65,0:04:13.55,Default,,0000,0000,0000,,and let's say you want to find the value of theta Dialogue: 0,0:04:13.55,0:04:17.52,Default,,0000,0000,0000,,so that Dialogue: 0,0:04:17.52,0:04:20.56,Default,,0000,0000,0000,,F of theta Dialogue: 0,0:04:20.56,0:04:25.79,Default,,0000,0000,0000,,is equal to zero. Let's start the [inaudible], and then we'll sort of slowly Dialogue: 0,0:04:25.79,0:04:27.83,Default,,0000,0000,0000,,change this until it becomes an algorithm for Dialogue: 0,0:04:27.83,0:04:34.83,Default,,0000,0000,0000,,fitting maximum likelihood models, like logistic regression. \NSo - let's see. Dialogue: 0,0:04:51.59,0:04:54.51,Default,,0000,0000,0000,,I guess that works. Okay, so let's say that's my function F. Dialogue: 0,0:04:54.51,0:04:57.26,Default,,0000,0000,0000,,This Dialogue: 0,0:04:57.26,0:05:00.63,Default,,0000,0000,0000,,is my horizontal axis of theta, plot of F of theta, Dialogue: 0,0:05:00.63,0:05:03.35,Default,,0000,0000,0000,,and so they're really trying to find this value for theta, and Dialogue: 0,0:05:03.35,0:05:05.45,Default,,0000,0000,0000,,which F of theta is equal to zero. This Dialogue: 0,0:05:05.45,0:05:08.18,Default,,0000,0000,0000,,is a horizontal axis. Dialogue: 0,0:05:08.18,0:05:11.78,Default,,0000,0000,0000,,So here's the [inaudible]. I'm going to Dialogue: 0,0:05:11.78,0:05:15.26,Default,,0000,0000,0000,,initialize Dialogue: 0,0:05:15.26,0:05:18.26,Default,,0000,0000,0000,,theta as some value. Dialogue: 0,0:05:18.26,0:05:21.78,Default,,0000,0000,0000,,We'll call theta superscript zero. Dialogue: 0,0:05:21.78,0:05:27.11,Default,,0000,0000,0000,,And then here's what Newton's Method does. We're going to evaluate the function F at a Dialogue: 0,0:05:27.11,0:05:30.30,Default,,0000,0000,0000,,value of theta, and then Dialogue: 0,0:05:30.30,0:05:34.36,Default,,0000,0000,0000,,we'll compute the derivative of F, and we'll use the linear approximation to the Dialogue: 0,0:05:34.36,0:05:38.05,Default,,0000,0000,0000,,function F of that value of theta. So in particular, Dialogue: 0,0:05:38.05,0:05:39.18,Default,,0000,0000,0000,, Dialogue: 0,0:05:39.18,0:05:43.09,Default,,0000,0000,0000,, Dialogue: 0,0:05:43.09,0:05:45.04,Default,,0000,0000,0000,, Dialogue: 0,0:05:45.04,0:05:52.04,Default,,0000,0000,0000,,I'm going to take the tangents to my function - Dialogue: 0,0:05:52.37,0:05:55.26,Default,,0000,0000,0000,,hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the Dialogue: 0,0:05:55.26,0:05:56.82,Default,,0000,0000,0000,, Dialogue: 0,0:05:56.82,0:06:00.03,Default,,0000,0000,0000,,tangent to my function at that point there zero, Dialogue: 0,0:06:00.03,0:06:04.10,Default,,0000,0000,0000,,and I'm going to sort of extend this tangent down until it intercepts the horizontal axis. Dialogue: 0,0:06:04.10,0:06:11.10,Default,,0000,0000,0000,,I want to see what value this is. And I'm going to call this theta one, okay. Dialogue: 0,0:06:12.07,0:06:15.51,Default,,0000,0000,0000,,And then so that's one iteration of Newton's Method. And Dialogue: 0,0:06:15.51,0:06:18.61,Default,,0000,0000,0000,,what I'll do then is the same thing with this point. Take the Dialogue: 0,0:06:18.61,0:06:20.89,Default,,0000,0000,0000,,tangent Dialogue: 0,0:06:20.89,0:06:22.47,Default,,0000,0000,0000,,down here, Dialogue: 0,0:06:22.47,0:06:25.33,Default,,0000,0000,0000,,and that's two iterations of the algorithm. And then just Dialogue: 0,0:06:25.33,0:06:27.96,Default,,0000,0000,0000,,sort of keep going, that's Dialogue: 0,0:06:27.96,0:06:31.15,Default,,0000,0000,0000,,theta three and so on, okay. Dialogue: 0,0:06:31.15,0:06:32.40,Default,,0000,0000,0000,,So Dialogue: 0,0:06:32.40,0:06:37.62,Default,,0000,0000,0000,,let's just go ahead and write down what this algorithm actually does. Dialogue: 0,0:06:37.62,0:06:38.09,Default,,0000,0000,0000,, Dialogue: 0,0:06:38.09,0:06:40.79,Default,,0000,0000,0000,,To go from theta zero to theta one, let Dialogue: 0,0:06:45.69,0:06:50.09,Default,,0000,0000,0000,,length - let me just call that capital delta. Dialogue: 0,0:06:50.09,0:06:51.80,Default,,0000,0000,0000,,So capital - so if you Dialogue: 0,0:06:51.80,0:06:52.87,Default,,0000,0000,0000,,remember the Dialogue: 0,0:06:52.87,0:06:55.18,Default,,0000,0000,0000,,definition of a derivative [inaudible], Dialogue: 0,0:06:55.18,0:06:58.57,Default,,0000,0000,0000,,derivative of F evaluated at theta zero. Dialogue: 0,0:06:58.57,0:07:01.49,Default,,0000,0000,0000,,In other words, the gradient of this first line, Dialogue: 0,0:07:01.49,0:07:04.81,Default,,0000,0000,0000,,by the definition of gradient is going to be equal to this vertical Dialogue: 0,0:07:04.81,0:07:05.59,Default,,0000,0000,0000,,length, Dialogue: 0,0:07:05.59,0:07:07.36,Default,,0000,0000,0000,,divided by this horizontal length. A Dialogue: 0,0:07:07.36,0:07:09.82,Default,,0000,0000,0000,,gradient of this - so the slope of this function Dialogue: 0,0:07:09.82,0:07:10.92,Default,,0000,0000,0000,,is defined as Dialogue: 0,0:07:10.92,0:07:14.30,Default,,0000,0000,0000,,the ratio between this vertical height Dialogue: 0,0:07:14.30,0:07:16.49,Default,,0000,0000,0000,,and this width of triangle. Dialogue: 0,0:07:16.49,0:07:18.51,Default,,0000,0000,0000,,So that's just equal to F of theta Dialogue: 0,0:07:18.51,0:07:22.85,Default,,0000,0000,0000,,zero, Dialogue: 0,0:07:22.85,0:07:24.87,Default,,0000,0000,0000,,divided by delta, Dialogue: 0,0:07:24.87,0:07:27.71,Default,,0000,0000,0000,,which implies that Dialogue: 0,0:07:27.71,0:07:30.16,Default,,0000,0000,0000,,delta is equal to F of Dialogue: 0,0:07:30.16,0:07:32.29,Default,,0000,0000,0000,,theta zero, Dialogue: 0,0:07:32.29,0:07:34.34,Default,,0000,0000,0000,,divided by a Dialogue: 0,0:07:34.34,0:07:37.52,Default,,0000,0000,0000,,prime of Dialogue: 0,0:07:37.52,0:07:41.84,Default,,0000,0000,0000,,theta zero, Dialogue: 0,0:07:41.84,0:07:43.33,Default,,0000,0000,0000,,okay. Dialogue: 0,0:07:43.33,0:07:46.39,Default,,0000,0000,0000,,And so theta Dialogue: 0,0:07:46.39,0:07:49.74,Default,,0000,0000,0000,,one is therefore theta zero minus delta, Dialogue: 0,0:07:49.74,0:07:51.62,Default,,0000,0000,0000,,minus capital delta, Dialogue: 0,0:07:51.62,0:07:56.19,Default,,0000,0000,0000,,which is therefore just F theta zero over F Dialogue: 0,0:07:56.19,0:07:58.09,Default,,0000,0000,0000,,prime of theta Dialogue: 0,0:07:58.09,0:08:00.20,Default,,0000,0000,0000,,zero, all Dialogue: 0,0:08:00.20,0:08:04.74,Default,,0000,0000,0000,,right. And more generally, one iteration of Newton's Method precedes this, theta T plus Dialogue: 0,0:08:04.74,0:08:07.63,Default,,0000,0000,0000,,one equals theta T Dialogue: 0,0:08:07.63,0:08:10.53,Default,,0000,0000,0000,,minus Dialogue: 0,0:08:10.53,0:08:14.62,Default,,0000,0000,0000,,F of theta T divided by F prime of theta Dialogue: 0,0:08:14.62,0:08:16.90,Default,,0000,0000,0000,,T. So that's one iteration Dialogue: 0,0:08:16.90,0:08:23.90,Default,,0000,0000,0000,,of Newton's Method. Dialogue: 0,0:08:23.97,0:08:28.30,Default,,0000,0000,0000,,Now, this is an algorithm for finding a value of theta for which F of theta equals Dialogue: 0,0:08:28.30,0:08:30.76,Default,,0000,0000,0000,,zero. And so we apply the same idea Dialogue: 0,0:08:30.76,0:08:32.73,Default,,0000,0000,0000,,to Dialogue: 0,0:08:32.73,0:08:36.87,Default,,0000,0000,0000,,maximizing the log likelihood, right. So we have a function L Dialogue: 0,0:08:36.87,0:08:37.99,Default,,0000,0000,0000,,of Dialogue: 0,0:08:37.99,0:08:40.16,Default,,0000,0000,0000,,theta, and we want to maximize Dialogue: 0,0:08:40.16,0:08:44.42,Default,,0000,0000,0000,,this function. Well, how do you maximize the function? You set the derivative to zero. So we want Dialogue: 0,0:08:44.42,0:08:50.12,Default,,0000,0000,0000,,theta such that our Dialogue: 0,0:08:50.12,0:08:53.90,Default,,0000,0000,0000,,prime of theta is equal to zero, so to maximize this function we want to find the place where Dialogue: 0,0:08:53.90,0:08:57.27,Default,,0000,0000,0000,,the derivative of the function is equal to zero, Dialogue: 0,0:08:57.27,0:08:58.38,Default,,0000,0000,0000,,and so we just apply Dialogue: 0,0:08:58.38,0:09:01.38,Default,,0000,0000,0000,,the same idea. Dialogue: 0,0:09:01.38,0:09:04.21,Default,,0000,0000,0000,,So we get Dialogue: 0,0:09:04.21,0:09:08.51,Default,,0000,0000,0000,,theta one equals theta T minus L Dialogue: 0,0:09:08.51,0:09:10.71,Default,,0000,0000,0000,,prime Dialogue: 0,0:09:10.71,0:09:17.71,Default,,0000,0000,0000,,of theta T over L Dialogue: 0,0:09:18.10,0:09:19.79,Default,,0000,0000,0000,,double prime of T, L Dialogue: 0,0:09:19.79,0:09:22.85,Default,,0000,0000,0000,,double prime of Dialogue: 0,0:09:22.85,0:09:28.21,Default,,0000,0000,0000,,theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be Dialogue: 0,0:09:28.21,0:09:30.90,Default,,0000,0000,0000,,the [inaudible] of L, Dialogue: 0,0:09:30.90,0:09:35.69,Default,,0000,0000,0000,,and then we want to find the value of theta for which the derivative of L is Dialogue: 0,0:09:35.69,0:09:36.30,Default,,0000,0000,0000,,zero, and Dialogue: 0,0:09:36.30,0:09:43.30,Default,,0000,0000,0000,,therefore must be a local optimum. Dialogue: 0,0:09:48.35,0:09:55.35,Default,,0000,0000,0000,,Does this make sense? Any questions about this? Dialogue: 0,0:09:58.90,0:10:02.70,Default,,0000,0000,0000,,[Inaudible] The answer to that is fairly complicated. There Dialogue: 0,0:10:02.70,0:10:06.26,Default,,0000,0000,0000,,are conditions on F that would guarantee that this will work. They are fairly complicated, Dialogue: 0,0:10:06.26,0:10:10.53,Default,,0000,0000,0000,,and this is more complex than I want to go into now. Dialogue: 0,0:10:10.53,0:10:13.35,Default,,0000,0000,0000,,In practice, this works very well for logistic regression, Dialogue: 0,0:10:13.35,0:10:18.65,Default,,0000,0000,0000,,and for sort of generalizing any models I'll talk about later. [Inaudible] Dialogue: 0,0:10:18.65,0:10:23.31,Default,,0000,0000,0000,,Yeah, it Dialogue: 0,0:10:23.31,0:10:27.19,Default,,0000,0000,0000,,usually doesn't matter. When I implement this, I usually just initialize theta zero to Dialogue: 0,0:10:27.19,0:10:29.39,Default,,0000,0000,0000,,zero to Dialogue: 0,0:10:29.39,0:10:31.57,Default,,0000,0000,0000,,just initialize the parameters Dialogue: 0,0:10:31.57,0:10:35.61,Default,,0000,0000,0000,,to the - back to all zeros, and usually this works fine. It's usually not a huge Dialogue: 0,0:10:35.61,0:10:42.61,Default,,0000,0000,0000,,deal how you initialize theta. Dialogue: 0,0:10:44.11,0:10:45.94,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,0:10:45.94,0:10:51.44,Default,,0000,0000,0000,,or is it just different conversions? Let me Dialogue: 0,0:10:51.44,0:10:55.60,Default,,0000,0000,0000,,say some things about that that'll sort of answer it. All of these Dialogue: 0,0:10:55.60,0:10:57.25,Default,,0000,0000,0000,, Dialogue: 0,0:10:57.25,0:10:57.63,Default,,0000,0000,0000,, Dialogue: 0,0:10:57.63,0:11:01.18,Default,,0000,0000,0000,,algorithms tend not to - converges problems, and all of these Dialogue: 0,0:11:01.18,0:11:03.87,Default,,0000,0000,0000,,algorithms will generally converge, Dialogue: 0,0:11:03.87,0:11:06.53,Default,,0000,0000,0000,,unless you choose too large a linear rate for Dialogue: 0,0:11:06.53,0:11:08.87,Default,,0000,0000,0000,,gradient ascent or something. Dialogue: 0,0:11:08.87,0:11:10.62,Default,,0000,0000,0000,,But the speeds of Dialogue: 0,0:11:10.62,0:11:14.57,Default,,0000,0000,0000,,conversions of these algorithms are very different. Dialogue: 0,0:11:14.57,0:11:18.13,Default,,0000,0000,0000,,So Dialogue: 0,0:11:18.13,0:11:22.40,Default,,0000,0000,0000,,it turns out that Newton's Method is an algorithm that enjoys extremely Dialogue: 0,0:11:22.40,0:11:24.53,Default,,0000,0000,0000,,fast conversions. Dialogue: 0,0:11:24.53,0:11:28.67,Default,,0000,0000,0000,,The technical term is that it enjoys a property called [inaudible] conversions. Don't Dialogue: 0,0:11:28.67,0:11:29.95,Default,,0000,0000,0000,,know [inaudible] what that means, Dialogue: 0,0:11:29.95,0:11:31.17,Default,,0000,0000,0000,,but just Dialogue: 0,0:11:31.17,0:11:33.54,Default,,0000,0000,0000,,stated informally, it Dialogue: 0,0:11:33.54,0:11:35.46,Default,,0000,0000,0000,,means that asymptotically Dialogue: 0,0:11:35.46,0:11:41.50,Default,,0000,0000,0000,,every iteration of Newton's Method will double the number of significant digits Dialogue: 0,0:11:41.50,0:11:43.74,Default,,0000,0000,0000,,that your solution Dialogue: 0,0:11:43.74,0:11:44.98,Default,,0000,0000,0000,,is accurate Dialogue: 0,0:11:44.98,0:11:46.61,Default,,0000,0000,0000,,to. Just lots of constant Dialogue: 0,0:11:46.61,0:11:48.01,Default,,0000,0000,0000,,factors. Dialogue: 0,0:11:48.01,0:11:48.81,Default,,0000,0000,0000,,Suppose that Dialogue: 0,0:11:48.81,0:11:51.32,Default,,0000,0000,0000,,on a certain iteration Dialogue: 0,0:11:51.32,0:11:56.73,Default,,0000,0000,0000,,your solution is within 0.01 at the optimum, so you have Dialogue: 0,0:11:56.73,0:11:58.57,Default,,0000,0000,0000,,0.01 error. Then after one Dialogue: 0,0:11:58.57,0:12:04.11,Default,,0000,0000,0000,,iteration, your error will be on the order of 0.001, Dialogue: 0,0:12:04.11,0:12:11.11,Default,,0000,0000,0000,,and after another iteration, your error will be on the order of 0.0001. Dialogue: 0,0:12:13.61,0:12:14.60,Default,,0000,0000,0000,,So this is called Dialogue: 0,0:12:14.60,0:12:18.32,Default,,0000,0000,0000,,quadratic conversions because you essentially get to square the error Dialogue: 0,0:12:18.32,0:12:21.52,Default,,0000,0000,0000,,on every iteration of Newton's Method. [Inaudible] Dialogue: 0,0:12:21.52,0:12:24.52,Default,,0000,0000,0000,,this is an asymptotic result that holds only when you are pretty Dialogue: 0,0:12:24.52,0:12:27.12,Default,,0000,0000,0000,,** close to the optimum anyway, so this is Dialogue: 0,0:12:27.12,0:12:30.55,Default,,0000,0000,0000,,the theoretical result that says it's true, but because of constant factors Dialogue: 0,0:12:30.55,0:12:34.34,Default,,0000,0000,0000,,and so on, may paint a slightly rosier picture than Dialogue: 0,0:12:34.34,0:12:35.69,Default,,0000,0000,0000,,might be accurate. Dialogue: 0,0:12:35.69,0:12:39.59,Default,,0000,0000,0000,,But the fact is, when you implement - when I implement Newton's Dialogue: 0,0:12:39.59,0:12:41.39,Default,,0000,0000,0000,,Method for logistic regression, Dialogue: 0,0:12:41.39,0:12:44.83,Default,,0000,0000,0000,,usually converges like a dozen iterations or so for most reasonable Dialogue: 0,0:12:44.83,0:12:46.27,Default,,0000,0000,0000,,size problems of Dialogue: 0,0:12:46.27,0:12:50.91,Default,,0000,0000,0000,,tens of hundreds of features. So one Dialogue: 0,0:12:50.91,0:12:51.55,Default,,0000,0000,0000,,thing I should Dialogue: 0,0:12:51.55,0:12:53.17,Default,,0000,0000,0000,,talk about, which Dialogue: 0,0:12:53.17,0:12:54.11,Default,,0000,0000,0000,,is Dialogue: 0,0:12:54.11,0:12:58.14,Default,,0000,0000,0000,,what I wrote down over there was actually Newton's Method for the case of Dialogue: 0,0:12:58.14,0:13:01.41,Default,,0000,0000,0000,,theta being a single-row number. The Dialogue: 0,0:13:01.41,0:13:05.82,Default,,0000,0000,0000,,generalization to Newton's Method for when theta is a vector rather than Dialogue: 0,0:13:05.82,0:13:07.83,Default,,0000,0000,0000,,when theta is just a row number Dialogue: 0,0:13:07.83,0:13:11.04,Default,,0000,0000,0000,,is the following, Dialogue: 0,0:13:11.04,0:13:14.58,Default,,0000,0000,0000,,which is that theta T plus one is theta T Dialogue: 0,0:13:14.58,0:13:16.87,Default,,0000,0000,0000,,plus - and then we have the Dialogue: 0,0:13:16.87,0:13:18.58,Default,,0000,0000,0000,,second derivative divided by the Dialogue: 0,0:13:18.58,0:13:22.52,Default,,0000,0000,0000,,first - the first derivative divided by the second derivative. Dialogue: 0,0:13:22.52,0:13:25.45,Default,,0000,0000,0000,, Dialogue: 0,0:13:25.45,0:13:30.09,Default,,0000,0000,0000,,And the appropriate generalization is this, where Dialogue: 0,0:13:30.09,0:13:32.95,Default,,0000,0000,0000,,this is the usual gradient of Dialogue: 0,0:13:32.95,0:13:37.35,Default,,0000,0000,0000,,your objective, and Dialogue: 0,0:13:37.35,0:13:44.35,Default,,0000,0000,0000,,each [inaudible] is a matrix called a Hessian, Dialogue: 0,0:13:50.73,0:13:55.46,Default,,0000,0000,0000,,which is just a matrix of second derivative where HIJ Dialogue: 0,0:13:55.46,0:14:02.46,Default,,0000,0000,0000,,equals - okay. Dialogue: 0,0:14:07.21,0:14:11.00,Default,,0000,0000,0000,,So just to sort of - the Dialogue: 0,0:14:11.00,0:14:15.98,Default,,0000,0000,0000,,first derivative divided by the second derivative, now you have a vector of first derivatives Dialogue: 0,0:14:15.98,0:14:17.05,Default,,0000,0000,0000,,times Dialogue: 0,0:14:17.05,0:14:21.06,Default,,0000,0000,0000,,sort of the inverse of the matrix of second derivatives. So Dialogue: 0,0:14:21.06,0:14:22.77,Default,,0000,0000,0000,,this is sort of just the same thing Dialogue: 0,0:14:22.77,0:14:28.21,Default,,0000,0000,0000,,[inaudible] of multiple dimensions. Dialogue: 0,0:14:28.21,0:14:32.66,Default,,0000,0000,0000,,So for logistic regression, again, use the - Dialogue: 0,0:14:32.66,0:14:35.18,Default,,0000,0000,0000,,for a reasonable number of features Dialogue: 0,0:14:35.18,0:14:40.07,Default,,0000,0000,0000,,and training examples - when I run this algorithm, usually you see a conversion Dialogue: 0,0:14:40.07,0:14:41.15,Default,,0000,0000,0000,,anywhere from sort of [inaudible] Dialogue: 0,0:14:41.15,0:14:44.43,Default,,0000,0000,0000,,to like a dozen or so other [inaudible]. Dialogue: 0,0:14:44.43,0:14:47.61,Default,,0000,0000,0000,,To compare to gradient ascent, it's [inaudible] to gradient ascent, this Dialogue: 0,0:14:47.61,0:14:52.86,Default,,0000,0000,0000,,usually means far fewer iterations to converge. Dialogue: 0,0:14:52.86,0:14:55.23,Default,,0000,0000,0000,,Compared to gradient ascent, let's say [inaudible] gradient ascent, Dialogue: 0,0:14:55.23,0:14:58.60,Default,,0000,0000,0000,,the disadvantage of Newton's Method is that on every iteration Dialogue: 0,0:14:58.60,0:15:01.56,Default,,0000,0000,0000,,you need to invert the Hessian. Dialogue: 0,0:15:01.56,0:15:02.95,Default,,0000,0000,0000,,So the Hessian Dialogue: 0,0:15:02.95,0:15:07.19,Default,,0000,0000,0000,,will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N Dialogue: 0,0:15:07.19,0:15:09.01,Default,,0000,0000,0000,,is the number of features. Dialogue: 0,0:15:09.01,0:15:12.70,Default,,0000,0000,0000,,And so if you have a large number of features in your learning problem, if you have Dialogue: 0,0:15:12.70,0:15:14.54,Default,,0000,0000,0000,,tens of thousands of features, Dialogue: 0,0:15:14.54,0:15:18.05,Default,,0000,0000,0000,,then inverting H could be a slightly computationally expensive Dialogue: 0,0:15:18.05,0:15:21.91,Default,,0000,0000,0000,,step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think Dialogue: 0,0:15:21.91,0:15:28.91,Default,,0000,0000,0000,,you're right. Dialogue: 0,0:15:35.03,0:15:38.65,Default,,0000,0000,0000,,That should probably be a minus. Dialogue: 0,0:15:38.65,0:15:40.00,Default,,0000,0000,0000,,Do you have [inaudible]? Dialogue: 0,0:15:40.00,0:15:44.81,Default,,0000,0000,0000,,Yeah, thanks. Dialogue: 0,0:15:44.81,0:15:48.13,Default,,0000,0000,0000,,Yeah, X to a minus. Thank you. [Inaudible] problem also. Dialogue: 0,0:15:48.13,0:15:49.72,Default,,0000,0000,0000,,I wrote down Dialogue: 0,0:15:49.72,0:15:51.11,Default,,0000,0000,0000,,this algorithm Dialogue: 0,0:15:51.11,0:15:54.76,Default,,0000,0000,0000,,to find the maximum likely estimate of the parameters for logistic regression. Dialogue: 0,0:15:54.76,0:15:56.02,Default,,0000,0000,0000,,I wrote this Dialogue: 0,0:15:56.02,0:15:58.43,Default,,0000,0000,0000,,down for maximizing a function. Dialogue: 0,0:15:58.43,0:16:01.79,Default,,0000,0000,0000,,So I'll leave you to think about this yourself. If I wanted to use Newton's Dialogue: 0,0:16:01.79,0:16:04.81,Default,,0000,0000,0000,,Method to minimize the function, how Dialogue: 0,0:16:04.81,0:16:06.82,Default,,0000,0000,0000,,does the algorithm change? All right. Dialogue: 0,0:16:06.82,0:16:09.47,Default,,0000,0000,0000,,So I'll leave you to think about that. So in other words, Dialogue: 0,0:16:09.47,0:16:12.62,Default,,0000,0000,0000,,it's not the maximizations. How does the algorithm change if you want to use it Dialogue: 0,0:16:12.62,0:16:19.62,Default,,0000,0000,0000,,for minimization? Dialogue: 0,0:16:27.55,0:16:30.03,Default,,0000,0000,0000,,Actually, Dialogue: 0,0:16:30.03,0:16:31.82,Default,,0000,0000,0000,,the answer Dialogue: 0,0:16:31.82,0:16:32.97,Default,,0000,0000,0000,,is that Dialogue: 0,0:16:32.97,0:16:34.73,Default,,0000,0000,0000,,it doesn't change. I'll Dialogue: 0,0:16:34.73,0:16:40.41,Default,,0000,0000,0000,,leave you to work that out yourself why, okay. All right. Dialogue: 0,0:16:40.41,0:16:47.34,Default,,0000,0000,0000,,Let's talk about generalized linear models. Dialogue: 0,0:16:47.34,0:16:50.34,Default,,0000,0000,0000,,Let me just say, just to give a recap of Dialogue: 0,0:16:50.34,0:16:52.62,Default,,0000,0000,0000,,both of the algorithms we've talked about so far. Dialogue: 0,0:16:52.62,0:16:55.44,Default,,0000,0000,0000,,We've talked about Dialogue: 0,0:16:55.44,0:16:59.47,Default,,0000,0000,0000,,two different algorithms for modeling PFY given X and parameterized by Dialogue: 0,0:16:59.47,0:17:00.58,Default,,0000,0000,0000,,theta. And Dialogue: 0,0:17:00.58,0:17:03.24,Default,,0000,0000,0000,,one of them - R was Dialogue: 0,0:17:03.24,0:17:05.91,Default,,0000,0000,0000,,a real number Dialogue: 0,0:17:05.91,0:17:10.09,Default,,0000,0000,0000,,and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then Dialogue: 0,0:17:10.09,0:17:13.05,Default,,0000,0000,0000,,we got Dialogue: 0,0:17:13.05,0:17:16.39,Default,,0000,0000,0000,,[inaudible] of linear regression. Dialogue: 0,0:17:16.39,0:17:19.54,Default,,0000,0000,0000,,In the other case, we saw Dialogue: 0,0:17:19.54,0:17:23.14,Default,,0000,0000,0000,,that if - was a classification problem where Y took on a value of either Dialogue: 0,0:17:23.14,0:17:25.31,Default,,0000,0000,0000,,zero or one. Dialogue: 0,0:17:25.31,0:17:27.11,Default,,0000,0000,0000,, Dialogue: 0,0:17:27.11,0:17:31.43,Default,,0000,0000,0000,,In that case, well, what's the most natural distribution of zeros and Dialogue: 0,0:17:31.43,0:17:34.99,Default,,0000,0000,0000,,ones is the [inaudible]. The [inaudible] distribution models Dialogue: 0,0:17:34.99,0:17:37.21,Default,,0000,0000,0000,,random variables with two values, Dialogue: 0,0:17:37.21,0:17:39.89,Default,,0000,0000,0000,,and in that case we got Dialogue: 0,0:17:39.89,0:17:45.82,Default,,0000,0000,0000,,logistic regression. Dialogue: 0,0:17:45.82,0:17:48.68,Default,,0000,0000,0000,,So along the way, some of the questions that came up were - so logistic regression, where Dialogue: 0,0:17:48.68,0:17:51.28,Default,,0000,0000,0000,, Dialogue: 0,0:17:51.28,0:17:52.29,Default,,0000,0000,0000,,on Dialogue: 0,0:17:52.29,0:17:54.48,Default,,0000,0000,0000,,earth did I get the [inaudible] function from? Dialogue: 0,0:17:54.48,0:17:57.45,Default,,0000,0000,0000,,And then so there are the choices you can use for, sort of, Dialogue: 0,0:17:57.45,0:17:59.78,Default,,0000,0000,0000,,just where did this function come from? Dialogue: 0,0:17:59.78,0:18:02.20,Default,,0000,0000,0000,,And there are other functions I could've plugged in, but Dialogue: 0,0:18:02.20,0:18:05.54,Default,,0000,0000,0000,,the [inaudible] function turns out to be a natural Dialogue: 0,0:18:05.54,0:18:06.85,Default,,0000,0000,0000,,default choice Dialogue: 0,0:18:06.85,0:18:09.24,Default,,0000,0000,0000,,that lead us to logistic regression. Dialogue: 0,0:18:09.24,0:18:11.81,Default,,0000,0000,0000,,And what I want to do now is Dialogue: 0,0:18:11.81,0:18:14.69,Default,,0000,0000,0000,,take both of these algorithms and Dialogue: 0,0:18:14.69,0:18:18.52,Default,,0000,0000,0000,,show that there are special cases that have [inaudible] the course of algorithms called Dialogue: 0,0:18:18.52,0:18:19.45,Default,,0000,0000,0000,,generalized Dialogue: 0,0:18:19.45,0:18:21.55,Default,,0000,0000,0000,,linear models, Dialogue: 0,0:18:21.55,0:18:23.56,Default,,0000,0000,0000,,and there will be pauses for - it will be as [inaudible] the course Dialogue: 0,0:18:23.56,0:18:27.11,Default,,0000,0000,0000,,of algorithms that think that the [inaudible] Dialogue: 0,0:18:27.11,0:18:30.90,Default,,0000,0000,0000,,function will fall out very naturally as well. Dialogue: 0,0:18:30.90,0:18:32.85,Default,,0000,0000,0000,,So, let's see - Dialogue: 0,0:18:32.85,0:18:36.82,Default,,0000,0000,0000,,just looking for a longer piece of chalk. I Dialogue: 0,0:18:36.82,0:18:40.93,Default,,0000,0000,0000,,should warn you, the ideas in generalized linear models are somewhat Dialogue: 0,0:18:40.93,0:18:42.18,Default,,0000,0000,0000,,complex, so what Dialogue: 0,0:18:42.18,0:18:46.44,Default,,0000,0000,0000,,I'm going to do today is try to sort of point you - point out the key ideas and give you a Dialogue: 0,0:18:46.44,0:18:48.56,Default,,0000,0000,0000,,gist of the entire story. Dialogue: 0,0:18:48.56,0:18:52.23,Default,,0000,0000,0000,,And then some of the details in the map and the derivations I'll leave you to work through Dialogue: 0,0:18:52.23,0:18:57.85,Default,,0000,0000,0000,,by yourselves in the intellection [inaudible], which posts Dialogue: 0,0:18:57.85,0:19:04.85,Default,,0000,0000,0000,,online. Dialogue: 0,0:19:05.71,0:19:08.98,Default,,0000,0000,0000,,So [inaudible] these two distributions, the [inaudible] and Dialogue: 0,0:19:08.98,0:19:13.35,Default,,0000,0000,0000,,the Gaussian. Dialogue: 0,0:19:13.35,0:19:15.14,Default,,0000,0000,0000,,So suppose we have data Dialogue: 0,0:19:15.14,0:19:18.55,Default,,0000,0000,0000,,that is zero-one valued, and we and we want to model it Dialogue: 0,0:19:18.55,0:19:21.42,Default,,0000,0000,0000,,with Dialogue: 0,0:19:21.42,0:19:23.94,Default,,0000,0000,0000,,[inaudible] variable Dialogue: 0,0:19:23.94,0:19:25.79,Default,,0000,0000,0000,,parameterized by Dialogue: 0,0:19:25.79,0:19:30.96,Default,,0000,0000,0000,,phi. So the [inaudible] distribution has the probability of Y equals one, Dialogue: 0,0:19:30.96,0:19:35.05,Default,,0000,0000,0000,,which just equals the phi, right. So the parameter phi in the Dialogue: 0,0:19:35.05,0:19:40.03,Default,,0000,0000,0000,,[inaudible] specifies the probability of Y being one. Now, Dialogue: 0,0:19:40.03,0:19:42.54,Default,,0000,0000,0000,,as you vary the parameter theta, you get - Dialogue: 0,0:19:42.54,0:19:47.42,Default,,0000,0000,0000,,you sort of get different [inaudible] distributions. As you vary the value of Dialogue: 0,0:19:47.42,0:19:50.56,Default,,0000,0000,0000,,theta you get different probability distributions on Y Dialogue: 0,0:19:50.56,0:19:53.31,Default,,0000,0000,0000,,that have different probabilities of being equal to one. Dialogue: 0,0:19:53.31,0:19:58.13,Default,,0000,0000,0000,,And so I want you to think of this as not one fixed distribution, but as a set where there are a Dialogue: 0,0:19:58.13,0:20:00.36,Default,,0000,0000,0000,,cause of distributions Dialogue: 0,0:20:00.36,0:20:03.74,Default,,0000,0000,0000,,that you get as you vary theta. Dialogue: 0,0:20:03.74,0:20:08.87,Default,,0000,0000,0000,,And in the same way, if you consider Gaussian distribution, Dialogue: 0,0:20:08.87,0:20:10.88,Default,,0000,0000,0000,,as you vary [inaudible] you Dialogue: 0,0:20:10.88,0:20:14.26,Default,,0000,0000,0000,,would get different Dialogue: 0,0:20:14.26,0:20:18.32,Default,,0000,0000,0000,,Gaussian distributions. So think of this again as a cost, or as a set to Dialogue: 0,0:20:18.32,0:20:19.91,Default,,0000,0000,0000,,distributions. Dialogue: 0,0:20:19.91,0:20:26.91,Default,,0000,0000,0000,,And what I want to do now is show that Dialogue: 0,0:20:27.68,0:20:31.89,Default,,0000,0000,0000,,both of these are special cases of the cause of distribution that's called the Dialogue: 0,0:20:31.89,0:20:34.30,Default,,0000,0000,0000,,exponential family distribution. Dialogue: 0,0:20:34.30,0:20:38.16,Default,,0000,0000,0000,,And in particular, we'll say that the cost of distributions, like Dialogue: 0,0:20:38.16,0:20:41.33,Default,,0000,0000,0000,,the [inaudible] distributions that you get as you vary theta, Dialogue: 0,0:20:41.33,0:20:44.96,Default,,0000,0000,0000,,we'll say the cost of distributions is in the exponential family Dialogue: 0,0:20:44.96,0:20:46.43,Default,,0000,0000,0000,,if it can be written Dialogue: 0,0:20:46.43,0:20:49.63,Default,,0000,0000,0000,,in the following form. Dialogue: 0,0:20:49.63,0:20:51.85,Default,,0000,0000,0000,,P of Y parameterized by theta is Dialogue: 0,0:20:51.85,0:20:54.22,Default,,0000,0000,0000,,equal to B of Y Dialogue: 0,0:20:54.22,0:21:01.22,Default,,0000,0000,0000,,[inaudible], Dialogue: 0,0:21:05.05,0:21:05.86,Default,,0000,0000,0000,,okay. Let me just Dialogue: 0,0:21:05.86,0:21:07.47,Default,,0000,0000,0000,,get some of these Dialogue: 0,0:21:07.47,0:21:10.69,Default,,0000,0000,0000,,terms, names, and Dialogue: 0,0:21:10.69,0:21:13.05,Default,,0000,0000,0000,,then - let me - I'll Dialogue: 0,0:21:13.05,0:21:19.36,Default,,0000,0000,0000,,say a bit more about what this means. Dialogue: 0,0:21:19.36,0:21:23.78,Default,,0000,0000,0000,,So [inaudible] is called the natural parameter of the distribution, Dialogue: 0,0:21:23.78,0:21:26.38,Default,,0000,0000,0000,,and T Dialogue: 0,0:21:26.38,0:21:33.38,Default,,0000,0000,0000,,of Y is called the sufficient statistic. Dialogue: 0,0:21:34.63,0:21:37.11,Default,,0000,0000,0000,,Usually, Dialogue: 0,0:21:37.11,0:21:41.02,Default,,0000,0000,0000,,for many of the examples we'll see, including the [inaudible] Dialogue: 0,0:21:41.02,0:21:43.59,Default,,0000,0000,0000,,and the Gaussian, T Dialogue: 0,0:21:43.59,0:21:47.81,Default,,0000,0000,0000,,of Y is just equal to Y. So for most of this lecture you can Dialogue: 0,0:21:47.81,0:21:51.66,Default,,0000,0000,0000,,mentally replace T of Y to be equal to Y, although this won't be true for the very Dialogue: 0,0:21:51.66,0:21:56.07,Default,,0000,0000,0000,,fine example we do today, but mentally, you think of T of Y as equal to Dialogue: 0,0:21:56.07,0:21:57.30,Default,,0000,0000,0000,,Y. Dialogue: 0,0:21:57.30,0:22:04.30,Default,,0000,0000,0000,,And so Dialogue: 0,0:22:08.23,0:22:09.93,Default,,0000,0000,0000,,for a given choice of Dialogue: 0,0:22:09.93,0:22:15.89,Default,,0000,0000,0000,,these functions, A, B and Dialogue: 0,0:22:15.89,0:22:21.10,Default,,0000,0000,0000,,T, all right - so we're gonna sort of fix the forms of the functions A, B and T. Dialogue: 0,0:22:21.10,0:22:21.95,Default,,0000,0000,0000,,Then Dialogue: 0,0:22:21.95,0:22:26.55,Default,,0000,0000,0000,,this formula defines, again, a set of distributions. It defines the cause of distributions Dialogue: 0,0:22:26.55,0:22:27.99,Default,,0000,0000,0000,,that Dialogue: 0,0:22:27.99,0:22:30.98,Default,,0000,0000,0000,,is now parameterized by Dialogue: 0,0:22:30.98,0:22:31.60,Default,,0000,0000,0000,,[inaudible]. Dialogue: 0,0:22:31.60,0:22:35.77,Default,,0000,0000,0000,,So again, let's write down specific formulas for A, B and T, true specific Dialogue: 0,0:22:35.77,0:22:38.51,Default,,0000,0000,0000,,choices of A, B and T. Then Dialogue: 0,0:22:38.51,0:22:43.48,Default,,0000,0000,0000,,as I vary [inaudible] I get different distributions. Dialogue: 0,0:22:43.48,0:22:44.76,Default,,0000,0000,0000,,And Dialogue: 0,0:22:44.76,0:22:46.47,Default,,0000,0000,0000,,I'm going to show that Dialogue: 0,0:22:46.47,0:22:48.36,Default,,0000,0000,0000,,the Dialogue: 0,0:22:48.36,0:22:52.29,Default,,0000,0000,0000,,[inaudible] - I'm going to show that the [inaudible] and the Gaussians are special Dialogue: 0,0:22:52.29,0:22:53.72,Default,,0000,0000,0000,,cases Dialogue: 0,0:22:53.72,0:22:55.88,Default,,0000,0000,0000,,of exponential family distributions. Dialogue: 0,0:22:55.88,0:23:00.61,Default,,0000,0000,0000,,And by that I mean that I can choose specific functions, A, B and T, Dialogue: 0,0:23:00.61,0:23:02.87,Default,,0000,0000,0000,,so that this becomes the formula Dialogue: 0,0:23:02.87,0:23:06.35,Default,,0000,0000,0000,,of the distributions of either a [inaudible] or a Gaussian. Dialogue: 0,0:23:06.35,0:23:09.00,Default,,0000,0000,0000,,And then again, as I vary [inaudible], Dialogue: 0,0:23:09.00,0:23:11.59,Default,,0000,0000,0000,,I'll get [inaudible], distributions with different means, Dialogue: 0,0:23:11.59,0:23:13.22,Default,,0000,0000,0000,,or as I vary [inaudible], I'll get Dialogue: 0,0:23:13.22,0:23:16.79,Default,,0000,0000,0000,,Gaussian distributions with different means for my Dialogue: 0,0:23:16.79,0:23:22.92,Default,,0000,0000,0000,,fixed values of A, B and T. Dialogue: 0,0:23:22.92,0:23:26.92,Default,,0000,0000,0000,,And for those of you that know what a sufficient statistic and statistics is, Dialogue: 0,0:23:26.92,0:23:27.98,Default,,0000,0000,0000,, Dialogue: 0,0:23:27.98,0:23:30.72,Default,,0000,0000,0000,,T of Y actually is a sufficient statistic Dialogue: 0,0:23:30.72,0:23:33.93,Default,,0000,0000,0000,,in the formal sense of Dialogue: 0,0:23:33.93,0:23:37.43,Default,,0000,0000,0000,,sufficient statistic for a probability distribution. They may have seen it in a statistics Dialogue: 0,0:23:37.43,0:23:38.12,Default,,0000,0000,0000,,class. Dialogue: 0,0:23:38.12,0:23:42.24,Default,,0000,0000,0000,,If you don't know what a sufficient statistic is, don't worry about. We Dialogue: 0,0:23:42.24,0:23:49.24,Default,,0000,0000,0000,,sort of don't need that property today. Okay. Dialogue: 0,0:23:52.44,0:23:55.86,Default,,0000,0000,0000,,So - Dialogue: 0,0:23:55.86,0:23:58.47,Default,,0000,0000,0000,,oh, one last comment. Dialogue: 0,0:23:58.47,0:24:02.75,Default,,0000,0000,0000,,Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also Dialogue: 0,0:24:02.75,0:24:04.29,Default,,0000,0000,0000,,just a raw number. Dialogue: 0,0:24:04.29,0:24:08.15,Default,,0000,0000,0000,,So in many cases, the parameter of this distribution is just a raw number, Dialogue: 0,0:24:08.15,0:24:10.32,Default,,0000,0000,0000,,and [inaudible] transposed T of Y Dialogue: 0,0:24:10.32,0:24:14.34,Default,,0000,0000,0000,,is just a product of raw numbers. So again, that would be true for our first two examples, but Dialogue: 0,0:24:14.34,0:24:14.91,Default,,0000,0000,0000,, Dialogue: 0,0:24:14.91,0:24:20.77,Default,,0000,0000,0000,,now for the last example I'll do today. Dialogue: 0,0:24:20.77,0:24:24.90,Default,,0000,0000,0000,,So now we'll show that the [inaudible] and the Gaussian are examples of exponential family Dialogue: 0,0:24:24.90,0:24:27.66,Default,,0000,0000,0000,,distributions. We'll start with the [inaudible]. Dialogue: 0,0:24:27.66,0:24:30.77,Default,,0000,0000,0000,,So the [inaudible] distribution with [inaudible] - I guess I wrote this down Dialogue: 0,0:24:30.77,0:24:32.51,Default,,0000,0000,0000,,already. Dialogue: 0,0:24:32.51,0:24:36.00,Default,,0000,0000,0000,,PFY equals one [inaudible] by phi, Dialogue: 0,0:24:36.00,0:24:36.59,Default,,0000,0000,0000,,[inaudible] equal to phi. Dialogue: 0,0:24:36.59,0:24:40.53,Default,,0000,0000,0000,,So the parameter of phi specifies the probability Dialogue: 0,0:24:40.53,0:24:42.60,Default,,0000,0000,0000,,that Y equals one. Dialogue: 0,0:24:42.60,0:24:45.31,Default,,0000,0000,0000,,And so my goal now is to choose Dialogue: 0,0:24:45.31,0:24:48.32,Default,,0000,0000,0000,,T, A and B, or is to choose A, B and T Dialogue: 0,0:24:48.32,0:24:53.23,Default,,0000,0000,0000,,so that my formula for the exponential family becomes identical to my formula Dialogue: 0,0:24:53.23,0:25:00.23,Default,,0000,0000,0000,,for the distribution of a [inaudible]. So Dialogue: 0,0:25:08.39,0:25:15.39,Default,,0000,0000,0000,,probability of Y parameterized by phi is equal to that, all Dialogue: 0,0:25:17.06,0:25:19.95,Default,,0000,0000,0000,,right. And you already saw sort Dialogue: 0,0:25:19.95,0:25:24.48,Default,,0000,0000,0000,,of a similar exponential notation where we talked about logistic regression. The probability of Dialogue: 0,0:25:24.48,0:25:26.63,Default,,0000,0000,0000,,Y being one is Dialogue: 0,0:25:26.63,0:25:30.99,Default,,0000,0000,0000,,phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the Dialogue: 0,0:25:30.99,0:25:37.99,Default,,0000,0000,0000,,Y times one minus phi to the one minus Y. So I'm gonna take Dialogue: 0,0:25:40.01,0:25:43.92,Default,,0000,0000,0000,,the exponent of the log of this, an exponentiation in taking log [inaudible] Dialogue: 0,0:25:43.92,0:25:45.08,Default,,0000,0000,0000,,cancel each other Dialogue: 0,0:25:45.08,0:25:46.06,Default,,0000,0000,0000,, Dialogue: 0,0:25:46.06,0:25:53.06,Default,,0000,0000,0000,,out [inaudible]. Dialogue: 0,0:25:53.17,0:25:54.98,Default,,0000,0000,0000,,And this is equal to E to the Dialogue: 0,0:25:54.98,0:26:01.98,Default,,0000,0000,0000,,Y. And so [inaudible] Dialogue: 0,0:26:28.22,0:26:31.15,Default,,0000,0000,0000,,is to be T of Y, Dialogue: 0,0:26:31.15,0:26:33.36,Default,,0000,0000,0000,,and Dialogue: 0,0:26:33.36,0:26:40.36,Default,,0000,0000,0000,,this will be Dialogue: 0,0:26:42.35,0:26:44.99,Default,,0000,0000,0000,,minus A of [inaudible]. Dialogue: 0,0:26:44.99,0:26:51.99,Default,,0000,0000,0000,,And then B of Y is just one, so B of Y doesn't matter. Just Dialogue: 0,0:26:56.83,0:26:57.95,Default,,0000,0000,0000,,take a second Dialogue: 0,0:26:57.95,0:27:00.59,Default,,0000,0000,0000,,to look through this and make sure it makes Dialogue: 0,0:27:00.59,0:27:07.59,Default,,0000,0000,0000,,sense. I'll clean another Dialogue: 0,0:27:42.40,0:27:47.33,Default,,0000,0000,0000,,board while you do that. So now let's write down a few more things. Just Dialogue: 0,0:27:47.33,0:27:51.23,Default,,0000,0000,0000,,copying from the previous board, we had that [inaudible] zero four equal to log Dialogue: 0,0:27:51.23,0:27:53.35,Default,,0000,0000,0000,,phi Dialogue: 0,0:27:53.35,0:27:56.78,Default,,0000,0000,0000,,over one minus Dialogue: 0,0:27:56.78,0:28:00.80,Default,,0000,0000,0000,,phi. [Inaudible] so if I want to do the [inaudible] take this formula, Dialogue: 0,0:28:00.80,0:28:03.55,Default,,0000,0000,0000,,and if you invert it, if you solve for Dialogue: 0,0:28:03.55,0:28:05.78,Default,,0000,0000,0000,,phi - excuse me, if you solve for Dialogue: 0,0:28:05.78,0:28:09.52,Default,,0000,0000,0000,,theta as a function of phi, which is really [inaudible] is the function of Dialogue: 0,0:28:09.52,0:28:11.48,Default,,0000,0000,0000,,phi. Dialogue: 0,0:28:11.48,0:28:16.91,Default,,0000,0000,0000,,Just invert this formula. You find that phi is Dialogue: 0,0:28:16.91,0:28:19.35,Default,,0000,0000,0000,,one over one plus [inaudible] minus [inaudible]. Dialogue: 0,0:28:19.35,0:28:20.53,Default,,0000,0000,0000,,And so Dialogue: 0,0:28:20.53,0:28:23.63,Default,,0000,0000,0000,,somehow the logistic function magically Dialogue: 0,0:28:23.63,0:28:28.94,Default,,0000,0000,0000,,falls out of this. We'll take this even this even further later. Dialogue: 0,0:28:28.94,0:28:31.37,Default,,0000,0000,0000,,Again, copying definitions from Dialogue: 0,0:28:31.37,0:28:35.51,Default,,0000,0000,0000,,the board on - from the previous board, A of [inaudible] Dialogue: 0,0:28:35.51,0:28:39.33,Default,,0000,0000,0000,,I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all Dialogue: 0,0:28:39.33,0:28:42.66,Default,,0000,0000,0000,,right. So [inaudible] depends on phi, and phi Dialogue: 0,0:28:42.66,0:28:45.70,Default,,0000,0000,0000,,depends on [inaudible]. So if I plug in Dialogue: 0,0:28:45.70,0:28:49.49,Default,,0000,0000,0000,,this definition for [inaudible] Dialogue: 0,0:28:49.49,0:28:50.68,Default,,0000,0000,0000,,into this - excuse Dialogue: 0,0:28:50.68,0:28:53.51,Default,,0000,0000,0000,,me, plug in this definition for phi into that, Dialogue: 0,0:28:53.51,0:28:57.34,Default,,0000,0000,0000,,I'll find that A of [inaudible] is therefore equal Dialogue: 0,0:28:57.34,0:29:00.89,Default,,0000,0000,0000,,to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly Dialogue: 0,0:29:00.89,0:29:01.95,Default,,0000,0000,0000,,interesting. Dialogue: 0,0:29:01.95,0:29:07.89,Default,,0000,0000,0000,,And just to complete - excuse me. Dialogue: 0,0:29:07.89,0:29:10.92,Default,,0000,0000,0000,,And just to complete the rest of this, T of Dialogue: 0,0:29:10.92,0:29:12.86,Default,,0000,0000,0000,,Y is equal to Y, Dialogue: 0,0:29:12.86,0:29:15.71,Default,,0000,0000,0000,,and Dialogue: 0,0:29:15.71,0:29:18.34,Default,,0000,0000,0000,,B of Y is equal to one, okay. Dialogue: 0,0:29:18.34,0:29:20.20,Default,,0000,0000,0000,,So just to recap what we've done, Dialogue: 0,0:29:20.20,0:29:23.43,Default,,0000,0000,0000,,we've come up with a certain choice of Dialogue: 0,0:29:23.43,0:29:24.72,Default,,0000,0000,0000,,functions A, T and Dialogue: 0,0:29:24.72,0:29:26.25,Default,,0000,0000,0000,,B, Dialogue: 0,0:29:26.25,0:29:29.61,Default,,0000,0000,0000,,so then my formula for the exponential family distribution Dialogue: 0,0:29:29.61,0:29:33.95,Default,,0000,0000,0000,,now becomes exactly the formula for the distributions, or for the probability Dialogue: 0,0:29:33.95,0:29:36.66,Default,,0000,0000,0000,,mass function of the [inaudible] distribution. Dialogue: 0,0:29:36.66,0:29:40.32,Default,,0000,0000,0000,,And the natural parameter [inaudible] has a certain relationship of the Dialogue: 0,0:29:40.32,0:29:45.26,Default,,0000,0000,0000,,original parameter of the [inaudible]. Dialogue: 0,0:29:45.26,0:29:46.30,Default,,0000,0000,0000,,Question? Dialogue: 0,0:29:46.30,0:29:48.58,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,0:29:48.58,0:29:49.60,Default,,0000,0000,0000,,Let's Dialogue: 0,0:29:49.60,0:29:54.43,Default,,0000,0000,0000,,see. [Inaudible]. The second to Dialogue: 0,0:29:54.43,0:29:58.97,Default,,0000,0000,0000,,the last one. Oh, this answer is fine. Okay. Dialogue: 0,0:29:58.97,0:30:00.62,Default,,0000,0000,0000,,Let's see. Dialogue: 0,0:30:00.62,0:30:02.62,Default,,0000,0000,0000,,Yeah, so this is - Dialogue: 0,0:30:02.62,0:30:07.26,Default,,0000,0000,0000,,well, if you expand this term out, one minus Y times log Y minus phi, Dialogue: 0,0:30:07.26,0:30:08.59,Default,,0000,0000,0000,,and so Dialogue: 0,0:30:08.59,0:30:09.91,Default,,0000,0000,0000,,one times log - Dialogue: 0,0:30:09.91,0:30:12.60,Default,,0000,0000,0000,,one minus phi becomes this. And the other Dialogue: 0,0:30:12.60,0:30:16.20,Default,,0000,0000,0000,,term is minus Y times log Y minus phi. Dialogue: 0,0:30:16.20,0:30:17.66,Default,,0000,0000,0000,,And then - Dialogue: 0,0:30:17.66,0:30:22.53,Default,,0000,0000,0000,,so the minus of a log is log one over X, or is just log one over Dialogue: 0,0:30:22.53,0:30:25.38,Default,,0000,0000,0000,,whatever. So minus Y times log Dialogue: 0,0:30:25.38,0:30:27.53,Default,,0000,0000,0000,,one minus phi becomes Dialogue: 0,0:30:27.53,0:30:34.00,Default,,0000,0000,0000,,sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah, Dialogue: 0,0:30:34.00,0:30:39.00,Default,,0000,0000,0000,,cool. Anything else? Yes? Dialogue: 0,0:30:39.00,0:30:41.21,Default,,0000,0000,0000,,[Inaudible] is a scaler, isn't it? Up there Dialogue: 0,0:30:41.21,0:30:43.10,Default,,0000,0000,0000,,- Dialogue: 0,0:30:43.10,0:30:46.71,Default,,0000,0000,0000,,Yes. - it's a [inaudible] transposed, so it can be a vector or Dialogue: 0,0:30:46.71,0:30:48.69,Default,,0000,0000,0000,,- Yes, [inaudible]. So Dialogue: 0,0:30:48.69,0:30:53.18,Default,,0000,0000,0000,,let's see. In most - in this and the next example, [inaudible] will turn out to Dialogue: 0,0:30:53.18,0:30:54.80,Default,,0000,0000,0000,,be a scaler. Dialogue: 0,0:30:54.80,0:30:56.27,Default,,0000,0000,0000,,And so - Dialogue: 0,0:30:56.27,0:30:59.44,Default,,0000,0000,0000,,well, on this board. Dialogue: 0,0:30:59.44,0:31:03.57,Default,,0000,0000,0000,,And so if [inaudible] is a scaler and T of Y is a scaler, then this is just Dialogue: 0,0:31:03.57,0:31:06.26,Default,,0000,0000,0000,,a real number times a real number. So this would be like a Dialogue: 0,0:31:06.26,0:31:08.27,Default,,0000,0000,0000,,one-dimensional vector Dialogue: 0,0:31:08.27,0:31:12.34,Default,,0000,0000,0000,,transposed times a one-dimensional vector. And so this is just real number times real number. Towards the Dialogue: 0,0:31:12.34,0:31:15.85,Default,,0000,0000,0000,,end of today's lecture, we'll go with just one example where both of Dialogue: 0,0:31:15.85,0:31:17.05,Default,,0000,0000,0000,,these are vectors. Dialogue: 0,0:31:17.05,0:31:24.05,Default,,0000,0000,0000,,But for main distributions, these will turn out to be scalers. Dialogue: 0,0:31:24.35,0:31:25.11,Default,,0000,0000,0000,, Dialogue: 0,0:31:25.11,0:31:26.73,Default,,0000,0000,0000,, Dialogue: 0,0:31:26.73,0:31:33.42,Default,,0000,0000,0000,,[Inaudible] distribution [inaudible]. I Dialogue: 0,0:31:33.42,0:31:35.12,Default,,0000,0000,0000,,mean, it doesn't have the Dialogue: 0,0:31:35.12,0:31:37.63,Default,,0000,0000,0000,,zero probability Dialogue: 0,0:31:37.63,0:31:42.28,Default,,0000,0000,0000,,or [inaudible] zero and one. I Dialogue: 0,0:31:42.28,0:31:43.28,Default,,0000,0000,0000,,see. So Dialogue: 0,0:31:43.28,0:31:46.26,Default,,0000,0000,0000,,- yeah. Let's - for this, Dialogue: 0,0:31:46.26,0:31:50.89,Default,,0000,0000,0000,,let's imagine that we're restricting the domain Dialogue: 0,0:31:50.89,0:31:51.59,Default,,0000,0000,0000,, Dialogue: 0,0:31:51.59,0:31:55.14,Default,,0000,0000,0000,,of the input of the function to be Y equals zero or one. Dialogue: 0,0:31:55.14,0:31:57.13,Default,,0000,0000,0000,,So think of that as maybe in Dialogue: 0,0:31:57.13,0:32:01.62,Default,,0000,0000,0000,,implicit constraint on it. [Inaudible]. But so this is a Dialogue: 0,0:32:01.62,0:32:02.54,Default,,0000,0000,0000,,probability mass function Dialogue: 0,0:32:02.54,0:32:05.90,Default,,0000,0000,0000,,for Y equals zero or Y equals one. So Dialogue: 0,0:32:05.90,0:32:12.90,Default,,0000,0000,0000,,write down Y equals zero one. Let's think of that as an [inaudible]. So - cool. Dialogue: 0,0:32:16.19,0:32:19.58,Default,,0000,0000,0000,,So this Dialogue: 0,0:32:19.58,0:32:21.48,Default,,0000,0000,0000,,takes the [inaudible] Dialogue: 0,0:32:21.48,0:32:24.62,Default,,0000,0000,0000,,distribution and invites in the form and the exponential family Dialogue: 0,0:32:24.62,0:32:27.83,Default,,0000,0000,0000,,distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the Dialogue: 0,0:32:27.83,0:32:29.97,Default,,0000,0000,0000,,Gaussian. Dialogue: 0,0:32:29.97,0:32:32.50,Default,,0000,0000,0000,,I'll basically just write out the answers. Dialogue: 0,0:32:32.50,0:32:37.19,Default,,0000,0000,0000,,So Dialogue: 0,0:32:37.19,0:32:38.25,Default,,0000,0000,0000,, Dialogue: 0,0:32:38.25,0:32:40.45,Default,,0000,0000,0000,,with a normal distribution with Dialogue: 0,0:32:40.45,0:32:42.79,Default,,0000,0000,0000,,[inaudible] sequence squared, Dialogue: 0,0:32:42.79,0:32:47.21,Default,,0000,0000,0000,,and so you remember, was it two lectures ago, Dialogue: 0,0:32:47.21,0:32:50.91,Default,,0000,0000,0000,,when we were dividing the maximum likelihood - excuse me, oh, no, just the Dialogue: 0,0:32:50.91,0:32:52.07,Default,,0000,0000,0000,,previous lecture Dialogue: 0,0:32:52.07,0:32:55.75,Default,,0000,0000,0000,,when we were dividing the maximum likelihood estimate Dialogue: 0,0:32:55.75,0:32:59.90,Default,,0000,0000,0000,,for the parameters of ordinary [inaudible] squares. We showed that Dialogue: 0,0:32:59.90,0:33:04.37,Default,,0000,0000,0000,,the parameter for [inaudible] squared didn't matter. When we divide Dialogue: 0,0:33:04.37,0:33:05.98,Default,,0000,0000,0000,,the [inaudible] model for [inaudible] square [inaudible], Dialogue: 0,0:33:05.98,0:33:09.20,Default,,0000,0000,0000,,we said that no matter what [inaudible] square was, we end up with the same Dialogue: 0,0:33:09.20,0:33:11.52,Default,,0000,0000,0000,,value of the parameters. Dialogue: 0,0:33:11.52,0:33:15.56,Default,,0000,0000,0000,,So for the purposes of just writing lesson, today's lecture, and not taking Dialogue: 0,0:33:15.56,0:33:16.46,Default,,0000,0000,0000,,account Dialogue: 0,0:33:16.46,0:33:17.69,Default,,0000,0000,0000,,[inaudible] squared, Dialogue: 0,0:33:17.69,0:33:20.93,Default,,0000,0000,0000,,I'm just going to set Dialogue: 0,0:33:20.93,0:33:25.74,Default,,0000,0000,0000,,[inaudible] squared to be for the one, okay, so as to not worry about it. Dialogue: 0,0:33:25.74,0:33:28.33,Default,,0000,0000,0000,,Lecture [inaudible] talks a little bit more about this, but I'm just gonna - Dialogue: 0,0:33:28.33,0:33:30.49,Default,,0000,0000,0000,,just to make Dialogue: 0,0:33:30.49,0:33:33.63,Default,,0000,0000,0000,,[inaudible] in class a bit easier and simpler today, let's just Dialogue: 0,0:33:33.63,0:33:36.05,Default,,0000,0000,0000,,say that [inaudible] square equals one. Dialogue: 0,0:33:36.05,0:33:38.61,Default,,0000,0000,0000,,[Inaudible] square is essentially just a scaling factor Dialogue: 0,0:33:38.61,0:33:43.72,Default,,0000,0000,0000,,on the variable Y. Dialogue: 0,0:33:43.72,0:33:45.66,Default,,0000,0000,0000,,So in that case, the Gaussian Dialogue: 0,0:33:45.66,0:33:50.84,Default,,0000,0000,0000,,density is given by this, Dialogue: 0,0:33:50.84,0:33:52.40,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:33:52.40,0:33:53.68,Default,,0000,0000,0000,,squared. Dialogue: 0,0:33:53.68,0:33:55.02,Default,,0000,0000,0000,, Dialogue: 0,0:33:55.02,0:33:58.67,Default,,0000,0000,0000,,And Dialogue: 0,0:33:58.67,0:34:02.49,Default,,0000,0000,0000,,- well, by a couple of steps of algebra, which I'm not going to do, Dialogue: 0,0:34:02.49,0:34:06.84,Default,,0000,0000,0000,,but is written out in [inaudible] in the lecture now so you can download. Dialogue: 0,0:34:06.84,0:34:09.87,Default,,0000,0000,0000,,This is one root two Dialogue: 0,0:34:09.87,0:34:13.99,Default,,0000,0000,0000,,pie, E to the minus one-half Y squared Dialogue: 0,0:34:13.99,0:34:16.90,Default,,0000,0000,0000,,times E to E. Dialogue: 0,0:34:16.90,0:34:20.21,Default,,0000,0000,0000,,New Y minus Dialogue: 0,0:34:20.21,0:34:22.69,Default,,0000,0000,0000,,one-half [inaudible] squared, okay. Dialogue: 0,0:34:22.69,0:34:25.02,Default,,0000,0000,0000,,So I'm just not doing the algebra. Dialogue: 0,0:34:25.02,0:34:27.71,Default,,0000,0000,0000,,And so that's B Dialogue: 0,0:34:27.71,0:34:32.66,Default,,0000,0000,0000,,of Y, Dialogue: 0,0:34:32.66,0:34:37.28,Default,,0000,0000,0000,,we have [inaudible] that's equal to [inaudible]. P of Dialogue: 0,0:34:37.28,0:34:42.26,Default,,0000,0000,0000,,Y equals Dialogue: 0,0:34:42.26,0:34:43.46,Default,,0000,0000,0000,,Y, Dialogue: 0,0:34:43.46,0:34:45.12,Default,,0000,0000,0000,,and - well, A of [inaudible] Dialogue: 0,0:34:45.12,0:34:49.86,Default,,0000,0000,0000,,is equal to Dialogue: 0,0:34:49.86,0:34:53.99,Default,,0000,0000,0000,,minus one-half - actually, I think that Dialogue: 0,0:34:53.99,0:34:59.24,Default,,0000,0000,0000,,should be plus one-half. Have I got that right? Dialogue: 0,0:34:59.24,0:35:01.62,Default,,0000,0000,0000,,Yeah, sorry. Let's Dialogue: 0,0:35:01.62,0:35:03.19,Default,,0000,0000,0000,,see - excuse me. Plus sign Dialogue: 0,0:35:03.19,0:35:06.62,Default,,0000,0000,0000,,there, okay. If you minus one-half [inaudible] squared, and Dialogue: 0,0:35:06.62,0:35:11.17,Default,,0000,0000,0000,,because [inaudible] is equal to [inaudible], this is just minus one-half Dialogue: 0,0:35:11.17,0:35:13.96,Default,,0000,0000,0000,,[inaudible] squared, okay. Dialogue: 0,0:35:13.96,0:35:15.42,Default,,0000,0000,0000,,And so Dialogue: 0,0:35:15.42,0:35:19.16,Default,,0000,0000,0000,,this would be a specific choice again of A, B and T Dialogue: 0,0:35:19.16,0:35:21.07,Default,,0000,0000,0000,,that Dialogue: 0,0:35:21.07,0:35:22.82,Default,,0000,0000,0000,,expresses the Gaussian density Dialogue: 0,0:35:22.82,0:35:26.63,Default,,0000,0000,0000,,in the form of an exponential family distribution. Dialogue: 0,0:35:26.63,0:35:30.84,Default,,0000,0000,0000,,And in this case, the relationship between [inaudible] and [inaudible] is that Dialogue: 0,0:35:30.84,0:35:34.76,Default,,0000,0000,0000,,[inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural Dialogue: 0,0:35:34.76,0:35:37.93,Default,,0000,0000,0000,,parameter of the exponential family distribution. Minus Dialogue: 0,0:35:37.93,0:35:43.58,Default,,0000,0000,0000,,half. Oh, this is minus half? [Inaudible] Oh, okay, thanks. Dialogue: 0,0:35:43.58,0:35:47.07,Default,,0000,0000,0000,,And so - guessing Dialogue: 0,0:35:47.07,0:35:51.18,Default,,0000,0000,0000,,that should be plus then. Is that right? Okay. Oh, Dialogue: 0,0:35:51.18,0:35:58.18,Default,,0000,0000,0000,,yes, you're right. Thank you. All right. Dialogue: 0,0:36:03.19,0:36:05.58,Default,,0000,0000,0000,,And so [inaudible] result that Dialogue: 0,0:36:05.58,0:36:09.47,Default,,0000,0000,0000,,if you've taken a look in undergrad statistics class, turns out that most of the Dialogue: 0,0:36:09.47,0:36:12.90,Default,,0000,0000,0000,,"textbook distributions," not all, but most of them, Dialogue: 0,0:36:12.90,0:36:16.61,Default,,0000,0000,0000,,can be written in the form of an exponential family distribution. Dialogue: 0,0:36:16.61,0:36:20.58,Default,,0000,0000,0000,,So you saw the Gaussian, the normal distribution. It turns out the [inaudible] Dialogue: 0,0:36:20.58,0:36:25.31,Default,,0000,0000,0000,,in normal distribution, which is a generalization of Gaussian random variables, so Dialogue: 0,0:36:25.31,0:36:27.48,Default,,0000,0000,0000,,it's a high dimension to vectors. Dialogue: 0,0:36:27.48,0:36:31.100,Default,,0000,0000,0000,,The [inaudible] normal distribution is also in the exponential family. Dialogue: 0,0:36:31.100,0:36:36.49,Default,,0000,0000,0000,,You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all Dialogue: 0,0:36:36.49,0:36:37.89,Default,,0000,0000,0000,,right. So the [inaudible] Dialogue: 0,0:36:37.89,0:36:42.14,Default,,0000,0000,0000,,models outcomes over zero and one. They'll be coin tosses with two outcomes. Dialogue: 0,0:36:42.14,0:36:46.73,Default,,0000,0000,0000,,The [inaudible] models outcomes over K possible values. That's also an Dialogue: 0,0:36:46.73,0:36:49.29,Default,,0000,0000,0000,,exponential families distribution. Dialogue: 0,0:36:49.29,0:36:52.67,Default,,0000,0000,0000,,You may have heard of the Parson distribution. And so the Parson distribution is Dialogue: 0,0:36:52.67,0:36:54.85,Default,,0000,0000,0000,,often used for modeling counts. Things like Dialogue: 0,0:36:54.85,0:36:59.78,Default,,0000,0000,0000,,the number of radioactive decays in a sample, or the number of Dialogue: 0,0:36:59.78,0:37:03.71,Default,,0000,0000,0000,,customers to your website, the numbers of visitors arriving in a store. The Dialogue: 0,0:37:03.71,0:37:08.06,Default,,0000,0000,0000,,Parson distribution is also in the exponential family. So are Dialogue: 0,0:37:08.06,0:37:10.37,Default,,0000,0000,0000,,the gamma and the exponential Dialogue: 0,0:37:10.37,0:37:15.36,Default,,0000,0000,0000,,distributions, if you've heard of them. So the gamma and the exponential distributions are Dialogue: 0,0:37:15.36,0:37:18.86,Default,,0000,0000,0000,,distributions of the positive numbers. So they're often used in model Dialogue: 0,0:37:18.86,0:37:20.30,Default,,0000,0000,0000,,intervals, like if you're Dialogue: 0,0:37:20.30,0:37:22.31,Default,,0000,0000,0000,,standing at the bus stop and you want to Dialogue: 0,0:37:22.31,0:37:25.62,Default,,0000,0000,0000,,ask, "When is the next bus likely to arrive? How long do I have to wait for Dialogue: 0,0:37:25.62,0:37:27.02,Default,,0000,0000,0000,,my bus to arrive?" Dialogue: 0,0:37:27.02,0:37:32.23,Default,,0000,0000,0000,,Often you model that with sort of gamma distribution or Dialogue: 0,0:37:32.23,0:37:34.19,Default,,0000,0000,0000,,exponential families, or the exponential Dialogue: 0,0:37:34.19,0:37:36.60,Default,,0000,0000,0000,,distribution. Those are also in the exponential family. Dialogue: 0,0:37:36.60,0:37:40.93,Default,,0000,0000,0000,,Even more [inaudible] distributions, like the [inaudible] and the [inaudible] Dialogue: 0,0:37:40.93,0:37:43.89,Default,,0000,0000,0000,,distributions, these are probably distributions over Dialogue: 0,0:37:43.89,0:37:49.09,Default,,0000,0000,0000,,fractions, are already probability distributions over probability distributions. Dialogue: 0,0:37:49.09,0:37:52.41,Default,,0000,0000,0000,,And also things like the Wisha distribution, which is the Dialogue: 0,0:37:52.41,0:37:56.61,Default,,0000,0000,0000,,distribution over covariance matrices. So all of these, it turns out, can be written in Dialogue: 0,0:37:56.61,0:38:00.72,Default,,0000,0000,0000,,the form of exponential family distributions. Well, Dialogue: 0,0:38:00.72,0:38:03.66,Default,,0000,0000,0000,,and Dialogue: 0,0:38:03.66,0:38:08.51,Default,,0000,0000,0000,,in the problem set where he asks you to take one of these distributions and write Dialogue: 0,0:38:08.51,0:38:12.74,Default,,0000,0000,0000,,it in the form of the exponential family distribution, and derive a generalized linear model Dialogue: 0,0:38:12.74,0:38:14.68,Default,,0000,0000,0000,,for it, okay. Dialogue: 0,0:38:14.68,0:38:21.68,Default,,0000,0000,0000,,Which brings me to the next topic of Dialogue: 0,0:38:28.79,0:38:32.12,Default,,0000,0000,0000,,having chosen and exponential family distribution, Dialogue: 0,0:38:32.12,0:38:37.83,Default,,0000,0000,0000,,how do you use it to derive a generalized linear model? Dialogue: 0,0:38:37.83,0:38:41.22,Default,,0000,0000,0000,,So Dialogue: 0,0:38:41.22,0:38:45.46,Default,,0000,0000,0000,,generalized linear models are often abbreviated GLM's. Dialogue: 0,0:38:45.46,0:38:48.56,Default,,0000,0000,0000,,And Dialogue: 0,0:38:48.56,0:38:51.62,Default,,0000,0000,0000,,I'm going to write down the three assumptions. You can think of them Dialogue: 0,0:38:51.62,0:38:54.70,Default,,0000,0000,0000,,as assumptions, or you can think of them as design choices, Dialogue: 0,0:38:54.70,0:39:00.31,Default,,0000,0000,0000,,that will then allow me to sort of turn a crank and come up with a generalized linear model. Dialogue: 0,0:39:00.31,0:39:03.60,Default,,0000,0000,0000,,So the first one is - I'm going to assume Dialogue: 0,0:39:03.60,0:39:04.31,Default,,0000,0000,0000,,that Dialogue: 0,0:39:04.31,0:39:05.58,Default,,0000,0000,0000,,given Dialogue: 0,0:39:05.58,0:39:09.43,Default,,0000,0000,0000,,my input X and my parameters theta, Dialogue: 0,0:39:09.43,0:39:12.73,Default,,0000,0000,0000,,I'm going to assume that the variable Y, Dialogue: 0,0:39:12.73,0:39:17.01,Default,,0000,0000,0000,,the output Y, or the response variable Y I'm trying to predict Dialogue: 0,0:39:17.01,0:39:20.08,Default,,0000,0000,0000,,is distributed Dialogue: 0,0:39:20.08,0:39:21.72,Default,,0000,0000,0000,,exponential family Dialogue: 0,0:39:21.72,0:39:23.88,Default,,0000,0000,0000,, Dialogue: 0,0:39:23.88,0:39:27.33,Default,,0000,0000,0000,,with some natural parameter [inaudible]. And so this means that there Dialogue: 0,0:39:27.33,0:39:29.30,Default,,0000,0000,0000,,is some specific choice Dialogue: 0,0:39:29.30,0:39:32.01,Default,,0000,0000,0000,,of those functions, A, B and T Dialogue: 0,0:39:32.01,0:39:33.05,Default,,0000,0000,0000,,so that Dialogue: 0,0:39:33.05,0:39:38.05,Default,,0000,0000,0000,,the conditional distribution of Y given X and parameterized by theta, Dialogue: 0,0:39:38.05,0:39:39.94,Default,,0000,0000,0000,,those exponential families Dialogue: 0,0:39:39.94,0:39:41.73,Default,,0000,0000,0000,,with parameter Dialogue: 0,0:39:41.73,0:39:46.39,Default,,0000,0000,0000,,[inaudible]. Where here, [inaudible] may depend on X in some way. Dialogue: 0,0:39:46.39,0:39:50.24,Default,,0000,0000,0000,,So for example, if you're trying to predict - if you want to Dialogue: 0,0:39:50.24,0:39:50.71,Default,,0000,0000,0000,,predict Dialogue: 0,0:39:50.71,0:39:53.09,Default,,0000,0000,0000,,how many customers have arrived at your website, Dialogue: 0,0:39:53.09,0:39:54.73,Default,,0000,0000,0000,,you may choose to Dialogue: 0,0:39:54.73,0:39:57.19,Default,,0000,0000,0000,,model the number of people Dialogue: 0,0:39:57.19,0:40:01.35,Default,,0000,0000,0000,,- the number of hits on your website by Parson Distribution since Parson Dialogue: 0,0:40:01.35,0:40:03.87,Default,,0000,0000,0000,,Distribution is natural for modeling com data. Dialogue: 0,0:40:03.87,0:40:04.92,Default,,0000,0000,0000,,And so you may Dialogue: 0,0:40:04.92,0:40:11.92,Default,,0000,0000,0000,,choose the exponential family distribution here to be the Parson distribution. Dialogue: 0,0:40:15.11,0:40:18.05,Default,,0000,0000,0000,,[Inaudible] that given X, our Dialogue: 0,0:40:18.05,0:40:20.58,Default,,0000,0000,0000,,goal is Dialogue: 0,0:40:20.58,0:40:22.33,Default,,0000,0000,0000,,to output Dialogue: 0,0:40:22.33,0:40:25.01,Default,,0000,0000,0000,,the Dialogue: 0,0:40:25.01,0:40:28.64,Default,,0000,0000,0000,,effective value of Y Dialogue: 0,0:40:28.64,0:40:29.65,Default,,0000,0000,0000,,given X. So Dialogue: 0,0:40:29.65,0:40:34.02,Default,,0000,0000,0000,,given the features in the website examples, I've given a set of features Dialogue: 0,0:40:34.02,0:40:37.61,Default,,0000,0000,0000,,about whether there were any proportions, whether there were sales, how many people linked to Dialogue: 0,0:40:37.61,0:40:39.95,Default,,0000,0000,0000,,your website, or whatever. Dialogue: 0,0:40:39.95,0:40:41.49,Default,,0000,0000,0000,,I'm going to assume that Dialogue: 0,0:40:41.49,0:40:45.36,Default,,0000,0000,0000,,our goal in our [inaudible] problem is to estimate the expected number of people that will Dialogue: 0,0:40:45.36,0:40:49.25,Default,,0000,0000,0000,,arrive at your website on a given day. Dialogue: 0,0:40:49.25,0:40:49.100,Default,,0000,0000,0000,, Dialogue: 0,0:40:49.100,0:40:51.66,Default,,0000,0000,0000,,So in other words, Dialogue: 0,0:40:51.66,0:40:54.67,Default,,0000,0000,0000,,you're saying that I want H of Dialogue: 0,0:40:54.67,0:40:58.37,Default,,0000,0000,0000,,X to be equal to - oh, Dialogue: 0,0:40:58.37,0:40:59.74,Default,,0000,0000,0000,,excuse me. I actually Dialogue: 0,0:40:59.74,0:41:05.71,Default,,0000,0000,0000,,meant to write T of Y here. Dialogue: 0,0:41:05.71,0:41:10.35,Default,,0000,0000,0000,,My goal is to get my learning algorithms hypothesis to output the expected value Dialogue: 0,0:41:10.35,0:41:12.72,Default,,0000,0000,0000,,of T of Y given X. But Dialogue: 0,0:41:12.72,0:41:17.17,Default,,0000,0000,0000,,again, for most of the examples, T of Y is just equal to Y. Dialogue: 0,0:41:17.17,0:41:21.11,Default,,0000,0000,0000,,And so for most of the examples, our goal is to get our learning algorithms output, T Dialogue: 0,0:41:21.11,0:41:28.11,Default,,0000,0000,0000,,expected value of Y given X because T of Y is usually Dialogue: 0,0:41:33.24,0:41:33.92,Default,,0000,0000,0000,,equal Dialogue: 0,0:41:33.92,0:41:36.23,Default,,0000,0000,0000,,to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic. Dialogue: 0,0:41:36.23,0:41:39.39,Default,,0000,0000,0000,,Same T of Y. Dialogue: 0,0:41:39.39,0:41:43.88,Default,,0000,0000,0000,,And lastly, this last one I wrote down - these are assumptions. This last Dialogue: 0,0:41:43.88,0:41:47.68,Default,,0000,0000,0000,,one you might - maybe wanna think of this as a design choice. Dialogue: 0,0:41:47.68,0:41:48.51,Default,,0000,0000,0000,, Dialogue: 0,0:41:48.51,0:41:50.02,Default,,0000,0000,0000,,Which is [inaudible] Dialogue: 0,0:41:50.02,0:41:53.12,Default,,0000,0000,0000,,assume that the distribution of Y given X Dialogue: 0,0:41:53.12,0:41:55.43,Default,,0000,0000,0000,,is a distributed exponential family Dialogue: 0,0:41:55.43,0:41:59.08,Default,,0000,0000,0000,,with some parameter [inaudible]. So the number of visitors on the website Dialogue: 0,0:41:59.08,0:42:01.13,Default,,0000,0000,0000,,on any given day will be Parson Dialogue: 0,0:42:01.13,0:42:03.44,Default,,0000,0000,0000,,or some parameter [inaudible]. Dialogue: 0,0:42:03.44,0:42:05.97,Default,,0000,0000,0000,,And the last decision I need to make is Dialogue: 0,0:42:05.97,0:42:09.27,Default,,0000,0000,0000,,was the relationship between my input teachers Dialogue: 0,0:42:09.27,0:42:11.22,Default,,0000,0000,0000,,and this parameter Dialogue: 0,0:42:11.22,0:42:14.63,Default,,0000,0000,0000,,[inaudible] parameterizing my Parson distribution or whatever. Dialogue: 0,0:42:14.63,0:42:18.80,Default,,0000,0000,0000,,And this last step, I'm going to make the Dialogue: 0,0:42:18.80,0:42:20.93,Default,,0000,0000,0000,,assumption, or really a design choice, Dialogue: 0,0:42:20.93,0:42:24.17,Default,,0000,0000,0000,,that I'm going to assume the relationship between [inaudible] Dialogue: 0,0:42:24.17,0:42:26.36,Default,,0000,0000,0000,,and my [inaudible] axis linear, Dialogue: 0,0:42:26.36,0:42:29.71,Default,,0000,0000,0000,,and in particular that they're governed by this - that [inaudible] is equal to theta, Dialogue: 0,0:42:29.71,0:42:32.18,Default,,0000,0000,0000,,transpose X. Dialogue: 0,0:42:32.18,0:42:35.76,Default,,0000,0000,0000,,And the reason I make this design choice is it will allow me to turn the crank of Dialogue: 0,0:42:35.76,0:42:36.81,Default,,0000,0000,0000,,the Dialogue: 0,0:42:36.81,0:42:38.62,Default,,0000,0000,0000,,generalized linear model of Dialogue: 0,0:42:38.62,0:42:42.16,Default,,0000,0000,0000,,machinery and come off with very nice algorithms for fitting Dialogue: 0,0:42:42.16,0:42:45.41,Default,,0000,0000,0000,,say Parson Regression models Dialogue: 0,0:42:45.41,0:42:47.21,Default,,0000,0000,0000,,or performed regression Dialogue: 0,0:42:47.21,0:42:50.28,Default,,0000,0000,0000,,with a gamma distribution outputs or exponential distribution outputs and Dialogue: 0,0:42:50.28,0:42:53.43,Default,,0000,0000,0000,,so on. Dialogue: 0,0:42:53.43,0:42:58.93,Default,,0000,0000,0000,,So Dialogue: 0,0:42:58.93,0:43:05.93,Default,,0000,0000,0000,,let's work through an example. Dialogue: 0,0:43:13.09,0:43:14.77,Default,,0000,0000,0000,, Dialogue: 0,0:43:14.77,0:43:17.80,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,0:43:17.80,0:43:22.59,Default,,0000,0000,0000,,equals theta transpose X works for the case where [inaudible] is a real number. Dialogue: 0,0:43:22.59,0:43:24.99,Default,,0000,0000,0000,,For the more general case, you Dialogue: 0,0:43:24.99,0:43:28.79,Default,,0000,0000,0000,,would have [inaudible] I equals theta Dialogue: 0,0:43:28.79,0:43:35.05,Default,,0000,0000,0000,,I, transpose X if [inaudible] Dialogue: 0,0:43:35.05,0:43:39.42,Default,,0000,0000,0000,,is a vector rather than a real number. But again, most of the examples [inaudible] Dialogue: 0,0:43:39.42,0:43:44.86,Default,,0000,0000,0000,,will just be a real number. Dialogue: 0,0:43:44.86,0:43:51.86,Default,,0000,0000,0000,,All right. Dialogue: 0,0:43:54.93,0:44:00.39,Default,,0000,0000,0000,,So let's work through the [inaudible] example. You'll see Dialogue: 0,0:44:00.39,0:44:07.39,Default,,0000,0000,0000,,where Y given X parameterized by theta - this is a distributed Dialogue: 0,0:44:08.84,0:44:12.24,Default,,0000,0000,0000,,exponential family with natural parameter [inaudible]. Dialogue: 0,0:44:12.24,0:44:13.67,Default,,0000,0000,0000,,And for Dialogue: 0,0:44:13.67,0:44:17.30,Default,,0000,0000,0000,,the [inaudible] distribution, I'm going to choose A, B and T to Dialogue: 0,0:44:17.30,0:44:19.65,Default,,0000,0000,0000,,be the specific forms Dialogue: 0,0:44:19.65,0:44:21.38,Default,,0000,0000,0000,,that Dialogue: 0,0:44:21.38,0:44:25.56,Default,,0000,0000,0000,,cause those exponential families to become the [inaudible] distribution. This is the example we Dialogue: 0,0:44:25.56,0:44:29.75,Default,,0000,0000,0000,,worked through just now, the first example we worked through just now. Dialogue: 0,0:44:29.75,0:44:33.53,Default,,0000,0000,0000,,So - oh, Dialogue: 0,0:44:33.53,0:44:35.06,Default,,0000,0000,0000,,and we also have - Dialogue: 0,0:44:35.06,0:44:41.07,Default,,0000,0000,0000,,so for any fixed Dialogue: 0,0:44:41.07,0:44:44.44,Default,,0000,0000,0000,,value of X and theta, my hypothesis, my learning Dialogue: 0,0:44:44.44,0:44:46.34,Default,,0000,0000,0000,,algorithm Dialogue: 0,0:44:46.34,0:44:53.17,Default,,0000,0000,0000,,will Dialogue: 0,0:44:53.17,0:44:56.80,Default,,0000,0000,0000,,make a prediction, or will make - will sort of output [inaudible] of Dialogue: 0,0:44:56.80,0:45:00.86,Default,,0000,0000,0000,,X, Dialogue: 0,0:45:00.86,0:45:02.56,Default,,0000,0000,0000,,which is Dialogue: 0,0:45:02.56,0:45:06.18,Default,,0000,0000,0000,,by my, I guess, assumption [inaudible]. Watch our learning Dialogue: 0,0:45:06.18,0:45:10.31,Default,,0000,0000,0000,,algorithm to output the expected value of Y given X Dialogue: 0,0:45:10.31,0:45:13.83,Default,,0000,0000,0000,, Dialogue: 0,0:45:13.83,0:45:14.95,Default,,0000,0000,0000,,and parameterized by theta, where Y can take on Dialogue: 0,0:45:14.95,0:45:17.35,Default,,0000,0000,0000,,only the value zero and one, Dialogue: 0,0:45:17.35,0:45:21.64,Default,,0000,0000,0000,,then the expected value of Y is just equal to the Dialogue: 0,0:45:21.64,0:45:26.10,Default,,0000,0000,0000,,probability that Y is equal to one. So Dialogue: 0,0:45:26.10,0:45:29.67,Default,,0000,0000,0000,,the expected value of a [inaudible] variable is just equal to the Dialogue: 0,0:45:29.67,0:45:36.67,Default,,0000,0000,0000,,probability that it's equal to one. Dialogue: 0,0:45:36.68,0:45:38.22,Default,,0000,0000,0000,,And so Dialogue: 0,0:45:38.22,0:45:41.55,Default,,0000,0000,0000,,the probability that Y equals one is just equal to phi Dialogue: 0,0:45:41.55,0:45:43.40,Default,,0000,0000,0000,,because that's the Dialogue: 0,0:45:43.40,0:45:44.59,Default,,0000,0000,0000,,parameter Dialogue: 0,0:45:44.59,0:45:47.12,Default,,0000,0000,0000,,of my [inaudible] distribution. Phi Dialogue: 0,0:45:47.12,0:45:54.12,Default,,0000,0000,0000,,is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one. Dialogue: 0,0:45:57.06,0:45:59.44,Default,,0000,0000,0000,,Which we worked out previously, Dialogue: 0,0:45:59.44,0:46:02.51,Default,,0000,0000,0000,,phi was one over one plus E to the negative [inaudible]. Dialogue: 0,0:46:02.51,0:46:06.06,Default,,0000,0000,0000,,So we worked this out on our previous board. This is the relationship - Dialogue: 0,0:46:06.06,0:46:10.04,Default,,0000,0000,0000,,so when we wrote down the [inaudible] distribution Dialogue: 0,0:46:10.04,0:46:14.36,Default,,0000,0000,0000,,in the form of an exponential family, we worked out what the relationship was between phi and Dialogue: 0,0:46:14.36,0:46:18.46,Default,,0000,0000,0000,,[inaudible], and it was this. So we worked out the relationship between the Dialogue: 0,0:46:18.46,0:46:23.69,Default,,0000,0000,0000,,expected value of Y and [inaudible] was this relationship. Dialogue: 0,0:46:23.69,0:46:25.72,Default,,0000,0000,0000,,And lastly, because Dialogue: 0,0:46:25.72,0:46:28.71,Default,,0000,0000,0000,,we made the design choice, or the assumption that Dialogue: 0,0:46:28.71,0:46:31.16,Default,,0000,0000,0000,,[inaudible] and theta are Dialogue: 0,0:46:31.16,0:46:32.19,Default,,0000,0000,0000,,linearly Dialogue: 0,0:46:32.19,0:46:36.17,Default,,0000,0000,0000,,related. This is therefore equal to Dialogue: 0,0:46:36.17,0:46:42.09,Default,,0000,0000,0000,,one over one plus E to the minus theta, transpose X. Dialogue: 0,0:46:42.09,0:46:45.20,Default,,0000,0000,0000,,And so that's Dialogue: 0,0:46:45.20,0:46:46.17,Default,,0000,0000,0000,,how I Dialogue: 0,0:46:46.17,0:46:49.39,Default,,0000,0000,0000,,come up with the logistic regression algorithm Dialogue: 0,0:46:49.39,0:46:51.19,Default,,0000,0000,0000,,when Dialogue: 0,0:46:51.19,0:46:53.42,Default,,0000,0000,0000,,you have a variable Y Dialogue: 0,0:46:53.42,0:46:57.42,Default,,0000,0000,0000,,- when you have a [inaudible] variable Y, or also response variable Dialogue: 0,0:46:57.42,0:46:58.07,Default,,0000,0000,0000,,Y Dialogue: 0,0:46:58.07,0:47:01.51,Default,,0000,0000,0000,,that takes on two values, and then you choose to model Dialogue: 0,0:47:01.51,0:47:07.62,Default,,0000,0000,0000,,variable [inaudible] distribution. Are you Dialogue: 0,0:47:07.62,0:47:14.11,Default,,0000,0000,0000,,sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope Dialogue: 0,0:47:14.11,0:47:15.56,Default,,0000,0000,0000,,you get Dialogue: 0,0:47:15.56,0:47:19.07,Default,,0000,0000,0000,,the ease of use of this, or sort of the power of this. The only decision I Dialogue: 0,0:47:19.07,0:47:22.16,Default,,0000,0000,0000,,made was really, I said Y - Dialogue: 0,0:47:22.16,0:47:24.76,Default,,0000,0000,0000,,let's say I have a new machine-learning problem and Dialogue: 0,0:47:24.76,0:47:28.35,Default,,0000,0000,0000,,I'm trying to predict the value of a variable Y that happens to take on two Dialogue: 0,0:47:28.35,0:47:29.62,Default,,0000,0000,0000,,values. Dialogue: 0,0:47:29.62,0:47:33.67,Default,,0000,0000,0000,,Then the only decision I need to make is I chose [inaudible] distribution. I Dialogue: 0,0:47:33.67,0:47:37.73,Default,,0000,0000,0000,,say I want to model - I want to assume that given X and theta, Dialogue: 0,0:47:37.73,0:47:40.100,Default,,0000,0000,0000,,I'm going to assume Y is distributed Dialogue: 0,0:47:40.100,0:47:43.16,Default,,0000,0000,0000,,[inaudible]. That's the only decision I made. Dialogue: 0,0:47:43.16,0:47:46.89,Default,,0000,0000,0000,,And then everything else follows automatically having made the Dialogue: 0,0:47:46.89,0:47:47.89,Default,,0000,0000,0000,,decision to Dialogue: 0,0:47:47.89,0:47:50.21,Default,,0000,0000,0000,,model Y given X and Dialogue: 0,0:47:50.21,0:47:53.98,Default,,0000,0000,0000,,parameterized by theta as being [inaudible]. Dialogue: 0,0:47:53.98,0:47:58.44,Default,,0000,0000,0000,,In the same way you can choose a different distribution, you can choose Y as Parson or Y as Dialogue: 0,0:47:58.44,0:48:00.19,Default,,0000,0000,0000,,gamma or Y as whatever, Dialogue: 0,0:48:00.19,0:48:03.11,Default,,0000,0000,0000,,and follow a similar process and come up with a different model and Dialogue: 0,0:48:03.11,0:48:04.41,Default,,0000,0000,0000,,different learning algorithm. Dialogue: 0,0:48:04.41,0:48:07.27,Default,,0000,0000,0000,,Come up with a different generalized linear model for whatever learning Dialogue: 0,0:48:07.27,0:48:12.29,Default,,0000,0000,0000,,algorithm you're faced with. Dialogue: 0,0:48:12.29,0:48:16.34,Default,,0000,0000,0000,,This tiny little notation, the Dialogue: 0,0:48:16.34,0:48:20.11,Default,,0000,0000,0000,,function G that Dialogue: 0,0:48:20.11,0:48:25.23,Default,,0000,0000,0000,,relates G of [inaudible] that relates the natural parameter Dialogue: 0,0:48:25.23,0:48:28.12,Default,,0000,0000,0000,,to the expected value of Y, Dialogue: 0,0:48:28.12,0:48:31.68,Default,,0000,0000,0000,,which in this case, one over one plus [inaudible] minus [inaudible], Dialogue: 0,0:48:31.68,0:48:38.68,Default,,0000,0000,0000,,this is called the canonical response function. And G inverse is called the Dialogue: 0,0:48:41.99,0:48:43.35,Default,,0000,0000,0000,,canonical Dialogue: 0,0:48:43.35,0:48:48.65,Default,,0000,0000,0000,,link function. These aren't Dialogue: 0,0:48:48.65,0:48:53.20,Default,,0000,0000,0000,, Dialogue: 0,0:48:53.20,0:48:57.18,Default,,0000,0000,0000,,a huge deal. I won't use this terminology a lot. I'm just Dialogue: 0,0:48:57.18,0:48:58.75,Default,,0000,0000,0000,,mentioning those in case Dialogue: 0,0:48:58.75,0:49:03.01,Default,,0000,0000,0000,,you hear about - people talk about generalized linear models, and if they talk about canonical Dialogue: 0,0:49:03.01,0:49:04.54,Default,,0000,0000,0000,,response functions or Dialogue: 0,0:49:04.54,0:49:09.38,Default,,0000,0000,0000,,canonical link functions, just so you know there's all of this. Dialogue: 0,0:49:09.38,0:49:11.27,Default,,0000,0000,0000,,Actually, many techs actually use Dialogue: 0,0:49:11.27,0:49:15.53,Default,,0000,0000,0000,,the reverse way. This is G inverse and this is G, but this Dialogue: 0,0:49:15.53,0:49:18.18,Default,,0000,0000,0000,,notation turns out to be more consistent with Dialogue: 0,0:49:18.18,0:49:20.12,Default,,0000,0000,0000,,other algorithms in machine learning. So Dialogue: 0,0:49:20.12,0:49:23.15,Default,,0000,0000,0000,,I'm going to use this notation. Dialogue: 0,0:49:23.15,0:49:27.12,Default,,0000,0000,0000,,But I probably won't use the terms canonical response functions and canonical Dialogue: 0,0:49:27.12,0:49:29.76,Default,,0000,0000,0000,,link functions in lecture a lot, so just - I don't know. Dialogue: 0,0:49:29.76,0:49:31.80,Default,,0000,0000,0000,,I'm not big on Dialogue: 0,0:49:31.80,0:49:38.80,Default,,0000,0000,0000,,memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay. Dialogue: 0,0:49:49.84,0:49:51.35,Default,,0000,0000,0000,,You know what, I think Dialogue: 0,0:49:51.35,0:49:55.39,Default,,0000,0000,0000,,in the interest of time, I'm going to skip over the Gaussian example. But again, Dialogue: 0,0:49:55.39,0:49:57.47,Default,,0000,0000,0000,,just like I said, Dialogue: 0,0:49:57.47,0:49:58.92,Default,,0000,0000,0000,,[inaudible], Y Dialogue: 0,0:49:58.92,0:50:02.56,Default,,0000,0000,0000,,is [inaudible], different variation I get of logistic regression. You can do the same Dialogue: 0,0:50:02.56,0:50:04.58,Default,,0000,0000,0000,,thing with the Gaussian distribution Dialogue: 0,0:50:04.58,0:50:08.82,Default,,0000,0000,0000,,and end up with ordinary [inaudible] squares model. Dialogue: 0,0:50:08.82,0:50:12.06,Default,,0000,0000,0000,,The problem with Gaussian is that it's almost so simple that when you see it for Dialogue: 0,0:50:12.06,0:50:13.96,Default,,0000,0000,0000,,the first time that it's Dialogue: 0,0:50:13.96,0:50:17.15,Default,,0000,0000,0000,,sometimes more confusing than the [inaudible] model because it looks so simple, Dialogue: 0,0:50:17.15,0:50:20.65,Default,,0000,0000,0000,,it looks like it has to be more complicated. So let me just skip that Dialogue: 0,0:50:20.65,0:50:22.72,Default,,0000,0000,0000,,and leave you to read about Dialogue: 0,0:50:22.72,0:50:25.37,Default,,0000,0000,0000,,the Gaussian example in the lecture notes. Dialogue: 0,0:50:25.37,0:50:27.40,Default,,0000,0000,0000,,And what I want to do is Dialogue: 0,0:50:27.40,0:50:29.77,Default,,0000,0000,0000,,actually go through a more complex example. Dialogue: 0,0:50:29.77,0:50:33.53,Default,,0000,0000,0000,,Question? [Inaudible] Dialogue: 0,0:50:33.53,0:50:35.48,Default,,0000,0000,0000,,Okay, right. So Dialogue: 0,0:50:35.48,0:50:39.22,Default,,0000,0000,0000,,how do choose what theory will be? Dialogue: 0,0:50:39.22,0:50:42.87,Default,,0000,0000,0000,,We'll get to that in the end. What you have there is the logistic Dialogue: 0,0:50:42.87,0:50:45.44,Default,,0000,0000,0000,,regression model, which is a [inaudible] model Dialogue: 0,0:50:45.44,0:50:51.31,Default,,0000,0000,0000,,that assumes the probability of Y given X is given by a certain form. Dialogue: 0,0:50:51.31,0:50:53.50,Default,,0000,0000,0000,,And so Dialogue: 0,0:50:53.50,0:50:58.28,Default,,0000,0000,0000,,what you do is you can write down the log likelihood of your training set, and Dialogue: 0,0:50:58.28,0:51:01.86,Default,,0000,0000,0000,,find the value of theta that maximizes the log likelihood of the parameters. Dialogue: 0,0:51:01.86,0:51:04.63,Default,,0000,0000,0000,,Does that make Dialogue: 0,0:51:04.63,0:51:05.51,Default,,0000,0000,0000,,sense? Dialogue: 0,0:51:05.51,0:51:09.16,Default,,0000,0000,0000,,So I'll say that again towards the end of today's lecture. But Dialogue: 0,0:51:09.16,0:51:13.70,Default,,0000,0000,0000,,for logistic regression, the way you choose theta is exactly maximum likelihood, Dialogue: 0,0:51:13.70,0:51:14.98,Default,,0000,0000,0000,,as we Dialogue: 0,0:51:14.98,0:51:18.18,Default,,0000,0000,0000,,worked out in the previous lecture, using Newton's Method or gradient Dialogue: 0,0:51:18.18,0:51:19.85,Default,,0000,0000,0000,,ascent or Dialogue: 0,0:51:19.85,0:51:21.91,Default,,0000,0000,0000,,whatever. I'll sort of try to Dialogue: 0,0:51:21.91,0:51:28.91,Default,,0000,0000,0000,,do that again for one more example towards the end of today's lecture. So Dialogue: 0,0:51:29.25,0:51:33.25,Default,,0000,0000,0000,,what I want to do is actually use the remaining, I don't know, Dialogue: 0,0:51:33.25,0:51:36.06,Default,,0000,0000,0000,,19 minutes or so of this class, Dialogue: 0,0:51:36.06,0:51:38.21,Default,,0000,0000,0000,,to go through the - Dialogue: 0,0:51:38.21,0:51:42.31,Default,,0000,0000,0000,,one of the more - it's probably the most complex example of a Dialogue: 0,0:51:42.31,0:51:46.09,Default,,0000,0000,0000,,generalized linear model that I've used. This one I want to go through because it's a little bit Dialogue: 0,0:51:46.09,0:51:47.66,Default,,0000,0000,0000,,trickier than Dialogue: 0,0:51:47.66,0:51:51.16,Default,,0000,0000,0000,,many of the other textbook examples of Dialogue: 0,0:51:51.16,0:51:53.51,Default,,0000,0000,0000,,generalized linear models. Dialogue: 0,0:51:53.51,0:51:54.58,Default,,0000,0000,0000,,So Dialogue: 0,0:51:54.58,0:51:57.96,Default,,0000,0000,0000,,again, what I'm going to do is Dialogue: 0,0:51:57.96,0:51:59.96,Default,,0000,0000,0000,,go through the derivation Dialogue: 0,0:51:59.96,0:52:03.91,Default,,0000,0000,0000,,reasonably quickly and give you the gist of it, and if there are steps I skip or details Dialogue: 0,0:52:03.91,0:52:07.52,Default,,0000,0000,0000,,omitted, I'll leave you to read about them more carefully Dialogue: 0,0:52:07.52,0:52:10.05,Default,,0000,0000,0000,,in the lecture notes. Dialogue: 0,0:52:10.05,0:52:13.65,Default,,0000,0000,0000,,And what I want to do is talk about Dialogue: 0,0:52:13.65,0:52:16.90,Default,,0000,0000,0000,,[inaudible]. Dialogue: 0,0:52:16.90,0:52:19.66,Default,,0000,0000,0000,,And Dialogue: 0,0:52:19.66,0:52:23.19,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:52:23.19,0:52:25.80,Default,,0000,0000,0000,,is the distribution over Dialogue: 0,0:52:25.80,0:52:28.15,Default,,0000,0000,0000,,K possible outcomes. Dialogue: 0,0:52:28.15,0:52:29.72,Default,,0000,0000,0000,, Dialogue: 0,0:52:29.72,0:52:32.94,Default,,0000,0000,0000,,Imagine you're now in a machine-learning problem where the value of Y that you're trying to Dialogue: 0,0:52:32.94,0:52:34.64,Default,,0000,0000,0000,,predict can take on Dialogue: 0,0:52:34.64,0:52:37.29,Default,,0000,0000,0000,,K possible outcomes, so rather than Dialogue: 0,0:52:37.29,0:52:39.59,Default,,0000,0000,0000,,only two outcomes. So obviously, this example's already - Dialogue: 0,0:52:39.59,0:52:44.65,Default,,0000,0000,0000,,if you want to have a learning algorithm, or to magically send emails for you into your Dialogue: 0,0:52:44.65,0:52:48.13,Default,,0000,0000,0000,,right email folder, and you may have a dozen email folders you want your algorithm Dialogue: 0,0:52:48.13,0:52:50.13,Default,,0000,0000,0000,,to classify emails into. Dialogue: 0,0:52:50.13,0:52:51.01,Default,,0000,0000,0000,,Or Dialogue: 0,0:52:51.01,0:52:54.21,Default,,0000,0000,0000,,predicting if the patient either has a disease or does not have Dialogue: 0,0:52:54.21,0:52:57.27,Default,,0000,0000,0000,,a disease, which would be a [inaudible] classification problem. Dialogue: 0,0:52:57.27,0:53:01.76,Default,,0000,0000,0000,,If you think that the patient may have one of K diseases, and Dialogue: 0,0:53:01.76,0:53:06.04,Default,,0000,0000,0000,,you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So Dialogue: 0,0:53:06.04,0:53:06.75,Default,,0000,0000,0000,,lots Dialogue: 0,0:53:06.75,0:53:10.40,Default,,0000,0000,0000,,of multi-cause classification problems where you have more than two causes. You model that Dialogue: 0,0:53:10.40,0:53:13.99,Default,,0000,0000,0000,,with [inaudible]. Dialogue: 0,0:53:13.99,0:53:17.17,Default,,0000,0000,0000,,And eventually - Dialogue: 0,0:53:17.17,0:53:20.61,Default,,0000,0000,0000,,so for logistic regression, I had [inaudible] like these where you have a Dialogue: 0,0:53:20.61,0:53:25.54,Default,,0000,0000,0000,,training set and you find a decision boundary that separates them. Dialogue: 0,0:53:25.54,0:53:29.07,Default,,0000,0000,0000,,[Inaudible], we're going to entertain Dialogue: 0,0:53:29.07,0:53:32.04,Default,,0000,0000,0000,,the value of predicting, taking on multiple values, so you now have Dialogue: 0,0:53:32.04,0:53:33.38,Default,,0000,0000,0000,,three causes, Dialogue: 0,0:53:33.38,0:53:35.76,Default,,0000,0000,0000,,and the learning algorithm Dialogue: 0,0:53:35.76,0:53:39.49,Default,,0000,0000,0000,,will learn some way to separate out three causes or more, rather than just two Dialogue: 0,0:53:39.49,0:53:43.80,Default,,0000,0000,0000,,causes. Dialogue: 0,0:53:43.80,0:53:46.94,Default,,0000,0000,0000,,So let's write [inaudible] in the form of Dialogue: 0,0:53:46.94,0:53:49.36,Default,,0000,0000,0000,,an exponential family distribution. Dialogue: 0,0:53:49.36,0:53:53.22,Default,,0000,0000,0000,, Dialogue: 0,0:53:53.22,0:53:58.01,Default,,0000,0000,0000,,So the parameters of a [inaudible] are phi one, phi two Dialogue: 0,0:53:58.01,0:53:59.92,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:53:59.92,0:54:04.05,Default,,0000,0000,0000,,phi K. I'll actually change this in a second - Dialogue: 0,0:54:04.05,0:54:09.01,Default,,0000,0000,0000,,where the probability of Y equals I is phi I, Dialogue: 0,0:54:09.01,0:54:10.16,Default,,0000,0000,0000,,right, because there are Dialogue: 0,0:54:10.16,0:54:12.98,Default,,0000,0000,0000,,K possible outcomes. Dialogue: 0,0:54:12.98,0:54:17.54,Default,,0000,0000,0000,,But if I choose this as my parameterization of the [inaudible], then Dialogue: 0,0:54:17.54,0:54:19.94,Default,,0000,0000,0000,,my parameter's actually redundant because Dialogue: 0,0:54:19.94,0:54:23.40,Default,,0000,0000,0000,,if these are probabilities, then you have to sum up the one. Dialogue: 0,0:54:23.40,0:54:26.18,Default,,0000,0000,0000,,And therefore for example, I Dialogue: 0,0:54:26.18,0:54:29.59,Default,,0000,0000,0000,,can derive the last parameter, phi K, Dialogue: 0,0:54:29.59,0:54:32.03,Default,,0000,0000,0000,,as one minus phi Dialogue: 0,0:54:32.03,0:54:34.79,Default,,0000,0000,0000,,one, up Dialogue: 0,0:54:34.79,0:54:36.39,Default,,0000,0000,0000,,to phi K minus Dialogue: 0,0:54:36.39,0:54:40.60,Default,,0000,0000,0000,,one. So this would be a Dialogue: 0,0:54:40.60,0:54:44.15,Default,,0000,0000,0000,,redundant parameterization from Dialogue: 0,0:54:44.15,0:54:48.79,Default,,0000,0000,0000,,[inaudible]. The result is over-parameterized. And so for purposes of Dialogue: 0,0:54:48.79,0:54:49.87,Default,,0000,0000,0000,,this [inaudible], I'm Dialogue: 0,0:54:49.87,0:54:53.86,Default,,0000,0000,0000,,going to treat my parameters of my [inaudible] as phi one, Dialogue: 0,0:54:53.86,0:54:58.86,Default,,0000,0000,0000,,phi two, up to phi K minus one. Dialogue: 0,0:54:58.86,0:55:02.80,Default,,0000,0000,0000,,And I won't think of phi K as a parameter. I'll just - so my parameters are Dialogue: 0,0:55:02.80,0:55:03.42,Default,,0000,0000,0000,,just - Dialogue: 0,0:55:03.42,0:55:07.87,Default,,0000,0000,0000,,I just have K minus one parameters, parameterizing my Dialogue: 0,0:55:07.87,0:55:11.19,Default,,0000,0000,0000,,[inaudible]. And sometimes I write phi K in my Dialogue: 0,0:55:11.19,0:55:13.07,Default,,0000,0000,0000,,derivations as well, and you should think of Dialogue: 0,0:55:13.07,0:55:17.61,Default,,0000,0000,0000,,phi K as just a shorthand for this, for one minus the rest of the parameters, okay. Dialogue: 0,0:55:17.61,0:55:24.61,Default,,0000,0000,0000,,So Dialogue: 0,0:55:36.48,0:55:40.80,Default,,0000,0000,0000,,it turns out the [inaudible] is one of the few examples where T of Y - it's one of the Dialogue: 0,0:55:40.80,0:55:44.57,Default,,0000,0000,0000,,examples where T of Y is not equal to Y. Dialogue: 0,0:55:44.57,0:55:48.100,Default,,0000,0000,0000,,So in this case, Y is Dialogue: 0,0:55:48.100,0:55:51.68,Default,,0000,0000,0000,,on of K possible values. Dialogue: 0,0:55:51.68,0:55:55.20,Default,,0000,0000,0000,,And so T of Y would be defined as follows; T Dialogue: 0,0:55:55.20,0:55:59.24,Default,,0000,0000,0000,,of one is going to be a vector with a one Dialogue: 0,0:55:59.24,0:56:02.10,Default,,0000,0000,0000,,and zeros everywhere else. T Dialogue: 0,0:56:02.10,0:56:03.64,Default,,0000,0000,0000,,of two Dialogue: 0,0:56:03.64,0:56:07.31,Default,,0000,0000,0000,,is going to be a zero, one, zero Dialogue: 0,0:56:07.31,0:56:08.69,Default,,0000,0000,0000,,and so Dialogue: 0,0:56:08.69,0:56:12.36,Default,,0000,0000,0000,,on. Except that these are going to be Dialogue: 0,0:56:12.36,0:56:15.64,Default,,0000,0000,0000,,K minus one-dimensional vectors. And Dialogue: 0,0:56:15.64,0:56:17.45,Default,,0000,0000,0000,,so Dialogue: 0,0:56:17.45,0:56:20.67,Default,,0000,0000,0000,,T of K minus one is going to be zero, zero, Dialogue: 0,0:56:20.67,0:56:23.72,Default,,0000,0000,0000,,zero, one. Dialogue: 0,0:56:23.72,0:56:26.83,Default,,0000,0000,0000,,And Dialogue: 0,0:56:26.83,0:56:29.07,Default,,0000,0000,0000,,T of K is going to be the vector of all zeros. Dialogue: 0,0:56:29.07,0:56:30.44,Default,,0000,0000,0000,,So this is just Dialogue: 0,0:56:30.44,0:56:32.42,Default,,0000,0000,0000,,how I'm choosing to define T Dialogue: 0,0:56:32.42,0:56:34.56,Default,,0000,0000,0000,,of Y Dialogue: 0,0:56:34.56,0:56:39.02,Default,,0000,0000,0000,,to write down the [inaudible] in the form of an exponential family Dialogue: 0,0:56:39.02,0:56:40.71,Default,,0000,0000,0000,,distribution. Dialogue: 0,0:56:40.71,0:56:44.80,Default,,0000,0000,0000,,Again, these are K minus one-dimensional vectors. Dialogue: 0,0:56:44.80,0:56:47.32,Default,,0000,0000,0000,,So Dialogue: 0,0:56:47.32,0:56:50.41,Default,,0000,0000,0000,,this is a good point to introduce one more useful piece of notation, Dialogue: 0,0:56:50.41,0:56:53.91,Default,,0000,0000,0000,,which is called indicator function notation. Dialogue: 0,0:56:53.91,0:56:58.07,Default,,0000,0000,0000,,So I'm going to write one, and then curly braces. Dialogue: 0,0:56:58.07,0:57:03.55,Default,,0000,0000,0000,,And if I write a true statement inside, then the indicator of that statement is going to be Dialogue: 0,0:57:03.55,0:57:04.95,Default,,0000,0000,0000,,one. Then I write one, Dialogue: 0,0:57:04.95,0:57:07.95,Default,,0000,0000,0000,,and then I write a false statement inside, then Dialogue: 0,0:57:07.95,0:57:11.85,Default,,0000,0000,0000,,the value of this indicator function is going to be a Dialogue: 0,0:57:11.85,0:57:15.30,Default,,0000,0000,0000,,zero. For example, if I write indicator two Dialogue: 0,0:57:15.30,0:57:16.99,Default,,0000,0000,0000,,equals three Dialogue: 0,0:57:16.99,0:57:20.79,Default,,0000,0000,0000,,[inaudible] that's false, and so this is equal to zero. Dialogue: 0,0:57:20.79,0:57:22.49,Default,,0000,0000,0000,,Whereas indicator [inaudible] Dialogue: 0,0:57:22.49,0:57:24.82,Default,,0000,0000,0000,,plus one equals two, Dialogue: 0,0:57:24.82,0:57:28.45,Default,,0000,0000,0000,,I wrote down a true statement inside. And so the indicator of the statement was equal to Dialogue: 0,0:57:28.45,0:57:30.03,Default,,0000,0000,0000,,one. So the Dialogue: 0,0:57:30.03,0:57:31.98,Default,,0000,0000,0000,,indicator function is just a Dialogue: 0,0:57:31.98,0:57:35.49,Default,,0000,0000,0000,,very useful notation for indicating sort of truth or falsehood Dialogue: 0,0:57:35.49,0:57:42.49,Default,,0000,0000,0000,,of the statement inside. And so - actually, let's do Dialogue: 0,0:57:45.53,0:57:47.20,Default,,0000,0000,0000,,this here. Dialogue: 0,0:57:47.20,0:57:53.94,Default,,0000,0000,0000,,To combine both of these, right, if I carve out a bit Dialogue: 0,0:57:53.94,0:57:57.15,Default,,0000,0000,0000,,of space here - Dialogue: 0,0:57:57.15,0:58:02.54,Default,,0000,0000,0000,,so if I use - so TY is a Dialogue: 0,0:58:02.54,0:58:06.67,Default,,0000,0000,0000,,vector. Y is one of K values, and so Dialogue: 0,0:58:06.67,0:58:09.63,Default,,0000,0000,0000,,TY is one of these K vectors. If Dialogue: 0,0:58:09.63,0:58:12.97,Default,,0000,0000,0000,,I use TY as [inaudible] to denote Dialogue: 0,0:58:12.97,0:58:15.95,Default,,0000,0000,0000,,the [inaudible] element of the vector Dialogue: 0,0:58:15.95,0:58:18.43,Default,,0000,0000,0000,,TY, Dialogue: 0,0:58:18.43,0:58:21.79,Default,,0000,0000,0000,,then TY - the [inaudible] element of the vector TY Dialogue: 0,0:58:21.79,0:58:26.53,Default,,0000,0000,0000,,is just equal to indicator for Dialogue: 0,0:58:26.53,0:58:30.46,Default,,0000,0000,0000,,whether Y is equal to I. Just take a Dialogue: 0,0:58:30.46,0:58:33.19,Default,,0000,0000,0000,,- let me clean a couple more boards. Take a look at this for a second Dialogue: 0,0:58:33.19,0:58:34.99,Default,,0000,0000,0000,,and make sure you understand why that - make Dialogue: 0,0:58:34.99,0:58:41.99,Default,,0000,0000,0000,,sure you understand all that notation and why this is true. All Dialogue: 0,0:59:09.84,0:59:15.60,Default,,0000,0000,0000,,right. Actually, raise your hand if this equation makes sense to you. Dialogue: 0,0:59:15.60,0:59:18.77,Default,,0000,0000,0000,,Most of you, Dialogue: 0,0:59:18.77,0:59:23.86,Default,,0000,0000,0000,,not all, okay. [Inaudible]. Dialogue: 0,0:59:23.86,0:59:26.60,Default,,0000,0000,0000,,Just as one kind of [inaudible], Dialogue: 0,0:59:26.60,0:59:29.68,Default,,0000,0000,0000,,suppose Y is equal to one Dialogue: 0,0:59:29.68,0:59:33.58,Default,,0000,0000,0000,,- let's Dialogue: 0,0:59:33.58,0:59:36.20,Default,,0000,0000,0000,,say - let me see. Suppose Y is equal to one, right, Dialogue: 0,0:59:36.20,0:59:39.74,Default,,0000,0000,0000,,so TY is equal to this vector, Dialogue: 0,0:59:39.74,0:59:42.39,Default,,0000,0000,0000,,and therefore the first element of this vector Dialogue: 0,0:59:42.39,0:59:46.78,Default,,0000,0000,0000,,will be one, and the rest of the elements will be equal to zero. Dialogue: 0,0:59:46.78,0:59:50.30,Default,,0000,0000,0000,,And so - let Dialogue: 0,0:59:50.30,0:59:54.34,Default,,0000,0000,0000,,me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of Dialogue: 0,0:59:54.34,0:59:58.96,Default,,0000,0000,0000,,the vector TY, and I want to know is this one or zero. All right. Dialogue: 0,0:59:58.96,1:00:01.25,Default,,0000,0000,0000,,Well, this will be one. Dialogue: 0,1:00:01.25,1:00:03.76,Default,,0000,0000,0000,,The [inaudible] element of the vector TY Dialogue: 0,1:00:03.76,1:00:05.06,Default,,0000,0000,0000,,will be equal to one Dialogue: 0,1:00:05.06,1:00:06.47,Default,,0000,0000,0000,,if, and only if Y is Dialogue: 0,1:00:06.47,1:00:08.79,Default,,0000,0000,0000,,equal to I. Dialogue: 0,1:00:08.79,1:00:12.34,Default,,0000,0000,0000,,Because for example, if Y is equal to one, then only the first element of this Dialogue: 0,1:00:12.34,1:00:13.81,Default,,0000,0000,0000,,vector will be zero. Dialogue: 0,1:00:13.81,1:00:17.83,Default,,0000,0000,0000,,If Y is equal to two, then only the second element of the vector will be zero Dialogue: 0,1:00:17.83,1:00:19.85,Default,,0000,0000,0000,,and so on. So the question of Dialogue: 0,1:00:19.85,1:00:23.90,Default,,0000,0000,0000,,whether or not - whether the [inaudible] element of this vector, TY, is equal to Dialogue: 0,1:00:23.90,1:00:25.67,Default,,0000,0000,0000,,one is Dialogue: 0,1:00:25.67,1:00:26.70,Default,,0000,0000,0000,,answered by Dialogue: 0,1:00:26.70,1:00:28.75,Default,,0000,0000,0000,,just asking is Y Dialogue: 0,1:00:28.75,1:00:35.75,Default,,0000,0000,0000,,equal to I. Okay. If you're Dialogue: 0,1:00:36.19,1:00:38.98,Default,,0000,0000,0000,,still not quite sure why that's true, go home and Dialogue: 0,1:00:38.98,1:00:41.41,Default,,0000,0000,0000,,think about it a bit more. And I think I - Dialogue: 0,1:00:41.41,1:00:45.44,Default,,0000,0000,0000,,and take a look at the lecture notes as Dialogue: 0,1:00:45.44,1:00:49.78,Default,,0000,0000,0000,,well, maybe that'll help. At least for now, only just take my word for it. So Dialogue: 0,1:00:49.78,1:00:54.46,Default,,0000,0000,0000,,let's go ahead and write out the distribution Dialogue: 0,1:00:54.46,1:00:58.37,Default,,0000,0000,0000,,for the [inaudible] in an exponential family form. Dialogue: 0,1:00:58.37,1:01:05.37,Default,,0000,0000,0000,,So PFY is equal to phi one. Dialogue: 0,1:01:06.48,1:01:07.81,Default,,0000,0000,0000,,Indicator Y equals one Dialogue: 0,1:01:07.81,1:01:10.54,Default,,0000,0000,0000,,times phi two. Indicator Y equals Dialogue: 0,1:01:10.54,1:01:12.91,Default,,0000,0000,0000,,to Dialogue: 0,1:01:12.91,1:01:17.13,Default,,0000,0000,0000,,up to phi K Dialogue: 0,1:01:17.13,1:01:21.22,Default,,0000,0000,0000,,times indicator Y equals K. And again, phi K is not a Dialogue: 0,1:01:21.22,1:01:23.37,Default,,0000,0000,0000,,parameter of the distribution. Phi K Dialogue: 0,1:01:23.37,1:01:29.44,Default,,0000,0000,0000,,is a shorthand for one minus phi one minus phi two minus the rest. Dialogue: 0,1:01:29.44,1:01:30.69,Default,,0000,0000,0000,,And so Dialogue: 0,1:01:30.69,1:01:34.64,Default,,0000,0000,0000,,using this equation on the left as well, I can also write this as Dialogue: 0,1:01:34.64,1:01:38.45,Default,,0000,0000,0000,,phi one times TY one, phi Dialogue: 0,1:01:38.45,1:01:40.38,Default,,0000,0000,0000,,two, TY Dialogue: 0,1:01:40.38,1:01:42.88,Default,,0000,0000,0000,,two, dot, dot, dot. Dialogue: 0,1:01:42.88,1:01:44.61,Default,,0000,0000,0000,,Phi K minus one, Dialogue: 0,1:01:44.61,1:01:46.02,Default,,0000,0000,0000,, Dialogue: 0,1:01:46.02,1:01:48.27,Default,,0000,0000,0000,,TY, K minus one Dialogue: 0,1:01:48.27,1:01:50.89,Default,,0000,0000,0000,,times phi K. And then Dialogue: 0,1:01:50.89,1:01:57.89,Default,,0000,0000,0000,,one minus [inaudible]. That should Dialogue: 0,1:02:02.24,1:02:08.24,Default,,0000,0000,0000,,be K. Dialogue: 0,1:02:08.24,1:02:10.71,Default,,0000,0000,0000,,And it turns out - Dialogue: 0,1:02:10.71,1:02:15.07,Default,,0000,0000,0000,,it takes some of the steps of algebra that I don't have time to show. Dialogue: 0,1:02:15.07,1:02:19.27,Default,,0000,0000,0000,,It turns out, you can simplify this into - well, the Dialogue: 0,1:02:19.27,1:02:26.27,Default,,0000,0000,0000,,exponential family form Dialogue: 0,1:02:32.39,1:02:39.39,Default,,0000,0000,0000,,where [inaudible] is a vector, this is a Dialogue: 0,1:02:50.50,1:02:57.50,Default,,0000,0000,0000,,K minus one-dimensional vector, and - well, Dialogue: 0,1:03:06.06,1:03:08.47,Default,,0000,0000,0000,,okay. So deriving this is a few steps of algebra Dialogue: 0,1:03:08.47,1:03:12.60,Default,,0000,0000,0000,,that you can work out yourself, but I won't do here. Dialogue: 0,1:03:12.60,1:03:17.79,Default,,0000,0000,0000,,And so using my definition for TY, and Dialogue: 0,1:03:17.79,1:03:18.81,Default,,0000,0000,0000,,by choosing Dialogue: 0,1:03:18.81,1:03:20.55,Default,,0000,0000,0000,,[inaudible] A and B this Dialogue: 0,1:03:20.55,1:03:24.49,Default,,0000,0000,0000,,way, I can take my distribution from [inaudible] and write it out in Dialogue: 0,1:03:24.49,1:03:31.49,Default,,0000,0000,0000,,a form of an exponential family distribution. Dialogue: 0,1:03:32.89,1:03:34.95,Default,,0000,0000,0000,,It turns out also that Dialogue: 0,1:03:34.95,1:03:36.74,Default,,0000,0000,0000,,- let's Dialogue: 0,1:03:36.74,1:03:40.43,Default,,0000,0000,0000,,see. [Inaudible], right. One of the things we did was we also had Dialogue: 0,1:03:40.43,1:03:42.20,Default,,0000,0000,0000,,[inaudible] as a function of phi, Dialogue: 0,1:03:42.20,1:03:45.20,Default,,0000,0000,0000,,and then we inverted that to write Dialogue: 0,1:03:45.20,1:03:47.86,Default,,0000,0000,0000,,out phi as a function of [inaudible]. Dialogue: 0,1:03:47.86,1:03:50.86,Default,,0000,0000,0000,,So it turns out you can do that as well. Dialogue: 0,1:03:50.86,1:03:55.49,Default,,0000,0000,0000,,So this defines [inaudible] as a function of the [inaudible] distributions parameters phi. Dialogue: 0,1:03:55.49,1:03:56.25,Default,,0000,0000,0000,,So Dialogue: 0,1:03:56.25,1:03:59.80,Default,,0000,0000,0000,,you can take this relationship between [inaudible] and phi and invert it, Dialogue: 0,1:03:59.80,1:04:02.63,Default,,0000,0000,0000,,and write out phi as a function of [inaudible]. Dialogue: 0,1:04:02.63,1:04:06.28,Default,,0000,0000,0000,,And it turns out, you get that Dialogue: 0,1:04:06.28,1:04:08.76,Default,,0000,0000,0000,,phi I is equal to [inaudible] - Dialogue: 0,1:04:08.76,1:04:15.76,Default,,0000,0000,0000,,excuse me. Dialogue: 0,1:04:16.34,1:04:20.96,Default,,0000,0000,0000,,And you get that phi I is equal to Dialogue: 0,1:04:20.96,1:04:27.96,Default,,0000,0000,0000,,[inaudible] I of one plus Dialogue: 0,1:04:28.48,1:04:35.48,Default,,0000,0000,0000,,that. Dialogue: 0,1:04:35.91,1:04:39.17,Default,,0000,0000,0000,,And the way you do this is you just - this defines Dialogue: 0,1:04:39.17,1:04:43.35,Default,,0000,0000,0000,,[inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And Dialogue: 0,1:04:43.35,1:04:47.61,Default,,0000,0000,0000,,this is - again, there are a couple of steps of algebra that I'm just not showing. Dialogue: 0,1:04:47.61,1:04:52.06,Default,,0000,0000,0000,,And then lastly, using our Dialogue: 0,1:04:52.06,1:04:55.65,Default,,0000,0000,0000,,assumption that the [inaudible] are a linear function of the [inaudible] Dialogue: 0,1:04:55.65,1:04:56.79,Default,,0000,0000,0000,,axis, phi Dialogue: 0,1:04:56.79,1:05:01.89,Default,,0000,0000,0000,,I is therefore equal to E to the theta I, transpose X, Dialogue: 0,1:05:01.89,1:05:06.32,Default,,0000,0000,0000,,divided by one plus sum over Dialogue: 0,1:05:06.32,1:05:08.92,Default,,0000,0000,0000,,J equals one, to K Dialogue: 0,1:05:08.92,1:05:11.11,Default,,0000,0000,0000,,minus one, Dialogue: 0,1:05:11.11,1:05:14.49,Default,,0000,0000,0000,,E to the Dialogue: 0,1:05:14.49,1:05:17.66,Default,,0000,0000,0000,,theta J, transpose Dialogue: 0,1:05:17.66,1:05:24.07,Default,,0000,0000,0000,,X. And this is just using the fact Dialogue: 0,1:05:24.07,1:05:28.55,Default,,0000,0000,0000,,that [inaudible] I equals theta I, transpose X, which was our earlier Dialogue: 0,1:05:28.55,1:05:35.55,Default,,0000,0000,0000,,design choice from generalized linear models. So Dialogue: 0,1:05:43.66,1:05:49.88,Default,,0000,0000,0000,,we're just about down. Dialogue: 0,1:05:49.88,1:05:51.73,Default,,0000,0000,0000,,So my learning algorithm Dialogue: 0,1:05:51.73,1:05:56.03,Default,,0000,0000,0000,,[inaudible]. I'm going to think of it as [inaudible] the Dialogue: 0,1:05:56.03,1:05:59.15,Default,,0000,0000,0000,,expected value of TY Dialogue: 0,1:05:59.15,1:06:02.28,Default,,0000,0000,0000,,given X and [inaudible] by theta. Dialogue: 0,1:06:02.28,1:06:04.72,Default,,0000,0000,0000,,So Dialogue: 0,1:06:04.72,1:06:08.85,Default,,0000,0000,0000,,TY was this vector indicator function. So Dialogue: 0,1:06:08.85,1:06:10.20,Default,,0000,0000,0000,,T one Dialogue: 0,1:06:10.20,1:06:13.17,Default,,0000,0000,0000,,was indicator Y equals one, Dialogue: 0,1:06:13.17,1:06:16.08,Default,,0000,0000,0000,,down to indicator Y equals Dialogue: 0,1:06:16.08,1:06:20.05,Default,,0000,0000,0000,,K minus one. All Dialogue: 0,1:06:20.05,1:06:24.35,Default,,0000,0000,0000,,right. So I want my learning algorithm to output this; the expected value of this vector of Dialogue: 0,1:06:24.35,1:06:31.35,Default,,0000,0000,0000,,indicator functions. Dialogue: 0,1:06:34.13,1:06:39.21,Default,,0000,0000,0000,,The expected value of indicator Y equals one is just Dialogue: 0,1:06:39.21,1:06:42.36,Default,,0000,0000,0000,,the probability that Y equals one, Dialogue: 0,1:06:42.36,1:06:45.19,Default,,0000,0000,0000,,which is given by phi one. Dialogue: 0,1:06:45.19,1:06:48.36,Default,,0000,0000,0000,,So I have a random variable that's one whenever Y is equal to one and zero Dialogue: 0,1:06:48.36,1:06:49.86,Default,,0000,0000,0000,,otherwise, Dialogue: 0,1:06:49.86,1:06:51.72,Default,,0000,0000,0000,,so the expected value of that, Dialogue: 0,1:06:51.72,1:06:54.90,Default,,0000,0000,0000,,of this indicator Y equals one is just the Dialogue: 0,1:06:54.90,1:06:59.25,Default,,0000,0000,0000,,probability that Y equals one, which is given by phi one. Dialogue: 0,1:06:59.25,1:07:02.24,Default,,0000,0000,0000,,And therefore, Dialogue: 0,1:07:02.24,1:07:05.38,Default,,0000,0000,0000,,by what we were taught earlier, Dialogue: 0,1:07:05.38,1:07:08.09,Default,,0000,0000,0000,,this is therefore [inaudible] Dialogue: 0,1:07:08.09,1:07:15.09,Default,,0000,0000,0000,,to the theta one, transpose X over - well - okay. Dialogue: 0,1:07:45.73,1:07:46.85,Default,,0000,0000,0000,,And so my Dialogue: 0,1:07:46.85,1:07:51.47,Default,,0000,0000,0000,,learning algorithm will output the probability that Y equals one, Y equals two, up to Y Dialogue: 0,1:07:51.47,1:07:54.96,Default,,0000,0000,0000,,equals K minus one. Dialogue: 0,1:07:54.96,1:07:58.43,Default,,0000,0000,0000,,And these probabilities are going to be parameterized by Dialogue: 0,1:07:58.43,1:08:05.43,Default,,0000,0000,0000,,these functions like these. Dialogue: 0,1:08:21.15,1:08:25.34,Default,,0000,0000,0000,,And so just to give this algorithm a name, Dialogue: 0,1:08:25.34,1:08:32.34,Default,,0000,0000,0000,,this algorithm is called softmax regression, Dialogue: 0,1:08:34.06,1:08:38.60,Default,,0000,0000,0000,,and is widely thought of as the generalization of Dialogue: 0,1:08:38.60,1:08:41.90,Default,,0000,0000,0000,,logistic regression, which is regression of two classes. Is widely thought Dialogue: 0,1:08:41.90,1:08:44.91,Default,,0000,0000,0000,,of as a generalization of logistic regression Dialogue: 0,1:08:44.91,1:08:46.31,Default,,0000,0000,0000,,to the case of Dialogue: 0,1:08:46.31,1:08:50.20,Default,,0000,0000,0000,,K classes rather than two classes. Dialogue: 0,1:08:50.20,1:08:52.54,Default,,0000,0000,0000,, Dialogue: 0,1:08:52.54,1:08:56.19,Default,,0000,0000,0000,,And so just to be very concrete about what you do, right. So you have a machine-learning Dialogue: 0,1:08:56.19,1:08:59.62,Default,,0000,0000,0000,,problem, and you want to apply softmax regression to it. So generally, Dialogue: 0,1:08:59.62,1:09:02.75,Default,,0000,0000,0000,,work for the entire derivation [inaudible]. I think the Dialogue: 0,1:09:02.75,1:09:05.72,Default,,0000,0000,0000,,question you had is about how to fit parameters. Dialogue: 0,1:09:05.72,1:09:09.65,Default,,0000,0000,0000,,So let's say you have a machine-learning problem, and Dialogue: 0,1:09:09.65,1:09:13.21,Default,,0000,0000,0000,,Y takes on one of K classes. Dialogue: 0,1:09:13.21,1:09:17.21,Default,,0000,0000,0000,,What you do is you sit down and say, "Okay, I wanna model Y as being Dialogue: 0,1:09:17.21,1:09:19.19,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,1:09:19.19,1:09:24.19,Default,,0000,0000,0000,,given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort Dialogue: 0,1:09:24.19,1:09:27.35,Default,,0000,0000,0000,,of turn the crank. And everything else I wrote down follows Dialogue: 0,1:09:27.35,1:09:29.98,Default,,0000,0000,0000,,automatically from you have made the choice Dialogue: 0,1:09:29.98,1:09:34.77,Default,,0000,0000,0000,,of using [inaudible] distribution as your choice of exponential family. Dialogue: 0,1:09:34.77,1:09:38.16,Default,,0000,0000,0000,,And then what you do is you then have this training set, X, I, Dialogue: 0,1:09:38.16,1:09:39.01,Default,,0000,0000,0000,,Y, Dialogue: 0,1:09:39.01,1:09:41.23,Default,,0000,0000,0000,,I up to X, M, Dialogue: 0,1:09:41.23,1:09:43.56,Default,,0000,0000,0000,,Y, M. So Dialogue: 0,1:09:43.56,1:09:45.43,Default,,0000,0000,0000,,you're Dialogue: 0,1:09:45.43,1:09:49.31,Default,,0000,0000,0000,,doing the training set. We're now [inaudible] the value of Y takes on one Dialogue: 0,1:09:49.31,1:09:51.48,Default,,0000,0000,0000,,of K possible values. Dialogue: 0,1:09:51.48,1:09:55.39,Default,,0000,0000,0000,,And what you do is you then Dialogue: 0,1:09:55.39,1:09:59.35,Default,,0000,0000,0000,,find the parameters of the model by maximum likelihood. So you write down the likelihood Dialogue: 0,1:09:59.35,1:10:03.04,Default,,0000,0000,0000,,of the parameters, and you maximize the likelihood. Dialogue: 0,1:10:03.04,1:10:06.81,Default,,0000,0000,0000,,So what's the likelihood? Well, the likelihood, as usual, is the Dialogue: 0,1:10:06.81,1:10:08.93,Default,,0000,0000,0000,,product of your training set of Dialogue: 0,1:10:08.93,1:10:11.03,Default,,0000,0000,0000,,P of YI Dialogue: 0,1:10:11.03,1:10:14.71,Default,,0000,0000,0000,,given XI parameterized Dialogue: 0,1:10:14.71,1:10:17.69,Default,,0000,0000,0000,,by theta. That's Dialogue: 0,1:10:17.69,1:10:19.94,Default,,0000,0000,0000,,the likelihood, same as we had before. Dialogue: 0,1:10:19.94,1:10:21.35,Default,,0000,0000,0000,,And that's Dialogue: 0,1:10:21.35,1:10:23.55,Default,,0000,0000,0000,, Dialogue: 0,1:10:23.55,1:10:24.64,Default,,0000,0000,0000,,product of your Dialogue: 0,1:10:24.64,1:10:26.44,Default,,0000,0000,0000,,training set of - Dialogue: 0,1:10:26.44,1:10:29.65,Default,,0000,0000,0000,,let me write these down now. Dialogue: 0,1:10:29.65,1:10:32.64,Default,,0000,0000,0000,,YI equals one Dialogue: 0,1:10:32.64,1:10:36.05,Default,,0000,0000,0000,,times phi two of indicator YI Dialogue: 0,1:10:36.05,1:10:38.30,Default,,0000,0000,0000,,equals two, dot, Dialogue: 0,1:10:38.30,1:10:39.49,Default,,0000,0000,0000,,dot, dot, Dialogue: 0,1:10:39.49,1:10:42.53,Default,,0000,0000,0000,,to phi K of indicator YI Dialogue: 0,1:10:42.53,1:10:45.02,Default,,0000,0000,0000,,equals Dialogue: 0,1:10:45.02,1:10:47.27,Default,,0000,0000,0000,,K. Dialogue: 0,1:10:47.27,1:10:50.47,Default,,0000,0000,0000,,Where, for example, Dialogue: 0,1:10:50.47,1:10:55.34,Default,,0000,0000,0000,,phi one depends on theta through this formula. It is E to the theta one, Dialogue: 0,1:10:55.34,1:10:56.65,Default,,0000,0000,0000,,transpose X Dialogue: 0,1:10:56.65,1:10:59.29,Default,,0000,0000,0000,,over Dialogue: 0,1:10:59.29,1:11:02.47,Default,,0000,0000,0000,,one Dialogue: 0,1:11:02.47,1:11:09.42,Default,,0000,0000,0000,,plus sum over J - well, that formula I had just now. Dialogue: 0,1:11:09.42,1:11:11.65,Default,,0000,0000,0000,,And so phi one here is really a Dialogue: 0,1:11:11.65,1:11:15.56,Default,,0000,0000,0000,,shorthand for this formula, and similarly for phi two and so on, Dialogue: 0,1:11:15.56,1:11:20.00,Default,,0000,0000,0000,,up to phi K, where phi K is one minus all of these things. All right. Dialogue: 0,1:11:20.00,1:11:21.81,Default,,0000,0000,0000,,So this is a Dialogue: 0,1:11:21.81,1:11:22.80,Default,,0000,0000,0000,, Dialogue: 0,1:11:22.80,1:11:26.21,Default,,0000,0000,0000,,-this formula looks more complicated than it really is. What you Dialogue: 0,1:11:26.21,1:11:27.64,Default,,0000,0000,0000,,really do is you write this down, Dialogue: 0,1:11:27.64,1:11:31.46,Default,,0000,0000,0000,,then you take logs, compute a derivative of this formula [inaudible] theta, Dialogue: 0,1:11:31.46,1:11:34.48,Default,,0000,0000,0000,,and Dialogue: 0,1:11:34.48,1:11:36.75,Default,,0000,0000,0000,,apply say gradient ascent Dialogue: 0,1:11:36.75,1:11:41.37,Default,,0000,0000,0000,,to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector, Dialogue: 0,1:11:41.37,1:11:45.06,Default,,0000,0000,0000,,right? And now it looks Dialogue: 0,1:11:45.06,1:11:48.61,Default,,0000,0000,0000,,like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one Dialogue: 0,1:11:48.61,1:11:50.46,Default,,0000,0000,0000,,through Dialogue: 0,1:11:50.46,1:11:52.60,Default,,0000,0000,0000,,theta Dialogue: 0,1:11:52.60,1:11:57.88,Default,,0000,0000,0000,,K minus one. I've been thinking of each of these as - Dialogue: 0,1:11:57.88,1:11:59.21,Default,,0000,0000,0000,,and N Dialogue: 0,1:11:59.21,1:12:00.91,Default,,0000,0000,0000,,plus one-dimensional vector. Dialogue: 0,1:12:00.91,1:12:03.64,Default,,0000,0000,0000,,If X is N plus one-dimensional, Dialogue: 0,1:12:03.64,1:12:05.15,Default,,0000,0000,0000,,then I've been - see, I think if Dialogue: 0,1:12:05.15,1:12:09.49,Default,,0000,0000,0000,,you have a set of parameters comprising K minus one vectors, Dialogue: 0,1:12:09.49,1:12:13.36,Default,,0000,0000,0000,,and each of these is a - you could group all of these together and make these, but I Dialogue: 0,1:12:13.36,1:12:15.69,Default,,0000,0000,0000,,just haven't been doing that. [Inaudible] the derivative Dialogue: 0,1:12:15.69,1:12:22.69,Default,,0000,0000,0000,,of K minus one parameter vectors. [Inaudible], what do they correspond to? Dialogue: 0,1:12:23.34,1:12:25.05,Default,,0000,0000,0000,,[Inaudible]. Dialogue: 0,1:12:25.05,1:12:29.05,Default,,0000,0000,0000,,We're sort of out of time. Let me take that offline. It's hard to answer in the Dialogue: 0,1:12:29.05,1:12:32.56,Default,,0000,0000,0000,,same way that the logistic regression - what does theta correspond to Dialogue: 0,1:12:32.56,1:12:37.73,Default,,0000,0000,0000,,in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like Dialogue: 0,1:12:37.73,1:12:42.53,Default,,0000,0000,0000,,the [inaudible] feature - Yeah. Sort of similar interpretation, Dialogue: 0,1:12:42.53,1:12:44.06,Default,,0000,0000,0000,,yeah. That's good. I think I'm running a little bit Dialogue: 0,1:12:44.06,1:12:48.67,Default,,0000,0000,0000,,late. Why don't I - why don't we officially close for the day, but you can come up Dialogue: 0,1:12:48.67,1:12:50.43,Default,,0000,0000,0000,,if you more questions and take them offline. Thanks.