[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:09.03,0:00:10.43,Default,,0000,0000,0000,, Dialogue: 0,0:00:10.43,0:00:13.70,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:13.70,0:00:20.70,Default,,0000,0000,0000,,Development. Welcome back. If you haven't given me the homework yet, Dialogue: 0,0:00:27.02,0:00:28.47,Default,,0000,0000,0000,,you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back. Dialogue: 0,0:00:28.47,0:00:32.26,Default,,0000,0000,0000,,What I want to do today is wrap up our discussion on support vector Dialogue: 0,0:00:32.26,0:00:33.51,Default,,0000,0000,0000,,machines Dialogue: 0,0:00:33.51,0:00:37.25,Default,,0000,0000,0000,,and in particular we'll also talk about the idea of kernels Dialogue: 0,0:00:37.25,0:00:39.44,Default,,0000,0000,0000,,and then talk about Dialogue: 0,0:00:39.44,0:00:41.87,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:00:41.87,0:00:45.60,Default,,0000,0000,0000,,and then I'll talk about Dialogue: 0,0:00:45.60,0:00:50.03,Default,,0000,0000,0000,,the SMO algorithm, which is an algorithm for solving the Dialogue: 0,0:00:50.03,0:00:53.77,Default,,0000,0000,0000,,optimization problem that we posed last time. Dialogue: 0,0:00:53.77,0:00:55.65,Default,,0000,0000,0000,, Dialogue: 0,0:00:55.65,0:00:58.20,Default,,0000,0000,0000,,To recap, we Dialogue: 0,0:00:58.20,0:01:01.38,Default,,0000,0000,0000,,wrote down Dialogue: 0,0:01:01.38,0:01:08.38,Default,,0000,0000,0000,,the following context optimization problem. Dialogue: 0,0:01:12.27,0:01:15.99,Default,,0000,0000,0000,,All this is assuming that the data is linearly separable, which is an assumption that I'll Dialogue: 0,0:01:15.99,0:01:18.11,Default,,0000,0000,0000,,fix later, Dialogue: 0,0:01:18.11,0:01:19.53,Default,,0000,0000,0000,, Dialogue: 0,0:01:19.53,0:01:22.77,Default,,0000,0000,0000,, Dialogue: 0,0:01:22.77,0:01:24.66,Default,,0000,0000,0000,, Dialogue: 0,0:01:24.66,0:01:30.21,Default,,0000,0000,0000,,and so with this optimization problem, given a training set, Dialogue: 0,0:01:30.21,0:01:34.84,Default,,0000,0000,0000,,this will find Dialogue: 0,0:01:34.84,0:01:37.44,Default,,0000,0000,0000,,the optimal margin Dialogue: 0,0:01:37.44,0:01:41.30,Default,,0000,0000,0000,,classifier for the data set that maximizes this Dialogue: 0,0:01:41.30,0:01:45.02,Default,,0000,0000,0000,,geometric margin from your training Dialogue: 0,0:01:45.02,0:01:46.90,Default,,0000,0000,0000,,examples. Dialogue: 0,0:01:46.90,0:01:51.81,Default,,0000,0000,0000,,And so in the previous lecture, we also derived the dual of this problem, Dialogue: 0,0:01:51.81,0:01:54.13,Default,,0000,0000,0000,,which was to maximize Dialogue: 0,0:01:54.13,0:02:01.13,Default,,0000,0000,0000,, Dialogue: 0,0:02:01.59,0:02:08.59,Default,,0000,0000,0000,, Dialogue: 0,0:02:18.32,0:02:21.55,Default,,0000,0000,0000,,this. And this is the dual of our Dialogue: 0,0:02:21.55,0:02:24.13,Default,,0000,0000,0000,,primal [inaudible] optimization problem. Dialogue: 0,0:02:24.13,0:02:28.38,Default,,0000,0000,0000,,Here, I'm using these angle brackets to denote inner product, Dialogue: 0,0:02:28.38,0:02:29.39,Default,,0000,0000,0000,, Dialogue: 0,0:02:29.39,0:02:31.37,Default,,0000,0000,0000,,so Dialogue: 0,0:02:31.37,0:02:33.19,Default,,0000,0000,0000,,this is just XI transpose XJ Dialogue: 0,0:02:33.19,0:02:38.56,Default,,0000,0000,0000,,for vectors XI Dialogue: 0,0:02:38.56,0:02:40.82,Default,,0000,0000,0000,,and XJ. We also worked out the Dialogue: 0,0:02:40.82,0:02:47.49,Default,,0000,0000,0000,,ways W would be given by sum over I alpha I Dialogue: 0,0:02:47.49,0:02:51.13,Default,,0000,0000,0000,,YI XI. Therefore, when you need to make a prediction of classification Dialogue: 0,0:02:51.13,0:02:52.15,Default,,0000,0000,0000,,time, you Dialogue: 0,0:02:52.15,0:02:54.69,Default,,0000,0000,0000,,need to compute the Dialogue: 0,0:02:54.69,0:03:01.30,Default,,0000,0000,0000,,value of the hypothesis applied to an [inaudible], which is G of W Dialogue: 0,0:03:01.30,0:03:05.97,Default,,0000,0000,0000,,transpose X plus B where G is that threshold function that outputs plus one and minus one. Dialogue: 0,0:03:05.97,0:03:08.66,Default,,0000,0000,0000,,And so this is G of Dialogue: 0,0:03:08.66,0:03:09.90,Default,,0000,0000,0000,,sum over I alpha Dialogue: 0,0:03:09.90,0:03:11.59,Default,,0000,0000,0000,, Dialogue: 0,0:03:11.59,0:03:18.59,Default,,0000,0000,0000,, Dialogue: 0,0:03:20.24,0:03:22.13,Default,,0000,0000,0000,,I. So that can also be written Dialogue: 0,0:03:22.13,0:03:24.39,Default,,0000,0000,0000,,in terms of inner Dialogue: 0,0:03:24.39,0:03:25.63,Default,,0000,0000,0000,,products between input Dialogue: 0,0:03:25.63,0:03:31.37,Default,,0000,0000,0000,,vectors X. So what I Dialogue: 0,0:03:31.37,0:03:35.18,Default,,0000,0000,0000,,want to do is now talk about the idea of kernels, which will make use of this Dialogue: 0,0:03:35.18,0:03:36.81,Default,,0000,0000,0000,,property because it turns out Dialogue: 0,0:03:36.81,0:03:41.69,Default,,0000,0000,0000,,you can take the only dependers of the algorithm on X Dialogue: 0,0:03:41.69,0:03:44.23,Default,,0000,0000,0000,,is through these inner products. Dialogue: 0,0:03:44.23,0:03:46.61,Default,,0000,0000,0000,,In fact, you can write the entire algorithm Dialogue: 0,0:03:46.61,0:03:50.30,Default,,0000,0000,0000,,without ever explicitly referring to Dialogue: 0,0:03:50.30,0:03:52.39,Default,,0000,0000,0000,,an X vector [inaudible] Dialogue: 0,0:03:52.39,0:03:53.89,Default,,0000,0000,0000,, Dialogue: 0,0:03:53.89,0:03:57.73,Default,,0000,0000,0000,,between input feature vectors. Dialogue: 0,0:03:57.73,0:04:00.94,Default,,0000,0000,0000,,And the idea of a high kernel is as following " Dialogue: 0,0:04:00.94,0:04:07.39,Default,,0000,0000,0000,,let's say that you have an input attribute. Dialogue: 0,0:04:07.39,0:04:10.84,Default,,0000,0000,0000,,Let's just say for now it's a real number. Maybe this is the Dialogue: 0,0:04:10.84,0:04:13.43,Default,,0000,0000,0000,,living area of a house Dialogue: 0,0:04:13.43,0:04:19.65,Default,,0000,0000,0000,,that you're trying to make a prediction on, like whether it will be sold in the next six months. Dialogue: 0,0:04:19.65,0:04:21.32,Default,,0000,0000,0000,,Quite often, we'll take this Dialogue: 0,0:04:21.32,0:04:23.79,Default,,0000,0000,0000,,feature X and we'll map it Dialogue: 0,0:04:23.79,0:04:28.98,Default,,0000,0000,0000,,to a richer set of features. So for example, we will take X and map it Dialogue: 0,0:04:28.98,0:04:35.98,Default,,0000,0000,0000,,to these four polynomial features, and let me acutely call this mapping Phi. So we'll Dialogue: 0,0:04:36.56,0:04:38.56,Default,,0000,0000,0000,,let Phi of X denote Dialogue: 0,0:04:38.56,0:04:40.93,Default,,0000,0000,0000,,the mapping from your original features to Dialogue: 0,0:04:40.93,0:04:44.80,Default,,0000,0000,0000,,some higher dimensional set of features. Dialogue: 0,0:04:44.80,0:04:49.08,Default,,0000,0000,0000,,So if you do this and you want to use the features Phi of X, Dialogue: 0,0:04:49.08,0:04:49.68,Default,,0000,0000,0000,,then Dialogue: 0,0:04:49.68,0:04:53.57,Default,,0000,0000,0000,,all you need to do is go back to the learning algorithm Dialogue: 0,0:04:53.57,0:04:57.27,Default,,0000,0000,0000,,and everywhere you see Dialogue: 0,0:04:57.27,0:04:59.00,Default,,0000,0000,0000,,XI, XJ, we'll replace it Dialogue: 0,0:04:59.00,0:05:00.19,Default,,0000,0000,0000,,with Dialogue: 0,0:05:00.19,0:05:04.98,Default,,0000,0000,0000,,the inner product between Phi of XI and Dialogue: 0,0:05:04.98,0:05:07.63,Default,,0000,0000,0000,,Phi Dialogue: 0,0:05:07.63,0:05:10.59,Default,,0000,0000,0000,,of XJ. So this corresponds to running a support vector machine Dialogue: 0,0:05:10.59,0:05:14.38,Default,,0000,0000,0000,,with the features given by Phi of X rather than with your original Dialogue: 0,0:05:14.38,0:05:19.13,Default,,0000,0000,0000,,onedimensional input feature X. Dialogue: 0,0:05:19.13,0:05:21.69,Default,,0000,0000,0000,,And in a scenario that I want to consider, Dialogue: 0,0:05:21.69,0:05:23.39,Default,,0000,0000,0000,,sometimes Phi of X Dialogue: 0,0:05:23.39,0:05:29.49,Default,,0000,0000,0000,,will be very high dimensional, Dialogue: 0,0:05:29.49,0:05:33.76,Default,,0000,0000,0000,,and in fact sometimes Phi of X " so for example, Phi of X may contain Dialogue: 0,0:05:33.76,0:05:36.60,Default,,0000,0000,0000,,very high degree polynomial features. Dialogue: 0,0:05:36.60,0:05:41.53,Default,,0000,0000,0000,,Sometimes Phi of X will actually even be an infinite dimensional vector of features, Dialogue: 0,0:05:41.53,0:05:45.09,Default,,0000,0000,0000,,and the question is if Phi of X is an extremely high dimensional, Dialogue: 0,0:05:45.09,0:05:48.60,Default,,0000,0000,0000,,then you can't actually compute to these inner products very Dialogue: 0,0:05:48.60,0:05:51.30,Default,,0000,0000,0000,,efficiently, it seems, because computers need to Dialogue: 0,0:05:51.30,0:05:55.32,Default,,0000,0000,0000,,represent an extremely high dimensional feature vector and then Dialogue: 0,0:05:55.32,0:05:57.20,Default,,0000,0000,0000,,take Dialogue: 0,0:05:57.20,0:05:59.74,Default,,0000,0000,0000,,[inaudible] inefficient. It Dialogue: 0,0:05:59.74,0:06:02.93,Default,,0000,0000,0000,,turns out that Dialogue: 0,0:06:02.93,0:06:07.88,Default,,0000,0000,0000,,in many important special cases, we can write down " let's call the kernel function, Dialogue: 0,0:06:07.88,0:06:09.84,Default,,0000,0000,0000,,denoted by K, Dialogue: 0,0:06:09.84,0:06:12.50,Default,,0000,0000,0000,,which will be Dialogue: 0,0:06:12.50,0:06:18.30,Default,,0000,0000,0000,,this, Dialogue: 0,0:06:18.30,0:06:21.34,Default,,0000,0000,0000,,which would be inner product between those feature Dialogue: 0,0:06:21.34,0:06:24.41,Default,,0000,0000,0000,,vectors. It turns out there will be important special cases where Dialogue: 0,0:06:24.41,0:06:28.81,Default,,0000,0000,0000,,computing Phi of X is computationally very expensive " maybe is impossible. Dialogue: 0,0:06:28.81,0:06:30.23,Default,,0000,0000,0000,,There's an infinite dimensional vector, Dialogue: 0,0:06:30.23,0:06:33.37,Default,,0000,0000,0000,,and you can't compute infinite dimensional vectors. Dialogue: 0,0:06:33.37,0:06:37.22,Default,,0000,0000,0000,,There will be important special cases where Phi of X is very expensive to represent because it Dialogue: 0,0:06:37.22,0:06:38.73,Default,,0000,0000,0000,,is so high dimensional, Dialogue: 0,0:06:38.73,0:06:41.53,Default,,0000,0000,0000,,but nonetheless, you can actually compute a kernel between Dialogue: 0,0:06:41.53,0:06:46.15,Default,,0000,0000,0000,,XI and XJ. You can compute the inner product between these two vectors Dialogue: 0,0:06:46.15,0:06:48.75,Default,,0000,0000,0000,,very inexpensively. Dialogue: 0,0:06:48.75,0:06:52.33,Default,,0000,0000,0000,,And so the idea of the support vector machine is that Dialogue: 0,0:06:52.33,0:06:56.54,Default,,0000,0000,0000,,everywhere in the algorithm that you see these inner products, Dialogue: 0,0:06:56.54,0:06:58.03,Default,,0000,0000,0000,,we're going to replace it Dialogue: 0,0:06:58.03,0:07:01.63,Default,,0000,0000,0000,,with a kernel function that you can compute efficiently, Dialogue: 0,0:07:01.63,0:07:03.84,Default,,0000,0000,0000,,and that lets you work in feature Dialogue: 0,0:07:03.84,0:07:07.91,Default,,0000,0000,0000,,spaces Phi of X even if Phi of X Dialogue: 0,0:07:07.91,0:07:12.43,Default,,0000,0000,0000,,are very high dimensional. Let Dialogue: 0,0:07:12.43,0:07:16.29,Default,,0000,0000,0000,,me now say how that's done. A Dialogue: 0,0:07:16.29,0:07:17.03,Default,,0000,0000,0000,, Dialogue: 0,0:07:17.03,0:07:20.69,Default,,0000,0000,0000,,little bit later today, we'll actually see some concrete examples of Phi of X and Dialogue: 0,0:07:20.69,0:07:25.58,Default,,0000,0000,0000,,of kernels. For now, let's just think about constructing kernels explicitly. Dialogue: 0,0:07:25.58,0:07:28.94,Default,,0000,0000,0000,,This best illustrates my example. Dialogue: 0,0:07:28.94,0:07:32.54,Default,,0000,0000,0000,, Dialogue: 0,0:07:32.54,0:07:35.05,Default,,0000,0000,0000,,Let's say you have Dialogue: 0,0:07:35.05,0:07:40.13,Default,,0000,0000,0000,,two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to Dialogue: 0,0:07:40.13,0:07:45.52,Default,,0000,0000,0000,,write X and Z to save on writing. Dialogue: 0,0:07:45.52,0:07:49.68,Default,,0000,0000,0000,,Let's say my kernel is K of X, Z Dialogue: 0,0:07:49.68,0:07:50.72,Default,,0000,0000,0000,,equals Dialogue: 0,0:07:50.72,0:07:55.44,Default,,0000,0000,0000,,X transpose Z squared. Dialogue: 0,0:07:55.44,0:08:02.44,Default,,0000,0000,0000,,And so this is " Dialogue: 0,0:08:09.20,0:08:12.86,Default,,0000,0000,0000,,right? X transpose Z " this thing here is X Dialogue: 0,0:08:12.86,0:08:15.24,Default,,0000,0000,0000,,transpose Z and this thing is X transpose Z, so this is X Dialogue: 0,0:08:15.24,0:08:17.39,Default,,0000,0000,0000,,transpose Z squared. Dialogue: 0,0:08:17.39,0:08:18.25,Default,,0000,0000,0000,, Dialogue: 0,0:08:18.25,0:08:25.25,Default,,0000,0000,0000,,And that's equal Dialogue: 0,0:08:29.49,0:08:33.19,Default,,0000,0000,0000,,to that. Dialogue: 0,0:08:33.19,0:08:35.86,Default,,0000,0000,0000,,And so this kernel corresponds Dialogue: 0,0:08:35.86,0:08:37.58,Default,,0000,0000,0000,,to the feature mapping where Phi Dialogue: 0,0:08:37.58,0:08:40.72,Default,,0000,0000,0000,,of X is equal to " and I'll write Dialogue: 0,0:08:40.72,0:08:47.72,Default,,0000,0000,0000,,this down for the case of N equals free, I guess. Dialogue: 0,0:09:03.15,0:09:06.52,Default,,0000,0000,0000,,And so with this definition of Phi of X, Dialogue: 0,0:09:06.52,0:09:10.30,Default,,0000,0000,0000,,you can verify for yourself that this thing Dialogue: 0,0:09:10.30,0:09:11.66,Default,,0000,0000,0000,,becomes the Dialogue: 0,0:09:11.66,0:09:17.87,Default,,0000,0000,0000,,inner product between Phi of X Dialogue: 0,0:09:17.87,0:09:21.96,Default,,0000,0000,0000,,and Phi of Z, because to get an inner product between two vectors Dialogue: 0,0:09:21.96,0:09:25.44,Default,,0000,0000,0000,,is " you can just take a sum of the corresponding elements of the vectors. You Dialogue: 0,0:09:25.44,0:09:27.47,Default,,0000,0000,0000,,multiply them. So Dialogue: 0,0:09:27.47,0:09:29.17,Default,,0000,0000,0000,,if this is Phi of X, Dialogue: 0,0:09:29.17,0:09:32.40,Default,,0000,0000,0000,,then the inner product between Phi of X and Phi of Z will be Dialogue: 0,0:09:32.40,0:09:33.55,Default,,0000,0000,0000,,the sum Dialogue: 0,0:09:33.55,0:09:35.16,Default,,0000,0000,0000,,over all the Dialogue: 0,0:09:35.16,0:09:39.61,Default,,0000,0000,0000,,elements of this vector times the corresponding elements of Phi of Z, Dialogue: 0,0:09:39.61,0:09:45.13,Default,,0000,0000,0000,,and what you get is this one. Dialogue: 0,0:09:45.13,0:09:48.35,Default,,0000,0000,0000,,And so the cool thing about this is that Dialogue: 0,0:09:48.35,0:09:53.40,Default,,0000,0000,0000,,in order to compute Phi of X, you Dialogue: 0,0:09:53.40,0:09:55.92,Default,,0000,0000,0000,, Dialogue: 0,0:09:55.92,0:10:02.92,Default,,0000,0000,0000,,need [inaudible] just to compute Dialogue: 0,0:10:03.02,0:10:05.37,Default,,0000,0000,0000,,Phi of X. If N Dialogue: 0,0:10:05.37,0:10:08.86,Default,,0000,0000,0000,,is a dimension of X and Z, then Dialogue: 0,0:10:08.86,0:10:13.79,Default,,0000,0000,0000,,Phi of X is a vector of all pairs of XI XJ multiplied of Dialogue: 0,0:10:13.79,0:10:15.02,Default,,0000,0000,0000,,each other, Dialogue: 0,0:10:15.02,0:10:19.45,Default,,0000,0000,0000,,and so the length of Phi of X is N squared. You need order N Dialogue: 0,0:10:19.45,0:10:23.84,Default,,0000,0000,0000,,squared time just to compute Phi of X. But Dialogue: 0,0:10:23.84,0:10:30.84,Default,,0000,0000,0000,,to compute K Dialogue: 0,0:10:37.15,0:10:39.36,Default,,0000,0000,0000,," to compute the kernel function, Dialogue: 0,0:10:39.36,0:10:41.22,Default,,0000,0000,0000,,all you need is order N time, Dialogue: 0,0:10:41.22,0:10:43.07,Default,,0000,0000,0000,,because the Dialogue: 0,0:10:43.07,0:10:47.15,Default,,0000,0000,0000,,kernel function is defined as X transpose Z squared, so you just take the inner product Dialogue: 0,0:10:47.15,0:10:50.78,Default,,0000,0000,0000,,between X and Z, which is order N time and you square that Dialogue: 0,0:10:50.78,0:10:52.41,Default,,0000,0000,0000,,and you've computed this Dialogue: 0,0:10:52.41,0:10:54.23,Default,,0000,0000,0000,,kernel function, Dialogue: 0,0:10:54.23,0:10:58.00,Default,,0000,0000,0000,,and so you just computed the inner product between two vectors where each vector Dialogue: 0,0:10:58.00,0:11:00.20,Default,,0000,0000,0000,,has N squared elements, but Dialogue: 0,0:11:00.20,0:11:07.20,Default,,0000,0000,0000,,you did it in N square time. Student:For any kernel you find Dialogue: 0,0:11:13.88,0:11:20.65,Default,,0000,0000,0000,,for X and Z, Dialogue: 0,0:11:20.65,0:11:23.80,Default,,0000,0000,0000,,does Phi exist for X and Z? Instructor (Andrew Dialogue: 0,0:11:23.80,0:11:30.80,Default,,0000,0000,0000,,Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense. Dialogue: 0,0:11:31.66,0:11:34.00,Default,,0000,0000,0000,, Dialogue: 0,0:11:34.00,0:11:35.46,Default,,0000,0000,0000,,So Dialogue: 0,0:11:35.46,0:11:39.31,Default,,0000,0000,0000,,let me just describe a couple of quick generalizations to Dialogue: 0,0:11:39.31,0:11:42.17,Default,,0000,0000,0000,,this. Dialogue: 0,0:11:42.17,0:11:47.46,Default,,0000,0000,0000,,One is that if Dialogue: 0,0:11:47.46,0:11:54.46,Default,,0000,0000,0000,,you define KXZ to be equal to Dialogue: 0,0:11:55.21,0:12:00.19,Default,,0000,0000,0000,,X transpose Z plus C squared, so again, you can compute this kernel Dialogue: 0,0:12:00.19,0:12:03.20,Default,,0000,0000,0000,,in order N time, Dialogue: 0,0:12:03.20,0:12:07.07,Default,,0000,0000,0000,,then that turns out to correspond to a feature vector Dialogue: 0,0:12:07.07,0:12:08.07,Default,,0000,0000,0000,,where I'm Dialogue: 0,0:12:08.07,0:12:10.81,Default,,0000,0000,0000,,just going to add a few more elements at the bottom Dialogue: 0,0:12:10.81,0:12:17.81,Default,,0000,0000,0000,,where you add root 2. Dialogue: 0,0:12:23.55,0:12:27.70,Default,,0000,0000,0000,,Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3 Dialogue: 0,0:12:27.70,0:12:29.67,Default,,0000,0000,0000,,and C. Dialogue: 0,0:12:29.67,0:12:31.34,Default,,0000,0000,0000,,And so Dialogue: 0,0:12:31.34,0:12:35.45,Default,,0000,0000,0000,,this is a way of creating a feature vector with both the monomials, Dialogue: 0,0:12:35.45,0:12:37.07,Default,,0000,0000,0000,,meaning the first order terms, Dialogue: 0,0:12:37.07,0:12:40.54,Default,,0000,0000,0000,,as well as the quadratic or the inner product Dialogue: 0,0:12:40.54,0:12:42.73,Default,,0000,0000,0000,,terms between XI and XJ, Dialogue: 0,0:12:42.73,0:12:45.23,Default,,0000,0000,0000,,and the parameter C here Dialogue: 0,0:12:45.23,0:12:49.02,Default,,0000,0000,0000,,allows you to control the relative waiting between the monomial Dialogue: 0,0:12:49.02,0:12:51.49,Default,,0000,0000,0000,,terms, so the first order terms, Dialogue: 0,0:12:51.49,0:12:53.68,Default,,0000,0000,0000,,and the quadratic Dialogue: 0,0:12:53.68,0:12:56.94,Default,,0000,0000,0000,,terms. Again, this is still inner Dialogue: 0,0:12:56.94,0:13:00.60,Default,,0000,0000,0000,,product between vectors of length and square [inaudible] Dialogue: 0,0:13:00.60,0:13:02.92,Default,,0000,0000,0000,,in order N time. Dialogue: 0,0:13:02.92,0:13:04.63,Default,,0000,0000,0000,, Dialogue: 0,0:13:04.63,0:13:11.63,Default,,0000,0000,0000,,More generally, here are some other examples of kernels. Dialogue: 0,0:13:13.72,0:13:18.47,Default,,0000,0000,0000,,Actually, a Dialogue: 0,0:13:18.47,0:13:25.07,Default,,0000,0000,0000,,generalization of the one I just derived right now would be the following kernel. Dialogue: 0,0:13:25.07,0:13:30.96,Default,,0000,0000,0000,,And so this corresponds to using all N plus DQZ Dialogue: 0,0:13:30.96,0:13:33.66,Default,,0000,0000,0000,,features Dialogue: 0,0:13:33.66,0:13:39.24,Default,,0000,0000,0000,,of all Dialogue: 0,0:13:39.24,0:13:43.40,Default,,0000,0000,0000,,monomials. Monomials just mean the products of XI XJ XK. Just Dialogue: 0,0:13:43.40,0:13:46.57,Default,,0000,0000,0000,,all the polynomial terms up Dialogue: 0,0:13:46.57,0:13:51.85,Default,,0000,0000,0000,,to degree D Dialogue: 0,0:13:51.85,0:13:56.10,Default,,0000,0000,0000,,and plus [inaudible] so on the order of Dialogue: 0,0:13:56.10,0:13:58.75,Default,,0000,0000,0000,,N plus D to the power of D, so this grows exponentially in D. Dialogue: 0,0:13:58.75,0:14:01.13,Default,,0000,0000,0000,,This is a very high dimensional feature vector, Dialogue: 0,0:14:01.13,0:14:04.93,Default,,0000,0000,0000,,but again, you can implicitly construct the feature vector and take Dialogue: 0,0:14:04.93,0:14:07.46,Default,,0000,0000,0000,,inner products between them. Dialogue: 0,0:14:07.46,0:14:10.73,Default,,0000,0000,0000,,It's very computationally efficient, because you just compute the inner product Dialogue: 0,0:14:10.73,0:14:14.87,Default,,0000,0000,0000,,between X and Z, add C and you take that real number to the power of D Dialogue: 0,0:14:14.87,0:14:18.59,Default,,0000,0000,0000,,and by plugging this in as a kernel, you're implicitly working in an extremely Dialogue: 0,0:14:18.59,0:14:25.59,Default,,0000,0000,0000,,high dimensional computing space. Dialogue: 0,0:14:25.88,0:14:27.73,Default,,0000,0000,0000,,So what Dialogue: 0,0:14:27.73,0:14:34.73,Default,,0000,0000,0000,,I've given is just a few specific examples of how to create kernels. I want Dialogue: 0,0:14:36.28,0:14:40.32,Default,,0000,0000,0000,,to go over just a few specific examples of kernels. So let's you ask Dialogue: 0,0:14:40.32,0:14:43.61,Default,,0000,0000,0000,,you more generally if you're faced with a new machine-learning problem, Dialogue: 0,0:14:43.61,0:14:46.12,Default,,0000,0000,0000,,how do you come up with a kernel? There Dialogue: 0,0:14:46.12,0:14:49.72,Default,,0000,0000,0000,,are many ways to think about it, but here's one intuition that's sort of Dialogue: 0,0:14:49.72,0:14:50.56,Default,,0000,0000,0000,,useful. Dialogue: 0,0:14:50.56,0:14:51.65,Default,,0000,0000,0000,, Dialogue: 0,0:14:51.65,0:14:53.77,Default,,0000,0000,0000,,So given a set of attributes of X, you're Dialogue: 0,0:14:53.77,0:14:57.80,Default,,0000,0000,0000,,going to use a feature vector of Phi of X Dialogue: 0,0:14:57.80,0:14:59.51,Default,,0000,0000,0000,,and given a set Dialogue: 0,0:14:59.51,0:15:04.15,Default,,0000,0000,0000,,of attributes Z, you're going to use an input feature vector Phi of Z, Dialogue: 0,0:15:04.15,0:15:07.39,Default,,0000,0000,0000,,and so the kernel is computing Dialogue: 0,0:15:07.39,0:15:11.90,Default,,0000,0000,0000,,the inner product between Phi of X and Phi of Z. Dialogue: 0,0:15:11.90,0:15:13.14,Default,,0000,0000,0000,,And so Dialogue: 0,0:15:13.14,0:15:14.87,Default,,0000,0000,0000,,one intuition " Dialogue: 0,0:15:14.87,0:15:17.19,Default,,0000,0000,0000,,this is a partial intuition. This Dialogue: 0,0:15:17.19,0:15:20.82,Default,,0000,0000,0000,,isn't as rigorous intuition that it is used for. It Dialogue: 0,0:15:20.82,0:15:21.67,Default,,0000,0000,0000,,is that Dialogue: 0,0:15:21.67,0:15:24.47,Default,,0000,0000,0000,,if X and Z are very similar, Dialogue: 0,0:15:24.47,0:15:28.38,Default,,0000,0000,0000,,then Phi of X and Phi of Z will be pointing in the same direction, Dialogue: 0,0:15:28.38,0:15:32.89,Default,,0000,0000,0000,,and therefore the inner product would be large. Dialogue: 0,0:15:32.89,0:15:34.15,Default,,0000,0000,0000,,Whereas in contrast, if Dialogue: 0,0:15:34.15,0:15:36.72,Default,,0000,0000,0000,,X and Z are very dissimilar, Dialogue: 0,0:15:36.72,0:15:40.42,Default,,0000,0000,0000,,then Phi of X and Phi of Z may be pointing different directions, Dialogue: 0,0:15:40.42,0:15:42.37,Default,,0000,0000,0000,,and so the inner product may be small. Dialogue: 0,0:15:42.37,0:15:46.99,Default,,0000,0000,0000,,That intuition is not a rigorous one, but it's sort of a useful one to think about. Dialogue: 0,0:15:46.99,0:15:49.24,Default,,0000,0000,0000,, Dialogue: 0,0:15:49.24,0:15:53.36,Default,,0000,0000,0000,,If you're faced with a new learning problem " if I give you some random Dialogue: 0,0:15:53.36,0:15:54.62,Default,,0000,0000,0000,,thing to classify Dialogue: 0,0:15:54.62,0:15:56.19,Default,,0000,0000,0000,,and you want to decide how to come up with Dialogue: 0,0:15:56.19,0:15:58.00,Default,,0000,0000,0000,,a kernel, Dialogue: 0,0:15:58.00,0:15:59.21,Default,,0000,0000,0000,,one way is Dialogue: 0,0:15:59.21,0:16:02.71,Default,,0000,0000,0000,,to try to come up with the function Dialogue: 0,0:16:02.71,0:16:04.26,Default,,0000,0000,0000,,P of XZ that Dialogue: 0,0:16:04.26,0:16:05.57,Default,,0000,0000,0000,,is large, Dialogue: 0,0:16:05.57,0:16:10.54,Default,,0000,0000,0000,,if you want to learn the algorithm to think of X and Z as similar Dialogue: 0,0:16:10.54,0:16:11.37,Default,,0000,0000,0000,,and Dialogue: 0,0:16:11.37,0:16:15.04,Default,,0000,0000,0000,,small. Dialogue: 0,0:16:15.04,0:16:17.64,Default,,0000,0000,0000,, Dialogue: 0,0:16:17.64,0:16:19.29,Default,,0000,0000,0000,, Dialogue: 0,0:16:19.29,0:16:24.50,Default,,0000,0000,0000,,Again, this isn't always true, but this is one of several intuitions. Dialogue: 0,0:16:24.50,0:16:27.97,Default,,0000,0000,0000,,So if you're trying to classify some brand new thing " you're trying to Dialogue: 0,0:16:27.97,0:16:31.100,Default,,0000,0000,0000,,classify [inaudible] or DNA sequences or something, Dialogue: 0,0:16:31.100,0:16:33.01,Default,,0000,0000,0000,, Dialogue: 0,0:16:33.01,0:16:37.25,Default,,0000,0000,0000,,some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think Dialogue: 0,0:16:37.25,0:16:38.79,Default,,0000,0000,0000,,these are similar things Dialogue: 0,0:16:38.79,0:16:41.87,Default,,0000,0000,0000,,or these are dissimilar. Dialogue: 0,0:16:41.87,0:16:43.59,Default,,0000,0000,0000,,And so this Dialogue: 0,0:16:43.59,0:16:46.73,Default,,0000,0000,0000,,answers the question of let's Dialogue: 0,0:16:46.73,0:16:49.86,Default,,0000,0000,0000,,say I have something I want to classify, and let's say I write down Dialogue: 0,0:16:49.86,0:16:53.37,Default,,0000,0000,0000,,the function Dialogue: 0,0:16:53.37,0:16:57.70,Default,,0000,0000,0000,,that I think is a good measure of how similar or dissimilar X and Z Dialogue: 0,0:16:57.70,0:16:59.37,Default,,0000,0000,0000,,are for my specific problem. Dialogue: 0,0:16:59.37,0:17:05.31,Default,,0000,0000,0000,,Let's say I write down K of XZ equals E to the Dialogue: 0,0:17:05.31,0:17:08.21,Default,,0000,0000,0000,,minus. Dialogue: 0,0:17:08.21,0:17:13.26,Default,,0000,0000,0000,,Let's say I write down this function, Dialogue: 0,0:17:13.26,0:17:19.09,Default,,0000,0000,0000,,and I think this is a good measure of how similar X and Z are. Dialogue: 0,0:17:19.09,0:17:26.09,Default,,0000,0000,0000,,The question, then, is is this really a valid Dialogue: 0,0:17:28.49,0:17:32.65,Default,,0000,0000,0000,,kernel? In other words, to understand how we can construct kernels " if I write down the Dialogue: 0,0:17:32.65,0:17:33.67,Default,,0000,0000,0000,,function like that, Dialogue: 0,0:17:33.67,0:17:37.94,Default,,0000,0000,0000,,the question is does there really exist some Phi Dialogue: 0,0:17:37.94,0:17:39.83,Default,,0000,0000,0000,,such that Dialogue: 0,0:17:39.83,0:17:41.05,Default,,0000,0000,0000,,KXZ Dialogue: 0,0:17:41.05,0:17:42.42,Default,,0000,0000,0000,,is equal to Dialogue: 0,0:17:42.42,0:17:48.37,Default,,0000,0000,0000,,the inner product? Dialogue: 0,0:17:48.37,0:17:49.86,Default,,0000,0000,0000,,And that's the Dialogue: 0,0:17:49.86,0:17:56.16,Default,,0000,0000,0000,,question of is K a valid kernel. Dialogue: 0,0:17:56.16,0:17:59.78,Default,,0000,0000,0000,,It turns out that Dialogue: 0,0:17:59.78,0:18:04.26,Default,,0000,0000,0000,,there is a result that characterizes necessary and sufficient Dialogue: 0,0:18:04.26,0:18:05.38,Default,,0000,0000,0000,,conditions for Dialogue: 0,0:18:05.38,0:18:07.70,Default,,0000,0000,0000,,when a function K that you might choose Dialogue: 0,0:18:07.70,0:18:10.37,Default,,0000,0000,0000,,is a valid kernel. Dialogue: 0,0:18:10.37,0:18:17.34,Default,,0000,0000,0000,,I should go ahead show part of that result now. Dialogue: 0,0:18:17.34,0:18:23.56,Default,,0000,0000,0000,,Suppose K is a valid kernel, Dialogue: 0,0:18:23.56,0:18:27.18,Default,,0000,0000,0000,,and when I say K is a kernel, what I mean is there does indeed exist some Dialogue: 0,0:18:27.18,0:18:31.80,Default,,0000,0000,0000,,function Phi for which this holds true. Dialogue: 0,0:18:31.80,0:18:34.70,Default,,0000,0000,0000,,Then Dialogue: 0,0:18:34.70,0:18:39.19,Default,,0000,0000,0000,,let any set of points Dialogue: 0,0:18:39.19,0:18:46.19,Default,,0000,0000,0000,,X1 up to XM be given. Let Dialogue: 0,0:18:47.61,0:18:52.64,Default,,0000,0000,0000,,me define a matrix K. Dialogue: 0,0:18:52.64,0:18:54.41,Default,,0000,0000,0000,, Dialogue: 0,0:18:54.41,0:18:57.69,Default,,0000,0000,0000,,I apologize for overloading notation. K I'm going Dialogue: 0,0:18:57.69,0:19:00.03,Default,,0000,0000,0000,,to use to denote both the Dialogue: 0,0:19:00.03,0:19:02.72,Default,,0000,0000,0000,,kernel function, which is the function of X and Z Dialogue: 0,0:19:02.72,0:19:06.94,Default,,0000,0000,0000,,as well as a matrix. Unfortunately, Dialogue: 0,0:19:06.94,0:19:10.18,Default,,0000,0000,0000,,there aren't enough alphabets. Dialogue: 0,0:19:10.18,0:19:12.59,Default,,0000,0000,0000,,Well, that's not true. We need to find Dialogue: 0,0:19:12.59,0:19:15.35,Default,,0000,0000,0000,,the kernel matrix to Dialogue: 0,0:19:15.35,0:19:20.17,Default,,0000,0000,0000,,be an M-by-M matrix such that K subscript IJ is equal to the Dialogue: 0,0:19:20.17,0:19:23.81,Default,,0000,0000,0000,,kernel function Dialogue: 0,0:19:23.81,0:19:30.81,Default,,0000,0000,0000,,applied to two of my examples. Dialogue: 0,0:19:31.48,0:19:38.48,Default,,0000,0000,0000,,Then it Dialogue: 0,0:19:39.25,0:19:45.03,Default,,0000,0000,0000,,turns out that for any vector Z that's indimensional, I want Dialogue: 0,0:19:45.03,0:19:50.27,Default,,0000,0000,0000,,you to consider Z transpose Dialogue: 0,0:19:50.27,0:19:57.27,Default,,0000,0000,0000,, Dialogue: 0,0:20:01.78,0:20:04.08,Default,,0000,0000,0000,,KZ. Dialogue: 0,0:20:04.08,0:20:05.50,Default,,0000,0000,0000,, Dialogue: 0,0:20:05.50,0:20:07.88,Default,,0000,0000,0000,,By Dialogue: 0,0:20:07.88,0:20:13.23,Default,,0000,0000,0000,,definition of matrix multiplication, this is Dialogue: 0,0:20:13.23,0:20:15.12,Default,,0000,0000,0000,,that, Dialogue: 0,0:20:15.12,0:20:18.50,Default,,0000,0000,0000,,and so Dialogue: 0,0:20:18.50,0:20:23.12,Default,,0000,0000,0000,,KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is Dialogue: 0,0:20:23.12,0:20:27.58,Default,,0000,0000,0000,,a valid kernel function, Dialogue: 0,0:20:27.58,0:20:31.12,Default,,0000,0000,0000,,and so there must exist such a Dialogue: 0,0:20:31.12,0:20:33.42,Default,,0000,0000,0000,,value Dialogue: 0,0:20:33.42,0:20:36.72,Default,,0000,0000,0000,,for Phi. Dialogue: 0,0:20:36.72,0:20:40.50,Default,,0000,0000,0000,,This is the inner product between two feature vectors, so let me just make that inner product Dialogue: 0,0:20:40.50,0:20:43.09,Default,,0000,0000,0000,,the explicit. I'm going Dialogue: 0,0:20:43.09,0:20:45.95,Default,,0000,0000,0000,,to sum over the elements of this vector, Dialogue: 0,0:20:45.95,0:20:50.32,Default,,0000,0000,0000,,and I'm going to use Phi XI subscript K just to denote the K Dialogue: 0,0:20:50.32,0:20:57.32,Default,,0000,0000,0000,,element of this vector. Dialogue: 0,0:20:58.49,0:21:03.48,Default,,0000,0000,0000,,Just Dialogue: 0,0:21:03.48,0:21:10.48,Default,,0000,0000,0000,,rearrange sums. You get Dialogue: 0,0:21:24.94,0:21:30.11,Default,,0000,0000,0000,,sum over K. This next set may look familiar to some of you, Dialogue: 0,0:21:30.11,0:21:37.11,Default,,0000,0000,0000,,which is just " Dialogue: 0,0:21:47.22,0:21:48.30,Default,,0000,0000,0000,,right? Therefore, Dialogue: 0,0:21:48.30,0:21:51.54,Default,,0000,0000,0000,,this is the sum of squares Dialogue: 0,0:21:51.54,0:21:56.63,Default,,0000,0000,0000,,and it must therefore be greater than or equal to zero. Dialogue: 0,0:21:56.63,0:22:00.25,Default,,0000,0000,0000,,Do you want to take a minute to look for all the steps and just make Dialogue: 0,0:22:00.25,0:22:07.25,Default,,0000,0000,0000,,sure you buy Dialogue: 0,0:22:09.40,0:22:11.01,Default,,0000,0000,0000,,them all? Oh, this is Dialogue: 0,0:22:11.01,0:22:15.08,Default,,0000,0000,0000,,the inner product between the vector of Phi of XI and Phi Dialogue: 0,0:22:15.08,0:22:17.17,Default,,0000,0000,0000,,of XJ, Dialogue: 0,0:22:17.17,0:22:19.21,Default,,0000,0000,0000,,so the Dialogue: 0,0:22:19.21,0:22:21.10,Default,,0000,0000,0000,,inner product between two vectors Dialogue: 0,0:22:21.10,0:22:28.08,Default,,0000,0000,0000,,is the sum over all the elements of the vectors of the corresponding element. Dialogue: 0,0:22:28.08,0:22:31.56,Default,,0000,0000,0000,,Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A Dialogue: 0,0:22:31.56,0:22:36.97,Default,,0000,0000,0000,,transpose B equals sum over K, AK BK, so that's Dialogue: 0,0:22:36.97,0:22:38.83,Default,,0000,0000,0000,,just this. This is the sum of K of the Dialogue: 0,0:22:38.83,0:22:44.49,Default,,0000,0000,0000,,K elements of this vector. Dialogue: 0,0:22:44.49,0:22:46.18,Default,,0000,0000,0000,,Take a look at this and make sure Dialogue: 0,0:22:46.18,0:22:50.47,Default,,0000,0000,0000,,it makes sense. Questions about Dialogue: 0,0:22:50.47,0:22:57.47,Default,,0000,0000,0000,,this? So just to Dialogue: 0,0:22:59.72,0:23:00.42,Default,,0000,0000,0000,,summarize, Dialogue: 0,0:23:00.42,0:23:03.30,Default,,0000,0000,0000,,what we showed was that Dialogue: 0,0:23:03.30,0:23:05.31,Default,,0000,0000,0000,,for any vector Z, Dialogue: 0,0:23:05.31,0:23:07.38,Default,,0000,0000,0000,,Z transpose KZ Dialogue: 0,0:23:07.38,0:23:09.70,Default,,0000,0000,0000,,is greater than or equal to zero, Dialogue: 0,0:23:09.70,0:23:14.09,Default,,0000,0000,0000,,and this is one of the standard definitions of a matrix, Dialogue: 0,0:23:14.09,0:23:17.54,Default,,0000,0000,0000,,the matrix K being posisemidefinite Dialogue: 0,0:23:17.54,0:23:24.24,Default,,0000,0000,0000,,when a matrix K is posisemidefinite, that is, K is equal to Dialogue: 0,0:23:24.24,0:23:28.01,Default,,0000,0000,0000,,zero. Just to summarize, what was shown is that if Dialogue: 0,0:23:28.01,0:23:29.00,Default,,0000,0000,0000,,K Dialogue: 0,0:23:29.00,0:23:31.15,Default,,0000,0000,0000,,is a Dialogue: 0,0:23:31.15,0:23:34.43,Default,,0000,0000,0000,,valid kernel " in other words, if K is a function Dialogue: 0,0:23:34.43,0:23:37.14,Default,,0000,0000,0000,,for which there exists Dialogue: 0,0:23:37.14,0:23:38.39,Default,,0000,0000,0000,,some Phi such that Dialogue: 0,0:23:38.39,0:23:39.99,Default,,0000,0000,0000,,K of XI Dialogue: 0,0:23:39.99,0:23:43.83,Default,,0000,0000,0000,,XJ is the inner product between Phi of XI and Phi of XJ. Dialogue: 0,0:23:43.83,0:23:48.13,Default,,0000,0000,0000,,So if K is a valid kernel, we showed, then, that the kernel matrix Dialogue: 0,0:23:48.13,0:23:50.93,Default,,0000,0000,0000,, Dialogue: 0,0:23:50.93,0:23:55.03,Default,,0000,0000,0000,,must be posisemidefinite. Dialogue: 0,0:23:55.03,0:23:59.91,Default,,0000,0000,0000,, Dialogue: 0,0:23:59.91,0:24:01.97,Default,,0000,0000,0000,,It turns out that the Dialogue: 0,0:24:01.97,0:24:04.84,Default,,0000,0000,0000,,converse [inaudible] Dialogue: 0,0:24:04.84,0:24:09.48,Default,,0000,0000,0000,,and so this gives you a test for Dialogue: 0,0:24:09.48,0:24:13.01,Default,,0000,0000,0000,,whether a function K is a valid kernel. Dialogue: 0,0:24:13.01,0:24:14.60,Default,,0000,0000,0000,,So this is Dialogue: 0,0:24:14.60,0:24:18.37,Default,,0000,0000,0000,,a theorem due to Mercer, and so kernels are also sometimes called Mercer Dialogue: 0,0:24:18.37,0:24:25.37,Default,,0000,0000,0000,,kernels. It means the same thing. It just means it's a valid kernel. Let Dialogue: 0,0:24:25.51,0:24:31.44,Default,,0000,0000,0000,,K of XZ be given. Dialogue: 0,0:24:31.44,0:24:33.81,Default,,0000,0000,0000,,Then K Dialogue: 0,0:24:33.81,0:24:36.98,Default,,0000,0000,0000,,is a valid Dialogue: 0,0:24:36.98,0:24:43.16,Default,,0000,0000,0000,,kernel " Dialogue: 0,0:24:43.16,0:24:45.31,Default,,0000,0000,0000,,in other Dialogue: 0,0:24:45.31,0:24:48.06,Default,,0000,0000,0000,,words, it's a Mercer kernel, i.e., there exists a Phi Dialogue: 0,0:24:48.06,0:24:50.12,Default,,0000,0000,0000,,such that Dialogue: 0,0:24:50.12,0:24:52.36,Default,,0000,0000,0000,,KXZ Dialogue: 0,0:24:52.36,0:24:53.64,Default,,0000,0000,0000,, Dialogue: 0,0:24:53.64,0:24:58.65,Default,,0000,0000,0000,, Dialogue: 0,0:24:58.65,0:25:00.97,Default,,0000,0000,0000,,equals Phi of X Dialogue: 0,0:25:00.97,0:25:04.84,Default,,0000,0000,0000,,transpose Dialogue: 0,0:25:04.84,0:25:11.18,Default,,0000,0000,0000,,Phi of Z " if and only if for any set of M examples, Dialogue: 0,0:25:11.18,0:25:15.04,Default,,0000,0000,0000,,and this really means for any set of M points. It's not necessarily a training set. Dialogue: 0,0:25:15.04,0:25:21.30,Default,,0000,0000,0000,,It's just any set of M points you may choose. It Dialogue: 0,0:25:21.30,0:25:28.30,Default,,0000,0000,0000,,holds true that the kernel Dialogue: 0,0:25:29.04,0:25:30.41,Default,,0000,0000,0000,,matrix, Dialogue: 0,0:25:30.41,0:25:34.29,Default,,0000,0000,0000,,capital K that I defined just now, Dialogue: 0,0:25:34.29,0:25:36.51,Default,,0000,0000,0000,,is Dialogue: 0,0:25:36.51,0:25:37.96,Default,,0000,0000,0000,, Dialogue: 0,0:25:37.96,0:25:44.96,Default,,0000,0000,0000,,symmetric posisemidefinite. Dialogue: 0,0:25:49.26,0:25:54.07,Default,,0000,0000,0000,,And so I proved only one direction of this result. I proved that if it's a Dialogue: 0,0:25:54.07,0:25:55.42,Default,,0000,0000,0000,,valid kernel, then K Dialogue: 0,0:25:55.42,0:25:57.30,Default,,0000,0000,0000,,is symmetry Dialogue: 0,0:25:57.30,0:26:00.91,Default,,0000,0000,0000,,posisemidefinite, but the converse I didn't show. It turns out that this is Dialogue: 0,0:26:00.91,0:26:03.71,Default,,0000,0000,0000,,necessary and a sufficient condition. And Dialogue: 0,0:26:03.71,0:26:06.25,Default,,0000,0000,0000,,so this gives you a useful test for Dialogue: 0,0:26:06.25,0:26:09.29,Default,,0000,0000,0000,,whether any function that you might want to choose Dialogue: 0,0:26:09.29,0:26:16.29,Default,,0000,0000,0000,,is a Dialogue: 0,0:26:23.27,0:26:27.04,Default,,0000,0000,0000,,kernel. A concrete example of something that's clearly not a Dialogue: 0,0:26:27.04,0:26:28.24,Default,,0000,0000,0000,,valid kernel would Dialogue: 0,0:26:28.24,0:26:30.66,Default,,0000,0000,0000,,be Dialogue: 0,0:26:30.66,0:26:36.04,Default,,0000,0000,0000,,if you find an input X such that K of X, X " and this Dialogue: 0,0:26:36.04,0:26:39.41,Default,,0000,0000,0000,,is minus one, for example " then this is an example of something that's clearly Dialogue: 0,0:26:39.41,0:26:41.53,Default,,0000,0000,0000,,not a valid kernel, Dialogue: 0,0:26:41.53,0:26:48.01,Default,,0000,0000,0000,,because minus one cannot possibly be equal to Phi of X Dialogue: 0,0:26:48.01,0:26:49.39,Default,,0000,0000,0000,,transpose Phi of X, Dialogue: 0,0:26:49.39,0:26:53.25,Default,,0000,0000,0000,,and so this would be one of many examples of functions that Dialogue: 0,0:26:53.25,0:26:57.04,Default,,0000,0000,0000,,will fail to meet the conditions of this theorem, Dialogue: 0,0:26:57.04,0:27:04.04,Default,,0000,0000,0000,,because inner products of a vector itself are always greater than zero. So Dialogue: 0,0:27:15.82,0:27:22.82,Default,,0000,0000,0000,, Dialogue: 0,0:27:29.53,0:27:33.62,Default,,0000,0000,0000,,just to tie this back explicitly to an Dialogue: 0,0:27:33.62,0:27:34.71,Default,,0000,0000,0000,,SVM, Dialogue: 0,0:27:34.71,0:27:37.19,Default,,0000,0000,0000,,let's say to use a support vector machine with a kernel, Dialogue: 0,0:27:37.19,0:27:41.40,Default,,0000,0000,0000,,what you do is you choose some function K of XZ, Dialogue: 0,0:27:41.40,0:27:45.86,Default,,0000,0000,0000,,and so you can choose " and it turns out that function I wrote down just now " this is, Dialogue: 0,0:27:45.86,0:27:49.06,Default,,0000,0000,0000,,indeed, a valid kernel. It Dialogue: 0,0:27:49.06,0:27:50.85,Default,,0000,0000,0000,,is called the Galcean kernel Dialogue: 0,0:27:50.85,0:27:52.25,Default,,0000,0000,0000,,because of the similarity to Dialogue: 0,0:27:52.25,0:27:53.54,Default,,0000,0000,0000,,Galceans. Dialogue: 0,0:27:53.54,0:27:58.58,Default,,0000,0000,0000,,So you choose some kernel function like this, or you may choose X Dialogue: 0,0:27:58.58,0:28:02.49,Default,,0000,0000,0000,,transpose Z plus Dialogue: 0,0:28:02.49,0:28:03.47,Default,,0000,0000,0000,,C Dialogue: 0,0:28:03.47,0:28:06.42,Default,,0000,0000,0000,,to the D vector. To apply a support vector machine kernel, you choose one of these Dialogue: 0,0:28:06.42,0:28:07.50,Default,,0000,0000,0000,,functions, Dialogue: 0,0:28:07.50,0:28:09.49,Default,,0000,0000,0000,,and the choice of this would depend on your problem. Dialogue: 0,0:28:09.49,0:28:14.15,Default,,0000,0000,0000,,It depends on what is a good measure of one or two examples Dialogue: 0,0:28:14.15,0:28:17.06,Default,,0000,0000,0000,,similar and one or two examples different for your problem. Dialogue: 0,0:28:17.06,0:28:21.94,Default,,0000,0000,0000,,Then you go back to our formulation of support vector machine, Dialogue: 0,0:28:21.94,0:28:24.81,Default,,0000,0000,0000,,and you have to use the dual formulation, Dialogue: 0,0:28:24.81,0:28:28.73,Default,,0000,0000,0000,,and you then replace everywhere you see Dialogue: 0,0:28:28.73,0:28:31.57,Default,,0000,0000,0000,,these Dialogue: 0,0:28:31.57,0:28:36.80,Default,,0000,0000,0000,,things, you Dialogue: 0,0:28:36.80,0:28:39.90,Default,,0000,0000,0000,,replace it with K of XI, Dialogue: 0,0:28:39.90,0:28:41.82,Default,,0000,0000,0000,,XJ. Dialogue: 0,0:28:41.82,0:28:45.36,Default,,0000,0000,0000,,And you then run exactly the same support vector machine algorithm, only everywhere Dialogue: 0,0:28:45.36,0:28:48.95,Default,,0000,0000,0000,,you see these inner products, you replace them with that, Dialogue: 0,0:28:48.95,0:28:52.48,Default,,0000,0000,0000,,and what you've just done is you've taken a support vector machine Dialogue: 0,0:28:52.48,0:28:56.89,Default,,0000,0000,0000,,and you've taken each of your feature vectors X Dialogue: 0,0:28:56.89,0:29:02.76,Default,,0000,0000,0000,,and you've replaced it with implicitly a very high dimensional feature vector. Dialogue: 0,0:29:02.76,0:29:06.27,Default,,0000,0000,0000,,It turns out that the Galcean kernel corresponds to a feature vector that's Dialogue: 0,0:29:06.27,0:29:08.64,Default,,0000,0000,0000,,infinite dimensional. Dialogue: 0,0:29:08.64,0:29:09.95,Default,,0000,0000,0000,, Dialogue: 0,0:29:09.95,0:29:13.86,Default,,0000,0000,0000,,Nonetheless, you can run a support vector machine in a finite amount of time, Dialogue: 0,0:29:13.86,0:29:15.45,Default,,0000,0000,0000,,even though you're working with Dialogue: 0,0:29:15.45,0:29:18.92,Default,,0000,0000,0000,,infinite dimensional feature vectors, Dialogue: 0,0:29:18.92,0:29:20.56,Default,,0000,0000,0000,,because all you ever need to do is Dialogue: 0,0:29:20.56,0:29:23.12,Default,,0000,0000,0000,,compute these things, and you don't ever need Dialogue: 0,0:29:23.12,0:29:26.33,Default,,0000,0000,0000,,to represent these infinite dimensional feature vectors Dialogue: 0,0:29:26.33,0:29:28.33,Default,,0000,0000,0000,,explicitly. Dialogue: 0,0:29:28.33,0:29:30.04,Default,,0000,0000,0000,,Why Dialogue: 0,0:29:30.04,0:29:33.25,Default,,0000,0000,0000,,is this a good idea? It turns out " Dialogue: 0,0:29:33.25,0:29:38.07,Default,,0000,0000,0000,,I think I started off talking about support vector machines. I started Dialogue: 0,0:29:38.07,0:29:41.58,Default,,0000,0000,0000,,saying that we wanted to start to develop non-linear learning algorithms. Dialogue: 0,0:29:41.58,0:29:45.35,Default,,0000,0000,0000,,So here's one useful picture to keep in mind, which is that Dialogue: 0,0:29:45.35,0:29:48.42,Default,,0000,0000,0000,,let's say your original data " Dialogue: 0,0:29:48.42,0:29:52.06,Default,,0000,0000,0000,,I didn't mean to draw that slanted. Let's say you have one-dimensional input Dialogue: 0,0:29:52.06,0:29:56.54,Default,,0000,0000,0000,,data. You just have one feature X and R. What a kernel Dialogue: 0,0:29:56.54,0:30:00.96,Default,,0000,0000,0000,,does is the following. It takes your original input data and maps it to a Dialogue: 0,0:30:00.96,0:30:04.91,Default,,0000,0000,0000,,very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw Dialogue: 0,0:30:04.91,0:30:09.17,Default,,0000,0000,0000,,two Dialogue: 0,0:30:09.17,0:30:12.95,Default,,0000,0000,0000,,dimensions here. Dialogue: 0,0:30:12.95,0:30:19.95,Default,,0000,0000,0000,,So say [inaudible] very high dimensional feature space where " Dialogue: 0,0:30:23.07,0:30:28.90,Default,,0000,0000,0000,,like so. So it takes all your data in R1 and maps it to R infinity, Dialogue: 0,0:30:28.90,0:30:30.36,Default,,0000,0000,0000,,and then you run a Dialogue: 0,0:30:30.36,0:30:33.77,Default,,0000,0000,0000,,support vector machine in this infinite dimensional space and also exponentially Dialogue: 0,0:30:33.77,0:30:35.24,Default,,0000,0000,0000,,high dimensional space, Dialogue: 0,0:30:35.24,0:30:38.03,Default,,0000,0000,0000,,and you'll find the optimal margin classifier " Dialogue: 0,0:30:38.03,0:30:39.73,Default,,0000,0000,0000,,in other words, the Dialogue: 0,0:30:39.73,0:30:43.39,Default,,0000,0000,0000,,classifier that separates your data in this very high dimensional space Dialogue: 0,0:30:43.39,0:30:46.06,Default,,0000,0000,0000,,with the largest possible geometric margin. Dialogue: 0,0:30:46.06,0:30:50.35,Default,,0000,0000,0000,,In this example that you just drew anyway, Dialogue: 0,0:30:50.35,0:30:54.68,Default,,0000,0000,0000,,whereas your data was not linearly separable in your originally one dimensional space, Dialogue: 0,0:30:54.68,0:30:58.34,Default,,0000,0000,0000,,when you map it to this much higher dimensional space, it becomes linearly separable, so you Dialogue: 0,0:30:58.34,0:30:59.96,Default,,0000,0000,0000,,can use your Dialogue: 0,0:30:59.96,0:31:01.25,Default,,0000,0000,0000,,linear classifier Dialogue: 0,0:31:01.25,0:31:04.99,Default,,0000,0000,0000,,to [inaudible] which data is not really separable in your original space. Dialogue: 0,0:31:04.99,0:31:07.09,Default,,0000,0000,0000,,This is what support vector machines Dialogue: 0,0:31:07.09,0:31:10.46,Default,,0000,0000,0000,,output nonlinear decision boundaries Dialogue: 0,0:31:10.46,0:31:14.93,Default,,0000,0000,0000,,and in the entire process, all you ever need to do is solve complex optimization Dialogue: 0,0:31:14.93,0:31:18.78,Default,,0000,0000,0000,,problems. Dialogue: 0,0:31:18.78,0:31:19.56,Default,,0000,0000,0000,, Dialogue: 0,0:31:19.56,0:31:25.00,Default,,0000,0000,0000,,Questions about any of this? Student:[Inaudible] sigmer? Instructor Dialogue: 0,0:31:25.00,0:31:30.31,Default,,0000,0000,0000,,(Andrew Ng):Yeah, so sigmer is " let's Dialogue: 0,0:31:30.31,0:31:31.100,Default,,0000,0000,0000,,see. Well, I was going to talk about [inaudible] Dialogue: 0,0:31:31.100,0:31:32.73,Default,,0000,0000,0000,,later. One way Dialogue: 0,0:31:32.73,0:31:35.69,Default,,0000,0000,0000,,to choose sigmer is Dialogue: 0,0:31:35.69,0:31:38.86,Default,,0000,0000,0000,,save aside a small amount of your data and Dialogue: 0,0:31:38.86,0:31:41.91,Default,,0000,0000,0000,,try different values of sigmer and train an SVM using, Dialogue: 0,0:31:41.91,0:31:44.74,Default,,0000,0000,0000,,say, two thirds of your data. Dialogue: 0,0:31:44.74,0:31:48.50,Default,,0000,0000,0000,,Try different values of sigmer, then see what works best on a separate Dialogue: 0,0:31:48.50,0:31:48.95,Default,,0000,0000,0000,,hold out Dialogue: 0,0:31:48.95,0:31:53.27,Default,,0000,0000,0000,,cross validation set " on a separate set that you're testing. Something about Dialogue: 0,0:31:53.27,0:31:56.29,Default,,0000,0000,0000,,learning algorithms we talked about Dialogue: 0,0:31:56.29,0:31:57.41,Default,,0000,0000,0000,," Dialogue: 0,0:31:57.41,0:32:00.36,Default,,0000,0000,0000,,locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a Dialogue: 0,0:32:00.36,0:32:03.16,Default,,0000,0000,0000,,number of parameters to some of these algorithms that you Dialogue: 0,0:32:03.16,0:32:05.83,Default,,0000,0000,0000,,can choose IDs by saving aside some data to test on. I'll Dialogue: 0,0:32:05.83,0:32:10.25,Default,,0000,0000,0000,,talk more about model selection [inaudible] explicitly. Are there other questions? Student:So Dialogue: 0,0:32:10.25,0:32:12.79,Default,,0000,0000,0000,,how do you know that Dialogue: 0,0:32:12.79,0:32:14.30,Default,,0000,0000,0000,, Dialogue: 0,0:32:14.30,0:32:16.83,Default,,0000,0000,0000,,moving it up to high dimensional space Dialogue: 0,0:32:16.83,0:32:19.82,Default,,0000,0000,0000,,is going to give you that kind of separation? Instructor (Andrew Dialogue: 0,0:32:19.82,0:32:21.28,Default,,0000,0000,0000,,Ng):Good question. Dialogue: 0,0:32:21.28,0:32:25.98,Default,,0000,0000,0000,,Usually, you don't know [inaudible]. Sometimes you can know, but Dialogue: 0,0:32:25.98,0:32:29.23,Default,,0000,0000,0000,,in most cases, you won't know [inaudible] actually going to Dialogue: 0,0:32:29.23,0:32:30.01,Default,,0000,0000,0000,,linearly Dialogue: 0,0:32:30.01,0:32:33.92,Default,,0000,0000,0000,,separable, so the next topic will be [inaudible], which is what Dialogue: 0,0:32:33.92,0:32:40.92,Default,,0000,0000,0000,,[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend Dialogue: 0,0:32:41.08,0:32:42.26,Default,,0000,0000,0000,,linearly Dialogue: 0,0:32:42.26,0:32:44.63,Default,,0000,0000,0000,,separated by mapping a higher dimension, couldn't you Dialogue: 0,0:32:44.63,0:32:47.31,Default,,0000,0000,0000,,also just use [inaudible] Dialogue: 0,0:32:47.31,0:32:49.40,Default,,0000,0000,0000,,higher Dialogue: 0,0:32:49.40,0:32:53.58,Default,,0000,0000,0000,,dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't Dialogue: 0,0:32:53.58,0:32:55.35,Default,,0000,0000,0000,,separate it in higher dimensional Dialogue: 0,0:32:55.35,0:32:59.46,Default,,0000,0000,0000,,space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs. Dialogue: 0,0:32:59.46,0:33:06.12,Default,,0000,0000,0000,,Okay. Student:What Dialogue: 0,0:33:06.12,0:33:08.18,Default,,0000,0000,0000,,if Dialogue: 0,0:33:08.18,0:33:09.37,Default,,0000,0000,0000,,you run an SVM algorithm Dialogue: 0,0:33:09.37,0:33:11.72,Default,,0000,0000,0000,,that assumes the data are linearly Dialogue: 0,0:33:11.72,0:33:12.77,Default,,0000,0000,0000,,separable on data that Dialogue: 0,0:33:12.77,0:33:15.32,Default,,0000,0000,0000,,is not actually linearly separable? Instructor Dialogue: 0,0:33:15.32,0:33:19.49,Default,,0000,0000,0000,,(Andrew Ng):You guys are really giving me a hard time about whether the Dialogue: 0,0:33:19.49,0:33:20.22,Default,,0000,0000,0000,,data's linearly Dialogue: 0,0:33:20.22,0:33:24.32,Default,,0000,0000,0000,,separable. It turns out this algorithm won't work if the data is not linearly separable, but Dialogue: 0,0:33:24.32,0:33:29.26,Default,,0000,0000,0000,,I'll change that in a second and make it work. Dialogue: 0,0:33:29.26,0:33:31.10,Default,,0000,0000,0000,, Dialogue: 0,0:33:31.10,0:33:35.39,Default,,0000,0000,0000,,If I move on to talk about that, Dialogue: 0,0:33:35.39,0:33:39.76,Default,,0000,0000,0000,,let me just say one final word about kernels, Dialogue: 0,0:33:39.76,0:33:46.45,Default,,0000,0000,0000,,which is that Dialogue: 0,0:33:46.45,0:33:49.13,Default,,0000,0000,0000,, Dialogue: 0,0:33:49.13,0:33:52.56,Default,,0000,0000,0000,, Dialogue: 0,0:33:52.56,0:33:54.78,Default,,0000,0000,0000,,I talked about kernels Dialogue: 0,0:33:54.78,0:34:00.77,Default,,0000,0000,0000,,in a context of support vector machines, Dialogue: 0,0:34:00.77,0:34:04.55,Default,,0000,0000,0000,,and the idea of kernels was what really made support vector machines a very Dialogue: 0,0:34:04.55,0:34:06.70,Default,,0000,0000,0000,,powerful learning algorithm, Dialogue: 0,0:34:06.70,0:34:09.85,Default,,0000,0000,0000,,and actually towards the end of today's lecture if I have time, I'll actually Dialogue: 0,0:34:09.85,0:34:14.12,Default,,0000,0000,0000,,give a couple more [inaudible] examples of how to choose kernels as well. Dialogue: 0,0:34:14.12,0:34:14.95,Default,,0000,0000,0000,, Dialogue: 0,0:34:14.95,0:34:18.72,Default,,0000,0000,0000,,It turns out that the idea of kernels is actually more general than support vector Dialogue: 0,0:34:18.72,0:34:19.66,Default,,0000,0000,0000,,machines, Dialogue: 0,0:34:19.66,0:34:21.21,Default,,0000,0000,0000,,and in particular, Dialogue: 0,0:34:21.21,0:34:25.56,Default,,0000,0000,0000,,we took this SVM algorithm and we derived a dual, and that was what let us write Dialogue: 0,0:34:25.56,0:34:27.96,Default,,0000,0000,0000,,the entire algorithm Dialogue: 0,0:34:27.96,0:34:30.100,Default,,0000,0000,0000,,in terms of inner products of these. Dialogue: 0,0:34:30.100,0:34:35.32,Default,,0000,0000,0000,,It turns out that you can take many of the other algorithms that you've seen Dialogue: 0,0:34:35.32,0:34:36.63,Default,,0000,0000,0000,,in this class " in fact, it turns Dialogue: 0,0:34:36.63,0:34:40.24,Default,,0000,0000,0000,,out you can take most of the linear algorithms we talked about, such as linear Dialogue: 0,0:34:40.24,0:34:42.12,Default,,0000,0000,0000,,regression, logistic regression Dialogue: 0,0:34:42.12,0:34:43.87,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:34:43.87,0:34:48.03,Default,,0000,0000,0000,,and it turns out you can take all of these algorithms and rewrite them entirely Dialogue: 0,0:34:48.03,0:34:51.43,Default,,0000,0000,0000,,in terms of these inner products. Dialogue: 0,0:34:51.43,0:34:55.15,Default,,0000,0000,0000,,So if you have any algorithm that you can rewrite in terms of inner products, then that Dialogue: 0,0:34:55.15,0:35:01.26,Default,,0000,0000,0000,,means you can replace it with K of Dialogue: 0,0:35:01.26,0:35:02.43,Default,,0000,0000,0000,,XI XJ Dialogue: 0,0:35:02.43,0:35:06.50,Default,,0000,0000,0000,,and that means that you can take any of theses algorithms and Dialogue: 0,0:35:06.50,0:35:10.95,Default,,0000,0000,0000,,implicitly map the features vectors of these very high dimensional feature spaces Dialogue: 0,0:35:10.95,0:35:12.99,Default,,0000,0000,0000,,and have the algorithm Dialogue: 0,0:35:12.99,0:35:13.67,Default,,0000,0000,0000,, Dialogue: 0,0:35:13.67,0:35:17.38,Default,,0000,0000,0000,,still work. The idea of kernels is perhaps most widely used with support vector Dialogue: 0,0:35:17.38,0:35:19.87,Default,,0000,0000,0000,,machines, but it is actually more general than that, Dialogue: 0,0:35:19.87,0:35:23.43,Default,,0000,0000,0000,,and you can take many of the other algorithms that you've seen and many of the algorithms that Dialogue: 0,0:35:23.43,0:35:26.50,Default,,0000,0000,0000,,we'll see later this quarter as well Dialogue: 0,0:35:26.50,0:35:30.31,Default,,0000,0000,0000,,and write them in terms of inner products and thereby kernalize them and apply them to Dialogue: 0,0:35:30.31,0:35:33.20,Default,,0000,0000,0000,,infinite dimensional feature spaces. Dialogue: 0,0:35:33.20,0:35:36.31,Default,,0000,0000,0000,,You'll actually get to play with many of these ideas more in the next problem Dialogue: 0,0:35:36.31,0:35:41.09,Default,,0000,0000,0000,, Dialogue: 0,0:35:41.09,0:35:43.27,Default,,0000,0000,0000,,set. Dialogue: 0,0:35:43.27,0:35:47.02,Default,,0000,0000,0000,,Let's talk about non-linear decision boundaries, and this is Dialogue: 0,0:35:47.02,0:35:50.21,Default,,0000,0000,0000,,the idea of Dialogue: 0,0:35:50.21,0:35:55.06,Default,,0000,0000,0000,," it's called the L1 norm soft Dialogue: 0,0:35:55.06,0:35:59.14,Default,,0000,0000,0000,,margin SVM. Machine only people sometimes aren't great at coming up with good Dialogue: 0,0:35:59.14,0:36:00.27,Default,,0000,0000,0000,,names, but here's Dialogue: 0,0:36:00.27,0:36:01.84,Default,,0000,0000,0000,,the idea. Dialogue: 0,0:36:01.84,0:36:08.84,Default,,0000,0000,0000,,Let's say I have a data Dialogue: 0,0:36:12.67,0:36:15.38,Default,,0000,0000,0000,,set. This is a linearly separable data Dialogue: 0,0:36:15.38,0:36:19.10,Default,,0000,0000,0000,,set, but what I do if I have a couple of other examples there that makes the data Dialogue: 0,0:36:19.10,0:36:23.78,Default,,0000,0000,0000,,nonlinearly separable, Dialogue: 0,0:36:23.78,0:36:25.93,Default,,0000,0000,0000,,and in fact, Dialogue: 0,0:36:25.93,0:36:30.06,Default,,0000,0000,0000,,sometimes even if the data is linearly separable, maybe you might not want to. Dialogue: 0,0:36:30.06,0:36:32.83,Default,,0000,0000,0000,,So for example, this is a very nice data set. It looks like Dialogue: 0,0:36:32.83,0:36:37.11,Default,,0000,0000,0000,,there's a great decision boundary that separates the two [inaudible]. Well, Dialogue: 0,0:36:37.11,0:36:40.27,Default,,0000,0000,0000,,what if I had just one outlier Dialogue: 0,0:36:40.27,0:36:42.49,Default,,0000,0000,0000,,down here? I could Dialogue: 0,0:36:42.49,0:36:46.27,Default,,0000,0000,0000,,still linearly separate this data set Dialogue: 0,0:36:46.27,0:36:48.03,Default,,0000,0000,0000,,with something like that, Dialogue: 0,0:36:48.03,0:36:50.19,Default,,0000,0000,0000,,but I'm somehow letting one Dialogue: 0,0:36:50.19,0:36:52.54,Default,,0000,0000,0000,,slightly suspicious example Dialogue: 0,0:36:52.54,0:36:55.95,Default,,0000,0000,0000,,skew my entire decision boundary by a lot, Dialogue: 0,0:36:55.95,0:36:57.28,Default,,0000,0000,0000,,and so Dialogue: 0,0:36:57.28,0:37:00.97,Default,,0000,0000,0000,,what I'm going to talk about now is the L1 norm soft margin SVM, which is Dialogue: 0,0:37:00.97,0:37:05.34,Default,,0000,0000,0000,,a slightly modified formulation of the SVM optimization problem. Dialogue: 0,0:37:05.34,0:37:09.07,Default,,0000,0000,0000,,They will let us deal with both of these cases " one where one of the data's Dialogue: 0,0:37:09.07,0:37:10.81,Default,,0000,0000,0000,,just not linearly separable Dialogue: 0,0:37:10.81,0:37:14.15,Default,,0000,0000,0000,,and two, what if you have some examples that you'd rather Dialogue: 0,0:37:14.15,0:37:16.60,Default,,0000,0000,0000,,not get [inaudible] in a training set. Maybe Dialogue: 0,0:37:16.60,0:37:19.39,Default,,0000,0000,0000,,with an outlier here, maybe you actually prefer to Dialogue: 0,0:37:19.39,0:37:23.12,Default,,0000,0000,0000,,choose that original decision boundary and not try so hard to get Dialogue: 0,0:37:23.12,0:37:26.62,Default,,0000,0000,0000,,that training example. Here's the formulation. Dialogue: 0,0:37:26.62,0:37:33.62,Default,,0000,0000,0000,,Our Dialogue: 0,0:37:44.48,0:37:45.43,Default,,0000,0000,0000,, Dialogue: 0,0:37:45.43,0:37:46.43,Default,,0000,0000,0000,,SVM Dialogue: 0,0:37:46.43,0:37:51.11,Default,,0000,0000,0000,,primal problem was to minimize one-half [inaudible] W Dialogue: 0,0:37:51.11,0:37:51.92,Default,,0000,0000,0000,,squared. Dialogue: 0,0:37:51.92,0:37:58.92,Default,,0000,0000,0000,, Dialogue: 0,0:38:02.72,0:38:05.19,Default,,0000,0000,0000,,So this is our original problem, and I'm Dialogue: 0,0:38:05.19,0:38:09.74,Default,,0000,0000,0000,,going to modify this by adding the Dialogue: 0,0:38:09.74,0:38:16.74,Default,,0000,0000,0000,,following. Dialogue: 0,0:38:23.45,0:38:27.37,Default,,0000,0000,0000,,In other words, I'm gonna add these penalty terms, CIs, Dialogue: 0,0:38:27.37,0:38:32.41,Default,,0000,0000,0000,,and I'm going to demand that each of my training examples is separated with Dialogue: 0,0:38:32.41,0:38:35.100,Default,,0000,0000,0000,,functional margin greater than or equal to one minus CI, Dialogue: 0,0:38:35.100,0:38:39.64,Default,,0000,0000,0000,,and you remember Dialogue: 0,0:38:39.64,0:38:45.48,Default,,0000,0000,0000,, Dialogue: 0,0:38:45.48,0:38:47.43,Default,,0000,0000,0000,,if this is Dialogue: 0,0:38:47.43,0:38:50.01,Default,,0000,0000,0000,,greater than zero " was it Dialogue: 0,0:38:50.01,0:38:52.75,Default,,0000,0000,0000,,two lectures ago that I said that if the Dialogue: 0,0:38:52.75,0:38:54.69,Default,,0000,0000,0000,,function margin is greater than zero, Dialogue: 0,0:38:54.69,0:39:00.81,Default,,0000,0000,0000,,that implies you classified it correctly. Dialogue: 0,0:39:00.81,0:39:04.59,Default,,0000,0000,0000,,If it's less than zero, then you misclassified it. Dialogue: 0,0:39:04.59,0:39:08.23,Default,,0000,0000,0000,,By setting some of the CIs to be larger than one, I can Dialogue: 0,0:39:08.23,0:39:10.66,Default,,0000,0000,0000,,actually have some examples with Dialogue: 0,0:39:10.66,0:39:12.88,Default,,0000,0000,0000,,functional margin negative, Dialogue: 0,0:39:12.88,0:39:16.44,Default,,0000,0000,0000,,and therefore I'm allowing my algorithm to misclassify some of the examples of the Dialogue: 0,0:39:16.44,0:39:19.09,Default,,0000,0000,0000,,training set. Dialogue: 0,0:39:19.09,0:39:23.13,Default,,0000,0000,0000,,However, I'll encourage the algorithm not to do that by adding to the Dialogue: 0,0:39:23.13,0:39:24.38,Default,,0000,0000,0000,,optimization objective, Dialogue: 0,0:39:24.38,0:39:29.61,Default,,0000,0000,0000,,this sort of penalty term that penalizes setting CIs to be large. Dialogue: 0,0:39:29.61,0:39:34.12,Default,,0000,0000,0000,,This is an optimization problem where the parameters are WB Dialogue: 0,0:39:34.12,0:39:37.49,Default,,0000,0000,0000,,and all of the CIs Dialogue: 0,0:39:37.49,0:39:42.74,Default,,0000,0000,0000,,and this is also a convex optimization problem. It Dialogue: 0,0:39:42.74,0:39:46.55,Default,,0000,0000,0000,,turns out that Dialogue: 0,0:39:46.55,0:39:50.63,Default,,0000,0000,0000,,similar to how we worked on the dual of the support vector machine, Dialogue: 0,0:39:50.63,0:39:54.32,Default,,0000,0000,0000,,we can also work out the dual for this optimization problem. I won't Dialogue: 0,0:39:54.32,0:39:55.50,Default,,0000,0000,0000,,actually do it, Dialogue: 0,0:39:55.50,0:39:59.17,Default,,0000,0000,0000,,but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going Dialogue: 0,0:39:59.17,0:40:01.64,Default,,0000,0000,0000,,to Dialogue: 0,0:40:01.64,0:40:03.49,Default,,0000,0000,0000,,use Dialogue: 0,0:40:03.49,0:40:08.12,Default,,0000,0000,0000,,Alpha and R to denote the [inaudible] multipliers no corresponding to Dialogue: 0,0:40:08.12,0:40:10.60,Default,,0000,0000,0000,,this set of constraints that we had previously Dialogue: 0,0:40:10.60,0:40:15.07,Default,,0000,0000,0000,,and this new set of constraints on the CI [inaudible] zero. This gives us a use of Dialogue: 0,0:40:15.07,0:40:16.44,Default,,0000,0000,0000,,the Dialogue: 0,0:40:16.44,0:40:23.28,Default,,0000,0000,0000,,[inaudible] multipliers. The [inaudible] will be Dialogue: 0,0:40:23.28,0:40:24.57,Default,,0000,0000,0000,,optimization objective Dialogue: 0,0:40:24.57,0:40:31.57,Default,,0000,0000,0000,,minus sum from Dialogue: 0,0:40:35.88,0:40:38.70,Default,,0000,0000,0000,,plus CI Dialogue: 0,0:40:38.70,0:40:44.35,Default,,0000,0000,0000,,minus " Dialogue: 0,0:40:44.35,0:40:46.68,Default,,0000,0000,0000,,and so Dialogue: 0,0:40:46.68,0:40:50.55,Default,,0000,0000,0000,,there's our [inaudible] optimization objective Dialogue: 0,0:40:50.55,0:40:55.05,Default,,0000,0000,0000,,minus or plus Alpha times each of these constraints, Dialogue: 0,0:40:55.05,0:40:56.77,Default,,0000,0000,0000,, Dialogue: 0,0:40:56.77,0:41:03.77,Default,,0000,0000,0000,,which are Dialogue: 0,0:41:05.20,0:41:12.20,Default,,0000,0000,0000,, Dialogue: 0,0:41:17.97,0:41:20.78,Default,,0000,0000,0000,,greater or equal to Dialogue: 0,0:41:20.78,0:41:27.78,Default,,0000,0000,0000,,zero. I won't redivide the entire dual again, but it's really the same, Dialogue: 0,0:41:28.50,0:41:33.80,Default,,0000,0000,0000,,and when you derive the dual of this optimization problem and when you simplify, Dialogue: 0,0:41:33.80,0:41:37.20,Default,,0000,0000,0000,,you find that you get the following. Dialogue: 0,0:41:37.20,0:41:40.61,Default,,0000,0000,0000,,You have to maximize Dialogue: 0,0:41:40.61,0:41:47.61,Default,,0000,0000,0000,,[inaudible], which is actually the same as before. Dialogue: 0,0:41:50.97,0:41:57.97,Default,,0000,0000,0000,, Dialogue: 0,0:42:03.13,0:42:10.13,Default,,0000,0000,0000,, Dialogue: 0,0:42:20.32,0:42:23.39,Default,,0000,0000,0000,,So it turns out when you derive the dual Dialogue: 0,0:42:23.39,0:42:24.72,Default,,0000,0000,0000,,and simply, Dialogue: 0,0:42:24.72,0:42:28.73,Default,,0000,0000,0000,,it turns out that the only way the dual changes compared to the previous one Dialogue: 0,0:42:28.73,0:42:32.39,Default,,0000,0000,0000,,is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero, Dialogue: 0,0:42:32.39,0:42:36.84,Default,,0000,0000,0000,,we now have a constraint that the Alphas are between zero and C. Dialogue: 0,0:42:36.84,0:42:40.01,Default,,0000,0000,0000,,This derivation isn't very hard, and you're encouraged to go home and try to do it yourself. Dialogue: 0,0:42:40.01,0:42:42.45,Default,,0000,0000,0000,,It's really essentially the same math, Dialogue: 0,0:42:42.45,0:42:44.42,Default,,0000,0000,0000,,and when you simply, it turns out Dialogue: 0,0:42:44.42,0:42:48.62,Default,,0000,0000,0000,,you can simply the R of the [inaudible] multiplier away Dialogue: 0,0:42:48.62,0:42:49.42,Default,,0000,0000,0000,,and you end up with Dialogue: 0,0:42:49.42,0:42:54.79,Default,,0000,0000,0000,,just these constraints of Dialogue: 0,0:42:54.79,0:42:57.30,Default,,0000,0000,0000,, Dialogue: 0,0:42:57.30,0:43:01.91,Default,,0000,0000,0000,,the Dialogue: 0,0:43:01.91,0:43:05.18,Default,,0000,0000,0000,,Alphas. Dialogue: 0,0:43:05.18,0:43:06.45,Default,,0000,0000,0000,,Just as an aside, Dialogue: 0,0:43:06.45,0:43:08.78,Default,,0000,0000,0000,,I won't Dialogue: 0,0:43:08.78,0:43:12.50,Default,,0000,0000,0000,,derive these, either. It turns out that " Dialogue: 0,0:43:12.50,0:43:15.78,Default,,0000,0000,0000,,remember, I wrote down the [inaudible] conditions in Dialogue: 0,0:43:15.78,0:43:18.26,Default,,0000,0000,0000,,the last lecture. Dialogue: 0,0:43:18.26,0:43:21.61,Default,,0000,0000,0000,,The necessary conditions for something to be an optimal solution Dialogue: 0,0:43:21.61,0:43:24.02,Default,,0000,0000,0000,,to constrain optimization problems. Dialogue: 0,0:43:24.02,0:43:27.12,Default,,0000,0000,0000,,So if you used the [inaudible] conditions, Dialogue: 0,0:43:27.12,0:43:30.50,Default,,0000,0000,0000,,it turns out you can actually derive conversions conditions, so Dialogue: 0,0:43:30.50,0:43:32.100,Default,,0000,0000,0000,,we want to solve this optimization problem. Dialogue: 0,0:43:32.100,0:43:37.93,Default,,0000,0000,0000,,When do we know the Alphas have converged to the global optimum? Dialogue: 0,0:43:37.93,0:43:40.82,Default,,0000,0000,0000,,It turns out you can use the following. Dialogue: 0,0:43:40.82,0:43:47.82,Default,,0000,0000,0000,, Dialogue: 0,0:43:48.45,0:43:55.45,Default,,0000,0000,0000,, Dialogue: 0,0:44:05.29,0:44:12.29,Default,,0000,0000,0000,, Dialogue: 0,0:44:14.14,0:44:16.79,Default,,0000,0000,0000,, Dialogue: 0,0:44:16.79,0:44:20.97,Default,,0000,0000,0000,,I don't want to say a lot about these. It Dialogue: 0,0:44:20.97,0:44:24.22,Default,,0000,0000,0000,,turns out from the [inaudible] conditions you can derive Dialogue: 0,0:44:24.22,0:44:27.34,Default,,0000,0000,0000,,these as the conversion conditions for an algorithm that you might choose to Dialogue: 0,0:44:27.34,0:44:28.38,Default,,0000,0000,0000,,use Dialogue: 0,0:44:28.38,0:44:34.47,Default,,0000,0000,0000,,to try to solve the optimization problem in terms of the Dialogue: 0,0:44:34.47,0:44:39.79,Default,,0000,0000,0000,,Alphas. That's the L1 norm soft margin SVM, and this is the Dialogue: 0,0:44:39.79,0:44:43.75,Default,,0000,0000,0000,,change the algorithm that lets us handle non-linearly separable data sets as well as Dialogue: 0,0:44:43.75,0:44:47.16,Default,,0000,0000,0000,,single outliers that may still be linearly separable Dialogue: 0,0:44:47.16,0:44:54.16,Default,,0000,0000,0000,,but you may choose not to separate [inaudible]. Dialogue: 0,0:44:55.87,0:44:58.41,Default,,0000,0000,0000,,Questions about this? Dialogue: 0,0:44:58.41,0:45:03.32,Default,,0000,0000,0000,,Raise Dialogue: 0,0:45:03.32,0:45:10.32,Default,,0000,0000,0000,, Dialogue: 0,0:45:17.75,0:45:24.75,Default,,0000,0000,0000,, Dialogue: 0,0:45:39.88,0:45:46.88,Default,,0000,0000,0000,, Dialogue: 0,0:45:52.07,0:45:56.53,Default,,0000,0000,0000,,your hand if this stuff makes sense at all. Great. Dialogue: 0,0:45:56.53,0:45:59.98,Default,,0000,0000,0000,,So Dialogue: 0,0:45:59.98,0:46:02.85,Default,,0000,0000,0000,,the last thing I want to do is Dialogue: 0,0:46:02.85,0:46:07.62,Default,,0000,0000,0000,,talk about an algorithm for actually solving this optimization problem. Dialogue: 0,0:46:07.62,0:46:10.40,Default,,0000,0000,0000,,We wrote down Dialogue: 0,0:46:10.40,0:46:13.77,Default,,0000,0000,0000,,this dual optimization problem with convergence criteria, Dialogue: 0,0:46:13.77,0:46:19.17,Default,,0000,0000,0000,,so let's come up with an efficient algorithm to actually solve this optimization problem. Dialogue: 0,0:46:19.17,0:46:21.18,Default,,0000,0000,0000,, Dialogue: 0,0:46:21.18,0:46:25.33,Default,,0000,0000,0000,,I want to do this partly to give me an excuse to talk about an algorithm Dialogue: 0,0:46:25.33,0:46:28.86,Default,,0000,0000,0000,,called coordinate assent, which is useful to Dialogue: 0,0:46:28.86,0:46:34.23,Default,,0000,0000,0000,,do. What I actually want to do is tell you about an algorithm called coordinate Dialogue: 0,0:46:34.23,0:46:36.81,Default,,0000,0000,0000,,assent, which is a useful algorithm to know about, Dialogue: 0,0:46:36.81,0:46:40.05,Default,,0000,0000,0000,,and it'll turn out that it won't apply in the simplest form Dialogue: 0,0:46:40.05,0:46:43.59,Default,,0000,0000,0000,,to this problem, but we'll then be able to modify it slightly and then it'll Dialogue: 0,0:46:43.59,0:46:45.55,Default,,0000,0000,0000,,give us a very efficient Dialogue: 0,0:46:45.55,0:46:48.91,Default,,0000,0000,0000,,algorithm for solving this [inaudible] Dialogue: 0,0:46:48.91,0:46:52.25,Default,,0000,0000,0000,,optimization problem. That was the other reason that I had to derive the dual, not only so that we could Dialogue: 0,0:46:52.25,0:46:53.32,Default,,0000,0000,0000,,use kernels Dialogue: 0,0:46:53.32,0:46:57.61,Default,,0000,0000,0000,,but also so that we can apply an algorithm like the Dialogue: 0,0:46:57.61,0:47:00.83,Default,,0000,0000,0000,,SMO Dialogue: 0,0:47:00.83,0:47:04.81,Default,,0000,0000,0000,,algorithm. First, let's talk about coordinate assent, which is another Dialogue: 0,0:47:04.81,0:47:11.81,Default,,0000,0000,0000,,[inaudible] optimization Dialogue: 0,0:47:17.83,0:47:21.24,Default,,0000,0000,0000,,algorithm. To describe coordinate Dialogue: 0,0:47:21.24,0:47:24.68,Default,,0000,0000,0000,,assent, I just want you to consider the problem of if we Dialogue: 0,0:47:24.68,0:47:29.53,Default,,0000,0000,0000,,want to maximize some function Dialogue: 0,0:47:29.53,0:47:34.04,Default,,0000,0000,0000,,W, which is a function of Alpha one through Alpha M with no constraints. So for Dialogue: 0,0:47:34.04,0:47:38.77,Default,,0000,0000,0000,, Dialogue: 0,0:47:38.77,0:47:39.71,Default,,0000,0000,0000,, Dialogue: 0,0:47:39.71,0:47:43.20,Default,,0000,0000,0000,,now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the Dialogue: 0,0:47:43.20,0:47:49.57,Default,,0000,0000,0000,,constraint that some of YI Alpha I must be equal to zero. Dialogue: 0,0:47:49.57,0:47:53.28,Default,,0000,0000,0000,,Then this is the coordinate Dialogue: 0,0:47:53.28,0:47:54.22,Default,,0000,0000,0000,,assent Dialogue: 0,0:47:54.22,0:47:56.41,Default,,0000,0000,0000,,algorithm. It will Dialogue: 0,0:47:56.41,0:48:00.55,Default,,0000,0000,0000,,repeat until convergence Dialogue: 0,0:48:00.55,0:48:03.84,Default,,0000,0000,0000,, Dialogue: 0,0:48:03.84,0:48:06.95,Default,,0000,0000,0000,,and will do for I equals one to M. The [inaudible] of coordinate assent essentially Dialogue: 0,0:48:06.95,0:48:09.55,Default,,0000,0000,0000,,holds all the parameters Dialogue: 0,0:48:09.55,0:48:12.04,Default,,0000,0000,0000,,except Alpha I fixed Dialogue: 0,0:48:12.04,0:48:16.25,Default,,0000,0000,0000,,and then it just maximizes this function with respect to just one of the parameters. Let me Dialogue: 0,0:48:16.25,0:48:18.22,Default,,0000,0000,0000,,write that Dialogue: 0,0:48:18.22,0:48:21.28,Default,,0000,0000,0000,,as Alpha I gets Dialogue: 0,0:48:21.28,0:48:24.04,Default,,0000,0000,0000,,updated as [inaudible] Dialogue: 0,0:48:24.04,0:48:29.23,Default,,0000,0000,0000,,over Alpha I hat of Dialogue: 0,0:48:29.23,0:48:29.84,Default,,0000,0000,0000,,W Dialogue: 0,0:48:29.84,0:48:31.97,Default,,0000,0000,0000,,Alpha one Dialogue: 0,0:48:31.97,0:48:38.97,Default,,0000,0000,0000,, Dialogue: 0,0:48:41.79,0:48:48.79,Default,,0000,0000,0000,,Alpha I minus one. This is really the fancy way of saying hold everything Dialogue: 0,0:48:49.68,0:48:54.25,Default,,0000,0000,0000,,except Alpha I Dialogue: 0,0:48:54.25,0:48:58.09,Default,,0000,0000,0000,,fixed. Just optimize W by optimization objective with Dialogue: 0,0:48:58.09,0:48:59.65,Default,,0000,0000,0000,,respect to only Dialogue: 0,0:48:59.65,0:49:00.96,Default,,0000,0000,0000,,Alpha I. Dialogue: 0,0:49:00.96,0:49:02.02,Default,,0000,0000,0000,,This is just a fancy Dialogue: 0,0:49:02.02,0:49:04.39,Default,,0000,0000,0000,,way of writing Dialogue: 0,0:49:04.39,0:49:09.22,Default,,0000,0000,0000,,it. This is coordinate Dialogue: 0,0:49:09.22,0:49:11.11,Default,,0000,0000,0000,,assent. Dialogue: 0,0:49:11.11,0:49:14.26,Default,,0000,0000,0000,,One picture Dialogue: 0,0:49:14.26,0:49:17.74,Default,,0000,0000,0000,,that's kind of useful for coordinate assent is if you Dialogue: 0,0:49:17.74,0:49:24.74,Default,,0000,0000,0000,,imagine you're trying to optimize a quadratic function, it Dialogue: 0,0:49:25.01,0:49:27.49,Default,,0000,0000,0000,,really looks like Dialogue: 0,0:49:27.49,0:49:30.80,Default,,0000,0000,0000,,that. These are the contours of the quadratic function and the minimums Dialogue: 0,0:49:30.80,0:49:31.92,Default,,0000,0000,0000,,here. Dialogue: 0,0:49:31.92,0:49:37.15,Default,,0000,0000,0000,,This is what Dialogue: 0,0:49:37.15,0:49:39.84,Default,,0000,0000,0000,,coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one. Dialogue: 0,0:49:39.84,0:49:44.86,Default,,0000,0000,0000,,My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going Dialogue: 0,0:49:44.86,0:49:48.97,Default,,0000,0000,0000,,to begin by minimizing this with respect to Alpha one. I Dialogue: 0,0:49:48.97,0:49:51.03,Default,,0000,0000,0000,,go there. Dialogue: 0,0:49:51.03,0:49:54.80,Default,,0000,0000,0000,,And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like Dialogue: 0,0:49:54.80,0:49:56.25,Default,,0000,0000,0000,,that. Dialogue: 0,0:49:56.25,0:49:58.36,Default,,0000,0000,0000,,Then, I'll minimize Dialogue: 0,0:49:58.36,0:50:02.24,Default,,0000,0000,0000,,with respect to Alpha one goes back to Alpha two Dialogue: 0,0:50:02.24,0:50:04.47,Default,,0000,0000,0000,,and so on. Dialogue: 0,0:50:04.47,0:50:08.05,Default,,0000,0000,0000,,You're always taking these axis-aligned steps Dialogue: 0,0:50:08.05,0:50:11.71,Default,,0000,0000,0000,,to get to the minimum. Dialogue: 0,0:50:11.71,0:50:16.04,Default,,0000,0000,0000,,It turns out that there's a modification to this. There are Dialogue: 0,0:50:16.04,0:50:18.15,Default,,0000,0000,0000,,variations of this as well. The way I describe Dialogue: 0,0:50:18.15,0:50:20.73,Default,,0000,0000,0000,,the algorithm, we're always doing this Dialogue: 0,0:50:20.73,0:50:24.17,Default,,0000,0000,0000,,in alternating order. We always optimize with respect to Alpha Dialogue: 0,0:50:24.17,0:50:28.52,Default,,0000,0000,0000,,one then Alpha two, then Alpha one, then Alpha two. Dialogue: 0,0:50:28.52,0:50:32.16,Default,,0000,0000,0000,,What I'm about to say applies only in higher dimensions, but it turns out if you have Dialogue: 0,0:50:32.16,0:50:35.24,Default,,0000,0000,0000,,a lot of parameters, Alpha one through Alpha M, Dialogue: 0,0:50:35.24,0:50:39.29,Default,,0000,0000,0000,,you may not choose to always visit them in a fixed order. You may choose Dialogue: 0,0:50:39.29,0:50:43.24,Default,,0000,0000,0000,,which Alphas update next depending on what you Dialogue: 0,0:50:43.24,0:50:46.08,Default,,0000,0000,0000,,think will allow you to make the most progress. If you have only two parameters, it makes Dialogue: 0,0:50:46.08,0:50:50.00,Default,,0000,0000,0000,,sense to alternate between them. Dialogue: 0,0:50:50.00,0:50:51.10,Default,,0000,0000,0000,,If Dialogue: 0,0:50:51.10,0:50:52.73,Default,,0000,0000,0000,,you have Dialogue: 0,0:50:52.73,0:50:57.02,Default,,0000,0000,0000,,higher dimensional parameters, sometimes you may choose to update them in a Dialogue: 0,0:50:57.02,0:51:00.65,Default,,0000,0000,0000,,different order if you think doing so would let you make faster progress towards the Dialogue: 0,0:51:00.65,0:51:01.53,Default,,0000,0000,0000,, Dialogue: 0,0:51:01.53,0:51:04.96,Default,,0000,0000,0000,, Dialogue: 0,0:51:04.96,0:51:08.52,Default,,0000,0000,0000,,maximum. It turns out that coordinate assent Dialogue: 0,0:51:08.52,0:51:11.46,Default,,0000,0000,0000,,compared to some of the algorithms we saw previously " compared to, say, Newton's Dialogue: 0,0:51:11.46,0:51:12.47,Default,,0000,0000,0000,, Dialogue: 0,0:51:12.47,0:51:16.33,Default,,0000,0000,0000,,method, coordinate assent will usually take a lot more steps, Dialogue: 0,0:51:16.33,0:51:20.82,Default,,0000,0000,0000,,but the chief advantage of coordinate assent when it works well Dialogue: 0,0:51:20.82,0:51:22.57,Default,,0000,0000,0000,,is that sometimes Dialogue: 0,0:51:22.57,0:51:29.12,Default,,0000,0000,0000,,the optimization objective W Dialogue: 0,0:51:29.12,0:51:33.16,Default,,0000,0000,0000,,sometimes is very inexpensive to optimize W with respect to any one of your Dialogue: 0,0:51:33.16,0:51:34.13,Default,,0000,0000,0000,,parameters, Dialogue: 0,0:51:34.13,0:51:37.94,Default,,0000,0000,0000,,and so coordinate assent has to take many more iterations than, say, Newton's Dialogue: 0,0:51:37.94,0:51:38.86,Default,,0000,0000,0000,,method Dialogue: 0,0:51:38.86,0:51:41.42,Default,,0000,0000,0000,,in order to Dialogue: 0,0:51:41.42,0:51:45.01,Default,,0000,0000,0000,,converge. It turns out that there are many optimization problems for which it's Dialogue: 0,0:51:45.01,0:51:46.75,Default,,0000,0000,0000,,particularly easy Dialogue: 0,0:51:46.75,0:51:47.57,Default,,0000,0000,0000,,to fix Dialogue: 0,0:51:47.57,0:51:51.94,Default,,0000,0000,0000,,all but one of the parameters and optimize with respect to just that one parameter, Dialogue: 0,0:51:51.94,0:51:55.62,Default,,0000,0000,0000,,and if that's true, then the inner group of coordinate assent with optimizing with Dialogue: 0,0:51:55.62,0:51:58.22,Default,,0000,0000,0000,,respect to Alpha I can be done very quickly Dialogue: 0,0:51:58.22,0:52:00.100,Default,,0000,0000,0000,,and cause [inaudible]. Dialogue: 0,0:52:00.100,0:52:02.26,Default,,0000,0000,0000,,It turns out that Dialogue: 0,0:52:02.26,0:52:06.85,Default,,0000,0000,0000,,this will be true when we modify this algorithm to solve the SVM Dialogue: 0,0:52:06.85,0:52:11.88,Default,,0000,0000,0000,,optimization problem. Dialogue: 0,0:52:11.88,0:52:18.88,Default,,0000,0000,0000,,Questions about this? Dialogue: 0,0:52:21.88,0:52:28.88,Default,,0000,0000,0000,, Dialogue: 0,0:52:47.49,0:52:49.70,Default,,0000,0000,0000,,Okay. Let's go ahead and apply this Dialogue: 0,0:52:49.70,0:52:54.33,Default,,0000,0000,0000,,to our support vector machine dual optimization problem. It Dialogue: 0,0:52:54.33,0:52:57.18,Default,,0000,0000,0000,,turns out that Dialogue: 0,0:52:57.18,0:53:02.46,Default,,0000,0000,0000,,coordinate assent in its basic form does not work for the following reason. The Dialogue: 0,0:53:02.46,0:53:07.37,Default,,0000,0000,0000,,reason is we have constrains on the Alpha Is. Mainly, what we can Dialogue: 0,0:53:07.37,0:53:10.53,Default,,0000,0000,0000,,recall from what we worked out previously, Dialogue: 0,0:53:10.53,0:53:12.26,Default,,0000,0000,0000,,we have a constraint that the sum of [inaudible] Y Dialogue: 0,0:53:12.26,0:53:15.57,Default,,0000,0000,0000,,Alpha I must be equal to zero, Dialogue: 0,0:53:15.57,0:53:19.65,Default,,0000,0000,0000,,and so if you fix all the Alphas except for one, Dialogue: 0,0:53:19.65,0:53:23.92,Default,,0000,0000,0000,,then you can't change one Alpha without violating the constraint. If Dialogue: 0,0:53:23.92,0:53:24.67,Default,,0000,0000,0000,, Dialogue: 0,0:53:24.67,0:53:26.46,Default,,0000,0000,0000,,I just try to change Alpha one, Dialogue: 0,0:53:26.46,0:53:27.24,Default,,0000,0000,0000,, Dialogue: 0,0:53:27.24,0:53:30.56,Default,,0000,0000,0000,,Alpha one is actually exactly determined as a function of the other Dialogue: 0,0:53:30.56,0:53:33.40,Default,,0000,0000,0000,,Alphas because this was sum to zero. Dialogue: 0,0:53:33.40,0:53:35.73,Default,,0000,0000,0000,,The Dialogue: 0,0:53:35.73,0:53:39.39,Default,,0000,0000,0000,,SMO algorithm, by the way, is due to John Platt, a colleague at Dialogue: 0,0:53:39.39,0:53:42.61,Default,,0000,0000,0000,,Microsoft. The SMO Dialogue: 0,0:53:42.61,0:53:46.29,Default,,0000,0000,0000,,algorithm, therefore, instead of trying to change one Alpha at a time, Dialogue: 0,0:53:46.29,0:53:48.98,Default,,0000,0000,0000,,we will try to change Dialogue: 0,0:53:48.98,0:53:50.49,Default,,0000,0000,0000,,two Dialogue: 0,0:53:50.49,0:53:54.20,Default,,0000,0000,0000,,Alphas at a time. Dialogue: 0,0:53:54.20,0:53:55.27,Default,,0000,0000,0000,,This is Dialogue: 0,0:53:55.27,0:53:58.19,Default,,0000,0000,0000,,called the SMO algorithm, in a Dialogue: 0,0:53:58.19,0:54:00.60,Default,,0000,0000,0000,,sense the sequential minimal optimization Dialogue: 0,0:54:00.60,0:54:02.99,Default,,0000,0000,0000,,and the term minimal refers to the fact that Dialogue: 0,0:54:02.99,0:54:06.35,Default,,0000,0000,0000,,we're choosing the smallest number of Alpha Is to change at a time, which Dialogue: 0,0:54:06.35,0:54:10.68,Default,,0000,0000,0000,,in this case, we need to change at least two at a time. Dialogue: 0,0:54:10.68,0:54:16.85,Default,,0000,0000,0000,,So then go ahead and outline the algorithm. Dialogue: 0,0:54:16.85,0:54:19.66,Default,,0000,0000,0000,,We will Dialogue: 0,0:54:19.66,0:54:23.63,Default,,0000,0000,0000,,select Dialogue: 0,0:54:23.63,0:54:28.61,Default,,0000,0000,0000,,two Alphas to change, some Alpha I and Alpha J [inaudible] " it Dialogue: 0,0:54:28.61,0:54:33.98,Default,,0000,0000,0000,,just means a rule of thumb. Dialogue: 0,0:54:33.98,0:54:40.93,Default,,0000,0000,0000,,We'll hold all the Alpha Ks fixed Dialogue: 0,0:54:40.93,0:54:44.68,Default,,0000,0000,0000,,except Dialogue: 0,0:54:44.68,0:54:49.94,Default,,0000,0000,0000,,Alpha I and Alpha J Dialogue: 0,0:54:49.94,0:54:54.20,Default,,0000,0000,0000,,and optimize W [inaudible] Alpha Dialogue: 0,0:54:54.20,0:54:56.64,Default,,0000,0000,0000,,with respect Dialogue: 0,0:54:56.64,0:54:59.44,Default,,0000,0000,0000,,to Alpha I Dialogue: 0,0:54:59.44,0:55:06.44,Default,,0000,0000,0000,,and Alpha J subject to all the constraints. Dialogue: 0,0:55:14.33,0:55:17.65,Default,,0000,0000,0000,,It turns out the Dialogue: 0,0:55:17.65,0:55:19.88,Default,,0000,0000,0000,,key step Dialogue: 0,0:55:19.88,0:55:21.44,Default,,0000,0000,0000,,which I'm going to Dialogue: 0,0:55:21.44,0:55:22.10,Default,,0000,0000,0000,,work out Dialogue: 0,0:55:22.10,0:55:25.58,Default,,0000,0000,0000,,is this one, which Dialogue: 0,0:55:25.58,0:55:28.45,Default,,0000,0000,0000,,is how do you optimize W of Alpha Dialogue: 0,0:55:28.45,0:55:32.20,Default,,0000,0000,0000,,with respect to the two parameters that you just chose to update Dialogue: 0,0:55:32.20,0:55:34.73,Default,,0000,0000,0000,,and subject to the constraints? I'll do Dialogue: 0,0:55:34.73,0:55:37.46,Default,,0000,0000,0000,,that in a second. Dialogue: 0,0:55:37.46,0:55:38.90,Default,,0000,0000,0000,, Dialogue: 0,0:55:38.90,0:55:41.13,Default,,0000,0000,0000,,You would keep running this algorithm Dialogue: 0,0:55:41.13,0:55:43.37,Default,,0000,0000,0000,,until you have satisfied Dialogue: 0,0:55:43.37,0:55:44.69,Default,,0000,0000,0000,,these Dialogue: 0,0:55:44.69,0:55:49.47,Default,,0000,0000,0000,,convergence criteria up to Epsilon. Dialogue: 0,0:55:49.47,0:55:53.51,Default,,0000,0000,0000,,What I want to do now is describe how to do this [inaudible] " Dialogue: 0,0:55:53.51,0:55:55.76,Default,,0000,0000,0000,,how to optimize W of Alpha Dialogue: 0,0:55:55.76,0:55:57.76,Default,,0000,0000,0000,,with respect to Dialogue: 0,0:55:57.76,0:56:00.50,Default,,0000,0000,0000,,Alpha I and Alpha J, Dialogue: 0,0:56:00.50,0:56:05.49,Default,,0000,0000,0000,,and it turns out that it's because you can do this extremely efficiently Dialogue: 0,0:56:05.49,0:56:08.45,Default,,0000,0000,0000,,that the SMO algorithm works well. The [inaudible] for the SMO Dialogue: 0,0:56:08.45,0:56:11.55,Default,,0000,0000,0000,,algorithm can be done extremely efficiently, so it may take a large number of Dialogue: 0,0:56:11.55,0:56:18.55,Default,,0000,0000,0000,,iterations to converge, but each iteration is very cheap. Let's talk about that. Dialogue: 0,0:56:33.45,0:56:40.45,Default,,0000,0000,0000,, Dialogue: 0,0:56:41.75,0:56:45.68,Default,,0000,0000,0000,,So in order to derive that step where we update in respect to Alpha I and Dialogue: 0,0:56:45.68,0:56:46.53,Default,,0000,0000,0000,,Alpha J, Dialogue: 0,0:56:46.53,0:56:50.85,Default,,0000,0000,0000,,I'm actually going to use Alpha one and Alpha two as my example. I'm gonna Dialogue: 0,0:56:50.85,0:56:51.88,Default,,0000,0000,0000,,update Dialogue: 0,0:56:51.88,0:56:56.61,Default,,0000,0000,0000,,Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but Dialogue: 0,0:56:56.61,0:56:59.66,Default,,0000,0000,0000,,just to make my notation on the board easier, I'm going to derive the Dialogue: 0,0:56:59.66,0:57:02.21,Default,,0000,0000,0000,,derivation for Alpha one and Alpha two Dialogue: 0,0:57:02.21,0:57:07.27,Default,,0000,0000,0000,,and the general [inaudible] completely analogous. Dialogue: 0,0:57:07.27,0:57:08.87,Default,,0000,0000,0000,, Dialogue: 0,0:57:08.87,0:57:13.26,Default,,0000,0000,0000,,On every step of the algorithm with respect to constraint, that Dialogue: 0,0:57:13.26,0:57:17.13,Default,,0000,0000,0000,,sum over I Alpha I YI is equal to zero. This is one of the Dialogue: 0,0:57:17.13,0:57:18.96,Default,,0000,0000,0000,,constraints we had previously for Dialogue: 0,0:57:18.96,0:57:22.57,Default,,0000,0000,0000,,our dual optimization problem. Dialogue: 0,0:57:22.57,0:57:25.75,Default,,0000,0000,0000,,This means that Alpha one Y1 plus Alpha Dialogue: 0,0:57:25.75,0:57:28.16,Default,,0000,0000,0000,,two Y2 must be equal to this, Dialogue: 0,0:57:28.16,0:57:31.32,Default,,0000,0000,0000,,to Dialogue: 0,0:57:31.32,0:57:38.32,Default,,0000,0000,0000,, Dialogue: 0,0:57:41.18,0:57:43.03,Default,,0000,0000,0000,,which I'm going to denote Dialogue: 0,0:57:43.03,0:57:46.26,Default,,0000,0000,0000,,by Zeta. Dialogue: 0,0:57:46.26,0:57:48.12,Default,,0000,0000,0000,, Dialogue: 0,0:57:48.12,0:57:49.45,Default,,0000,0000,0000,,So Dialogue: 0,0:57:49.45,0:57:51.95,Default,,0000,0000,0000,,we also have the constraint Dialogue: 0,0:57:51.95,0:57:53.44,Default,,0000,0000,0000,,that the Alpha Is Dialogue: 0,0:57:53.44,0:57:57.29,Default,,0000,0000,0000,,must be between zero and C. We had two constraints on our dual. This was one of the Dialogue: 0,0:57:57.29,0:57:59.81,Default,,0000,0000,0000,,constraints. This was the other one. Dialogue: 0,0:57:59.81,0:58:06.81,Default,,0000,0000,0000,,In pictures, Dialogue: 0,0:58:08.14,0:58:15.14,Default,,0000,0000,0000,,the Dialogue: 0,0:58:29.29,0:58:33.04,Default,,0000,0000,0000,,constraint that the Alpha Is is between zero Dialogue: 0,0:58:33.04,0:58:35.34,Default,,0000,0000,0000,,and C, that is often called the Bosk constraint, Dialogue: 0,0:58:35.34,0:58:38.09,Default,,0000,0000,0000,,and so if I draw Alpha one Dialogue: 0,0:58:38.09,0:58:43.24,Default,,0000,0000,0000,,and Alpha two, then Dialogue: 0,0:58:43.24,0:58:50.24,Default,,0000,0000,0000,,I Dialogue: 0,0:58:50.24,0:58:53.27,Default,,0000,0000,0000,,have a constraint that Dialogue: 0,0:58:53.27,0:58:57.08,Default,,0000,0000,0000,,the values of Alpha one and Alpha two must lie within this box that ranges from zero Dialogue: 0,0:58:57.08,0:58:58.89,Default,,0000,0000,0000,,to C. Dialogue: 0,0:58:58.89,0:59:01.91,Default,,0000,0000,0000,,And so the picture of the algorithm is as follows. Dialogue: 0,0:59:01.91,0:59:04.45,Default,,0000,0000,0000,,I have some constraint Dialogue: 0,0:59:04.45,0:59:06.70,Default,,0000,0000,0000,,that Alpha one Y1 Dialogue: 0,0:59:06.70,0:59:08.63,Default,,0000,0000,0000,,plus Alpha Dialogue: 0,0:59:08.63,0:59:10.45,Default,,0000,0000,0000,,two Y2 Dialogue: 0,0:59:10.45,0:59:14.13,Default,,0000,0000,0000,,must equal to Zeta, Dialogue: 0,0:59:14.13,0:59:18.24,Default,,0000,0000,0000,, Dialogue: 0,0:59:18.24,0:59:20.79,Default,,0000,0000,0000,, Dialogue: 0,0:59:20.79,0:59:21.84,Default,,0000,0000,0000,, Dialogue: 0,0:59:21.84,0:59:26.48,Default,,0000,0000,0000,,and so this implies that Dialogue: 0,0:59:26.48,0:59:32.28,Default,,0000,0000,0000,,Alpha one must be equal to Zeta minus Alpha two Y2 Dialogue: 0,0:59:32.28,0:59:36.86,Default,,0000,0000,0000,,over Y1, Dialogue: 0,0:59:36.86,0:59:39.79,Default,,0000,0000,0000,,and so what I want to do is I want to optimize the objective with respect Dialogue: 0,0:59:39.79,0:59:43.44,Default,,0000,0000,0000,, Dialogue: 0,0:59:43.44,0:59:45.21,Default,,0000,0000,0000,,to this. Dialogue: 0,0:59:45.21,0:59:48.59,Default,,0000,0000,0000,,What I can do is plug in my definition for Alpha one as a function of Alpha two Dialogue: 0,0:59:48.59,0:59:51.35,Default,,0000,0000,0000,,and so this becomes W of Dialogue: 0,0:59:51.35,0:59:55.47,Default,,0000,0000,0000,,Alpha one must be equal to Zeta minus Alpha two Y2 Dialogue: 0,0:59:55.47,0:59:56.53,Default,,0000,0000,0000,,over Dialogue: 0,0:59:56.53,1:00:01.65,Default,,0000,0000,0000,,Y1, Alpha two, Alpha three Dialogue: 0,1:00:01.65,1:00:03.14,Default,,0000,0000,0000,,and so on, Dialogue: 0,1:00:03.14,1:00:06.72,Default,,0000,0000,0000,,and it turns out that because W is a quadratic function " if you look back to our Dialogue: 0,1:00:06.72,1:00:10.67,Default,,0000,0000,0000,,earlier definition of W, you find it's a quadratic function in all the Alphas " Dialogue: 0,1:00:10.67,1:00:13.41,Default,,0000,0000,0000,,it turns out that if you look at this expression for W Dialogue: 0,1:00:13.41,1:00:16.71,Default,,0000,0000,0000,,and if you view it as just a function of Alpha two, Dialogue: 0,1:00:16.71,1:00:21.53,Default,,0000,0000,0000,,you find that this is a one dimensional quadratic function of Alpha two Dialogue: 0,1:00:21.53,1:00:25.05,Default,,0000,0000,0000,,if you hold Alpha three, Alpha four and so on fixed, Dialogue: 0,1:00:25.05,1:00:26.44,Default,,0000,0000,0000,,and so Dialogue: 0,1:00:26.44,1:00:30.37,Default,,0000,0000,0000,,this can be simplified to some expression of the form A Alpha two squared Dialogue: 0,1:00:30.37,1:00:32.36,Default,,0000,0000,0000,,plus B Alpha two plus Dialogue: 0,1:00:32.36,1:00:33.23,Default,,0000,0000,0000,,C. This Dialogue: 0,1:00:33.23,1:00:37.51,Default,,0000,0000,0000,,is a standard quadratic function. Dialogue: 0,1:00:37.51,1:00:40.47,Default,,0000,0000,0000,,This is really easy to optimize. We know how to optimize Dialogue: 0,1:00:40.47,1:00:41.63,Default,,0000,0000,0000,," Dialogue: 0,1:00:41.63,1:00:43.82,Default,,0000,0000,0000,,when did we learn this? This was high school Dialogue: 0,1:00:43.82,1:00:47.97,Default,,0000,0000,0000,,or undergrad or something. You know how to optimize quadratic functions like these. Dialogue: 0,1:00:47.97,1:00:49.93,Default,,0000,0000,0000,,You just do that and that gives you the Dialogue: 0,1:00:49.93,1:00:52.45,Default,,0000,0000,0000,,optimal value for Alpha two. Dialogue: 0,1:00:52.45,1:00:54.74,Default,,0000,0000,0000,,The last step with a Dialogue: 0,1:00:54.74,1:00:59.09,Default,,0000,0000,0000,,Bosk constraint like this " just in pictures, Dialogue: 0,1:00:59.09,1:01:02.47,Default,,0000,0000,0000,,you know your solution must lie on this line, and so there'll be some Dialogue: 0,1:01:02.47,1:01:04.83,Default,,0000,0000,0000,,sort of quadratic function Dialogue: 0,1:01:04.83,1:01:07.74,Default,,0000,0000,0000,,over this line, say, Dialogue: 0,1:01:07.74,1:01:10.08,Default,,0000,0000,0000,,and so if you minimize the quadratic function, Dialogue: 0,1:01:10.08,1:01:13.97,Default,,0000,0000,0000,,maybe you get a value that lies in the box, and if so, then you're done. Dialogue: 0,1:01:13.97,1:01:17.23,Default,,0000,0000,0000,,Or if your quadratic function looks like this, Dialogue: 0,1:01:17.23,1:01:20.98,Default,,0000,0000,0000,,maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with Dialogue: 0,1:01:20.98,1:01:22.72,Default,,0000,0000,0000,,a solution like that. If Dialogue: 0,1:01:22.72,1:01:25.94,Default,,0000,0000,0000,,that happens, you clip your solution Dialogue: 0,1:01:25.94,1:01:29.59,Default,,0000,0000,0000,,just to map it back inside the box. Dialogue: 0,1:01:29.59,1:01:34.77,Default,,0000,0000,0000,,That'll give you the optimal solution of this quadratic optimization problem Dialogue: 0,1:01:34.77,1:01:35.68,Default,,0000,0000,0000,,subject to Dialogue: 0,1:01:35.68,1:01:37.08,Default,,0000,0000,0000,,your solution Dialogue: 0,1:01:37.08,1:01:39.54,Default,,0000,0000,0000,,satisfying this box constraint Dialogue: 0,1:01:39.54,1:01:42.59,Default,,0000,0000,0000,,and lying on this straight line " in other words, subject to Dialogue: 0,1:01:42.59,1:01:48.74,Default,,0000,0000,0000,,the solution lying on this line segment within the box. Dialogue: 0,1:01:48.74,1:01:53.18,Default,,0000,0000,0000,, Dialogue: 0,1:01:53.18,1:01:57.63,Default,,0000,0000,0000,,Having solved the Alpha two this way, you can clip it if necessary Dialogue: 0,1:01:57.63,1:02:00.43,Default,,0000,0000,0000,,to get it back within the box constraint Dialogue: 0,1:02:00.43,1:02:01.07,Default,,0000,0000,0000,,and then Dialogue: 0,1:02:01.07,1:02:04.13,Default,,0000,0000,0000,,we have Alpha one as a function of Alpha two Dialogue: 0,1:02:04.13,1:02:06.49,Default,,0000,0000,0000,,and this allows you to Dialogue: 0,1:02:06.49,1:02:09.84,Default,,0000,0000,0000,,optimize W with respect to Alpha one and Alpha two quickly, Dialogue: 0,1:02:09.84,1:02:11.42,Default,,0000,0000,0000,,subject to all the constraints, Dialogue: 0,1:02:11.42,1:02:14.77,Default,,0000,0000,0000,,and the key step is really just sort of one dequadratic optimization, which Dialogue: 0,1:02:14.77,1:02:16.44,Default,,0000,0000,0000,,we do very quickly, Dialogue: 0,1:02:16.44,1:02:23.44,Default,,0000,0000,0000,,which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that Dialogue: 0,1:02:24.96,1:02:31.96,Default,,0000,0000,0000,,we can Dialogue: 0,1:02:32.53,1:02:34.33,Default,,0000,0000,0000,,change Dialogue: 0,1:02:34.33,1:02:37.75,Default,,0000,0000,0000,,whatever, but the SMO Dialogue: 0,1:02:37.75,1:02:42.88,Default,,0000,0000,0000,,algorithm, we can Dialogue: 0,1:02:42.88,1:02:44.74,Default,,0000,0000,0000,,change two at a Dialogue: 0,1:02:44.74,1:02:46.02,Default,,0000,0000,0000,,time, Dialogue: 0,1:02:46.02,1:02:49.06,Default,,0000,0000,0000,,so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change Dialogue: 0,1:02:49.06,1:02:52.47,Default,,0000,0000,0000,," as I run optimization algorithm, I Dialogue: 0,1:02:52.47,1:02:54.83,Default,,0000,0000,0000,,have to respect the constraint that Dialogue: 0,1:02:54.83,1:02:56.06,Default,,0000,0000,0000,,sum over Dialogue: 0,1:02:56.06,1:02:57.90,Default,,0000,0000,0000,,I Alpha I YI Dialogue: 0,1:02:57.90,1:03:00.34,Default,,0000,0000,0000,,must be equal to zero, Dialogue: 0,1:03:00.34,1:03:07.34,Default,,0000,0000,0000,,so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent. Dialogue: 0,1:03:07.49,1:03:09.25,Default,,0000,0000,0000,,Suppose I tried Dialogue: 0,1:03:09.25,1:03:13.26,Default,,0000,0000,0000,,to change just Alpha one. Dialogue: 0,1:03:13.26,1:03:15.45,Default,,0000,0000,0000,,Then I know that Alpha one Dialogue: 0,1:03:15.45,1:03:20.31,Default,,0000,0000,0000,,must be equal to the sum from I equals two to M Dialogue: 0,1:03:20.31,1:03:21.81,Default,,0000,0000,0000,,Alpha I Dialogue: 0,1:03:21.81,1:03:22.96,Default,,0000,0000,0000,,YI Dialogue: 0,1:03:22.96,1:03:25.84,Default,,0000,0000,0000,,divided by Y1, right, Dialogue: 0,1:03:25.84,1:03:27.10,Default,,0000,0000,0000,,and so Alpha Dialogue: 0,1:03:27.10,1:03:30.04,Default,,0000,0000,0000,,one can actually be written exactly as a function of Alpha two, Alpha three and so on Dialogue: 0,1:03:30.04,1:03:32.59,Default,,0000,0000,0000,,through Alpha M. Dialogue: 0,1:03:32.59,1:03:35.61,Default,,0000,0000,0000,,And so if I hold Alpha two, Alpha three, Alpha four Dialogue: 0,1:03:35.61,1:03:39.71,Default,,0000,0000,0000,,through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final Dialogue: 0,1:03:39.71,1:03:41.70,Default,,0000,0000,0000,, Dialogue: 0,1:03:41.70,1:03:45.51,Default,,0000,0000,0000,,[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same Dialogue: 0,1:03:45.51,1:03:46.87,Default,,0000,0000,0000,,time, Dialogue: 0,1:03:46.87,1:03:50.91,Default,,0000,0000,0000,,then I still have a constraint and so I know Alpha one and Alpha two Dialogue: 0,1:03:50.91,1:03:53.63,Default,,0000,0000,0000,,must satisfy Dialogue: 0,1:03:53.63,1:03:55.34,Default,,0000,0000,0000,,that linear constraint Dialogue: 0,1:03:55.34,1:03:59.20,Default,,0000,0000,0000,,but at least this way I can change Alpha one if I also Dialogue: 0,1:03:59.20,1:04:04.55,Default,,0000,0000,0000,,change Alpha two accordingly to make sure this satisfies Dialogue: 0,1:04:04.55,1:04:08.87,Default,,0000,0000,0000,,the constraint. Dialogue: 0,1:04:08.87,1:04:13.89,Default,,0000,0000,0000,,Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on Dialogue: 0,1:04:13.89,1:04:15.92,Default,,0000,0000,0000,,each iteration, I have some Dialogue: 0,1:04:15.92,1:04:18.10,Default,,0000,0000,0000,,setting of the parameters, Dialogue: 0,1:04:18.10,1:04:21.40,Default,,0000,0000,0000,,Alpha one, Alpha two, Alpha three and so on, Dialogue: 0,1:04:21.40,1:04:24.73,Default,,0000,0000,0000,,and I want to change Alpha one and Alpha two, say. Dialogue: 0,1:04:24.73,1:04:27.01,Default,,0000,0000,0000,,So from the previous iteration, Dialogue: 0,1:04:27.01,1:04:28.23,Default,,0000,0000,0000,,let's Dialogue: 0,1:04:28.23,1:04:31.76,Default,,0000,0000,0000,,say I had not validated the constraint, so that holds true, Dialogue: 0,1:04:31.76,1:04:35.26,Default,,0000,0000,0000,,and so I'm just defining Zeta to be equal to this, because Dialogue: 0,1:04:35.26,1:04:37.77,Default,,0000,0000,0000,,Alpha one Y1 plus Alpha two Y2 must be equal to sum from Dialogue: 0,1:04:37.77,1:04:42.41,Default,,0000,0000,0000,,I equals [inaudible] to M of that, Dialogue: 0,1:04:42.41,1:04:44.12,Default,,0000,0000,0000,,and so I'm just defining this Dialogue: 0,1:04:44.12,1:04:51.12,Default,,0000,0000,0000,,to be Zeta. Dialogue: 0,1:04:59.71,1:05:01.44,Default,,0000,0000,0000,,Student:[Inaudible]. Instructor (Andrew Dialogue: 0,1:05:01.44,1:05:06.71,Default,,0000,0000,0000,,Ng):On every iteration, you change maybe a different pair of Alphas to update. The Dialogue: 0,1:05:06.71,1:05:10.98,Default,,0000,0000,0000,,way you do this is something I don't want to talk about. I'll say a couple more words about Dialogue: 0,1:05:10.98,1:05:11.73,Default,,0000,0000,0000,,that, but Dialogue: 0,1:05:11.73,1:05:14.47,Default,,0000,0000,0000,,the basic outline of the algorithm is Dialogue: 0,1:05:14.47,1:05:18.16,Default,,0000,0000,0000,,on every iteration of the algorithm, you're going to select some Alpha I and Dialogue: 0,1:05:18.16,1:05:19.39,Default,,0000,0000,0000,,Alpha J to update Dialogue: 0,1:05:19.39,1:05:23.13,Default,,0000,0000,0000,,like on this board. So that's an Alpha I and an Alpha J to update Dialogue: 0,1:05:23.13,1:05:24.74,Default,,0000,0000,0000,,via some [inaudible] Dialogue: 0,1:05:24.74,1:05:27.84,Default,,0000,0000,0000,,and then you use the procedure I just described to Dialogue: 0,1:05:27.84,1:05:31.05,Default,,0000,0000,0000,,actually update Dialogue: 0,1:05:31.05,1:05:33.88,Default,,0000,0000,0000,,Alpha I and Alpha J. What I actually just talked about Dialogue: 0,1:05:33.88,1:05:35.39,Default,,0000,0000,0000,,was the procedure Dialogue: 0,1:05:35.39,1:05:39.70,Default,,0000,0000,0000,,to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the Dialogue: 0,1:05:39.70,1:05:46.70,Default,,0000,0000,0000,,[inaudible] for choosing Alpha Dialogue: 0,1:05:48.70,1:05:52.29,Default,,0000,0000,0000,,I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up Dialogue: 0,1:05:52.29,1:05:53.84,Default,,0000,0000,0000,,there. I'll just write it again. W of Dialogue: 0,1:05:53.84,1:06:00.20,Default,,0000,0000,0000,,Alpha is that function we had previously. W of Alpha was the sum over I Dialogue: 0,1:06:00.20,1:06:02.95,Default,,0000,0000,0000,," this is about solving the " Dialogue: 0,1:06:02.95,1:06:09.60,Default,,0000,0000,0000,,it was that thing. Dialogue: 0,1:06:09.60,1:06:12.87,Default,,0000,0000,0000,,All of this is about solving the optimization problem for the SVM, so this Dialogue: 0,1:06:12.87,1:06:19.47,Default,,0000,0000,0000,,is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing Dialogue: 0,1:06:19.47,1:06:26.03,Default,,0000,0000,0000,,that one, you Dialogue: 0,1:06:26.03,1:06:27.19,Default,,0000,0000,0000,,can Dialogue: 0,1:06:27.19,1:06:28.64,Default,,0000,0000,0000,,make the Dialogue: 0,1:06:28.64,1:06:34.34,Default,,0000,0000,0000,,other one that you have to change work, Dialogue: 0,1:06:34.34,1:06:35.62,Default,,0000,0000,0000,,right? Instructor (Andrew Ng):What Dialogue: 0,1:06:35.62,1:06:39.93,Default,,0000,0000,0000,,do Dialogue: 0,1:06:39.93,1:06:43.98,Default,,0000,0000,0000,,you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently. Dialogue: 0,1:06:43.98,1:06:46.89,Default,,0000,0000,0000,,What we're trying to do is we're trying to optimize the objective function W Dialogue: 0,1:06:46.89,1:06:48.24,Default,,0000,0000,0000,,of Alpha, Dialogue: 0,1:06:48.24,1:06:52.05,Default,,0000,0000,0000,,so the metric of progress that we care about is whether W of Alpha is getting Dialogue: 0,1:06:52.05,1:06:54.96,Default,,0000,0000,0000,,better on every iteration, Dialogue: 0,1:06:54.96,1:06:58.85,Default,,0000,0000,0000,,and so what is true for coordinate assent and for SMO is on every iteration; Dialogue: 0,1:06:58.85,1:07:02.93,Default,,0000,0000,0000,,W of Alpha can only increase. It may stay the same Dialogue: 0,1:07:02.93,1:07:04.90,Default,,0000,0000,0000,,or it may increase. It can't get worse. Dialogue: 0,1:07:04.90,1:07:08.83,Default,,0000,0000,0000,,It's true that eventually, the Alphas will converge at some value. It's true that Dialogue: 0,1:07:08.83,1:07:12.91,Default,,0000,0000,0000,,in intervening iterations, one of the Alphas may move further Dialogue: 0,1:07:12.91,1:07:16.53,Default,,0000,0000,0000,,away and then closer and further and closer to its final value, Dialogue: 0,1:07:16.53,1:07:20.09,Default,,0000,0000,0000,,but what we really care about is that W of Alpha is getting better Dialogue: 0,1:07:20.09,1:07:27.09,Default,,0000,0000,0000,,every time, which is true. Dialogue: 0,1:07:28.49,1:07:30.84,Default,,0000,0000,0000,,Just a couple more words on Dialogue: 0,1:07:30.84,1:07:33.72,Default,,0000,0000,0000,,SMO before I wrap up on this. Dialogue: 0,1:07:33.72,1:07:38.18,Default,,0000,0000,0000,,One is that John Platt's original Dialogue: 0,1:07:38.18,1:07:39.85,Default,,0000,0000,0000,,algorithm talked about a [inaudible] Dialogue: 0,1:07:39.85,1:07:43.87,Default,,0000,0000,0000,,for choosing which values or pairs, Alpha I and Alpha J, to update Dialogue: 0,1:07:43.87,1:07:46.65,Default,,0000,0000,0000,,next Dialogue: 0,1:07:46.65,1:07:50.17,Default,,0000,0000,0000,,is one of those things that's not conceptually complicated but it's very Dialogue: 0,1:07:50.17,1:07:51.01,Default,,0000,0000,0000,,complicated Dialogue: 0,1:07:51.01,1:07:53.33,Default,,0000,0000,0000,,to explain in words. I won't talk about that here. Dialogue: 0,1:07:53.33,1:07:56.37,Default,,0000,0000,0000,,If you want to learn about it, Dialogue: 0,1:07:56.37,1:07:58.54,Default,,0000,0000,0000,,go ahead and look up John Platt's paper on the Dialogue: 0,1:07:58.54,1:08:00.18,Default,,0000,0000,0000,, Dialogue: 0,1:08:00.18,1:08:06.24,Default,,0000,0000,0000,,SMO algorithm. The [inaudible] is pretty easy to read, Dialogue: 0,1:08:06.24,1:08:11.33,Default,,0000,0000,0000,,and later on, we'll also posting a handout on the Dialogue: 0,1:08:11.33,1:08:13.39,Default,,0000,0000,0000,,course homepage Dialogue: 0,1:08:13.39,1:08:14.20,Default,,0000,0000,0000,,with Dialogue: 0,1:08:14.20,1:08:18.12,Default,,0000,0000,0000,,some of a simplified version of this [inaudible] that you can use in Dialogue: 0,1:08:18.12,1:08:20.17,Default,,0000,0000,0000,,problems. Dialogue: 0,1:08:20.17,1:08:24.23,Default,,0000,0000,0000,,You can see some of the process readings in more details. Dialogue: 0,1:08:24.23,1:08:28.58,Default,,0000,0000,0000,,One other thing that I didn't talk about was how to update the parameter B. So Dialogue: 0,1:08:28.58,1:08:30.90,Default,,0000,0000,0000,,this is solving all your Alphas. Dialogue: 0,1:08:30.90,1:08:34.02,Default,,0000,0000,0000,,This is also the Alpha that allows us to get W. Dialogue: 0,1:08:34.02,1:08:38.06,Default,,0000,0000,0000,,The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not Dialogue: 0,1:08:38.06,1:08:39.93,Default,,0000,0000,0000,,very difficult. Dialogue: 0,1:08:39.93,1:08:42.51,Default,,0000,0000,0000,,I'll let you read about that yourself with the Dialogue: 0,1:08:42.51,1:08:45.05,Default,,0000,0000,0000,,notes that we'll post Dialogue: 0,1:08:45.05,1:08:52.05,Default,,0000,0000,0000,,along with the next Dialogue: 0,1:08:52.23,1:08:55.59,Default,,0000,0000,0000,,problems. To wrap up today's lecture, what I want to do is just tell Dialogue: 0,1:08:55.59,1:09:01.47,Default,,0000,0000,0000,,you briefly about a couple of examples of applications Dialogue: 0,1:09:01.47,1:09:03.03,Default,,0000,0000,0000,, Dialogue: 0,1:09:03.03,1:09:04.39,Default,,0000,0000,0000,, Dialogue: 0,1:09:04.39,1:09:07.49,Default,,0000,0000,0000,, Dialogue: 0,1:09:07.49,1:09:14.49,Default,,0000,0000,0000,,of SVMs. Dialogue: 0,1:09:26.01,1:09:29.63,Default,,0000,0000,0000,,Let's consider the problem of Handler's Integer Recognition. Dialogue: 0,1:09:29.63,1:09:31.91,Default,,0000,0000,0000,,In Handler's Integer Dialogue: 0,1:09:31.91,1:09:33.47,Default,,0000,0000,0000,,Recognition, Dialogue: 0,1:09:33.47,1:09:36.57,Default,,0000,0000,0000,,you're given a pixel array with Dialogue: 0,1:09:36.57,1:09:40.61,Default,,0000,0000,0000,,a scanned image of, Dialogue: 0,1:09:40.61,1:09:44.12,Default,,0000,0000,0000,,say, a zip code somewhere in Britain. This is an array of pixels, Dialogue: 0,1:09:44.12,1:09:47.86,Default,,0000,0000,0000,,and some of these pixels will be on Dialogue: 0,1:09:47.86,1:09:48.85,Default,,0000,0000,0000,,and other pixels Dialogue: 0,1:09:48.85,1:09:52.69,Default,,0000,0000,0000,,will be off. This combination of pixels being on maybe represents the Dialogue: 0,1:09:52.69,1:09:56.03,Default,,0000,0000,0000,,character one. Dialogue: 0,1:09:56.03,1:09:57.74,Default,,0000,0000,0000,,The question is given Dialogue: 0,1:09:57.74,1:10:00.27,Default,,0000,0000,0000,,an input feature vector like this, Dialogue: 0,1:10:00.27,1:10:03.12,Default,,0000,0000,0000,,if you Dialogue: 0,1:10:03.12,1:10:06.83,Default,,0000,0000,0000,,have, say, ten pixels by ten pixels, then you have Dialogue: 0,1:10:06.83,1:10:13.64,Default,,0000,0000,0000,,a hundred dimensional feature vector, Dialogue: 0,1:10:13.64,1:10:16.100,Default,,0000,0000,0000,,then [inaudible]. If you have ten pixels by ten pixels, you have Dialogue: 0,1:10:16.100,1:10:18.32,Default,,0000,0000,0000,,100 Dialogue: 0,1:10:18.32,1:10:22.24,Default,,0000,0000,0000,,features, and maybe these are binary features of XB01 or maybe the Xs are Dialogue: 0,1:10:22.24,1:10:23.93,Default,,0000,0000,0000,,gray scale values Dialogue: 0,1:10:23.93,1:10:29.17,Default,,0000,0000,0000,,corresponding to how dark each of these pixels was. Dialogue: 0,1:10:29.17,1:10:32.30,Default,,0000,0000,0000,,[Inaudible]. Dialogue: 0,1:10:32.30,1:10:36.01,Default,,0000,0000,0000,,Turns out for many years, there was a neuronetwork that was a champion Dialogue: 0,1:10:36.01,1:10:40.04,Default,,0000,0000,0000,,algorithm for Handler's Integer Recognition. Dialogue: 0,1:10:40.04,1:10:44.05,Default,,0000,0000,0000,,And it turns out that you can apply an SVM with the Dialogue: 0,1:10:44.05,1:10:46.89,Default,,0000,0000,0000,,following Dialogue: 0,1:10:46.89,1:10:53.16,Default,,0000,0000,0000,,kernel. It turns out either the polynomial kernel or Dialogue: 0,1:10:53.16,1:10:57.93,Default,,0000,0000,0000,,the Galcean Dialogue: 0,1:10:57.93,1:11:02.58,Default,,0000,0000,0000,, Dialogue: 0,1:11:02.58,1:11:05.20,Default,,0000,0000,0000,,kernel works fine for this problem, Dialogue: 0,1:11:05.20,1:11:10.46,Default,,0000,0000,0000,,and just by writing down this kernel and throwing an SVM at Dialogue: 0,1:11:10.46,1:11:14.24,Default,,0000,0000,0000,,it, an SVM gave performance comparable to the very best neuronetworks. Dialogue: 0,1:11:14.24,1:11:15.10,Default,,0000,0000,0000,, Dialogue: 0,1:11:15.10,1:11:19.56,Default,,0000,0000,0000,, Dialogue: 0,1:11:19.56,1:11:22.21,Default,,0000,0000,0000,,This is surprising because support vector machine Dialogue: 0,1:11:22.21,1:11:26.06,Default,,0000,0000,0000,,doesn't take into account any knowledge about the pixels, Dialogue: 0,1:11:26.06,1:11:27.11,Default,,0000,0000,0000,,and in particular, Dialogue: 0,1:11:27.11,1:11:30.31,Default,,0000,0000,0000,,it doesn't know that this pixel is Dialogue: 0,1:11:30.31,1:11:32.13,Default,,0000,0000,0000,,next to that pixel Dialogue: 0,1:11:32.13,1:11:36.56,Default,,0000,0000,0000,,because it's just representing the pixel intensity value as a vector. Dialogue: 0,1:11:36.56,1:11:39.92,Default,,0000,0000,0000,,And so this means the performance of SVM would be the same even if you were to Dialogue: 0,1:11:39.92,1:11:42.89,Default,,0000,0000,0000,,shuffle all the pixels around. Dialogue: 0,1:11:42.89,1:11:46.42,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,1:11:46.42,1:11:52.03,Default,,0000,0000,0000,,let's say comparable to the very best neuronetworks, which had been Dialogue: 0,1:11:52.03,1:11:58.58,Default,,0000,0000,0000,,under very careful development for many years. I want Dialogue: 0,1:11:58.58,1:11:59.79,Default,,0000,0000,0000,,to Dialogue: 0,1:11:59.79,1:12:03.93,Default,,0000,0000,0000,,tell you about one other cool example, which is SVMs are Dialogue: 0,1:12:03.93,1:12:08.61,Default,,0000,0000,0000,,also used also to classify other fairly esoteric objects. So for Dialogue: 0,1:12:08.61,1:12:09.36,Default,,0000,0000,0000,,example, Dialogue: 0,1:12:09.36,1:12:13.44,Default,,0000,0000,0000,,let's say you want to classify Dialogue: 0,1:12:13.44,1:12:16.99,Default,,0000,0000,0000,,protein sequences into different classes of Dialogue: 0,1:12:16.99,1:12:20.99,Default,,0000,0000,0000,,proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize Dialogue: 0,1:12:20.99,1:12:25.04,Default,,0000,0000,0000,,for that. There are Dialogue: 0,1:12:25.04,1:12:29.28,Default,,0000,0000,0000,,20 amino acids, and proteins in our bodies are made up by Dialogue: 0,1:12:29.28,1:12:31.44,Default,,0000,0000,0000,,sequences of amino acids. Dialogue: 0,1:12:31.44,1:12:35.47,Default,,0000,0000,0000,,Even though there are 20 amino acids and 26 alphabets, I'm going to Dialogue: 0,1:12:35.47,1:12:37.13,Default,,0000,0000,0000,,denote amino acids by the alphabet Dialogue: 0,1:12:37.13,1:12:41.41,Default,,0000,0000,0000,,A through Z with apologizes to the biologists. Here's Dialogue: 0,1:12:41.41,1:12:45.46,Default,,0000,0000,0000,,an amino acid sequence Dialogue: 0,1:12:45.46,1:12:52.46,Default,,0000,0000,0000,, Dialogue: 0,1:12:53.68,1:13:00.68,Default,,0000,0000,0000,, Dialogue: 0,1:13:03.45,1:13:05.14,Default,,0000,0000,0000,, Dialogue: 0,1:13:05.14,1:13:12.14,Default,,0000,0000,0000,,represented by a series of alphabets. Dialogue: 0,1:13:15.95,1:13:17.83,Default,,0000,0000,0000,, Dialogue: 0,1:13:17.83,1:13:18.56,Default,,0000,0000,0000,,So Dialogue: 0,1:13:18.56,1:13:21.42,Default,,0000,0000,0000,,suppose I want to assign Dialogue: 0,1:13:21.42,1:13:25.07,Default,,0000,0000,0000,,this protein into a few classes depending on what Dialogue: 0,1:13:25.07,1:13:27.79,Default,,0000,0000,0000,,type of protein it is. The Dialogue: 0,1:13:27.79,1:13:29.78,Default,,0000,0000,0000,,question is how Dialogue: 0,1:13:29.78,1:13:32.64,Default,,0000,0000,0000,,do I construct my feature Dialogue: 0,1:13:32.64,1:13:36.59,Default,,0000,0000,0000,,vector? This is challenging for many reasons, one of which is that protein Dialogue: 0,1:13:36.59,1:13:39.89,Default,,0000,0000,0000,,sequences can be of different lengths. There are some very long protein Dialogue: 0,1:13:39.89,1:13:41.90,Default,,0000,0000,0000,,sequences and some very short ones, Dialogue: 0,1:13:41.90,1:13:44.59,Default,,0000,0000,0000,,and so you can't have a feature saying what is Dialogue: 0,1:13:44.59,1:13:48.87,Default,,0000,0000,0000,,the amino acid in the 100th position, because maybe there is no 100th Dialogue: 0,1:13:48.87,1:13:53.05,Default,,0000,0000,0000,,position in this protein sequence. Some are very long. Some are very short. Dialogue: 0,1:13:53.05,1:13:55.44,Default,,0000,0000,0000,,Here's my feature representation, Dialogue: 0,1:13:55.44,1:13:56.93,Default,,0000,0000,0000,,which is Dialogue: 0,1:13:56.93,1:13:59.72,Default,,0000,0000,0000,,I'm going to write down Dialogue: 0,1:13:59.72,1:14:05.27,Default,,0000,0000,0000,,all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC Dialogue: 0,1:14:05.27,1:14:07.22,Default,,0000,0000,0000,,down Dialogue: 0,1:14:07.22,1:14:09.42,Default,,0000,0000,0000,,to Dialogue: 0,1:14:09.42,1:14:11.47,Default,,0000,0000,0000,,AAAZ and then Dialogue: 0,1:14:11.47,1:14:13.98,Default,,0000,0000,0000,,AABA Dialogue: 0,1:14:13.98,1:14:20.17,Default,,0000,0000,0000,,and so on. You get the idea. Write down Dialogue: 0,1:14:20.17,1:14:23.76,Default,,0000,0000,0000,,all Dialogue: 0,1:14:23.76,1:14:27.07,Default,,0000,0000,0000,,possible combinations of Dialogue: 0,1:14:27.07,1:14:28.33,Default,,0000,0000,0000,,four alphabets Dialogue: 0,1:14:28.33,1:14:30.56,Default,,0000,0000,0000,,and my feature representation will be Dialogue: 0,1:14:30.56,1:14:33.57,Default,,0000,0000,0000,,I'm going to scan through this sequence of amino acids Dialogue: 0,1:14:33.57,1:14:38.60,Default,,0000,0000,0000,,and count how often each of these subsequences occur. So for example, Dialogue: 0,1:14:38.60,1:14:40.34,Default,,0000,0000,0000,,BAJT occurs twice and so I'll put Dialogue: 0,1:14:40.34,1:14:42.75,Default,,0000,0000,0000,,a two there, and Dialogue: 0,1:14:42.75,1:14:44.53,Default,,0000,0000,0000,,none of these sequences occur, so I'll put Dialogue: 0,1:14:44.53,1:14:46.17,Default,,0000,0000,0000,,a zero there. I guess Dialogue: 0,1:14:46.17,1:14:50.01,Default,,0000,0000,0000,,I have a one here Dialogue: 0,1:14:50.01,1:14:54.43,Default,,0000,0000,0000,,and a zero there. This very long vector Dialogue: 0,1:14:54.43,1:14:57.30,Default,,0000,0000,0000,,will be my feature representation for protein. Dialogue: 0,1:14:57.30,1:15:00.47,Default,,0000,0000,0000,,This representation applies no matter how long my protein sequence is. Dialogue: 0,1:15:00.47,1:15:02.73,Default,,0000,0000,0000,,How Dialogue: 0,1:15:02.73,1:15:05.25,Default,,0000,0000,0000,,large is this? Well, it turns out Dialogue: 0,1:15:05.25,1:15:08.11,Default,,0000,0000,0000,,this is going to be in R20 Dialogue: 0,1:15:08.11,1:15:12.07,Default,,0000,0000,0000,,to the four, Dialogue: 0,1:15:12.07,1:15:16.09,Default,,0000,0000,0000,,and so you have a 160,000 Dialogue: 0,1:15:16.09,1:15:18.36,Default,,0000,0000,0000,,dimensional feature vector, Dialogue: 0,1:15:18.36,1:15:21.43,Default,,0000,0000,0000,,which is reasonably large, even by modern computer standards. Dialogue: 0,1:15:21.43,1:15:23.91,Default,,0000,0000,0000,,Clearly, we don't want to explicitly represent Dialogue: 0,1:15:23.91,1:15:26.07,Default,,0000,0000,0000,,these high Dialogue: 0,1:15:26.07,1:15:30.03,Default,,0000,0000,0000,,dimensional feature vectors. Imagine you have 1,000 examples and you store this as double Dialogue: 0,1:15:30.03,1:15:34.28,Default,,0000,0000,0000,,[inaudible]. Even on modern day computers, this is big. Dialogue: 0,1:15:34.28,1:15:38.43,Default,,0000,0000,0000,,It turns out that there's an Dialogue: 0,1:15:38.43,1:15:42.09,Default,,0000,0000,0000,,efficient dynamic programming algorithm that can efficiently Dialogue: 0,1:15:42.09,1:15:45.26,Default,,0000,0000,0000,,compute inner products between these feature vectors, and so we can apply this Dialogue: 0,1:15:45.26,1:15:48.21,Default,,0000,0000,0000,,feature representation, even though it's a ridiculously high Dialogue: 0,1:15:48.21,1:15:49.63,Default,,0000,0000,0000,,feature vector Dialogue: 0,1:15:49.63,1:15:52.50,Default,,0000,0000,0000,,to classify protein sequences. Dialogue: 0,1:15:52.50,1:15:56.58,Default,,0000,0000,0000,,I won't talk about the [inaudible] algorithm. If any of you have seen Dialogue: 0,1:15:56.58,1:16:00.85,Default,,0000,0000,0000,,the [inaudible] algorithm for finding subsequences, it's kind of Dialogue: 0,1:16:00.85,1:16:02.31,Default,,0000,0000,0000,,reminiscent of that. You Dialogue: 0,1:16:02.31,1:16:05.29,Default,,0000,0000,0000,,can look those up if you're interested. Dialogue: 0,1:16:05.29,1:16:08.18,Default,,0000,0000,0000,,This is just another example of a cool kernel, and Dialogue: 0,1:16:08.18,1:16:13.23,Default,,0000,0000,0000,,more generally, if you're faced with some new machine-learning problem, sometimes you can Dialogue: 0,1:16:13.23,1:16:15.75,Default,,0000,0000,0000,,choose a standard kernel like a Galcean kernel, Dialogue: 0,1:16:15.75,1:16:19.58,Default,,0000,0000,0000,,and sometimes there are research papers written on how to Dialogue: 0,1:16:19.58,1:16:21.41,Default,,0000,0000,0000,,come up with a new kernel Dialogue: 0,1:16:21.41,1:16:27.52,Default,,0000,0000,0000,,for a new problem. Dialogue: 0,1:16:27.52,1:16:29.59,Default,,0000,0000,0000,,Two last sentences I want Dialogue: 0,1:16:29.59,1:16:33.71,Default,,0000,0000,0000,,to say. Where are we now? That wraps up SVMs, which many Dialogue: 0,1:16:33.71,1:16:38.38,Default,,0000,0000,0000,,people would consider one of the most effective off the shelf learning algorithms, Dialogue: 0,1:16:38.38,1:16:42.23,Default,,0000,0000,0000,,and so as of today, you've actually seen a lot of learning algorithms. I want to Dialogue: 0,1:16:42.23,1:16:45.69,Default,,0000,0000,0000,,close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms Dialogue: 0,1:16:45.69,1:16:48.17,Default,,0000,0000,0000,,to a lot of problems. Dialogue: 0,1:16:48.17,1:16:52.51,Default,,0000,0000,0000,,We're still in week four of the quarter, so there's more to come. In particular, what I Dialogue: 0,1:16:52.51,1:16:56.89,Default,,0000,0000,0000,,want to do next is talk about how to really understand the learning algorithms and Dialogue: 0,1:16:56.89,1:16:58.86,Default,,0000,0000,0000,,when they work well and when they work poorly Dialogue: 0,1:16:58.86,1:17:01.89,Default,,0000,0000,0000,,and to take the tools which you now have and really talk about how you can use Dialogue: 0,1:17:01.89,1:17:04.45,Default,,0000,0000,0000,,them really well. We'll start to do that in the next lecture. Thanks.