[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.29,0:00:12.08,Default,,0000,0000,0000,,This subtitle is provided from lecsub.jimdo.com under Creative Commons License. Dialogue: 0,0:00:12.08,0:00:15.35,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:15.35,0:00:22.35,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:24.17,0:00:25.19,Default,,0000,0000,0000,, Dialogue: 0,0:00:25.19,0:00:27.30,Default,,0000,0000,0000,,What I want to do today is Dialogue: 0,0:00:27.30,0:00:30.83,Default,,0000,0000,0000,,continue our discussion of Naive Bayes, which is the learning algorithm that Dialogue: 0,0:00:30.83,0:00:32.60,Default,,0000,0000,0000,,I Dialogue: 0,0:00:32.60,0:00:36.11,Default,,0000,0000,0000,,started to discuss in the previous lecture Dialogue: 0,0:00:36.11,0:00:40.25,Default,,0000,0000,0000,,and talk about a couple of different event models in Naive Bayes, Dialogue: 0,0:00:40.25,0:00:44.45,Default,,0000,0000,0000,,and then I'll take a brief digression to talk about neural networks, which is Dialogue: 0,0:00:44.45,0:00:47.00,Default,,0000,0000,0000,,something that I actually won't spend a lot of time on, Dialogue: 0,0:00:47.00,0:00:50.35,Default,,0000,0000,0000,,and then I want to start to talk about support vector machines, Dialogue: 0,0:00:50.35,0:00:52.68,Default,,0000,0000,0000,,and support vector machines is Dialogue: 0,0:00:52.68,0:00:56.45,Default,,0000,0000,0000,,the learning algorithms, the supervised learning algorithm that Dialogue: 0,0:00:56.45,0:01:00.91,Default,,0000,0000,0000,,many people consider the most effective, off-the-shelf Dialogue: 0,0:01:00.91,0:01:02.25,Default,,0000,0000,0000,,supervised learning Dialogue: 0,0:01:02.25,0:01:05.24,Default,,0000,0000,0000,,algorithm. That point of view is debatable, but there are many people that hold that Dialogue: 0,0:01:05.24,0:01:06.34,Default,,0000,0000,0000,,point of view, Dialogue: 0,0:01:06.34,0:01:10.80,Default,,0000,0000,0000,,and we'll start discussing that today, and this will actually take us a few Dialogue: 0,0:01:10.80,0:01:13.83,Default,,0000,0000,0000,,lectures to complete. Dialogue: 0,0:01:13.83,0:01:16.94,Default,,0000,0000,0000,,So let's talk about Naive Bayes. To recap Dialogue: 0,0:01:16.94,0:01:21.37,Default,,0000,0000,0000,,from the previous lecture, Dialogue: 0,0:01:21.37,0:01:23.30,Default,,0000,0000,0000,,I started off Dialogue: 0,0:01:23.30,0:01:24.42,Default,,0000,0000,0000,,describing Dialogue: 0,0:01:24.42,0:01:28.33,Default,,0000,0000,0000,,spam classification as the most [inaudible] example for Naive Bayes Dialogue: 0,0:01:28.33,0:01:33.70,Default,,0000,0000,0000,,in which we would create feature vectors like these, right, that correspond Dialogue: 0,0:01:33.70,0:01:37.84,Default,,0000,0000,0000,,to words in a dictionary. Dialogue: 0,0:01:37.84,0:01:39.50,Default,,0000,0000,0000,,And so, you know, Dialogue: 0,0:01:39.50,0:01:43.10,Default,,0000,0000,0000,,based on what words appear in a piece of email were represented as a Dialogue: 0,0:01:43.10,0:01:45.23,Default,,0000,0000,0000,,feature vector with Dialogue: 0,0:01:45.23,0:01:48.13,Default,,0000,0000,0000,,ones and zeros in the corresponding places, Dialogue: 0,0:01:48.13,0:01:49.95,Default,,0000,0000,0000,,and Naive Dialogue: 0,0:01:49.95,0:01:56.95,Default,,0000,0000,0000,,Bayes was a generative learning algorithm, and by that Dialogue: 0,0:01:58.80,0:02:00.16,Default,,0000,0000,0000,,I mean it's Dialogue: 0,0:02:00.16,0:02:06.14,Default,,0000,0000,0000,,an algorithm in which we model P(X|Y), and for Naive Dialogue: 0,0:02:06.14,0:02:08.48,Default,,0000,0000,0000,,Bayes, specifically, we Dialogue: 0,0:02:08.48,0:02:12.86,Default,,0000,0000,0000,,modeled it as product from I equals one Dialogue: 0,0:02:12.86,0:02:13.79,Default,,0000,0000,0000,,to Dialogue: 0,0:02:13.79,0:02:17.11,Default,,0000,0000,0000,,N, P(Xi|Y), Dialogue: 0,0:02:17.11,0:02:19.26,Default,,0000,0000,0000,,and also we model P(Y), Dialogue: 0,0:02:19.26,0:02:22.75,Default,,0000,0000,0000,,and then we use Bayes Rule, right, to combine these two together, Dialogue: 0,0:02:22.75,0:02:25.05,Default,,0000,0000,0000,,and so our predictions, Dialogue: 0,0:02:25.05,0:02:28.55,Default,,0000,0000,0000,,when you give it a new piece of email you want to tell if it's spam or not spam, Dialogue: 0,0:02:28.55,0:02:33.12,Default,,0000,0000,0000,,you predict arg max over Y, P(Y|X), which Dialogue: 0,0:02:33.12,0:02:36.46,Default,,0000,0000,0000,,by Bayes Rule is arg max over Y, Dialogue: 0,0:02:36.46,0:02:39.42,Default,,0000,0000,0000,,P(X given Y) Dialogue: 0,0:02:39.42,0:02:41.02,Default,,0000,0000,0000,,times P(Y), okay? Dialogue: 0,0:02:41.02,0:02:43.56,Default,,0000,0000,0000,,So this is Naive Bayes, Dialogue: 0,0:02:43.56,0:02:45.87,Default,,0000,0000,0000,,and just to draw attention to two things, Dialogue: 0,0:02:45.87,0:02:49.82,Default,,0000,0000,0000,,one is that in this model, each of our features Dialogue: 0,0:02:49.82,0:02:51.55,Default,,0000,0000,0000,,were zero, one, so Dialogue: 0,0:02:51.55,0:02:53.92,Default,,0000,0000,0000,,indicating whether different words appear, Dialogue: 0,0:02:53.92,0:02:57.59,Default,,0000,0000,0000,,and the length or the feature vector was, sort of, Dialogue: 0,0:02:57.59,0:03:00.03,Default,,0000,0000,0000,,the length N of the feature vector was Dialogue: 0,0:03:00.03,0:03:07.03,Default,,0000,0000,0000,,the number of words in the dictionary. Dialogue: 0,0:03:08.34,0:03:15.34,Default,,0000,0000,0000,,So it might be on this version on the order of 50,000 words, say. What I want Dialogue: 0,0:03:18.42,0:03:21.17,Default,,0000,0000,0000,,to do now is describe two variations on this algorithm. Dialogue: 0,0:03:21.17,0:03:23.78,Default,,0000,0000,0000,,The first one is the simpler one, Dialogue: 0,0:03:23.78,0:03:26.61,Default,,0000,0000,0000,,which it's just a generalization to if Dialogue: 0,0:03:26.61,0:03:28.04,Default,,0000,0000,0000,,xi Dialogue: 0,0:03:28.04,0:03:33.36,Default,,0000,0000,0000,,takes on more values. So, you know, one thing that's commonly done Dialogue: 0,0:03:33.36,0:03:35.17,Default,,0000,0000,0000,,is Dialogue: 0,0:03:35.17,0:03:38.84,Default,,0000,0000,0000,,to apply Naive Bayes to problems where Dialogue: 0,0:03:38.84,0:03:43.48,Default,,0000,0000,0000,,some of these features, xi, takes on K values rather than just two values, Dialogue: 0,0:03:43.48,0:03:46.58,Default,,0000,0000,0000,,and in that case, Dialogue: 0,0:03:46.58,0:03:48.68,Default,,0000,0000,0000,,you actually build, Dialogue: 0,0:03:48.68,0:03:51.92,Default,,0000,0000,0000,,sort of, a very similar model where P(X|Y) Dialogue: 0,0:03:51.92,0:03:53.13,Default,,0000,0000,0000,,is Dialogue: 0,0:03:53.13,0:03:57.12,Default,,0000,0000,0000,,really the same thing, right, Dialogue: 0,0:03:57.12,0:04:02.10,Default,,0000,0000,0000,,where now these are going to be multinomial probabilities Dialogue: 0,0:04:02.10,0:04:07.73,Default,,0000,0000,0000,,rather than Bernoulli's because the XI's can, maybe, take on up to K values. Dialogue: 0,0:04:07.73,0:04:11.74,Default,,0000,0000,0000,,It turns out, the situation where - one situation where this arises very Dialogue: 0,0:04:11.74,0:04:16.76,Default,,0000,0000,0000,,commonly is if you have a feature that's actually continuous valued, Dialogue: 0,0:04:16.76,0:04:20.64,Default,,0000,0000,0000,,and you choose to dispertise it, and you choose to take a continuous value feature Dialogue: 0,0:04:20.64,0:04:25.77,Default,,0000,0000,0000,,and dispertise it into a finite set of K values, Dialogue: 0,0:04:25.77,0:04:27.71,Default,,0000,0000,0000,,and so it's a perfect example Dialogue: 0,0:04:27.71,0:04:32.37,Default,,0000,0000,0000,,if you remember our very first Dialogue: 0,0:04:32.37,0:04:35.19,Default,,0000,0000,0000,,supervised learning problem of predicting Dialogue: 0,0:04:35.19,0:04:37.09,Default,,0000,0000,0000,,the price of Dialogue: 0,0:04:37.09,0:04:41.04,Default,,0000,0000,0000,,houses. If you have the classification problem on these houses, so based on Dialogue: 0,0:04:41.04,0:04:44.13,Default,,0000,0000,0000,,features of a house, and you want to predict whether or not the house will be sold in the next six Dialogue: 0,0:04:44.13,0:04:46.88,Default,,0000,0000,0000,,months, say. That's a classification problem, and Dialogue: 0,0:04:46.88,0:04:49.21,Default,,0000,0000,0000,,once you use Naive Bayes, Dialogue: 0,0:04:49.21,0:04:52.23,Default,,0000,0000,0000,,then given a continuous value feature Dialogue: 0,0:04:52.23,0:04:54.62,Default,,0000,0000,0000,,like the living area, Dialogue: 0,0:04:54.62,0:04:59.26,Default,,0000,0000,0000,,you know, one pretty common thing to do would be take the continuous value living area Dialogue: 0,0:04:59.26,0:05:02.30,Default,,0000,0000,0000,,and just dispertise it Dialogue: 0,0:05:02.30,0:05:06.67,Default,,0000,0000,0000,,into a few Dialogue: 0,0:05:06.67,0:05:08.91,Default,,0000,0000,0000,, Dialogue: 0,0:05:08.91,0:05:12.77,Default,,0000,0000,0000,,- discreet buckets, Dialogue: 0,0:05:12.77,0:05:13.90,Default,,0000,0000,0000,,and so Dialogue: 0,0:05:13.90,0:05:16.90,Default,,0000,0000,0000,,depending on whether the living area of the house is less than 500 square feet Dialogue: 0,0:05:16.90,0:05:19.48,Default,,0000,0000,0000,,or between 1,000 and 1500 square feet, Dialogue: 0,0:05:19.48,0:05:22.28,Default,,0000,0000,0000,,and so on, or whether it's greater than 2,000 square feet, Dialogue: 0,0:05:22.28,0:05:26.49,Default,,0000,0000,0000,,you choose the value of the corresponding feature, XI, to be one, Dialogue: 0,0:05:26.49,0:05:28.89,Default,,0000,0000,0000,,two, three, or four, okay? Dialogue: 0,0:05:28.89,0:05:31.21,Default,,0000,0000,0000,,So that was the first Dialogue: 0,0:05:31.21,0:05:36.48,Default,,0000,0000,0000,,variation or generalization of Naive Bayes I Dialogue: 0,0:05:36.48,0:05:43.48,Default,,0000,0000,0000,,wanted to talk about. I should just check; are there questions about this? Okay. Dialogue: 0,0:05:49.92,0:05:55.74,Default,,0000,0000,0000,,Cool. And so it turns out that in practice, it's fairly common to use about ten buckets Dialogue: 0,0:05:55.74,0:05:59.10,Default,,0000,0000,0000,,to dispertise a continuous value feature. I drew four here only to Dialogue: 0,0:05:59.10,0:06:00.90,Default,,0000,0000,0000,,save on writing. Dialogue: 0,0:06:00.90,0:06:02.79,Default,,0000,0000,0000,, Dialogue: 0,0:06:02.79,0:06:03.39,Default,,0000,0000,0000,, Dialogue: 0,0:06:03.39,0:06:06.64,Default,,0000,0000,0000,,The second and, sort of, final variation that I want to talk about for Naive Bayes Dialogue: 0,0:06:06.64,0:06:09.84,Default,,0000,0000,0000,,is a variation that's specific Dialogue: 0,0:06:09.84,0:06:10.51,Default,,0000,0000,0000,,to Dialogue: 0,0:06:10.51,0:06:15.72,Default,,0000,0000,0000,,classifying text documents, or, more generally, for classifying sequences. So the text Dialogue: 0,0:06:15.72,0:06:20.22,Default,,0000,0000,0000,,document, like a piece of email, you can think of as a sequence of words Dialogue: 0,0:06:20.22,0:06:23.51,Default,,0000,0000,0000,,and you can apply this, sort of, model I'm about to describe to classifying other sequences as Dialogue: 0,0:06:23.51,0:06:24.91,Default,,0000,0000,0000,,well, but let Dialogue: 0,0:06:24.91,0:06:27.40,Default,,0000,0000,0000,,me just focus on text, Dialogue: 0,0:06:27.40,0:06:29.69,Default,,0000,0000,0000,,and here's Dialogue: 0,0:06:29.69,0:06:33.42,Default,,0000,0000,0000,,the idea. So the Dialogue: 0,0:06:33.42,0:06:38.38,Default,,0000,0000,0000,,Naiven a piece of email, we were Dialogue: 0,0:06:38.38,0:06:42.11,Default,,0000,0000,0000,,representing it using this binary vector value representation, Dialogue: 0,0:06:42.11,0:06:46.47,Default,,0000,0000,0000,,and one of the things that this loses, for instance, is the number of times that different words Dialogue: 0,0:06:46.47,0:06:47.05,Default,,0000,0000,0000,,appear, Dialogue: 0,0:06:47.05,0:06:49.43,Default,,0000,0000,0000,,all right? So, for example, if Dialogue: 0,0:06:49.43,0:06:53.15,Default,,0000,0000,0000,,some word appears a lot of times, and you see the word, you know, buy a lot of times. Dialogue: 0,0:06:53.15,0:06:58.37,Default,,0000,0000,0000,,You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of Dialogue: 0,0:06:58.37,0:06:59.49,Default,,0000,0000,0000,,times in the email, it Dialogue: 0,0:06:59.49,0:07:03.45,Default,,0000,0000,0000,,is more likely to be spam than it appears, I guess, only once Dialogue: 0,0:07:03.45,0:07:08.13,Default,,0000,0000,0000,,because even once, I guess, is enough. So let me just Dialogue: 0,0:07:08.13,0:07:11.21,Default,,0000,0000,0000,,try a different, what's called an event model Dialogue: 0,0:07:11.21,0:07:14.87,Default,,0000,0000,0000,,for Naive Bayes that will take into account the number of times a word appears in Dialogue: 0,0:07:14.87,0:07:16.53,Default,,0000,0000,0000,,the email, Dialogue: 0,0:07:16.53,0:07:17.43,Default,,0000,0000,0000,,and to give Dialogue: 0,0:07:17.43,0:07:22.30,Default,,0000,0000,0000,,this previous model a name as well this particular model for Dialogue: 0,0:07:22.30,0:07:24.06,Default,,0000,0000,0000,,text Dialogue: 0,0:07:24.06,0:07:25.99,Default,,0000,0000,0000,,classification Dialogue: 0,0:07:25.99,0:07:29.19,Default,,0000,0000,0000,, Dialogue: 0,0:07:29.19,0:07:32.37,Default,,0000,0000,0000,, Dialogue: 0,0:07:32.37,0:07:35.64,Default,,0000,0000,0000,,is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what Dialogue: 0,0:07:35.64,0:07:39.64,Default,,0000,0000,0000,,the name means. It Dialogue: 0,0:07:39.64,0:07:43.14,Default,,0000,0000,0000,,refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about Dialogue: 0,0:07:43.14,0:07:45.35,Default,,0000,0000,0000,,what the name means. Dialogue: 0,0:07:45.35,0:07:46.47,Default,,0000,0000,0000,,In contrast, Dialogue: 0,0:07:46.47,0:07:50.53,Default,,0000,0000,0000,,what I want to do now is describe a different representation for email in terms of the Dialogue: 0,0:07:50.53,0:07:51.82,Default,,0000,0000,0000,,feature Dialogue: 0,0:07:51.82,0:07:58.82,Default,,0000,0000,0000,,vector, and this is called the Multinomial Event Model, and, Dialogue: 0,0:08:00.62,0:08:03.32,Default,,0000,0000,0000,,again, there is a rationale behind the name, but it's slightly cryptic, so don't worry Dialogue: 0,0:08:03.32,0:08:07.25,Default,,0000,0000,0000,,about why it's called the Multinomial Event Model; it's just called that. Dialogue: 0,0:08:07.25,0:08:09.35,Default,,0000,0000,0000,,And here's what we're gonna do, Dialogue: 0,0:08:09.35,0:08:15.56,Default,,0000,0000,0000,,given a piece of email, I'm going to represent my email as a feature vector, Dialogue: 0,0:08:15.56,0:08:19.04,Default,,0000,0000,0000,,and so my IF training Dialogue: 0,0:08:19.04,0:08:23.34,Default,,0000,0000,0000,,example, XI will be a feature vector, Dialogue: 0,0:08:23.34,0:08:29.15,Default,,0000,0000,0000,,XI sub group one, XI sub group two, Dialogue: 0,0:08:29.15,0:08:31.28,Default,,0000,0000,0000,,XI subscript Dialogue: 0,0:08:31.28,0:08:33.02,Default,,0000,0000,0000,,NI Dialogue: 0,0:08:33.02,0:08:35.37,Default,,0000,0000,0000,,where Dialogue: 0,0:08:35.37,0:08:38.81,Default,,0000,0000,0000,,NI is equal to the Dialogue: 0,0:08:38.81,0:08:42.16,Default,,0000,0000,0000,,number of words Dialogue: 0,0:08:42.16,0:08:45.83,Default,,0000,0000,0000,,in this email, right? So if one of my training examples is an email with Dialogue: 0,0:08:45.83,0:08:47.34,Default,,0000,0000,0000,,300 words in it, Dialogue: 0,0:08:47.34,0:08:49.75,Default,,0000,0000,0000,,then I represent this email via Dialogue: 0,0:08:49.75,0:08:52.01,Default,,0000,0000,0000,,a feature vector Dialogue: 0,0:08:52.01,0:08:57.32,Default,,0000,0000,0000,,with 300 elements, Dialogue: 0,0:08:57.32,0:09:00.36,Default,,0000,0000,0000,,and each Dialogue: 0,0:09:00.36,0:09:04.27,Default,,0000,0000,0000,,of these elements of the feature vector - lets see. Let me just write this as X subscript Dialogue: 0,0:09:04.27,0:09:05.56,Default,,0000,0000,0000,,J. Dialogue: 0,0:09:05.56,0:09:12.56,Default,,0000,0000,0000,,These will be an index into my dictionary, okay? Dialogue: 0,0:09:12.82,0:09:16.35,Default,,0000,0000,0000,,And so if my dictionary has 50,000 words, then Dialogue: 0,0:09:16.35,0:09:18.69,Default,,0000,0000,0000,,each position in my feature vector Dialogue: 0,0:09:18.69,0:09:22.52,Default,,0000,0000,0000,,will be a variable that takes on one of 50,000 possible Dialogue: 0,0:09:22.52,0:09:24.44,Default,,0000,0000,0000,,values Dialogue: 0,0:09:24.44,0:09:25.64,Default,,0000,0000,0000,, Dialogue: 0,0:09:25.64,0:09:31.58,Default,,0000,0000,0000,,corresponding to what word appeared in the J position of my email, okay? Dialogue: 0,0:09:31.58,0:09:33.94,Default,,0000,0000,0000,,So, in other words, I'm gonna take all the words in my email Dialogue: 0,0:09:33.94,0:09:35.68,Default,,0000,0000,0000,,and you have a feature Dialogue: 0,0:09:35.68,0:09:38.55,Default,,0000,0000,0000,,vector that just says which word Dialogue: 0,0:09:38.55,0:09:39.90,Default,,0000,0000,0000,,in my dictionary Dialogue: 0,0:09:39.90,0:09:43.70,Default,,0000,0000,0000,,was each word in the email, okay? Dialogue: 0,0:09:43.70,0:09:47.68,Default,,0000,0000,0000,,So a different definition for NI now, NI now varies and is different for every Dialogue: 0,0:09:47.68,0:09:49.05,Default,,0000,0000,0000,,training example, Dialogue: 0,0:09:49.05,0:09:52.74,Default,,0000,0000,0000,,and this XJ is now indexed into the dictionary. Dialogue: 0,0:09:52.74,0:09:54.74,Default,,0000,0000,0000,,You know, the components of the feature vector Dialogue: 0,0:09:54.74,0:09:59.45,Default,,0000,0000,0000,,are no longer binary random variables; they're these indices in the dictionary that take on a much Dialogue: 0,0:09:59.45,0:10:02.75,Default,,0000,0000,0000,,larger set of values. Dialogue: 0,0:10:02.75,0:10:05.66,Default,,0000,0000,0000,,And so Dialogue: 0,0:10:05.66,0:10:08.89,Default,,0000,0000,0000,,our generative model for this Dialogue: 0,0:10:08.89,0:10:09.98,Default,,0000,0000,0000,, Dialogue: 0,0:10:09.98,0:10:16.98,Default,,0000,0000,0000,, Dialogue: 0,0:10:22.55,0:10:24.36,Default,,0000,0000,0000,,will be that the joint distribution Dialogue: 0,0:10:24.36,0:10:27.76,Default,,0000,0000,0000,,over X and Y will be that, where again N is now the length of the email, all right? Dialogue: 0,0:10:27.76,0:10:30.84,Default,,0000,0000,0000,,So the way to think about this formula is you Dialogue: 0,0:10:30.84,0:10:32.06,Default,,0000,0000,0000,,imagine Dialogue: 0,0:10:32.06,0:10:33.41,Default,,0000,0000,0000,,that there was some Dialogue: 0,0:10:33.41,0:10:38.05,Default,,0000,0000,0000,,probably distribution over emails. There's some random distribution that generates the emails, Dialogue: 0,0:10:38.05,0:10:40.90,Default,,0000,0000,0000,,and that process proceeds as follows: Dialogue: 0,0:10:40.90,0:10:45.09,Default,,0000,0000,0000,,First, Y is chosen, first the class label. Is someone gonna send you spam email or not Dialogue: 0,0:10:45.09,0:10:48.52,Default,,0000,0000,0000,,spam emails is chosen for Dialogue: 0,0:10:48.52,0:10:54.24,Default,,0000,0000,0000,,us. So first Y, the random variable Y, the class label of spam or not spam is generated, Dialogue: 0,0:10:54.24,0:10:58.63,Default,,0000,0000,0000,,and then having decided whether they sent you spam or not spam, Dialogue: 0,0:10:58.63,0:11:02.88,Default,,0000,0000,0000,,someone iterates over all 300 positions of the email, Dialogue: 0,0:11:02.88,0:11:07.36,Default,,0000,0000,0000,,or 300 words that are going to compose them as email, Dialogue: 0,0:11:07.36,0:11:10.30,Default,,0000,0000,0000,,and would generate words from some distribution that Dialogue: 0,0:11:10.30,0:11:14.24,Default,,0000,0000,0000,,depends on whether they chose to send you spam or not spam. So if Dialogue: 0,0:11:14.24,0:11:17.46,Default,,0000,0000,0000,,they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and Dialogue: 0,0:11:17.46,0:11:19.52,Default,,0000,0000,0000,,whatever at discounts, sale, whatever. Dialogue: 0,0:11:19.52,0:11:23.27,Default,,0000,0000,0000,,And if somebody chose to send you not spam, then they'll send you, sort of, the more normal Dialogue: 0,0:11:23.27,0:11:27.58,Default,,0000,0000,0000,, Dialogue: 0,0:11:27.58,0:11:31.15,Default,,0000,0000,0000,,words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the Dialogue: 0,0:11:31.15,0:11:34.82,Default,,0000,0000,0000,,previous event model, and N has a very different definition from the previous event Dialogue: 0,0:11:34.82,0:11:38.15,Default,,0000,0000,0000,,model. Dialogue: 0,0:11:38.15,0:11:44.68,Default,,0000,0000,0000,,And so the parameters of this model are - let's see. Phi Dialogue: 0,0:11:44.68,0:11:48.20,Default,,0000,0000,0000,,subscript K given Y equals one, which Dialogue: 0,0:11:48.20,0:11:51.37,Default,,0000,0000,0000,,is the Dialogue: 0,0:11:51.37,0:11:55.19,Default,,0000,0000,0000,,probability that, Dialogue: 0,0:11:55.19,0:11:57.49,Default,,0000,0000,0000,,you know, conditioned on Dialogue: 0,0:11:57.49,0:11:59.28,Default,,0000,0000,0000,,someone deciding to spend you spam, Dialogue: 0,0:11:59.28,0:12:03.07,Default,,0000,0000,0000,,what's the probability that the next word they choose to email you in Dialogue: 0,0:12:03.07,0:12:06.26,Default,,0000,0000,0000,,the spam email is going to be word K, and Dialogue: 0,0:12:06.26,0:12:10.35,Default,,0000,0000,0000,,similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi Dialogue: 0,0:12:10.35,0:12:17.35,Default,,0000,0000,0000,,Y Dialogue: 0,0:12:21.16,0:12:25.59,Default,,0000,0000,0000,,and just same as before, okay? Dialogue: 0,0:12:25.59,0:12:28.17,Default,,0000,0000,0000,,So these are the parameters of the model, Dialogue: 0,0:12:28.17,0:12:29.35,Default,,0000,0000,0000,, Dialogue: 0,0:12:29.35,0:12:31.45,Default,,0000,0000,0000,,and Dialogue: 0,0:12:31.45,0:12:35.17,Default,,0000,0000,0000,,given a training set, you can work out the maximum likelihood estimates of the Dialogue: 0,0:12:35.17,0:12:42.17,Default,,0000,0000,0000,, Dialogue: 0,0:12:42.23,0:12:47.81,Default,,0000,0000,0000,,parameters. So the maximum likelihood estimate of the parameters will be Dialogue: 0,0:12:47.81,0:12:49.10,Default,,0000,0000,0000,,equal to - Dialogue: 0,0:12:49.10,0:12:53.44,Default,,0000,0000,0000,,and now I'm gonna write one of these, you know, big indicator function things again. It'll be Dialogue: 0,0:12:53.44,0:12:57.75,Default,,0000,0000,0000,,a sum over your training sets Dialogue: 0,0:12:57.75,0:13:00.29,Default,,0000,0000,0000,,indicator Dialogue: 0,0:13:00.29,0:13:07.29,Default,,0000,0000,0000,,whether that was spam times the Dialogue: 0,0:13:07.45,0:13:10.13,Default,,0000,0000,0000,,sum over all the words in that email where N subscript Dialogue: 0,0:13:10.13,0:13:15.31,Default,,0000,0000,0000,,I is the number of words in email I in your training set, Dialogue: 0,0:13:15.31,0:13:22.31,Default,,0000,0000,0000,,times indicator XIJ, SK times that I, okay? Dialogue: 0,0:13:39.39,0:13:40.35,Default,,0000,0000,0000,,So Dialogue: 0,0:13:40.35,0:13:45.40,Default,,0000,0000,0000,,the numerator says sum over all your emails and take into account all the emails that had Dialogue: 0,0:13:45.40,0:13:48.40,Default,,0000,0000,0000,,class label one, take into account only of the emails that were spam because if Y equals zero, then Dialogue: 0,0:13:48.40,0:13:52.21,Default,,0000,0000,0000,,this is zero, and this would go away, Dialogue: 0,0:13:52.21,0:13:54.47,Default,,0000,0000,0000,,and then times sum over Dialogue: 0,0:13:54.47,0:13:56.58,Default,,0000,0000,0000,,all the words in your spam email, and Dialogue: 0,0:13:56.58,0:14:01.04,Default,,0000,0000,0000,,it counts up the number of times you observed the word K in your spam emails. Dialogue: 0,0:14:01.04,0:14:04.67,Default,,0000,0000,0000,,So, in other words, the numerator is look at all the spam emails in your training set Dialogue: 0,0:14:04.67,0:14:11.67,Default,,0000,0000,0000,,and count up the total number of times the word K appeared in this email. The Dialogue: 0,0:14:13.84,0:14:17.16,Default,,0000,0000,0000,,denominator then is sum over I into our training set of Dialogue: 0,0:14:17.16,0:14:20.20,Default,,0000,0000,0000,,whenever one of your examples is spam, Dialogue: 0,0:14:20.20,0:14:21.43,Default,,0000,0000,0000,,you know, sum up the length Dialogue: 0,0:14:21.43,0:14:23.22,Default,,0000,0000,0000,,of that spam email, Dialogue: 0,0:14:23.22,0:14:26.16,Default,,0000,0000,0000,,and so the denominator is the total length Dialogue: 0,0:14:26.16,0:14:32.37,Default,,0000,0000,0000,,of all of the spam emails in your training set. Dialogue: 0,0:14:32.37,0:14:36.82,Default,,0000,0000,0000,,And so the ratio is just out of all your spam emails, what is the fraction of words in Dialogue: 0,0:14:36.82,0:14:39.19,Default,,0000,0000,0000,,all your spam emails that were word K, Dialogue: 0,0:14:39.19,0:14:41.84,Default,,0000,0000,0000,,and that's your estimate for the probability Dialogue: 0,0:14:41.84,0:14:44.24,Default,,0000,0000,0000,,of the next piece of spam mail Dialogue: 0,0:14:44.24,0:14:51.24,Default,,0000,0000,0000,,generating the word K in any given position, okay? At the Dialogue: 0,0:14:53.10,0:14:56.48,Default,,0000,0000,0000,,end of the previous lecture, I talked about LaPlace smoothing, Dialogue: 0,0:14:56.48,0:14:58.98,Default,,0000,0000,0000,,and so when you do that as Dialogue: 0,0:14:58.98,0:15:03.31,Default,,0000,0000,0000,,well, you add one to the numerator and K to the denominator, and this is Dialogue: 0,0:15:03.31,0:15:06.60,Default,,0000,0000,0000,,the LaPlace smoothed estimate Dialogue: 0,0:15:06.60,0:15:08.79,Default,,0000,0000,0000,,of this parameter, Dialogue: 0,0:15:08.79,0:15:11.37,Default,,0000,0000,0000,,okay? And similarly, you do the same thing for - Dialogue: 0,0:15:11.37,0:15:14.04,Default,,0000,0000,0000,,and you can work out the estimates for the Dialogue: 0,0:15:14.04,0:15:20.97,Default,,0000,0000,0000,,other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just Dialogue: 0,0:15:20.97,0:15:22.62,Default,,0000,0000,0000,,wondering Dialogue: 0,0:15:22.62,0:15:23.60,Default,,0000,0000,0000,, Dialogue: 0,0:15:23.60,0:15:26.29,Default,,0000,0000,0000,,what the X of I is, and what the N of - Dialogue: 0,0:15:26.29,0:15:28.35,Default,,0000,0000,0000,,Instructor (Andrew Ng):Right. So Dialogue: 0,0:15:28.35,0:15:32.14,Default,,0000,0000,0000,,in this second event model, the definition for XI and the definition for N Dialogue: 0,0:15:32.14,0:15:34.77,Default,,0000,0000,0000,,are different, right? So here - Dialogue: 0,0:15:34.77,0:15:41.77,Default,,0000,0000,0000,,well, this is for one example XY. So here, N is the number of words Dialogue: 0,0:15:44.30,0:15:46.34,Default,,0000,0000,0000,,in a given email, Dialogue: 0,0:15:46.34,0:15:50.73,Default,,0000,0000,0000,,right? And if it's the I email subscripting then this N subscript I, and so Dialogue: 0,0:15:50.73,0:15:55.71,Default,,0000,0000,0000,,N will be different for different training examples, and here Dialogue: 0,0:15:55.71,0:16:00.91,Default,,0000,0000,0000,,XI will be, you know, Dialogue: 0,0:16:00.91,0:16:04.07,Default,,0000,0000,0000,,these values from 1 to 50,000, Dialogue: 0,0:16:04.07,0:16:05.58,Default,,0000,0000,0000,,and XI is Dialogue: 0,0:16:05.58,0:16:06.94,Default,,0000,0000,0000,,essentially Dialogue: 0,0:16:06.94,0:16:10.60,Default,,0000,0000,0000,,the identity of the Ith Dialogue: 0,0:16:10.60,0:16:11.34,Default,,0000,0000,0000,,word Dialogue: 0,0:16:11.34,0:16:14.83,Default,,0000,0000,0000,,in a given piece of email, Dialogue: 0,0:16:14.83,0:16:16.85,Default,,0000,0000,0000,,okay? So that's why this is Dialogue: 0,0:16:16.85,0:16:20.44,Default,,0000,0000,0000,,grouping, or this is a product over all the different words of your email Dialogue: 0,0:16:20.44,0:16:22.06,Default,,0000,0000,0000,,of their probability Dialogue: 0,0:16:22.06,0:16:24.43,Default,,0000,0000,0000,,the Ith word in your email, Dialogue: 0,0:16:24.43,0:16:31.43,Default,,0000,0000,0000,,conditioned on Y. Yeah? Dialogue: 0,0:16:32.28,0:16:36.48,Default,,0000,0000,0000,,Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, Dialogue: 0,0:16:36.48,0:16:39.26,Default,,0000,0000,0000,,and I shouldn't have used K here. Let Dialogue: 0,0:16:39.26,0:16:43.75,Default,,0000,0000,0000,,me use a different alphabet and see if that makes sense; does Dialogue: 0,0:16:43.75,0:16:46.49,Default,,0000,0000,0000,,that make sense? Oh, you know what, I'm sorry. Dialogue: 0,0:16:46.49,0:16:49.43,Default,,0000,0000,0000,,You're absolutely right. Dialogue: 0,0:16:49.43,0:16:50.56,Default,,0000,0000,0000,,Thank you. All right. So Dialogue: 0,0:16:50.56,0:16:54.66,Default,,0000,0000,0000,,in LaPlace smoothing, that shouldn't be Dialogue: 0,0:16:54.66,0:16:57.94,Default,,0000,0000,0000,,K. This should be, you know, 50,000, Dialogue: 0,0:16:57.94,0:17:01.94,Default,,0000,0000,0000,,if you have 50,000 words in your dictionary. Yeah, thanks. Great. Dialogue: 0,0:17:01.94,0:17:04.66,Default,,0000,0000,0000,,I stole notation from the previous lecture and Dialogue: 0,0:17:04.66,0:17:06.72,Default,,0000,0000,0000,,didn't translate it properly. Dialogue: 0,0:17:06.72,0:17:08.40,Default,,0000,0000,0000,,So LaPlace smoothing, right, Dialogue: 0,0:17:08.40,0:17:15.40,Default,,0000,0000,0000,,this is the number of possible values that the random variable XI can Dialogue: 0,0:17:16.43,0:17:17.32,Default,,0000,0000,0000,,take on. Cool. Raise your hand if Dialogue: 0,0:17:17.32,0:17:21.68,Default,,0000,0000,0000,,this makes sense? Okay. Some of you, are there Dialogue: 0,0:17:21.68,0:17:28.68,Default,,0000,0000,0000,,more questions about this? Dialogue: 0,0:17:34.85,0:17:41.85,Default,,0000,0000,0000,, Dialogue: 0,0:17:42.01,0:17:43.88,Default,,0000,0000,0000,,Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah, Dialogue: 0,0:17:43.88,0:17:45.11,Default,,0000,0000,0000,,let's see. Dialogue: 0,0:17:45.11,0:17:49.92,Default,,0000,0000,0000,,So LaPlace smoothing is Dialogue: 0,0:17:49.92,0:17:51.11,Default,,0000,0000,0000,,a Dialogue: 0,0:17:51.11,0:17:55.57,Default,,0000,0000,0000,,method to give you, sort of, hopefully, better estimates of their probability Dialogue: 0,0:17:55.57,0:17:58.15,Default,,0000,0000,0000,,distribution over a multinomial, Dialogue: 0,0:17:58.15,0:18:00.43,Default,,0000,0000,0000,,and so Dialogue: 0,0:18:00.43,0:18:02.87,Default,,0000,0000,0000,,was I using X to Y in the previous lecture? Dialogue: 0,0:18:02.87,0:18:04.60,Default,,0000,0000,0000,,So Dialogue: 0,0:18:04.60,0:18:07.81,Default,,0000,0000,0000,,in trying to estimate the probability over a multinomial - I think X Dialogue: 0,0:18:07.81,0:18:10.66,Default,,0000,0000,0000,,and Y are different. I think - Dialogue: 0,0:18:10.66,0:18:13.69,Default,,0000,0000,0000,,was it X or Y? I think it was X, actually. Well - oh, Dialogue: 0,0:18:13.69,0:18:16.80,Default,,0000,0000,0000,,I see, right, right. I think I was using a different definition Dialogue: 0,0:18:16.80,0:18:18.96,Default,,0000,0000,0000,,for the random variable Y because Dialogue: 0,0:18:18.96,0:18:23.86,Default,,0000,0000,0000,,suppose you have a multinomial random variable, X Dialogue: 0,0:18:23.86,0:18:27.89,Default,,0000,0000,0000,,which takes on - let's Dialogue: 0,0:18:27.89,0:18:30.19,Default,,0000,0000,0000,,use a different alphabet. Suppose you have Dialogue: 0,0:18:30.19,0:18:34.95,Default,,0000,0000,0000,,a multinomial random variable X which takes on L different values, then Dialogue: 0,0:18:34.95,0:18:37.44,Default,,0000,0000,0000,,the maximum likelihood estimate Dialogue: 0,0:18:37.44,0:18:40.55,Default,,0000,0000,0000,,for the probability of X, Dialogue: 0,0:18:40.55,0:18:42.06,Default,,0000,0000,0000,,PFX equals K, Dialogue: 0,0:18:42.06,0:18:43.72,Default,,0000,0000,0000,,will be equal to, right, Dialogue: 0,0:18:43.72,0:18:46.70,Default,,0000,0000,0000,,the number of observations. Dialogue: 0,0:18:46.70,0:18:51.86,Default,,0000,0000,0000,, Dialogue: 0,0:18:51.86,0:18:55.14,Default,,0000,0000,0000,,The maximum likelihood estimate for the Dialogue: 0,0:18:55.14,0:19:02.11,Default,,0000,0000,0000,,probability of X being equal to K will be the number of observations of X Dialogue: 0,0:19:02.11,0:19:04.80,Default,,0000,0000,0000,,equals K divided Dialogue: 0,0:19:04.80,0:19:10.51,Default,,0000,0000,0000,,by the total number of observations Dialogue: 0,0:19:10.51,0:19:11.33,Default,,0000,0000,0000,,of X, Dialogue: 0,0:19:11.33,0:19:13.75,Default,,0000,0000,0000,,okay? So that's the maximum likelihood estimate. Dialogue: 0,0:19:13.75,0:19:17.71,Default,,0000,0000,0000,,And to add LaPlace smoothing to this, you, sort of, add one to the numerator, Dialogue: 0,0:19:17.71,0:19:20.12,Default,,0000,0000,0000,,and you add L to the Dialogue: 0,0:19:20.12,0:19:20.91,Default,,0000,0000,0000,, Dialogue: 0,0:19:20.91,0:19:22.61,Default,,0000,0000,0000,,denominator Dialogue: 0,0:19:22.61,0:19:26.20,Default,,0000,0000,0000,,where L was the number of possible values that X can take on. Dialogue: 0,0:19:26.20,0:19:28.54,Default,,0000,0000,0000,,So, in this case, Dialogue: 0,0:19:28.54,0:19:31.74,Default,,0000,0000,0000,,this is a probability that X equals K, Dialogue: 0,0:19:31.74,0:19:36.40,Default,,0000,0000,0000,,and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be Dialogue: 0,0:19:36.40,0:19:40.30,Default,,0000,0000,0000,,something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific Dialogue: 0,0:19:40.30,0:19:47.30,Default,,0000,0000,0000,,definition for a Dialogue: 0,0:19:48.19,0:19:48.66,Default,,0000,0000,0000,,maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples Dialogue: 0,0:19:48.66,0:19:51.63,Default,,0000,0000,0000,,make sense, but I don't know what the, like, general formula for it is. Dialogue: 0,0:19:51.63,0:19:53.97,Default,,0000,0000,0000,,Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so Dialogue: 0,0:19:53.97,0:19:55.61,Default,,0000,0000,0000,,the question is Dialogue: 0,0:19:55.61,0:20:01.81,Default,,0000,0000,0000,,what's the definition for maximum likelihood estimate? So actually Dialogue: 0,0:20:01.81,0:20:05.46,Default,,0000,0000,0000,,in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I Dialogue: 0,0:20:05.46,0:20:06.55,Default,,0000,0000,0000,,was, sort of, Dialogue: 0,0:20:06.55,0:20:09.13,Default,,0000,0000,0000,,throwing out the maximum likelihood estimates on the board Dialogue: 0,0:20:09.13,0:20:11.05,Default,,0000,0000,0000,,without proving them. Dialogue: 0,0:20:11.05,0:20:15.52,Default,,0000,0000,0000,,The way to actually work this out is Dialogue: 0,0:20:15.52,0:20:19.90,Default,,0000,0000,0000,,to actually Dialogue: 0,0:20:19.90,0:20:26.90,Default,,0000,0000,0000,,write down the likelihood. Dialogue: 0,0:20:29.07,0:20:32.85,Default,,0000,0000,0000,,So the way to figure out all of these maximum likelihood estimates is to write Dialogue: 0,0:20:32.85,0:20:33.73,Default,,0000,0000,0000,,down Dialogue: 0,0:20:33.73,0:20:35.19,Default,,0000,0000,0000,,the likelihood of Dialogue: 0,0:20:35.19,0:20:41.17,Default,,0000,0000,0000,,the parameters, phi K given Y being Dialogue: 0,0:20:41.17,0:20:43.81,Default,,0000,0000,0000,,zero, phi Dialogue: 0,0:20:43.81,0:20:44.45,Default,,0000,0000,0000,,Y, Dialogue: 0,0:20:44.45,0:20:47.09,Default,,0000,0000,0000,,right? And so given a training set, Dialogue: 0,0:20:47.09,0:20:50.35,Default,,0000,0000,0000,,the likelihood, I guess, I should be writing log likelihood Dialogue: 0,0:20:50.35,0:20:54.74,Default,,0000,0000,0000,,will be the log of the product of I equals one to Dialogue: 0,0:20:54.74,0:20:55.91,Default,,0000,0000,0000,,N, PFXI, YI, Dialogue: 0,0:20:55.91,0:20:57.95,Default,,0000,0000,0000,,you Dialogue: 0,0:20:57.95,0:21:04.95,Default,,0000,0000,0000,,know, parameterized by these things, okay? Dialogue: 0,0:21:06.94,0:21:09.45,Default,,0000,0000,0000,,Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are Dialogue: 0,0:21:09.45,0:21:12.01,Default,,0000,0000,0000,,parameterized Dialogue: 0,0:21:12.01,0:21:13.50,Default,,0000,0000,0000,,by Dialogue: 0,0:21:13.50,0:21:19.30,Default,,0000,0000,0000,, Dialogue: 0,0:21:19.30,0:21:21.52,Default,,0000,0000,0000,, Dialogue: 0,0:21:21.52,0:21:26.93,Default,,0000,0000,0000,,- well, Dialogue: 0,0:21:26.93,0:21:28.14,Default,,0000,0000,0000,, Dialogue: 0,0:21:28.14,0:21:31.69,Default,,0000,0000,0000,,I'll just drop the parameters to write this Dialogue: 0,0:21:31.69,0:21:38.69,Default,,0000,0000,0000,, Dialogue: 0,0:21:42.37,0:21:49.20,Default,,0000,0000,0000,,more simply - oh, I just put it in - times PFYI, okay? Dialogue: 0,0:21:49.20,0:21:50.97,Default,,0000,0000,0000,,So this is my log likelihood, Dialogue: 0,0:21:50.97,0:21:55.09,Default,,0000,0000,0000,,and so the way you get the maximum likelihood estimate of the parameters Dialogue: 0,0:21:55.09,0:21:55.81,Default,,0000,0000,0000,,is you - Dialogue: 0,0:21:55.81,0:21:59.75,Default,,0000,0000,0000,,so if given a fixed training set, given a set of fixed IYI's, Dialogue: 0,0:21:59.75,0:22:02.18,Default,,0000,0000,0000,,you maximize this in terms of Dialogue: 0,0:22:02.18,0:22:05.74,Default,,0000,0000,0000,,these parameters, and then you get the maximum likelihood estimates that I've been writing Dialogue: 0,0:22:05.74,0:22:07.53,Default,,0000,0000,0000,,out. Dialogue: 0,0:22:07.53,0:22:11.22,Default,,0000,0000,0000,,So in a previous section of today's lecture I wrote out some maximum likelihood estimates Dialogue: 0,0:22:11.22,0:22:14.28,Default,,0000,0000,0000,,for the Gaussian Discriminant Analysis model, and Dialogue: 0,0:22:14.28,0:22:15.61,Default,,0000,0000,0000,,for Naive Bayes, Dialogue: 0,0:22:15.61,0:22:17.75,Default,,0000,0000,0000,,and then this - I didn't prove them, Dialogue: 0,0:22:17.75,0:22:22.31,Default,,0000,0000,0000,,but you get to, sort of, play with that yourself in the homework problem as well Dialogue: 0,0:22:22.31,0:22:26.42,Default,,0000,0000,0000,,and for one of these models, and you'll be able to verify that when Dialogue: 0,0:22:26.42,0:22:29.70,Default,,0000,0000,0000,,you maximize the likelihood and maximize the log likelihood Dialogue: 0,0:22:29.70,0:22:33.75,Default,,0000,0000,0000,,that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to Dialogue: 0,0:22:33.75,0:22:34.95,Default,,0000,0000,0000,,find the way Dialogue: 0,0:22:34.95,0:22:41.63,Default,,0000,0000,0000,,these are derived is by maximizing this, okay? Cool. All right. Dialogue: 0,0:22:41.63,0:22:44.73,Default,,0000,0000,0000,,So Dialogue: 0,0:22:44.73,0:22:51.30,Default,,0000,0000,0000,,that wraps up what I wanted to say about - oh, Dialogue: 0,0:22:51.30,0:22:55.81,Default,,0000,0000,0000,,so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that Dialogue: 0,0:22:55.81,0:23:01.48,Default,,0000,0000,0000,,for text classification, the Naivent Dialogue: 0,0:23:01.48,0:23:04.08,Default,,0000,0000,0000,,model, the last Dialogue: 0,0:23:04.08,0:23:07.30,Default,,0000,0000,0000,,Naivent model, Dialogue: 0,0:23:07.30,0:23:10.33,Default,,0000,0000,0000,,it turns out that almost always does better than Dialogue: 0,0:23:10.33,0:23:11.37,Default,,0000,0000,0000,,the Dialogue: 0,0:23:11.37,0:23:15.36,Default,,0000,0000,0000,,first Naive Bayes model I talked about when you're applying it to the Dialogue: 0,0:23:15.36,0:23:18.67,Default,,0000,0000,0000,,specific case - to the specific of text classification, Dialogue: 0,0:23:18.67,0:23:21.07,Default,,0000,0000,0000,,and Dialogue: 0,0:23:21.07,0:23:25.05,Default,,0000,0000,0000,,one of the reasons is hypothesized for this is that this second model, Dialogue: 0,0:23:25.05,0:23:26.88,Default,,0000,0000,0000,,the multinomial event model, Dialogue: 0,0:23:26.88,0:23:30.12,Default,,0000,0000,0000,,takes into account Dialogue: 0,0:23:30.12,0:23:33.88,Default,,0000,0000,0000,,the number of times a word appears in a document, whereas the former model Dialogue: 0,0:23:33.88,0:23:35.17,Default,,0000,0000,0000,,doesn't. Dialogue: 0,0:23:35.17,0:23:37.37,Default,,0000,0000,0000,,I should say that in truth Dialogue: 0,0:23:37.37,0:23:40.70,Default,,0000,0000,0000,,that actually turns out not to be completely understood why the latter Dialogue: 0,0:23:40.70,0:23:43.78,Default,,0000,0000,0000,,model does better than the former one for text classification, and, sort of, Dialogue: 0,0:23:43.78,0:23:45.64,Default,,0000,0000,0000,,researchers are still Dialogue: 0,0:23:45.64,0:23:46.79,Default,,0000,0000,0000,,debating about it a little bit, but if you Dialogue: 0,0:23:46.79,0:23:51.32,Default,,0000,0000,0000,,ever have a text classification problem, you know, Naive Bayes Classify is probably Dialogue: 0,0:23:51.32,0:23:51.86,Default,,0000,0000,0000,,not, Dialogue: 0,0:23:51.86,0:23:54.15,Default,,0000,0000,0000,,by far, the best learning algorithm out there, Dialogue: 0,0:23:54.15,0:23:55.63,Default,,0000,0000,0000,,but it is Dialogue: 0,0:23:55.63,0:23:58.46,Default,,0000,0000,0000,,relatively straightforward to implement, and it's a very good Dialogue: 0,0:23:58.46,0:24:03.07,Default,,0000,0000,0000,,algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It Dialogue: 0,0:24:03.07,0:24:04.49,Default,,0000,0000,0000,,doesn't Dialogue: 0,0:24:04.49,0:24:07.02,Default,,0000,0000,0000,,actually Dialogue: 0,0:24:07.02,0:24:07.91,Default,,0000,0000,0000,, Dialogue: 0,0:24:07.91,0:24:13.71,Default,,0000,0000,0000,,care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I Dialogue: 0,0:24:13.71,0:24:14.95,Default,,0000,0000,0000,,mean, X variable, if my model like you had exclamation in, does Dialogue: 0,0:24:14.95,0:24:17.17,Default,,0000,0000,0000,,that usually Dialogue: 0,0:24:17.17,0:24:18.33,Default,,0000,0000,0000,,do Dialogue: 0,0:24:18.33,0:24:20.34,Default,,0000,0000,0000,,better if you have enough data? Dialogue: 0,0:24:20.34,0:24:23.05,Default,,0000,0000,0000,,Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words. Dialogue: 0,0:24:23.05,0:24:26.99,Default,,0000,0000,0000,,You can shuffle all the words in the email, and it does exactly the same thing. Dialogue: 0,0:24:26.99,0:24:27.77,Default,,0000,0000,0000,,So Dialogue: 0,0:24:27.77,0:24:30.93,Default,,0000,0000,0000,,in natural language processing, there's actually another name; it's called a Unique Grand Dialogue: 0,0:24:30.93,0:24:32.85,Default,,0000,0000,0000,,Model in natural language processing, Dialogue: 0,0:24:32.85,0:24:35.53,Default,,0000,0000,0000,,and there's some other models like, sort of, Dialogue: 0,0:24:35.53,0:24:39.19,Default,,0000,0000,0000,,say, higher order markup models that take into account some of the ordering Dialogue: 0,0:24:39.19,0:24:44.08,Default,,0000,0000,0000,,of the words. It turns out that for text classification, the models Dialogue: 0,0:24:44.08,0:24:45.26,Default,,0000,0000,0000,,like the Dialogue: 0,0:24:45.26,0:24:49.29,Default,,0000,0000,0000,,bigram models or trigram models, Dialogue: 0,0:24:49.29,0:24:53.36,Default,,0000,0000,0000,,I believe they do only very slightly better, if at all, Dialogue: 0,0:24:53.36,0:25:00.36,Default,,0000,0000,0000,,but that's when you're applying them to text classification, okay? All right. Dialogue: 0,0:25:03.55,0:25:07.19,Default,,0000,0000,0000,,So the next thing I want to talk about is to start again to discussion of Dialogue: 0,0:25:07.19,0:25:11.56,Default,,0000,0000,0000,,non-linear classifiers. So Dialogue: 0,0:25:11.56,0:25:18.56,Default,,0000,0000,0000,,it turns out Dialogue: 0,0:25:21.67,0:25:27.59,Default,,0000,0000,0000,,- well, and so the very first Dialogue: 0,0:25:27.59,0:25:31.22,Default,,0000,0000,0000,,classification algorithm we talked about was logistic regression, which Dialogue: 0,0:25:31.22,0:25:33.42,Default,,0000,0000,0000,,had the forming form for Dialogue: 0,0:25:33.42,0:25:36.34,Default,,0000,0000,0000,,hypothesis, Dialogue: 0,0:25:36.34,0:25:37.52,Default,,0000,0000,0000,,and Dialogue: 0,0:25:37.52,0:25:39.22,Default,,0000,0000,0000,,you can think of this as Dialogue: 0,0:25:39.22,0:25:41.19,Default,,0000,0000,0000,,predicting one when Dialogue: 0,0:25:41.19,0:25:44.42,Default,,0000,0000,0000,,this estimator probability is greater or equal to 0.5 and Dialogue: 0,0:25:44.42,0:25:46.18,Default,,0000,0000,0000,,predicting zero, Dialogue: 0,0:25:46.18,0:25:46.84,Default,,0000,0000,0000,,right, Dialogue: 0,0:25:46.84,0:25:50.05,Default,,0000,0000,0000,,when this is less than 0.5, Dialogue: 0,0:25:50.05,0:25:52.76,Default,,0000,0000,0000,,and given a training set, right? Dialogue: 0,0:25:52.76,0:25:59.76,Default,,0000,0000,0000,,Logistic regression Dialogue: 0,0:26:02.55,0:26:05.92,Default,,0000,0000,0000,,will maybe do grade and descends or something or Dialogue: 0,0:26:05.92,0:26:07.44,Default,,0000,0000,0000,,use Newton's method Dialogue: 0,0:26:07.44,0:26:09.37,Default,,0000,0000,0000,,to find a straight line that Dialogue: 0,0:26:09.37,0:26:12.13,Default,,0000,0000,0000,,reasonably separates the positive and negative classes. Dialogue: 0,0:26:12.13,0:26:15.83,Default,,0000,0000,0000,,But sometimes a data set just can't be separated by a straight line, so is there Dialogue: 0,0:26:15.83,0:26:16.95,Default,,0000,0000,0000,,an algorithm Dialogue: 0,0:26:16.95,0:26:18.23,Default,,0000,0000,0000,,that will let you Dialogue: 0,0:26:18.23,0:26:25.23,Default,,0000,0000,0000,,start to learn these sorts of non-linear division boundaries? Dialogue: 0,0:26:25.54,0:26:29.23,Default,,0000,0000,0000,,And so how do you go about getting a non-linear classifier? And, by the Dialogue: 0,0:26:29.23,0:26:34.36,Default,,0000,0000,0000,,way, one cool result is that remember when I said - when we talked Dialogue: 0,0:26:34.36,0:26:37.79,Default,,0000,0000,0000,,about generative learning algorithms, I said that Dialogue: 0,0:26:37.79,0:26:41.04,Default,,0000,0000,0000,,if you assume Y given X is Dialogue: 0,0:26:41.04,0:26:43.71,Default,,0000,0000,0000,,exponential family, Dialogue: 0,0:26:43.71,0:26:45.33,Default,,0000,0000,0000,, Dialogue: 0,0:26:45.33,0:26:46.67,Default,,0000,0000,0000,,right, with parameter Dialogue: 0,0:26:46.67,0:26:50.73,Default,,0000,0000,0000,,A, and if you build a generative learning algorithm using this, right, plus Dialogue: 0,0:26:50.73,0:26:55.85,Default,,0000,0000,0000,,one, if this is A to Dialogue: 0,0:26:55.85,0:26:59.71,Default,,0000,0000,0000,,one. This is exponential family Dialogue: 0,0:26:59.71,0:27:01.75,Default,,0000,0000,0000,,with natural parameter A to zero, Dialogue: 0,0:27:01.75,0:27:05.61,Default,,0000,0000,0000,,right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, Dialogue: 0,0:27:05.61,0:27:06.21,Default,,0000,0000,0000,,then Dialogue: 0,0:27:06.21,0:27:08.61,Default,,0000,0000,0000,,you end up with a logistic posterior. It Dialogue: 0,0:27:08.61,0:27:11.41,Default,,0000,0000,0000,,actually turns out that a Naive Bayes model Dialogue: 0,0:27:11.41,0:27:12.63,Default,,0000,0000,0000,,actually falls into this Dialogue: 0,0:27:12.63,0:27:17.44,Default,,0000,0000,0000,,as well. So the Naive Bayes model actually falls into this exponential family as well, and, Dialogue: 0,0:27:17.44,0:27:19.37,Default,,0000,0000,0000,,therefore, Dialogue: 0,0:27:19.37,0:27:23.00,Default,,0000,0000,0000,,under the Naive Bayes model, you're actually Dialogue: 0,0:27:23.00,0:27:27.08,Default,,0000,0000,0000,,using this other linear classifier as well, okay? Dialogue: 0,0:27:27.08,0:27:29.02,Default,,0000,0000,0000,,So the question is Dialogue: 0,0:27:29.02,0:27:32.48,Default,,0000,0000,0000,,how can you start to get non-linear classifiers? Dialogue: 0,0:27:32.48,0:27:35.65,Default,,0000,0000,0000,,And I'm going to talk about one method today Dialogue: 0,0:27:35.65,0:27:41.84,Default,,0000,0000,0000,,which is - and Dialogue: 0,0:27:41.84,0:27:45.30,Default,,0000,0000,0000,,we started to talk about it very briefly which is Dialogue: 0,0:27:45.30,0:27:46.65,Default,,0000,0000,0000,,taking a Dialogue: 0,0:27:46.65,0:27:48.99,Default,,0000,0000,0000,,simpler algorithm like Dialogue: 0,0:27:48.99,0:27:50.72,Default,,0000,0000,0000,,logistic regression Dialogue: 0,0:27:50.72,0:27:57.72,Default,,0000,0000,0000,,and using it to build up to more complex non-linear classifiers, okay? So Dialogue: 0,0:27:58.39,0:28:00.44,Default,,0000,0000,0000,,to motivate this discussion, Dialogue: 0,0:28:00.44,0:28:05.34,Default,,0000,0000,0000,,I'm going to use the little picture - let's see. So suppose you have features X1, Dialogue: 0,0:28:05.34,0:28:07.34,Default,,0000,0000,0000,,X2, and X3, Dialogue: 0,0:28:07.34,0:28:10.32,Default,,0000,0000,0000,,and so by convention, I'm gonna follow Dialogue: 0,0:28:10.32,0:28:13.46,Default,,0000,0000,0000,,our earlier convention that X0 is set to one, Dialogue: 0,0:28:13.46,0:28:17.01,Default,,0000,0000,0000,,and so I'm gonna use a little diagram like this Dialogue: 0,0:28:17.01,0:28:18.09,Default,,0000,0000,0000,,to denote Dialogue: 0,0:28:18.09,0:28:19.69,Default,,0000,0000,0000,,our Dialogue: 0,0:28:19.69,0:28:24.32,Default,,0000,0000,0000,,logistic regression unit, okay? So Dialogue: 0,0:28:24.32,0:28:28.61,Default,,0000,0000,0000,,think of a little picture like that, you know, this little circle as denoting a Dialogue: 0,0:28:28.61,0:28:29.42,Default,,0000,0000,0000,,computation note Dialogue: 0,0:28:29.42,0:28:32.63,Default,,0000,0000,0000,,that takes this input, you know, several features and then it outputs another Dialogue: 0,0:28:32.63,0:28:36.09,Default,,0000,0000,0000,,number that's X subscript theta of X, given by Dialogue: 0,0:28:36.09,0:28:37.80,Default,,0000,0000,0000,,a sigmoid function, Dialogue: 0,0:28:37.80,0:28:39.93,Default,,0000,0000,0000,,and so this little computational unit - Dialogue: 0,0:28:39.93,0:28:43.46,Default,,0000,0000,0000,,well, will have parameters theta. Dialogue: 0,0:28:43.46,0:28:47.70,Default,,0000,0000,0000,,Now, in order to get non-linear division boundaries, all we need to do - Dialogue: 0,0:28:47.70,0:28:50.85,Default,,0000,0000,0000,,well, at least one thing to do is just Dialogue: 0,0:28:50.85,0:28:52.78,Default,,0000,0000,0000,,come up with Dialogue: 0,0:28:52.78,0:28:54.39,Default,,0000,0000,0000,, Dialogue: 0,0:28:54.39,0:28:57.02,Default,,0000,0000,0000,,a way to represent hypotheses Dialogue: 0,0:28:57.02,0:29:00.64,Default,,0000,0000,0000,,that can output non-linear division boundaries, right, Dialogue: 0,0:29:00.64,0:29:02.70,Default,,0000,0000,0000,,and so Dialogue: 0,0:29:02.70,0:29:07.49,Default,,0000,0000,0000,,this is - Dialogue: 0,0:29:07.49,0:29:09.75,Default,,0000,0000,0000,,when you put a bunch of those little pictures Dialogue: 0,0:29:09.75,0:29:11.39,Default,,0000,0000,0000,,that I drew Dialogue: 0,0:29:11.39,0:29:12.79,Default,,0000,0000,0000,,on the previous board, Dialogue: 0,0:29:12.79,0:29:14.17,Default,,0000,0000,0000,,you can then get Dialogue: 0,0:29:14.17,0:29:18.55,Default,,0000,0000,0000,,what's called Dialogue: 0,0:29:18.55,0:29:20.78,Default,,0000,0000,0000,,a Neural Network in which you Dialogue: 0,0:29:20.78,0:29:23.71,Default,,0000,0000,0000,,think of having my features here Dialogue: 0,0:29:23.71,0:29:30.71,Default,,0000,0000,0000,,and then I would feed them to Dialogue: 0,0:29:31.02,0:29:34.53,Default,,0000,0000,0000,,say a few of these little Dialogue: 0,0:29:34.53,0:29:38.94,Default,,0000,0000,0000,,sigmoidal units, Dialogue: 0,0:29:38.94,0:29:43.56,Default,,0000,0000,0000,,and these together will feed into yet another sigmoidal unit, say, Dialogue: 0,0:29:43.56,0:29:46.27,Default,,0000,0000,0000,,which will output Dialogue: 0,0:29:46.27,0:29:51.51,Default,,0000,0000,0000,,my final output H subscript theta of X, okay? And Dialogue: 0,0:29:51.51,0:29:54.50,Default,,0000,0000,0000,,just to give these things names, let me call Dialogue: 0,0:29:54.50,0:29:58.38,Default,,0000,0000,0000,,the values output by these three intermediate sigmoidal units; let me call them A1, Dialogue: 0,0:29:58.38,0:30:01.36,Default,,0000,0000,0000,,A2, A3. Dialogue: 0,0:30:01.36,0:30:02.58,Default,,0000,0000,0000,,And let me just be Dialogue: 0,0:30:02.58,0:30:05.68,Default,,0000,0000,0000,,completely concrete about what this formula represents, right? So Dialogue: 0,0:30:05.68,0:30:06.57,Default,,0000,0000,0000,, Dialogue: 0,0:30:06.57,0:30:08.31,Default,,0000,0000,0000,,each of these Dialogue: 0,0:30:08.31,0:30:12.08,Default,,0000,0000,0000,,units in the middle will have their own associated set of parameters, Dialogue: 0,0:30:12.08,0:30:15.82,Default,,0000,0000,0000,,and so the value A1 will be computed as Dialogue: 0,0:30:15.82,0:30:17.94,Default,,0000,0000,0000,,G Dialogue: 0,0:30:17.94,0:30:21.44,Default,,0000,0000,0000,,of X transpose, Dialogue: 0,0:30:21.44,0:30:27.44,Default,,0000,0000,0000,,and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta Dialogue: 0,0:30:27.44,0:30:29.09,Default,,0000,0000,0000,,two, Dialogue: 0,0:30:29.09,0:30:30.98,Default,,0000,0000,0000,,and A3 will be Dialogue: 0,0:30:30.98,0:30:34.49,Default,,0000,0000,0000,,G of X transpose, Dialogue: 0,0:30:34.49,0:30:35.70,Default,,0000,0000,0000,,theta three, Dialogue: 0,0:30:35.70,0:30:42.70,Default,,0000,0000,0000,,where G is the sigmoid function, all right? So G of Z, Dialogue: 0,0:30:43.64,0:30:49.85,Default,,0000,0000,0000,,and then, finally, our hypothesis will output Dialogue: 0,0:30:49.85,0:30:52.36,Default,,0000,0000,0000,,G of Dialogue: 0,0:30:52.36,0:30:53.95,Default,,0000,0000,0000,,A Dialogue: 0,0:30:53.95,0:30:56.62,Default,,0000,0000,0000,,transpose Dialogue: 0,0:30:56.62,0:30:59.20,Default,,0000,0000,0000,,theta four, right? Where, you Dialogue: 0,0:30:59.20,0:31:01.32,Default,,0000,0000,0000,,know, this A vector Dialogue: 0,0:31:01.32,0:31:04.39,Default,,0000,0000,0000,,is a vector of A1, Dialogue: 0,0:31:04.39,0:31:05.78,Default,,0000,0000,0000,,A2, A3. Dialogue: 0,0:31:05.78,0:31:08.22,Default,,0000,0000,0000,,We can append another one to it at first Dialogue: 0,0:31:08.22,0:31:11.47,Default,,0000,0000,0000,,if you want, okay? Let Dialogue: 0,0:31:11.47,0:31:14.27,Default,,0000,0000,0000,,me Dialogue: 0,0:31:14.27,0:31:16.21,Default,,0000,0000,0000,,just draw up here this - I'm Dialogue: 0,0:31:16.21,0:31:18.47,Default,,0000,0000,0000,,sorry about the cluttered board. Dialogue: 0,0:31:18.47,0:31:20.95,Default,,0000,0000,0000,,And so H subscript theta of X, Dialogue: 0,0:31:20.95,0:31:23.85,Default,,0000,0000,0000,,this is a function Dialogue: 0,0:31:23.85,0:31:29.30,Default,,0000,0000,0000,,of all the parameters theta one Dialogue: 0,0:31:29.30,0:31:32.42,Default,,0000,0000,0000,,through theta four, Dialogue: 0,0:31:32.42,0:31:36.68,Default,,0000,0000,0000,,and so one way to learn parameters for this model Dialogue: 0,0:31:36.68,0:31:37.65,Default,,0000,0000,0000,,is to Dialogue: 0,0:31:37.65,0:31:39.91,Default,,0000,0000,0000,,write down the cost function, Dialogue: 0,0:31:39.91,0:31:41.18,Default,,0000,0000,0000,,say, J of theta Dialogue: 0,0:31:41.18,0:31:46.04,Default,,0000,0000,0000,,equals one-half sum from Y equals one to Dialogue: 0,0:31:46.04,0:31:49.73,Default,,0000,0000,0000,,M, YI minus H subscript theta of Dialogue: 0,0:31:49.73,0:31:51.77,Default,,0000,0000,0000,,XI Dialogue: 0,0:31:51.77,0:31:54.45,Default,,0000,0000,0000,,squared, say. Okay, Dialogue: 0,0:31:54.45,0:31:56.07,Default,,0000,0000,0000,,so that's our familiar Dialogue: 0,0:31:56.07,0:31:58.28,Default,,0000,0000,0000,,quadratic cost function, Dialogue: 0,0:31:58.28,0:31:59.98,Default,,0000,0000,0000,,and so Dialogue: 0,0:31:59.98,0:32:03.99,Default,,0000,0000,0000,,one way to learn the parameters of an algorithm like this is to just use gradient Dialogue: 0,0:32:03.99,0:32:04.84,Default,,0000,0000,0000,,interscent Dialogue: 0,0:32:04.84,0:32:08.81,Default,,0000,0000,0000,,to minimize J of theta as a function of theta, okay? See, in the phi Dialogue: 0,0:32:08.81,0:32:10.94,Default,,0000,0000,0000,,gradient descent Dialogue: 0,0:32:10.94,0:32:14.61,Default,,0000,0000,0000,,to minimize this square area, which Dialogue: 0,0:32:14.61,0:32:18.05,Default,,0000,0000,0000,,stated differently means you use gradient descent to make the predictions of your neural Dialogue: 0,0:32:18.05,0:32:20.33,Default,,0000,0000,0000,,network as close as possible Dialogue: 0,0:32:20.33,0:32:27.33,Default,,0000,0000,0000,,to what you observed as the labels in your training Dialogue: 0,0:32:30.55,0:32:34.52,Default,,0000,0000,0000,,set, okay? So it turns out green descent on this neural network is a specific name, the Dialogue: 0,0:32:34.52,0:32:35.06,Default,,0000,0000,0000,,algorithm Dialogue: 0,0:32:35.06,0:32:38.79,Default,,0000,0000,0000,,that implements grand descent is called back propagation, and so if you ever hear that all that Dialogue: 0,0:32:38.79,0:32:41.24,Default,,0000,0000,0000,,means is - it just means gradient interscent on Dialogue: 0,0:32:41.24,0:32:45.17,Default,,0000,0000,0000,,a cost function like this or a variation of this on the neural network Dialogue: 0,0:32:45.17,0:32:47.91,Default,,0000,0000,0000,,that looks like that, Dialogue: 0,0:32:47.91,0:32:50.22,Default,,0000,0000,0000,,and - Dialogue: 0,0:32:50.22,0:32:54.36,Default,,0000,0000,0000,,well, this algorithm actually has some advantages and disadvantages, but let me Dialogue: 0,0:32:54.36,0:32:57.83,Default,,0000,0000,0000,,actually show you. So, let's see. Dialogue: 0,0:32:57.83,0:33:00.73,Default,,0000,0000,0000,,One of the interesting things about the neural network is that Dialogue: 0,0:33:00.73,0:33:05.43,Default,,0000,0000,0000,,you can look at what these intermediate notes are computing, right? So Dialogue: 0,0:33:05.43,0:33:08.27,Default,,0000,0000,0000,,this neural network has what's called Dialogue: 0,0:33:08.27,0:33:11.61,Default,,0000,0000,0000,,a hidden layer before you then have the output layer, Dialogue: 0,0:33:11.61,0:33:14.41,Default,,0000,0000,0000,,and, more generally, you can actually have inputs feed Dialogue: 0,0:33:14.41,0:33:16.28,Default,,0000,0000,0000,,into these Dialogue: 0,0:33:16.28,0:33:19.69,Default,,0000,0000,0000,,computation units, feed into more layers of computation units, to even more layers, to Dialogue: 0,0:33:19.69,0:33:23.37,Default,,0000,0000,0000,,more layers, and then finally you have an output layer at the end Dialogue: 0,0:33:23.37,0:33:26.97,Default,,0000,0000,0000,,And one cool thing you can do is look at all of these intermediate units, look at Dialogue: 0,0:33:26.97,0:33:28.28,Default,,0000,0000,0000,,these Dialogue: 0,0:33:28.28,0:33:30.98,Default,,0000,0000,0000,,units and what's called a hidden layer of the neural network. Don't Dialogue: 0,0:33:30.98,0:33:32.58,Default,,0000,0000,0000,,worry about why it's called that. Look Dialogue: 0,0:33:32.58,0:33:34.49,Default,,0000,0000,0000,,at computations of the hidden unit Dialogue: 0,0:33:34.49,0:33:39.57,Default,,0000,0000,0000,,and ask what is the hidden unit computing the neural network? So Dialogue: 0,0:33:39.57,0:33:42.90,Default,,0000,0000,0000,,to, maybe, get a better sense of neural networks might be doing, let me show you a Dialogue: 0,0:33:42.90,0:33:44.58,Default,,0000,0000,0000,,video - I'm gonna switch to the laptop - this is made Dialogue: 0,0:33:44.58,0:33:46.25,Default,,0000,0000,0000,,by a Dialogue: 0,0:33:46.25,0:33:49.36,Default,,0000,0000,0000,,friend, Yann LeCun Dialogue: 0,0:33:49.36,0:33:52.42,Default,,0000,0000,0000,,who's currently a professor at New York University. Dialogue: 0,0:33:52.42,0:33:56.29,Default,,0000,0000,0000,,Can I show a video on the laptop? Dialogue: 0,0:33:56.29,0:33:58.60,Default,,0000,0000,0000,,So let me show you a video Dialogue: 0,0:33:58.60,0:34:00.44,Default,,0000,0000,0000,,from Yann Dialogue: 0,0:34:00.44,0:34:06.33,Default,,0000,0000,0000,,LeCun on a neural network that he developed for Hammerton Digit Recognition. Dialogue: 0,0:34:06.33,0:34:10.11,Default,,0000,0000,0000,,There was one other thing he did in this neural network that I'm not gonna talk about called Dialogue: 0,0:34:10.11,0:34:12.39,Default,,0000,0000,0000,,a Convolutional Neural Network Dialogue: 0,0:34:12.39,0:34:16.01,Default,,0000,0000,0000,,that - well, Dialogue: 0,0:34:16.01,0:34:18.81,Default,,0000,0000,0000,,his system is called LeNet, Dialogue: 0,0:34:18.81,0:34:22.57,Default,,0000,0000,0000,,and let's see. Would you put Dialogue: 0,0:34:22.57,0:34:23.36,Default,,0000,0000,0000,,on the Dialogue: 0,0:34:23.36,0:34:30.36,Default,,0000,0000,0000,,laptop display? Hum, Dialogue: 0,0:34:37.78,0:34:38.79,Default,,0000,0000,0000,,actually maybe if - Dialogue: 0,0:34:38.79,0:34:42.34,Default,,0000,0000,0000,,or you can just put on the screen on the side; that would work too Dialogue: 0,0:34:42.34,0:34:49.34,Default,,0000,0000,0000,,if the big screen isn't working. Dialogue: 0,0:35:12.25,0:35:13.73,Default,,0000,0000,0000,,Let's see. I'm Dialogue: 0,0:35:13.73,0:35:19.30,Default,,0000,0000,0000,,just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come Dialogue: 0,0:35:19.30,0:35:24.21,Default,,0000,0000,0000,,on? Well, let me say a few more things about neural network. Dialogue: 0,0:35:24.21,0:35:25.53,Default,,0000,0000,0000,,So it turns out that Dialogue: 0,0:35:25.53,0:35:29.04,Default,,0000,0000,0000,,when you write a quadratic cost function like I Dialogue: 0,0:35:29.04,0:35:31.69,Default,,0000,0000,0000,,wrote down on the chalkboard just now, Dialogue: 0,0:35:31.69,0:35:34.64,Default,,0000,0000,0000,,it turns out that unlike logistic regression, Dialogue: 0,0:35:34.64,0:35:39.03,Default,,0000,0000,0000,,that will almost always respond to non-convex optimization problem, Dialogue: 0,0:35:39.03,0:35:42.71,Default,,0000,0000,0000,,and so whereas for logistic regression if you run gradient descent Dialogue: 0,0:35:42.71,0:35:44.28,Default,,0000,0000,0000,,or Newton's method or whatever, Dialogue: 0,0:35:44.28,0:35:46.00,Default,,0000,0000,0000,,you converse the global optimer. Dialogue: 0,0:35:46.00,0:35:50.50,Default,,0000,0000,0000,,This is not true for neural networks. In general, there are lots of local optimer Dialogue: 0,0:35:50.50,0:35:55.21,Default,,0000,0000,0000,,and, sort of, much harder optimization problem. Dialogue: 0,0:35:55.21,0:35:56.16,Default,,0000,0000,0000,,So Dialogue: 0,0:35:56.16,0:35:59.70,Default,,0000,0000,0000,,neural networks, if you're, sort of, familiar with them, and you're good at making design choices Dialogue: 0,0:35:59.70,0:36:01.60,Default,,0000,0000,0000,,like what learning rate to use, and Dialogue: 0,0:36:01.60,0:36:04.32,Default,,0000,0000,0000,,how many hidden units to use, and so on, you can, Dialogue: 0,0:36:04.32,0:36:07.70,Default,,0000,0000,0000,,sort of, get them to be fairly effective, Dialogue: 0,0:36:07.70,0:36:09.11,Default,,0000,0000,0000,,and Dialogue: 0,0:36:09.11,0:36:12.91,Default,,0000,0000,0000,,there's, sort of, often ongoing debates about, you know, is this learning algorithm Dialogue: 0,0:36:12.91,0:36:13.82,Default,,0000,0000,0000,,better, or is that learning Dialogue: 0,0:36:13.82,0:36:17.87,Default,,0000,0000,0000,,algorithm better? The vast majority of machine learning researchers today seem to perceive Dialogue: 0,0:36:17.87,0:36:20.18,Default,,0000,0000,0000,,support vector machines, which is what I'll talk about later, Dialogue: 0,0:36:20.18,0:36:24.35,Default,,0000,0000,0000,,to be a much more effective off-theshelf learning algorithm than neural networks. Dialogue: 0,0:36:24.35,0:36:26.51,Default,,0000,0000,0000,,This point of view is contested Dialogue: 0,0:36:26.51,0:36:32.04,Default,,0000,0000,0000,,a bit, but so neural networks are not something that I personally use a lot right Dialogue: 0,0:36:32.04,0:36:35.67,Default,,0000,0000,0000,,now because there's a hard optimization problem and you should do so often verge, Dialogue: 0,0:36:35.67,0:36:39.97,Default,,0000,0000,0000,,and it actually, sort of works. It, sort of, works reasonably Dialogue: 0,0:36:39.97,0:36:41.74,Default,,0000,0000,0000,,well. It's just Dialogue: 0,0:36:41.74,0:36:43.70,Default,,0000,0000,0000,,because this is fairly complicated, Dialogue: 0,0:36:43.70,0:36:45.27,Default,,0000,0000,0000,,there's not an algorithm that Dialogue: 0,0:36:45.27,0:36:52.01,Default,,0000,0000,0000,,I use commonly or that my friends use all Dialogue: 0,0:36:52.01,0:36:56.02,Default,,0000,0000,0000,,time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, Dialogue: 0,0:36:56.02,0:36:56.82,Default,,0000,0000,0000,, Dialogue: 0,0:36:56.82,0:37:00.40,Default,,0000,0000,0000,,you know, the most effective learning algorithm before support vector Dialogue: 0,0:37:00.40,0:37:02.03,Default,,0000,0000,0000,,machines were invented. Dialogue: 0,0:37:02.03,0:37:06.35,Default,,0000,0000,0000,,So here's Yann LeCun's video, and - Dialogue: 0,0:37:06.35,0:37:09.35,Default,,0000,0000,0000,,well, there's actually audio on this too, the soundboard. So I'll just tell you Dialogue: 0,0:37:09.35,0:37:11.21,Default,,0000,0000,0000,,what's happening. Dialogue: 0,0:37:11.21,0:37:11.88,Default,,0000,0000,0000,, Dialogue: 0,0:37:11.88,0:37:14.63,Default,,0000,0000,0000,,What you're seeing is a trained neural network, Dialogue: 0,0:37:14.63,0:37:18.55,Default,,0000,0000,0000,,and this display where my mouse pointer is pointing, right, this big three Dialogue: 0,0:37:18.55,0:37:19.43,Default,,0000,0000,0000,,there Dialogue: 0,0:37:19.43,0:37:21.19,Default,,0000,0000,0000,,is Dialogue: 0,0:37:21.19,0:37:24.80,Default,,0000,0000,0000,,the input to the neural network. So you're showing the neural network this image, and it's Dialogue: 0,0:37:24.80,0:37:26.86,Default,,0000,0000,0000,,trying to recognize what is this. Dialogue: 0,0:37:26.86,0:37:30.77,Default,,0000,0000,0000,,The final answer output by the neural network is this number up here, right Dialogue: 0,0:37:30.77,0:37:33.19,Default,,0000,0000,0000,,below where it says LeNet-5, Dialogue: 0,0:37:33.19,0:37:34.23,Default,,0000,0000,0000,,and the Dialogue: 0,0:37:34.23,0:37:37.76,Default,,0000,0000,0000,,neural network correctly recognizes this image as a three, Dialogue: 0,0:37:37.76,0:37:39.89,Default,,0000,0000,0000,,and if you look to the left of this image, Dialogue: 0,0:37:39.89,0:37:42.31,Default,,0000,0000,0000,,what's interesting about this is Dialogue: 0,0:37:42.31,0:37:46.64,Default,,0000,0000,0000,,the display on the left portion of this is actually showing the Dialogue: 0,0:37:46.64,0:37:49.100,Default,,0000,0000,0000,,intermediate computations of the neural network. In other words, it's showing you Dialogue: 0,0:37:49.100,0:37:51.90,Default,,0000,0000,0000,,what are the hidden layers of the Dialogue: 0,0:37:51.90,0:37:53.83,Default,,0000,0000,0000,,neural network computing. Dialogue: 0,0:37:53.83,0:37:57.53,Default,,0000,0000,0000,,And so, for example, if you look at this one, the third image down from the top, Dialogue: 0,0:37:57.53,0:38:01.57,Default,,0000,0000,0000,,this seems to be computing, you know, certain edges into digits, Dialogue: 0,0:38:01.57,0:38:05.63,Default,,0000,0000,0000,,right? We're just computing digits on the Dialogue: 0,0:38:05.63,0:38:07.49,Default,,0000,0000,0000,,right-hand side of the bottom or something Dialogue: 0,0:38:07.49,0:38:09.89,Default,,0000,0000,0000,,of the input display Dialogue: 0,0:38:09.89,0:38:11.48,Default,,0000,0000,0000,,of the input image, okay? Dialogue: 0,0:38:11.48,0:38:15.13,Default,,0000,0000,0000,,So let me just play this video, and you can see Dialogue: 0,0:38:15.13,0:38:17.02,Default,,0000,0000,0000,,some Dialogue: 0,0:38:17.02,0:38:20.62,Default,,0000,0000,0000,,of the inputs and outputs of the neural network, and Dialogue: 0,0:38:20.62,0:38:21.53,Default,,0000,0000,0000,,those Dialogue: 0,0:38:21.53,0:38:26.31,Default,,0000,0000,0000,,are very different fonts. There's this robustness to Dialogue: 0,0:38:26.31,0:38:30.67,Default,,0000,0000,0000,,noise. Dialogue: 0,0:38:30.67,0:38:36.93,Default,,0000,0000,0000,,All Dialogue: 0,0:38:36.93,0:38:37.88,Default,,0000,0000,0000,, Dialogue: 0,0:38:37.88,0:38:41.95,Default,,0000,0000,0000,, Dialogue: 0,0:38:41.95,0:38:43.00,Default,,0000,0000,0000,,right. Multiple Dialogue: 0,0:38:43.00,0:38:49.09,Default,,0000,0000,0000,,digits, Dialogue: 0,0:38:49.09,0:38:50.74,Default,,0000,0000,0000,, Dialogue: 0,0:38:50.74,0:38:57.74,Default,,0000,0000,0000,,that's, kind of, cool. All Dialogue: 0,0:38:59.26,0:39:06.26,Default,,0000,0000,0000,,right. Dialogue: 0,0:39:12.14,0:39:19.14,Default,,0000,0000,0000,, Dialogue: 0,0:39:30.78,0:39:35.79,Default,,0000,0000,0000,, Dialogue: 0,0:39:35.79,0:39:38.69,Default,,0000,0000,0000,, Dialogue: 0,0:39:38.69,0:39:40.46,Default,,0000,0000,0000,, Dialogue: 0,0:39:40.46,0:39:41.66,Default,,0000,0000,0000,, Dialogue: 0,0:39:41.66,0:39:48.66,Default,,0000,0000,0000,, Dialogue: 0,0:39:52.36,0:39:54.81,Default,,0000,0000,0000,,So, Dialogue: 0,0:39:54.81,0:39:58.22,Default,,0000,0000,0000,,just for fun, let me show you one more video, Dialogue: 0,0:39:58.22,0:40:00.58,Default,,0000,0000,0000,,which was - let's Dialogue: 0,0:40:00.58,0:40:06.09,Default,,0000,0000,0000,,see. This is another video from the various CV's, the machine that changed the world, which Dialogue: 0,0:40:06.09,0:40:07.49,Default,,0000,0000,0000,,was produced by WGBH Dialogue: 0,0:40:07.49,0:40:09.06,Default,,0000,0000,0000,,Television Dialogue: 0,0:40:09.06,0:40:10.80,Default,,0000,0000,0000,,in Dialogue: 0,0:40:10.80,0:40:12.60,Default,,0000,0000,0000,,corporation with British Foreclass Incorporation, Dialogue: 0,0:40:12.60,0:40:16.16,Default,,0000,0000,0000,,and it was aired on PBS a few years ago, I think. Dialogue: 0,0:40:16.16,0:40:17.63,Default,,0000,0000,0000,,I want to show you a Dialogue: 0,0:40:17.63,0:40:19.59,Default,,0000,0000,0000,,video describing the NETtalk Dialogue: 0,0:40:19.59,0:40:23.63,Default,,0000,0000,0000,,Neural Network, which was developed by Terry Sejnowski; he's a Dialogue: 0,0:40:23.63,0:40:25.74,Default,,0000,0000,0000,,researcher. Dialogue: 0,0:40:25.74,0:40:29.59,Default,,0000,0000,0000,,And so NETtalk was actually one of the major milestones in the Dialogue: 0,0:40:29.59,0:40:31.75,Default,,0000,0000,0000,,history of neural network, Dialogue: 0,0:40:31.75,0:40:33.49,Default,,0000,0000,0000,,and this specific application Dialogue: 0,0:40:33.49,0:40:36.40,Default,,0000,0000,0000,,is getting the neural network to read text. Dialogue: 0,0:40:36.40,0:40:38.78,Default,,0000,0000,0000,,So, in other words, can you show a Dialogue: 0,0:40:38.78,0:40:41.66,Default,,0000,0000,0000,,piece of English to a computer Dialogue: 0,0:40:41.66,0:40:43.42,Default,,0000,0000,0000,,and have the computer read, Dialogue: 0,0:40:43.42,0:40:46.41,Default,,0000,0000,0000,,sort of, verbally produce sounds that could respond Dialogue: 0,0:40:46.41,0:40:48.32,Default,,0000,0000,0000,,to the Dialogue: 0,0:40:48.32,0:40:50.63,Default,,0000,0000,0000,,reading of the text. Dialogue: 0,0:40:50.63,0:40:55.28,Default,,0000,0000,0000,,And it turns out that in the history of AI and the history of machine learning, Dialogue: 0,0:40:55.28,0:40:56.51,Default,,0000,0000,0000,,this video Dialogue: 0,0:40:56.51,0:41:00.85,Default,,0000,0000,0000,,created a lot of excitement about neural networks and about machine learning. Part of the Dialogue: 0,0:41:00.85,0:41:05.43,Default,,0000,0000,0000,,reason was that Terry Sejnowski had the foresight to choose Dialogue: 0,0:41:05.43,0:41:06.70,Default,,0000,0000,0000,,to use, Dialogue: 0,0:41:06.70,0:41:08.04,Default,,0000,0000,0000,,in his video, Dialogue: 0,0:41:08.04,0:41:12.94,Default,,0000,0000,0000,,a child-like voice talking about visiting your grandmother's house and so on. You'll Dialogue: 0,0:41:12.94,0:41:14.29,Default,,0000,0000,0000,,see it in a second, Dialogue: 0,0:41:14.29,0:41:16.03,Default,,0000,0000,0000,,and so Dialogue: 0,0:41:16.03,0:41:19.40,Default,,0000,0000,0000,,this really created the perception of - created Dialogue: 0,0:41:19.40,0:41:23.13,Default,,0000,0000,0000,,the impression of the neural network being like a young child learning how to speak, Dialogue: 0,0:41:23.13,0:41:25.26,Default,,0000,0000,0000,,and talking about going to your grandmothers, and so Dialogue: 0,0:41:25.26,0:41:27.38,Default,,0000,0000,0000,,on. So this actually Dialogue: 0,0:41:27.38,0:41:29.29,Default,,0000,0000,0000,,helped generate a lot of excitement Dialogue: 0,0:41:29.29,0:41:32.45,Default,,0000,0000,0000,,within academia and outside academia on neural networks, sort of, early in the history Dialogue: 0,0:41:32.45,0:41:36.11,Default,,0000,0000,0000,,of neural networks. I'm just gonna show you the video. Dialogue: 0,0:41:36.11,0:41:37.91,Default,,0000,0000,0000,,[Begin Video] You're going to hear first Dialogue: 0,0:41:37.91,0:41:41.56,Default,,0000,0000,0000,,what the network sounds like at the very beginning of the training, and it won't Dialogue: 0,0:41:41.56,0:41:43.59,Default,,0000,0000,0000,,sound like words, but it'll sound Dialogue: 0,0:41:43.59,0:41:48.01,Default,,0000,0000,0000,,like attempts that will get better and better with Dialogue: 0,0:41:48.01,0:41:54.24,Default,,0000,0000,0000,,time. [Computer's voice] The network Dialogue: 0,0:41:54.24,0:41:58.25,Default,,0000,0000,0000,,takes the letters, say the phrase, grandmother's house, and Dialogue: 0,0:41:58.25,0:42:02.68,Default,,0000,0000,0000,,makes a random attempt at pronouncing it. [Computer's Dialogue: 0,0:42:02.68,0:42:07.22,Default,,0000,0000,0000,,voice] Dialogue: 0,0:42:07.22,0:42:09.27,Default,,0000,0000,0000,,Grandmother's house. The Dialogue: 0,0:42:09.27,0:42:12.53,Default,,0000,0000,0000,,phonetic difference between the guess and the right pronunciation Dialogue: 0,0:42:12.53,0:42:16.54,Default,,0000,0000,0000,,is sent back through the network. [Computer's Dialogue: 0,0:42:16.54,0:42:18.99,Default,,0000,0000,0000,,voice] Dialogue: 0,0:42:18.99,0:42:20.33,Default,,0000,0000,0000,,Grandmother's house. Dialogue: 0,0:42:20.33,0:42:23.92,Default,,0000,0000,0000,,By adjusting the connection strengths after each attempt, Dialogue: 0,0:42:23.92,0:42:26.69,Default,,0000,0000,0000,,the net slowly improves. And, Dialogue: 0,0:42:26.69,0:42:30.03,Default,,0000,0000,0000,,finally, after letting it train overnight, Dialogue: 0,0:42:30.03,0:42:32.70,Default,,0000,0000,0000,,the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's Dialogue: 0,0:42:32.70,0:42:37.76,Default,,0000,0000,0000,,house. Dialogue: 0,0:42:37.76,0:42:41.15,Default,,0000,0000,0000,,Well, because she gives us candy. Dialogue: 0,0:42:41.15,0:42:46.79,Default,,0000,0000,0000,,Well, and we - NETtalk understands nothing about the language. It is simply associating letters Dialogue: 0,0:42:46.79,0:42:49.81,Default,,0000,0000,0000,,with sounds. Dialogue: 0,0:42:49.81,0:42:52.36,Default,,0000,0000,0000,,[End Video] All right. So Dialogue: 0,0:42:52.36,0:42:54.38,Default,,0000,0000,0000,,at the time this was done, I mean, this is Dialogue: 0,0:42:54.38,0:42:57.38,Default,,0000,0000,0000,,an amazing piece of work. I should say Dialogue: 0,0:42:57.38,0:42:59.14,Default,,0000,0000,0000,,today there are other Dialogue: 0,0:42:59.14,0:43:03.39,Default,,0000,0000,0000,,text to speech systems that work better than what you just saw, Dialogue: 0,0:43:03.39,0:43:08.36,Default,,0000,0000,0000,,and you'll also appreciate getting candy from your grandmother's house is a little bit Dialogue: 0,0:43:08.36,0:43:11.45,Default,,0000,0000,0000,,less impressive than talking about the Dow Jones falling 15 points, Dialogue: 0,0:43:11.45,0:43:16.15,Default,,0000,0000,0000,,and profit taking, whatever. So Dialogue: 0,0:43:16.15,0:43:18.79,Default,,0000,0000,0000,,but I wanted to show that just because that was another cool, Dialogue: 0,0:43:18.79,0:43:25.79,Default,,0000,0000,0000,,major landmark in the history of neural networks. Okay. Dialogue: 0,0:43:27.26,0:43:32.95,Default,,0000,0000,0000,,So let's switch back to the chalkboard, Dialogue: 0,0:43:32.95,0:43:35.86,Default,,0000,0000,0000,,and what I want to do next Dialogue: 0,0:43:35.86,0:43:39.48,Default,,0000,0000,0000,,is Dialogue: 0,0:43:39.48,0:43:42.37,Default,,0000,0000,0000,,tell you about Support Vector Machines, Dialogue: 0,0:43:42.37,0:43:49.37,Default,,0000,0000,0000,,okay? That, sort of, wraps up our discussion on neural Dialogue: 0,0:44:08.84,0:44:10.85,Default,,0000,0000,0000,,networks. So I started off talking about neural networks Dialogue: 0,0:44:10.85,0:44:13.27,Default,,0000,0000,0000,,by motivating it as a way to get Dialogue: 0,0:44:13.27,0:44:17.92,Default,,0000,0000,0000,,us to output non-linear classifiers, right? I don't really approve of it. It turns out Dialogue: 0,0:44:17.92,0:44:18.42,Default,,0000,0000,0000,,that Dialogue: 0,0:44:18.42,0:44:21.80,Default,,0000,0000,0000,,you'd be able to come up with non-linear division boundaries using a neural Dialogue: 0,0:44:21.80,0:44:22.48,Default,,0000,0000,0000,,network Dialogue: 0,0:44:22.48,0:44:27.31,Default,,0000,0000,0000,,like what I drew on the chalkboard earlier. Dialogue: 0,0:44:27.31,0:44:30.47,Default,,0000,0000,0000,,Support Vector Machines will be another learning algorithm that will give us a way Dialogue: 0,0:44:30.47,0:44:34.53,Default,,0000,0000,0000,,to come up with non-linear classifiers. There's a very effective, off-the-shelf Dialogue: 0,0:44:34.53,0:44:36.02,Default,,0000,0000,0000,,learning algorithm, Dialogue: 0,0:44:36.02,0:44:39.40,Default,,0000,0000,0000,,but it turns out that in the discussion I'm gonna - in Dialogue: 0,0:44:39.40,0:44:43.38,Default,,0000,0000,0000,,the progression and development I'm gonna pursue, I'm actually going to start Dialogue: 0,0:44:43.38,0:44:47.80,Default,,0000,0000,0000,,off by describing yet another class of linear classifiers with linear division Dialogue: 0,0:44:47.80,0:44:49.27,Default,,0000,0000,0000,,boundaries, Dialogue: 0,0:44:49.27,0:44:54.18,Default,,0000,0000,0000,,and only be later, sort of, in probably the next lecture or the one after that, Dialogue: 0,0:44:54.18,0:44:55.25,Default,,0000,0000,0000,,that we'll then Dialogue: 0,0:44:55.25,0:44:58.70,Default,,0000,0000,0000,,take the support vector machine idea and, sort of, do some clever things to it Dialogue: 0,0:44:58.70,0:45:02.96,Default,,0000,0000,0000,,to make it work very well to generate non-linear division boundaries as well, okay? Dialogue: 0,0:45:02.96,0:45:07.86,Default,,0000,0000,0000,,But we'll actually start by talking about linear classifiers a little bit more. Dialogue: 0,0:45:07.86,0:45:12.62,Default,,0000,0000,0000,,And to do that, I want to Dialogue: 0,0:45:12.62,0:45:14.56,Default,,0000,0000,0000,,convey two Dialogue: 0,0:45:14.56,0:45:16.69,Default,,0000,0000,0000,,intuitions about classification. Dialogue: 0,0:45:16.69,0:45:20.70,Default,,0000,0000,0000,,One is Dialogue: 0,0:45:20.70,0:45:24.45,Default,,0000,0000,0000,,you think about logistic regression; we have this logistic function that was outputting Dialogue: 0,0:45:24.45,0:45:28.67,Default,,0000,0000,0000,,the probability that Y equals one, Dialogue: 0,0:45:28.67,0:45:31.05,Default,,0000,0000,0000,,and it crosses Dialogue: 0,0:45:31.05,0:45:32.96,Default,,0000,0000,0000,,this line at zero. Dialogue: 0,0:45:32.96,0:45:37.82,Default,,0000,0000,0000,,So when you run logistic regression, I want you to think of it as Dialogue: 0,0:45:37.82,0:45:41.18,Default,,0000,0000,0000,,an algorithm that computes Dialogue: 0,0:45:41.18,0:45:43.24,Default,,0000,0000,0000,,theta transpose X, Dialogue: 0,0:45:43.24,0:45:46.79,Default,,0000,0000,0000,,and then it predicts Dialogue: 0,0:45:46.79,0:45:51.17,Default,,0000,0000,0000,,one, right, Dialogue: 0,0:45:51.17,0:45:56.36,Default,,0000,0000,0000,,if and only if, theta transpose X is greater than zero, right? IFF stands for if Dialogue: 0,0:45:56.36,0:45:59.54,Default,,0000,0000,0000,,and only if. It means the same thing as a double implication, Dialogue: 0,0:45:59.54,0:46:04.63,Default,,0000,0000,0000,,and it predicts zero, Dialogue: 0,0:46:04.63,0:46:11.63,Default,,0000,0000,0000,,if and only if, theta transpose X is less than zero, okay? So if Dialogue: 0,0:46:12.50,0:46:15.02,Default,,0000,0000,0000,,it's the case that Dialogue: 0,0:46:15.02,0:46:18.07,Default,,0000,0000,0000,,theta transpose X is much greater than zero, Dialogue: 0,0:46:18.07,0:46:22.31,Default,,0000,0000,0000,,the double greater than sign means these are much greater than, all right. Dialogue: 0,0:46:22.31,0:46:25.59,Default,,0000,0000,0000,,So if theta transpose X is much greater than zero, Dialogue: 0,0:46:25.59,0:46:26.74,Default,,0000,0000,0000,,then, Dialogue: 0,0:46:26.74,0:46:33.07,Default,,0000,0000,0000,,you know, think of that as a very confident Dialogue: 0,0:46:33.07,0:46:38.78,Default,,0000,0000,0000,,prediction Dialogue: 0,0:46:38.78,0:46:40.78,Default,,0000,0000,0000,,that Y is equal to one, Dialogue: 0,0:46:40.78,0:46:43.46,Default,,0000,0000,0000,,right? If theta transpose X is much greater than zero, then we're Dialogue: 0,0:46:43.46,0:46:47.05,Default,,0000,0000,0000,,gonna predict one then moreover we're very confident it's one, and the picture for Dialogue: 0,0:46:47.05,0:46:48.48,Default,,0000,0000,0000,,that is if theta Dialogue: 0,0:46:48.48,0:46:51.18,Default,,0000,0000,0000,,transpose X is way out here, then Dialogue: 0,0:46:51.18,0:46:54.33,Default,,0000,0000,0000,,we're estimating that the probability of Y being equal to one Dialogue: 0,0:46:54.33,0:46:56.09,Default,,0000,0000,0000,,on the sigmoid function, it will Dialogue: 0,0:46:56.09,0:46:58.70,Default,,0000,0000,0000,,be very close to one. Dialogue: 0,0:46:58.70,0:47:02.78,Default,,0000,0000,0000,,And, in the same way, if theta transpose X Dialogue: 0,0:47:02.78,0:47:05.19,Default,,0000,0000,0000,,is much less than zero, Dialogue: 0,0:47:05.19,0:47:12.19,Default,,0000,0000,0000,,then we're very confident that Dialogue: 0,0:47:12.81,0:47:15.85,Default,,0000,0000,0000,,Y is equal to zero. Dialogue: 0,0:47:15.85,0:47:19.60,Default,,0000,0000,0000,,So Dialogue: 0,0:47:19.60,0:47:23.86,Default,,0000,0000,0000,,wouldn't it be nice - so when we fit logistic regression of some of the Dialogue: 0,0:47:23.86,0:47:25.31,Default,,0000,0000,0000,,classifiers is your training set, Dialogue: 0,0:47:25.31,0:47:29.57,Default,,0000,0000,0000,,then so wouldn't it be nice if, Dialogue: 0,0:47:29.57,0:47:30.43,Default,,0000,0000,0000,,right, Dialogue: 0,0:47:30.43,0:47:34.35,Default,,0000,0000,0000,,for all I Dialogue: 0,0:47:34.35,0:47:38.74,Default,,0000,0000,0000,,such that Y is equal to one. Dialogue: 0,0:47:38.74,0:47:39.70,Default,,0000,0000,0000,,We have Dialogue: 0,0:47:39.70,0:47:41.28,Default,,0000,0000,0000,,theta Dialogue: 0,0:47:41.28,0:47:45.01,Default,,0000,0000,0000,,transpose XI is much greater than zero, Dialogue: 0,0:47:45.01,0:47:47.93,Default,,0000,0000,0000,,and for all I such that Y is equal Dialogue: 0,0:47:47.93,0:47:49.87,Default,,0000,0000,0000,,to Dialogue: 0,0:47:49.87,0:47:56.87,Default,,0000,0000,0000,,zero, Dialogue: 0,0:47:57.59,0:47:59.91,Default,,0000,0000,0000,,we have theta transpose XI is Dialogue: 0,0:47:59.91,0:48:00.97,Default,,0000,0000,0000,,much less than zero, Dialogue: 0,0:48:00.97,0:48:05.66,Default,,0000,0000,0000,,okay? So wouldn't it be nice if this is true? That, Dialogue: 0,0:48:05.66,0:48:06.38,Default,,0000,0000,0000,, Dialogue: 0,0:48:06.38,0:48:09.93,Default,,0000,0000,0000,,essentially, if our training set, we can find parameters theta Dialogue: 0,0:48:09.93,0:48:12.75,Default,,0000,0000,0000,,so that our learning algorithm not only Dialogue: 0,0:48:12.75,0:48:16.84,Default,,0000,0000,0000,,makes correct classifications on all the examples in a training set, but further it's, sort Dialogue: 0,0:48:16.84,0:48:19.88,Default,,0000,0000,0000,,of, is very confident about all of those correct Dialogue: 0,0:48:19.88,0:48:22.30,Default,,0000,0000,0000,,classifications. This is Dialogue: 0,0:48:22.30,0:48:26.86,Default,,0000,0000,0000,,the first intuition that I want you to have, and we'll come back to this first intuition Dialogue: 0,0:48:26.86,0:48:28.89,Default,,0000,0000,0000,,in a second Dialogue: 0,0:48:28.89,0:48:33.21,Default,,0000,0000,0000,,when we talk about functional margins, okay? Dialogue: 0,0:48:33.21,0:48:40.21,Default,,0000,0000,0000,,We'll define this later. The second Dialogue: 0,0:48:43.54,0:48:48.12,Default,,0000,0000,0000,,intuition that I want to Dialogue: 0,0:48:48.12,0:48:55.12,Default,,0000,0000,0000,,convey, Dialogue: 0,0:48:56.77,0:49:00.21,Default,,0000,0000,0000,,and it turns out for the rest of today's lecture I'm going to assume Dialogue: 0,0:49:00.21,0:49:04.05,Default,,0000,0000,0000,,that a training set is linearly separable, okay? So by that I mean for the rest of Dialogue: 0,0:49:04.05,0:49:05.62,Default,,0000,0000,0000,, Dialogue: 0,0:49:05.62,0:49:08.69,Default,,0000,0000,0000,,today's lecture, I'm going to assume that Dialogue: 0,0:49:08.69,0:49:10.90,Default,,0000,0000,0000,,there is indeed a straight line that can separate Dialogue: 0,0:49:10.90,0:49:15.03,Default,,0000,0000,0000,,your training set, and we'll remove this assumption later, but Dialogue: 0,0:49:15.03,0:49:17.09,Default,,0000,0000,0000,,just to develop the algorithm, let's take away the Dialogue: 0,0:49:17.09,0:49:19.41,Default,,0000,0000,0000,,linearly separable training set. Dialogue: 0,0:49:19.41,0:49:20.44,Default,,0000,0000,0000,,And so Dialogue: 0,0:49:20.44,0:49:24.40,Default,,0000,0000,0000,,there's a sense that out of all the straight lines that separate the training Dialogue: 0,0:49:24.40,0:49:28.57,Default,,0000,0000,0000,,set, you know, maybe that straight line isn't such a good one, Dialogue: 0,0:49:28.57,0:49:33.14,Default,,0000,0000,0000,,and that one actually isn't such a great one either, but Dialogue: 0,0:49:33.14,0:49:35.02,Default,,0000,0000,0000,,maybe that line in the Dialogue: 0,0:49:35.02,0:49:38.87,Default,,0000,0000,0000,,middle is a much better linear separator than the others, right? Dialogue: 0,0:49:38.87,0:49:41.18,Default,,0000,0000,0000,,And one reason that Dialogue: 0,0:49:41.18,0:49:45.76,Default,,0000,0000,0000,,when you and I look at it this one seems best Dialogue: 0,0:49:45.76,0:49:48.33,Default,,0000,0000,0000,,is because this line is just further from the data, all right? Dialogue: 0,0:49:48.33,0:49:52.47,Default,,0000,0000,0000,,That is separates the data with a greater distance between your positive and your negative Dialogue: 0,0:49:52.47,0:49:55.78,Default,,0000,0000,0000,,examples and division boundary, okay? Dialogue: 0,0:49:55.78,0:49:59.39,Default,,0000,0000,0000,,And this second intuition, we'll come back to this shortly, Dialogue: 0,0:49:59.39,0:50:02.83,Default,,0000,0000,0000,,about this final line Dialogue: 0,0:50:02.83,0:50:06.10,Default,,0000,0000,0000,,that I drew being, maybe, the best line Dialogue: 0,0:50:06.10,0:50:09.100,Default,,0000,0000,0000,,this notion of distance from the training examples. This is the second intuition I want Dialogue: 0,0:50:09.100,0:50:11.24,Default,,0000,0000,0000,,to convey, Dialogue: 0,0:50:11.24,0:50:13.66,Default,,0000,0000,0000,,and we'll formalize it later when Dialogue: 0,0:50:13.66,0:50:17.85,Default,,0000,0000,0000,,we talk about Dialogue: 0,0:50:17.85,0:50:24.85,Default,,0000,0000,0000,,geometric margins of our classifiers, okay? Dialogue: 0,0:50:41.55,0:50:44.89,Default,,0000,0000,0000,,So in order to describe support vector machine, unfortunately, I'm gonna Dialogue: 0,0:50:44.89,0:50:48.24,Default,,0000,0000,0000,,have to pull a notation change, Dialogue: 0,0:50:48.24,0:50:50.11,Default,,0000,0000,0000,,and, sort Dialogue: 0,0:50:50.11,0:50:53.99,Default,,0000,0000,0000,,of, unfortunately, it, sort of, was impossible to do logistic regression, Dialogue: 0,0:50:53.99,0:50:55.81,Default,,0000,0000,0000,,and support vector machines, Dialogue: 0,0:50:55.81,0:51:01.03,Default,,0000,0000,0000,,and all the other algorithms using one completely consistent notation, Dialogue: 0,0:51:01.03,0:51:04.18,Default,,0000,0000,0000,,and so I'm actually gonna change notations slightly Dialogue: 0,0:51:04.18,0:51:07.100,Default,,0000,0000,0000,,for linear classifiers, and that will actually make it much easier for us - Dialogue: 0,0:51:07.100,0:51:10.02,Default,,0000,0000,0000,,that'll make it much easier later today Dialogue: 0,0:51:10.02,0:51:15.15,Default,,0000,0000,0000,,and in next week's lectures to actually talk about support vector machine. But the Dialogue: 0,0:51:15.15,0:51:19.55,Default,,0000,0000,0000,,notation that I'm gonna use for the rest Dialogue: 0,0:51:19.55,0:51:21.76,Default,,0000,0000,0000,,of today and for most of next week Dialogue: 0,0:51:21.76,0:51:25.77,Default,,0000,0000,0000,,will be that my B equals Y, Dialogue: 0,0:51:25.77,0:51:30.22,Default,,0000,0000,0000,,and instead of be zero, one, they'll be minus one and plus one, Dialogue: 0,0:51:30.22,0:51:33.78,Default,,0000,0000,0000,, Dialogue: 0,0:51:33.78,0:51:36.76,Default,,0000,0000,0000,,and a development of a support vector machine Dialogue: 0,0:51:36.76,0:51:38.47,Default,,0000,0000,0000,,we will have H, Dialogue: 0,0:51:38.47,0:51:43.94,Default,,0000,0000,0000,,have a hypothesis Dialogue: 0,0:51:43.94,0:51:49.37,Default,,0000,0000,0000,,output values to Dialogue: 0,0:51:49.37,0:51:53.18,Default,,0000,0000,0000,,the either plus one or minus one, Dialogue: 0,0:51:53.18,0:51:54.62,Default,,0000,0000,0000,,and so Dialogue: 0,0:51:54.62,0:51:58.04,Default,,0000,0000,0000,,we'll let G of Z be equal to Dialogue: 0,0:51:58.04,0:52:00.35,Default,,0000,0000,0000,,one if Z is Dialogue: 0,0:52:00.35,0:52:04.46,Default,,0000,0000,0000,,greater or equal to zero, and minus one otherwise, right? So just rather than zero and Dialogue: 0,0:52:04.46,0:52:07.81,Default,,0000,0000,0000,,one, we change everything to plus one and minus one. Dialogue: 0,0:52:07.81,0:52:10.15,Default,,0000,0000,0000,,And, finally, Dialogue: 0,0:52:10.15,0:52:13.77,Default,,0000,0000,0000,,whereas previously I wrote G subscript Dialogue: 0,0:52:13.77,0:52:16.78,Default,,0000,0000,0000,,theta of X equals Dialogue: 0,0:52:16.78,0:52:18.43,Default,,0000,0000,0000,,G of theta transpose X Dialogue: 0,0:52:18.43,0:52:20.88,Default,,0000,0000,0000,,and we had the convention Dialogue: 0,0:52:20.88,0:52:23.41,Default,,0000,0000,0000,,that X zero is equal to one, Dialogue: 0,0:52:23.41,0:52:25.79,Default,,0000,0000,0000,,right? And so X is an RN plus Dialogue: 0,0:52:25.79,0:52:28.67,Default,,0000,0000,0000,,one. Dialogue: 0,0:52:28.67,0:52:34.50,Default,,0000,0000,0000,,I'm gonna drop this convention of Dialogue: 0,0:52:34.50,0:52:38.40,Default,,0000,0000,0000,,letting X zero equals a one, and letting X be an RN plus one, and instead I'm Dialogue: 0,0:52:38.40,0:52:44.05,Default,,0000,0000,0000,,going to parameterize my linear classifier as H subscript W, B of X Dialogue: 0,0:52:44.05,0:52:46.16,Default,,0000,0000,0000,,equals G of Dialogue: 0,0:52:46.16,0:52:50.63,Default,,0000,0000,0000,,W transpose X plus B, okay? Dialogue: 0,0:52:50.63,0:52:51.35,Default,,0000,0000,0000,,And so Dialogue: 0,0:52:51.35,0:52:53.36,Default,,0000,0000,0000,,B Dialogue: 0,0:52:53.36,0:52:54.99,Default,,0000,0000,0000,,just now plays the role of Dialogue: 0,0:52:54.99,0:52:56.29,Default,,0000,0000,0000,,theta zero, Dialogue: 0,0:52:56.29,0:53:03.29,Default,,0000,0000,0000,,and W now plays the role of the rest of the parameters, theta one Dialogue: 0,0:53:04.18,0:53:07.06,Default,,0000,0000,0000,,through theta N, okay? So just by separating out Dialogue: 0,0:53:07.06,0:53:10.57,Default,,0000,0000,0000,,the interceptor B rather than lumping it together, it'll make it easier for Dialogue: 0,0:53:10.57,0:53:11.23,Default,,0000,0000,0000,,us Dialogue: 0,0:53:11.23,0:53:17.57,Default,,0000,0000,0000,,to develop support vector machines. Dialogue: 0,0:53:17.57,0:53:24.57,Default,,0000,0000,0000,,So - yes. Student:[Off mic]. Instructor Dialogue: 0,0:53:27.89,0:53:31.83,Default,,0000,0000,0000,,(Andrew Ng):Oh, yes. Right, yes. So W is - Dialogue: 0,0:53:31.83,0:53:33.05,Default,,0000,0000,0000,,right. So W Dialogue: 0,0:53:33.05,0:53:35.50,Default,,0000,0000,0000,,is a vector in RN, and Dialogue: 0,0:53:35.50,0:53:37.07,Default,,0000,0000,0000,,X Dialogue: 0,0:53:37.07,0:53:42.71,Default,,0000,0000,0000,,is now a vector in RN rather than N plus one, Dialogue: 0,0:53:42.71,0:53:49.71,Default,,0000,0000,0000,,and a lowercase b is a real number. Okay. Dialogue: 0,0:53:56.43,0:54:02.65,Default,,0000,0000,0000,,Now, let's formalize the notion of functional margin and germesh margin. Let me make a Dialogue: 0,0:54:02.65,0:54:03.66,Default,,0000,0000,0000,,definition. Dialogue: 0,0:54:03.66,0:54:10.66,Default,,0000,0000,0000,,I'm going to say that the functional margin Dialogue: 0,0:54:13.16,0:54:19.68,Default,,0000,0000,0000,,of the hyper plane Dialogue: 0,0:54:19.68,0:54:21.88,Default,,0000,0000,0000,,WB Dialogue: 0,0:54:21.88,0:54:25.23,Default,,0000,0000,0000,,with respect to a specific training example, Dialogue: 0,0:54:25.23,0:54:27.43,Default,,0000,0000,0000,,XIYI is - WRT Dialogue: 0,0:54:27.43,0:54:32.07,Default,,0000,0000,0000,,stands for with respect to - Dialogue: 0,0:54:32.07,0:54:35.49,Default,,0000,0000,0000,,the function margin of a hyper plane WB with Dialogue: 0,0:54:35.49,0:54:38.10,Default,,0000,0000,0000,,respect to Dialogue: 0,0:54:38.10,0:54:43.10,Default,,0000,0000,0000,,a certain training example, XIYI has been defined as Gamma Dialogue: 0,0:54:43.10,0:54:45.32,Default,,0000,0000,0000,,Hat I equals YI Dialogue: 0,0:54:45.32,0:54:46.92,Default,,0000,0000,0000,,times W transpose XI Dialogue: 0,0:54:46.92,0:54:51.16,Default,,0000,0000,0000,,plus B, okay? Dialogue: 0,0:54:51.16,0:54:54.58,Default,,0000,0000,0000,,And so a set of parameters, W, B defines Dialogue: 0,0:54:54.58,0:54:55.96,Default,,0000,0000,0000,,a Dialogue: 0,0:54:55.96,0:55:00.24,Default,,0000,0000,0000,,classifier - it, sort of, defines a linear separating boundary, Dialogue: 0,0:55:00.24,0:55:01.16,Default,,0000,0000,0000,,and so Dialogue: 0,0:55:01.16,0:55:03.99,Default,,0000,0000,0000,,when I say hyper plane, I just mean Dialogue: 0,0:55:03.99,0:55:07.39,Default,,0000,0000,0000,,the decision boundary that's Dialogue: 0,0:55:07.39,0:55:11.17,Default,,0000,0000,0000,,defined by the parameters W, B. Dialogue: 0,0:55:11.17,0:55:12.97,Default,,0000,0000,0000,,You know what, Dialogue: 0,0:55:12.97,0:55:17.70,Default,,0000,0000,0000,,if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with Dialogue: 0,0:55:17.70,0:55:18.87,Default,,0000,0000,0000,,parameters W, B Dialogue: 0,0:55:18.87,0:55:21.02,Default,,0000,0000,0000,,with respect to a training example Dialogue: 0,0:55:21.02,0:55:23.54,Default,,0000,0000,0000,,is given by this formula, okay? Dialogue: 0,0:55:23.54,0:55:25.53,Default,,0000,0000,0000,,And interpretation of this is that Dialogue: 0,0:55:25.53,0:55:28.37,Default,,0000,0000,0000,,if Dialogue: 0,0:55:28.37,0:55:29.80,Default,,0000,0000,0000,,YI is equal to one, Dialogue: 0,0:55:29.80,0:55:33.24,Default,,0000,0000,0000,,then for each to have a large functional margin, you Dialogue: 0,0:55:33.24,0:55:36.52,Default,,0000,0000,0000,,want W transpose XI plus B to Dialogue: 0,0:55:36.52,0:55:37.85,Default,,0000,0000,0000,,be large, Dialogue: 0,0:55:37.85,0:55:39.08,Default,,0000,0000,0000,,right? Dialogue: 0,0:55:39.08,0:55:41.39,Default,,0000,0000,0000,,And if YI Dialogue: 0,0:55:41.39,0:55:43.71,Default,,0000,0000,0000,,is equal minus one, Dialogue: 0,0:55:43.71,0:55:47.48,Default,,0000,0000,0000,,then in order for the functional margin to be large - we, sort of, want the functional margins Dialogue: 0,0:55:47.48,0:55:50.23,Default,,0000,0000,0000,,to large, but in order for the function margins to be large, Dialogue: 0,0:55:50.23,0:55:55.23,Default,,0000,0000,0000,,if YI is equal to minus one, then the only way for this to be big is if W Dialogue: 0,0:55:55.23,0:55:57.03,Default,,0000,0000,0000,,transpose XI Dialogue: 0,0:55:57.03,0:55:58.22,Default,,0000,0000,0000,,plus B Dialogue: 0,0:55:58.22,0:56:03.85,Default,,0000,0000,0000,,is much less than zero, okay? So this Dialogue: 0,0:56:03.85,0:56:06.78,Default,,0000,0000,0000,,captures the intuition that we had earlier about functional Dialogue: 0,0:56:06.78,0:56:08.69,Default,,0000,0000,0000,,margins - the Dialogue: 0,0:56:08.69,0:56:12.84,Default,,0000,0000,0000,,intuition we had earlier that if YI is equal to one, we want this to be big, and if YI Dialogue: 0,0:56:12.84,0:56:14.67,Default,,0000,0000,0000,,is equal to minus one, we Dialogue: 0,0:56:14.67,0:56:15.99,Default,,0000,0000,0000,,want this to be small, Dialogue: 0,0:56:15.99,0:56:17.01,Default,,0000,0000,0000,,and this, sort of, Dialogue: 0,0:56:17.01,0:56:18.70,Default,,0000,0000,0000,,practice of two cases Dialogue: 0,0:56:18.70,0:56:22.60,Default,,0000,0000,0000,,into one statement that we'd like the functional margin to be large. Dialogue: 0,0:56:22.60,0:56:26.59,Default,,0000,0000,0000,,And notice this is also that Dialogue: 0,0:56:26.59,0:56:31.25,Default,,0000,0000,0000,,so long as YI times W transpose XY Dialogue: 0,0:56:31.25,0:56:34.05,Default,,0000,0000,0000,,plus B, so long as this is greater than zero, Dialogue: 0,0:56:34.05,0:56:36.65,Default,,0000,0000,0000,,that means we Dialogue: 0,0:56:36.65,0:56:43.65,Default,,0000,0000,0000,,classified it correctly, okay? Dialogue: 0,0:57:14.36,0:57:18.07,Default,,0000,0000,0000,,And one more definition, I'm going to say that Dialogue: 0,0:57:18.07,0:57:19.90,Default,,0000,0000,0000,,the functional margin of Dialogue: 0,0:57:19.90,0:57:23.68,Default,,0000,0000,0000,,a hyper plane with respect to an entire training set Dialogue: 0,0:57:23.68,0:57:30.68,Default,,0000,0000,0000,,is Dialogue: 0,0:57:31.58,0:57:33.61,Default,,0000,0000,0000,,going to define gamma hat to Dialogue: 0,0:57:33.61,0:57:35.51,Default,,0000,0000,0000,,be equal to Dialogue: 0,0:57:35.51,0:57:35.76,Default,,0000,0000,0000,,min Dialogue: 0,0:57:35.65,0:57:39.57,Default,,0000,0000,0000,,over all your training examples of gamma hat, I, right? Dialogue: 0,0:57:39.57,0:57:40.54,Default,,0000,0000,0000,,So if you have Dialogue: 0,0:57:40.54,0:57:44.44,Default,,0000,0000,0000,,a training set, if you have just more than one training example, Dialogue: 0,0:57:44.44,0:57:46.95,Default,,0000,0000,0000,,I'm going to define the functional margin Dialogue: 0,0:57:46.95,0:57:49.78,Default,,0000,0000,0000,,with respect to the entire training set as Dialogue: 0,0:57:49.78,0:57:51.80,Default,,0000,0000,0000,,the worst case of all of your Dialogue: 0,0:57:51.80,0:57:54.44,Default,,0000,0000,0000,,functional margins of the entire training set. Dialogue: 0,0:57:54.44,0:57:57.85,Default,,0000,0000,0000,,And so for now we should think of the Dialogue: 0,0:57:57.85,0:58:02.18,Default,,0000,0000,0000,,first function like an intuition of saying that we would like the function margin Dialogue: 0,0:58:02.18,0:58:03.45,Default,,0000,0000,0000,,to be large, Dialogue: 0,0:58:03.45,0:58:05.91,Default,,0000,0000,0000,,and for our purposes, for now, Dialogue: 0,0:58:05.91,0:58:07.31,Default,,0000,0000,0000,,let's just say we would like Dialogue: 0,0:58:07.31,0:58:10.39,Default,,0000,0000,0000,,the worst-case functional margin to be large, okay? And Dialogue: 0,0:58:10.39,0:58:16.30,Default,,0000,0000,0000,,we'll change this a little bit later as well. Now, it turns out that Dialogue: 0,0:58:16.30,0:58:20.15,Default,,0000,0000,0000,,there's one little problem with this intuition that will, sort of, edge Dialogue: 0,0:58:20.15,0:58:22.64,Default,,0000,0000,0000,,us later, Dialogue: 0,0:58:22.64,0:58:26.19,Default,,0000,0000,0000,,which it actually turns out to be very easy to make the functional margin large, all Dialogue: 0,0:58:26.19,0:58:28.60,Default,,0000,0000,0000,,right? So, for example, Dialogue: 0,0:58:28.60,0:58:31.90,Default,,0000,0000,0000,,so as I have a classifiable parameters W and Dialogue: 0,0:58:31.90,0:58:37.37,Default,,0000,0000,0000,,B. If I take W and multiply it by two and take B and multiply it by two, Dialogue: 0,0:58:37.37,0:58:42.05,Default,,0000,0000,0000,,then if you refer to the definition of the functional margin, I guess that was what? Dialogue: 0,0:58:42.05,0:58:42.96,Default,,0000,0000,0000,,Gamma I, Dialogue: 0,0:58:42.96,0:58:44.41,Default,,0000,0000,0000,,gamma hat I Dialogue: 0,0:58:44.41,0:58:47.93,Default,,0000,0000,0000,,equals YI Dialogue: 0,0:58:47.93,0:58:51.06,Default,,0000,0000,0000,,times W times transpose B. If Dialogue: 0,0:58:51.06,0:58:52.86,Default,,0000,0000,0000,,I double Dialogue: 0,0:58:52.86,0:58:54.61,Default,,0000,0000,0000,,W and B, Dialogue: 0,0:58:54.61,0:58:55.01,Default,,0000,0000,0000,,then Dialogue: 0,0:58:55.01,0:58:59.42,Default,,0000,0000,0000,,I can easily double my functional margin. So this goal Dialogue: 0,0:58:59.42,0:59:02.80,Default,,0000,0000,0000,,of making the functional margin large, in and of itself, isn't so Dialogue: 0,0:59:02.80,0:59:05.96,Default,,0000,0000,0000,,useful because it's easy to make the functional margin arbitrarily large Dialogue: 0,0:59:05.96,0:59:08.73,Default,,0000,0000,0000,,just by scaling other parameters. And so Dialogue: 0,0:59:08.73,0:59:13.26,Default,,0000,0000,0000,,maybe one thing we need to do later is add a normalization Dialogue: 0,0:59:13.26,0:59:14.88,Default,,0000,0000,0000,,condition. Dialogue: 0,0:59:14.88,0:59:16.85,Default,,0000,0000,0000,,For example, maybe Dialogue: 0,0:59:16.85,0:59:20.18,Default,,0000,0000,0000,,we want to add a normalization condition that de-norm, the Dialogue: 0,0:59:20.18,0:59:21.79,Default,,0000,0000,0000,,alter-norm of Dialogue: 0,0:59:21.79,0:59:23.11,Default,,0000,0000,0000,,the parameter W Dialogue: 0,0:59:23.11,0:59:24.01,Default,,0000,0000,0000,,is equal to one, Dialogue: 0,0:59:24.01,0:59:28.15,Default,,0000,0000,0000,,and we'll come back to this in a second. All Dialogue: 0,0:59:28.15,0:59:34.28,Default,,0000,0000,0000,,right. And then so - Okay. Dialogue: 0,0:59:34.28,0:59:39.87,Default,,0000,0000,0000,,Now, let's talk about - see how much Dialogue: 0,0:59:39.87,0:59:46.87,Default,,0000,0000,0000,,time we have, 15 minutes. Well, see, I'm trying to Dialogue: 0,0:59:52.91,0:59:59.91,Default,,0000,0000,0000,,decide how much to try to do in the last 15 minutes. Okay. So let's talk Dialogue: 0,1:00:05.34,1:00:10.89,Default,,0000,0000,0000,,about the geometric margin, Dialogue: 0,1:00:10.89,1:00:13.38,Default,,0000,0000,0000,,and so Dialogue: 0,1:00:13.38,1:00:16.94,Default,,0000,0000,0000,,the geometric margin of a training example - Dialogue: 0,1:00:16.94,1:00:23.94,Default,,0000,0000,0000,,[inaudible], right? Dialogue: 0,1:00:27.28,1:00:30.65,Default,,0000,0000,0000,,So the division boundary of my classifier Dialogue: 0,1:00:30.65,1:00:33.44,Default,,0000,0000,0000,,is going to be given by the plane W transpose X Dialogue: 0,1:00:33.44,1:00:34.89,Default,,0000,0000,0000,,plus B is equal Dialogue: 0,1:00:34.89,1:00:37.65,Default,,0000,0000,0000,,to zero, okay? Right, and these are the Dialogue: 0,1:00:37.65,1:00:41.39,Default,,0000,0000,0000,,X1, X2 axis, say, Dialogue: 0,1:00:41.39,1:00:42.58,Default,,0000,0000,0000,,and Dialogue: 0,1:00:42.58,1:00:48.30,Default,,0000,0000,0000,,we're going to draw relatively few training examples here. Let's say I'm drawing Dialogue: 0,1:00:48.30,1:00:52.93,Default,,0000,0000,0000,,deliberately few training examples so that I can add things to this, okay? Dialogue: 0,1:00:52.93,1:00:54.66,Default,,0000,0000,0000,,And so Dialogue: 0,1:00:54.66,1:00:57.20,Default,,0000,0000,0000,,assuming we classified an example correctly, I'm Dialogue: 0,1:00:57.20,1:01:01.21,Default,,0000,0000,0000,,going to define the geometric margin Dialogue: 0,1:01:01.21,1:01:05.56,Default,,0000,0000,0000,,as just a geometric distance between a point between the Dialogue: 0,1:01:05.56,1:01:07.24,Default,,0000,0000,0000,,training example - Dialogue: 0,1:01:07.24,1:01:09.65,Default,,0000,0000,0000,,yeah, between the training example XI, Dialogue: 0,1:01:09.65,1:01:11.68,Default,,0000,0000,0000,,YI Dialogue: 0,1:01:11.68,1:01:13.26,Default,,0000,0000,0000,,and the distance Dialogue: 0,1:01:13.26,1:01:15.49,Default,,0000,0000,0000,,given by this separating Dialogue: 0,1:01:15.49,1:01:18.25,Default,,0000,0000,0000,,line, given by this separating hyper plane, okay? Dialogue: 0,1:01:18.25,1:01:20.33,Default,,0000,0000,0000,,That's what I'm going to define the Dialogue: 0,1:01:20.33,1:01:23.81,Default,,0000,0000,0000,,geometric margin to be. Dialogue: 0,1:01:23.81,1:01:26.30,Default,,0000,0000,0000,,And so I'm gonna do Dialogue: 0,1:01:26.30,1:01:29.92,Default,,0000,0000,0000,,some algebra fairly quickly. In case it doesn't make sense, and read Dialogue: 0,1:01:29.92,1:01:33.97,Default,,0000,0000,0000,,through the lecture notes more carefully for details. Dialogue: 0,1:01:33.97,1:01:36.48,Default,,0000,0000,0000,,Sort of, by standard geometry, Dialogue: 0,1:01:36.48,1:01:40.59,Default,,0000,0000,0000,,the normal, or in other words, the vector that's 90 degrees to the Dialogue: 0,1:01:40.59,1:01:41.87,Default,,0000,0000,0000,,separating hyper plane Dialogue: 0,1:01:41.87,1:01:45.07,Default,,0000,0000,0000,,is going to be given by W divided by Dialogue: 0,1:01:45.07,1:01:47.48,Default,,0000,0000,0000,,the norm of W; that's just how Dialogue: 0,1:01:47.48,1:01:49.62,Default,,0000,0000,0000,,planes and high dimensions work. If this Dialogue: 0,1:01:49.62,1:01:53.38,Default,,0000,0000,0000,,stuff - some of this you have to use, take a look t the lecture notes on the Dialogue: 0,1:01:53.38,1:01:55.59,Default,,0000,0000,0000,,website. Dialogue: 0,1:01:55.59,1:01:58.20,Default,,0000,0000,0000,,And so let's say this distance Dialogue: 0,1:01:58.20,1:01:59.79,Default,,0000,0000,0000,,is Dialogue: 0,1:01:59.79,1:02:01.22,Default,,0000,0000,0000,, Dialogue: 0,1:02:01.22,1:02:02.08,Default,,0000,0000,0000,,gamma I, Dialogue: 0,1:02:02.08,1:02:04.87,Default,,0000,0000,0000,,okay? And so I'm going to use the convention that Dialogue: 0,1:02:04.87,1:02:08.55,Default,,0000,0000,0000,,I'll put a hat on top where I'm referring to functional margins, Dialogue: 0,1:02:08.55,1:02:11.62,Default,,0000,0000,0000,,and no hat on top for geometric margins. So let's say geometric margin, Dialogue: 0,1:02:11.62,1:02:14.12,Default,,0000,0000,0000,,as this example, is Dialogue: 0,1:02:14.12,1:02:18.01,Default,,0000,0000,0000,,gamma I. Dialogue: 0,1:02:18.01,1:02:21.79,Default,,0000,0000,0000,,That means that Dialogue: 0,1:02:21.79,1:02:26.46,Default,,0000,0000,0000,,this point here, right, Dialogue: 0,1:02:26.46,1:02:28.90,Default,,0000,0000,0000,,is going to be Dialogue: 0,1:02:28.90,1:02:32.15,Default,,0000,0000,0000,,XI minus Dialogue: 0,1:02:32.15,1:02:36.12,Default,,0000,0000,0000,,gamma I times W Dialogue: 0,1:02:36.12,1:02:39.48,Default,,0000,0000,0000,,over normal W, okay? Dialogue: 0,1:02:39.48,1:02:41.28,Default,,0000,0000,0000,,Because Dialogue: 0,1:02:41.28,1:02:45.40,Default,,0000,0000,0000,,W over normal W is the unit vector, is the length one vector that Dialogue: 0,1:02:45.40,1:02:48.46,Default,,0000,0000,0000,,is normal to the separating hyper plane, Dialogue: 0,1:02:48.46,1:02:52.07,Default,,0000,0000,0000,,and so when we subtract gamma I times the unit vector Dialogue: 0,1:02:52.07,1:02:55.19,Default,,0000,0000,0000,,from this point, XI, or at this point here is XI. Dialogue: 0,1:02:55.19,1:02:57.04,Default,,0000,0000,0000,,So XI minus, Dialogue: 0,1:02:57.04,1:02:59.08,Default,,0000,0000,0000,,you know, this little vector here Dialogue: 0,1:02:59.08,1:03:01.54,Default,,0000,0000,0000,,is going to be this point that I've drawn as Dialogue: 0,1:03:01.54,1:03:03.08,Default,,0000,0000,0000,,a heavy circle, Dialogue: 0,1:03:03.08,1:03:05.66,Default,,0000,0000,0000,,okay? So this heavy point here is Dialogue: 0,1:03:05.66,1:03:07.43,Default,,0000,0000,0000,,XI minus this vector, Dialogue: 0,1:03:07.43,1:03:14.43,Default,,0000,0000,0000,,and this vector is gamma I time W over norm of W, okay? Dialogue: 0,1:03:16.05,1:03:17.65,Default,,0000,0000,0000,,And so Dialogue: 0,1:03:17.65,1:03:21.16,Default,,0000,0000,0000,,because this heavy point is on the separating hyper plane, Dialogue: 0,1:03:21.16,1:03:22.69,Default,,0000,0000,0000,,right, this point Dialogue: 0,1:03:22.69,1:03:24.95,Default,,0000,0000,0000,,must satisfy Dialogue: 0,1:03:24.95,1:03:28.66,Default,,0000,0000,0000,,W transpose times Dialogue: 0,1:03:28.66,1:03:34.66,Default,,0000,0000,0000,,that point Dialogue: 0,1:03:34.66,1:03:35.57,Default,,0000,0000,0000,,equals zero, Dialogue: 0,1:03:35.57,1:03:37.19,Default,,0000,0000,0000,,right? Because Dialogue: 0,1:03:37.19,1:03:38.03,Default,,0000,0000,0000,,all Dialogue: 0,1:03:38.03,1:03:41.48,Default,,0000,0000,0000,,points X on the separating hyper plane satisfy the equation W transpose X plus B Dialogue: 0,1:03:41.48,1:03:42.94,Default,,0000,0000,0000,,equals zero, Dialogue: 0,1:03:42.94,1:03:45.99,Default,,0000,0000,0000,,and so this point is on the separating hyper plane, therefore, Dialogue: 0,1:03:45.99,1:03:49.20,Default,,0000,0000,0000,,it must satisfy W transpose this point - oh, Dialogue: 0,1:03:49.20,1:03:53.81,Default,,0000,0000,0000,,excuse me. Plus B is equal Dialogue: 0,1:03:53.81,1:03:57.58,Default,,0000,0000,0000,,to zero, okay? Dialogue: 0,1:03:57.58,1:04:01.29,Default,,0000,0000,0000,,Raise your hand if this makes sense so far? Oh, Dialogue: 0,1:04:01.29,1:04:03.67,Default,,0000,0000,0000,,okay. Cool, most of you, but, again, Dialogue: 0,1:04:03.67,1:04:07.10,Default,,0000,0000,0000,,I'm, sort of, being slightly fast in this geometry. So if you're not quite sure Dialogue: 0,1:04:07.10,1:04:10.09,Default,,0000,0000,0000,,why this is a normal vector, or how I subtracted Dialogue: 0,1:04:10.09,1:04:17.09,Default,,0000,0000,0000,,this, or whatever, take a look at the details in the lecture notes. Dialogue: 0,1:04:21.30,1:04:24.32,Default,,0000,0000,0000,,And so Dialogue: 0,1:04:24.32,1:04:28.46,Default,,0000,0000,0000,,what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So Dialogue: 0,1:04:28.46,1:04:30.27,Default,,0000,0000,0000,,this equation I just wrote down, Dialogue: 0,1:04:30.27,1:04:32.20,Default,,0000,0000,0000,,solve this equation for gamma Dialogue: 0,1:04:32.20,1:04:33.94,Default,,0000,0000,0000,,or gamma I, Dialogue: 0,1:04:33.94,1:04:40.94,Default,,0000,0000,0000,,and you find that - Dialogue: 0,1:04:44.13,1:04:47.63,Default,,0000,0000,0000,,you saw that previous equation from gamma I - Dialogue: 0,1:04:47.63,1:04:49.77,Default,,0000,0000,0000,,well, why don't I just do it? Dialogue: 0,1:04:49.77,1:04:52.82,Default,,0000,0000,0000,,You have W transpose XI plus Dialogue: 0,1:04:52.82,1:04:54.66,Default,,0000,0000,0000,,B Dialogue: 0,1:04:54.66,1:04:55.49,Default,,0000,0000,0000,,equals Dialogue: 0,1:04:55.49,1:04:57.05,Default,,0000,0000,0000,,gamma Dialogue: 0,1:04:57.05,1:05:01.49,Default,,0000,0000,0000,,I times W transpose W over norm of W; Dialogue: 0,1:05:01.49,1:05:05.17,Default,,0000,0000,0000,,that's just equal to gamma Dialogue: 0,1:05:05.17,1:05:06.75,Default,,0000,0000,0000,,times the norm of W Dialogue: 0,1:05:06.75,1:05:07.66,Default,,0000,0000,0000,,because Dialogue: 0,1:05:07.66,1:05:09.53,Default,,0000,0000,0000,,W transpose W Dialogue: 0,1:05:09.53,1:05:12.47,Default,,0000,0000,0000,,is the norm Dialogue: 0,1:05:12.47,1:05:14.83,Default,,0000,0000,0000,,of W squared, and, therefore, Dialogue: 0,1:05:14.83,1:05:17.59,Default,,0000,0000,0000,,gamma Dialogue: 0,1:05:17.59,1:05:24.59,Default,,0000,0000,0000,,is just - well, Dialogue: 0,1:05:26.58,1:05:33.58,Default,,0000,0000,0000,,transpose X equals, okay? Dialogue: 0,1:05:33.100,1:05:38.09,Default,,0000,0000,0000,,And, in other words, this little calculation just showed us that if you have a Dialogue: 0,1:05:38.09,1:05:41.30,Default,,0000,0000,0000,,training example Dialogue: 0,1:05:41.30,1:05:42.60,Default,,0000,0000,0000,,XI, Dialogue: 0,1:05:42.60,1:05:46.75,Default,,0000,0000,0000,,then the distance between XI and the separating hyper plane defined by the Dialogue: 0,1:05:46.75,1:05:49.45,Default,,0000,0000,0000,,parameters W and B Dialogue: 0,1:05:49.45,1:05:56.45,Default,,0000,0000,0000,,can be computed by this formula, okay? So Dialogue: 0,1:06:02.98,1:06:05.46,Default,,0000,0000,0000,,the last thing I want to do is actually Dialogue: 0,1:06:05.46,1:06:06.73,Default,,0000,0000,0000,,take into account Dialogue: 0,1:06:06.73,1:06:11.93,Default,,0000,0000,0000,,the sign of the - the correct classification of the training example. So I've Dialogue: 0,1:06:11.93,1:06:13.13,Default,,0000,0000,0000,,been assuming that Dialogue: 0,1:06:13.13,1:06:16.34,Default,,0000,0000,0000,,we've been classifying an example correctly. Dialogue: 0,1:06:16.34,1:06:20.100,Default,,0000,0000,0000,,So, more generally, to find Dialogue: 0,1:06:20.100,1:06:27.44,Default,,0000,0000,0000,,the geometric Dialogue: 0,1:06:27.44,1:06:30.03,Default,,0000,0000,0000,,margin of a training example to be Dialogue: 0,1:06:30.03,1:06:32.98,Default,,0000,0000,0000,,gamma I equals YI Dialogue: 0,1:06:32.98,1:06:39.98,Default,,0000,0000,0000,,times that thing on top, okay? Dialogue: 0,1:06:45.39,1:06:47.28,Default,,0000,0000,0000,,And so Dialogue: 0,1:06:47.28,1:06:51.17,Default,,0000,0000,0000,,this is very similar to the functional margin, except for the normalization by the Dialogue: 0,1:06:51.17,1:06:52.73,Default,,0000,0000,0000,,norm of W, Dialogue: 0,1:06:52.73,1:06:57.76,Default,,0000,0000,0000,,and so as before, you know, this says that so long as - Dialogue: 0,1:06:57.76,1:07:00.89,Default,,0000,0000,0000,,we would like the geometric margin to be large, and all that means is that so Dialogue: 0,1:07:00.89,1:07:01.85,Default,,0000,0000,0000,,long as Dialogue: 0,1:07:01.85,1:07:03.92,Default,,0000,0000,0000,,we're classifying the example correctly, Dialogue: 0,1:07:03.92,1:07:08.21,Default,,0000,0000,0000,,we would ideally hope of the example to be as far as possible from the separating Dialogue: 0,1:07:08.21,1:07:11.08,Default,,0000,0000,0000,,hyper plane, so long as it's on the right side of Dialogue: 0,1:07:11.08,1:07:12.28,Default,,0000,0000,0000,,the separating hyper plane, and that's what YI Dialogue: 0,1:07:12.28,1:07:19.28,Default,,0000,0000,0000,,multiplied into this does. Dialogue: 0,1:07:22.69,1:07:27.84,Default,,0000,0000,0000,,And so a couple of easy facts, one is if the Dialogue: 0,1:07:27.84,1:07:31.77,Default,,0000,0000,0000,,norm of W is equal to one, Dialogue: 0,1:07:31.77,1:07:34.29,Default,,0000,0000,0000,,then Dialogue: 0,1:07:34.29,1:07:37.44,Default,,0000,0000,0000,,the functional margin is equal to the geometric margin, and you see Dialogue: 0,1:07:37.44,1:07:39.37,Default,,0000,0000,0000,,that quite easily, Dialogue: 0,1:07:39.37,1:07:41.48,Default,,0000,0000,0000,,and, more generally, Dialogue: 0,1:07:41.48,1:07:43.02,Default,,0000,0000,0000,, Dialogue: 0,1:07:43.02,1:07:47.78,Default,,0000,0000,0000,,the geometric margin is just equal Dialogue: 0,1:07:47.78,1:07:54.78,Default,,0000,0000,0000,,to the functional margin divided by the norm of W, okay? Let's see, okay. Dialogue: 0,1:08:11.96,1:08:18.96,Default,,0000,0000,0000,,And so one final definition is Dialogue: 0,1:08:22.09,1:08:26.77,Default,,0000,0000,0000,,so far I've defined the geometric margin with respect to a single training example, Dialogue: 0,1:08:26.77,1:08:32.19,Default,,0000,0000,0000,,and so as before, I'll define the geometric margin with respect to an entire training set Dialogue: 0,1:08:32.19,1:08:34.50,Default,,0000,0000,0000,,as gamma equals Dialogue: 0,1:08:34.50,1:08:37.60,Default,,0000,0000,0000,,min over I Dialogue: 0,1:08:37.60,1:08:44.60,Default,,0000,0000,0000,,of gamma I, all right? Dialogue: 0,1:08:45.25,1:08:47.31,Default,,0000,0000,0000,,And so Dialogue: 0,1:08:47.31,1:08:54.31,Default,,0000,0000,0000,,the maximum margin classifier, which is a precursor to the support vector machine, Dialogue: 0,1:08:59.54,1:09:04.88,Default,,0000,0000,0000,,is the learning algorithm that chooses the parameters W and B Dialogue: 0,1:09:04.88,1:09:06.73,Default,,0000,0000,0000,,so as to maximize Dialogue: 0,1:09:06.73,1:09:08.90,Default,,0000,0000,0000,,the geometric margin, Dialogue: 0,1:09:08.90,1:09:11.92,Default,,0000,0000,0000,,and so I just write that down. The Dialogue: 0,1:09:11.92,1:09:16.49,Default,,0000,0000,0000,,maximum margin classified poses the following optimization problem. It says Dialogue: 0,1:09:16.49,1:09:19.96,Default,,0000,0000,0000,,choose gamma, W, and B Dialogue: 0,1:09:19.96,1:09:23.53,Default,,0000,0000,0000,,so as to maximize the geometric margin, Dialogue: 0,1:09:23.53,1:09:25.50,Default,,0000,0000,0000,,subject to that YI times - Dialogue: 0,1:09:25.50,1:09:32.50,Default,,0000,0000,0000,,well, Dialogue: 0,1:09:33.42,1:09:37.56,Default,,0000,0000,0000,,this is just one way to write it, Dialogue: 0,1:09:37.56,1:09:41.09,Default,,0000,0000,0000,,subject to - actually, do I write it like that? Yeah, Dialogue: 0,1:09:41.09,1:09:43.82,Default,,0000,0000,0000,,fine. Dialogue: 0,1:09:43.82,1:09:46.21,Default,,0000,0000,0000,,There are several ways to write this, and one of the things we'll Dialogue: 0,1:09:46.21,1:09:48.95,Default,,0000,0000,0000,,do next time is actually - I'm trying Dialogue: 0,1:09:48.95,1:09:52.73,Default,,0000,0000,0000,,to figure out if I can do this in five minutes. I'm guessing this could Dialogue: 0,1:09:52.73,1:09:55.42,Default,,0000,0000,0000,,be difficult. Well, Dialogue: 0,1:09:55.42,1:09:59.43,Default,,0000,0000,0000,,so this maximizing your classifier is the maximization problem Dialogue: 0,1:09:59.43,1:10:00.93,Default,,0000,0000,0000,,over parameter gamma Dialogue: 0,1:10:00.93,1:10:02.52,Default,,0000,0000,0000,,W Dialogue: 0,1:10:02.52,1:10:04.92,Default,,0000,0000,0000,,and B, and for now, it turns out Dialogue: 0,1:10:04.92,1:10:08.87,Default,,0000,0000,0000,,that the geometric margin doesn't change depending on the norm of W, right? Because Dialogue: 0,1:10:08.87,1:10:11.48,Default,,0000,0000,0000,,in the definition of the geometric margin, Dialogue: 0,1:10:11.48,1:10:14.43,Default,,0000,0000,0000,,notice that we're dividing by the norm of W anyway. Dialogue: 0,1:10:14.43,1:10:18.24,Default,,0000,0000,0000,,So you can actually set the norm of W to be anything you want, and you can multiply Dialogue: 0,1:10:18.24,1:10:22.53,Default,,0000,0000,0000,,W and B by any constant; it doesn't change the geometric margin. Dialogue: 0,1:10:22.53,1:10:26.81,Default,,0000,0000,0000,,This will actually be important, and we'll come back to this later. Notice that you Dialogue: 0,1:10:26.81,1:10:29.83,Default,,0000,0000,0000,,can take the parameters WB, Dialogue: 0,1:10:29.83,1:10:33.76,Default,,0000,0000,0000,,and you can impose any normalization constant to it, or you can change W and B by Dialogue: 0,1:10:33.76,1:10:35.76,Default,,0000,0000,0000,,any scaling factor and Dialogue: 0,1:10:35.76,1:10:37.97,Default,,0000,0000,0000,,replace them by ten W and ten B Dialogue: 0,1:10:37.97,1:10:38.73,Default,,0000,0000,0000,,whatever, Dialogue: 0,1:10:38.73,1:10:42.81,Default,,0000,0000,0000,,and it does not change the geometric margin, Dialogue: 0,1:10:42.81,1:10:44.33,Default,,0000,0000,0000,,okay? And so Dialogue: 0,1:10:44.33,1:10:47.80,Default,,0000,0000,0000,,in this first formulation, I'm just gonna impose a constraint and say that the norm of W was Dialogue: 0,1:10:47.80,1:10:48.99,Default,,0000,0000,0000,,one, Dialogue: 0,1:10:48.99,1:10:51.97,Default,,0000,0000,0000,,and so the function of the geometric margins will be the same, Dialogue: 0,1:10:51.97,1:10:52.70,Default,,0000,0000,0000,,and then we'll say Dialogue: 0,1:10:52.70,1:10:55.12,Default,,0000,0000,0000,,maximize the geometric margins subject to - Dialogue: 0,1:10:55.12,1:10:56.80,Default,,0000,0000,0000,,you maximize gamma Dialogue: 0,1:10:56.80,1:11:00.49,Default,,0000,0000,0000,,subject to that every training example must have geometric margin Dialogue: 0,1:11:00.49,1:11:02.07,Default,,0000,0000,0000,,at least gamma, Dialogue: 0,1:11:02.07,1:11:04.22,Default,,0000,0000,0000,,and this is a geometric margin because Dialogue: 0,1:11:04.22,1:11:05.78,Default,,0000,0000,0000,,when the norm of W is equal to one, then Dialogue: 0,1:11:05.78,1:11:08.16,Default,,0000,0000,0000,,the functional of the geometric margin Dialogue: 0,1:11:08.16,1:11:11.41,Default,,0000,0000,0000,,are identical, okay? Dialogue: 0,1:11:11.41,1:11:15.13,Default,,0000,0000,0000,,So this is the maximum margin classifier, and it turns out that if you do this, Dialogue: 0,1:11:15.13,1:11:19.08,Default,,0000,0000,0000,,it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but Dialogue: 0,1:11:19.08,1:11:19.66,Default,,0000,0000,0000,,it Dialogue: 0,1:11:19.66,1:11:22.71,Default,,0000,0000,0000,, Dialogue: 0,1:11:22.71,1:11:24.14,Default,,0000,0000,0000,,turns out that Dialogue: 0,1:11:24.14,1:11:26.45,Default,,0000,0000,0000,,as we develop this algorithm further, there Dialogue: 0,1:11:26.45,1:11:30.52,Default,,0000,0000,0000,,will be a clever way to allow us to change this algorithm to let it work Dialogue: 0,1:11:30.52,1:11:33.15,Default,,0000,0000,0000,,in infinite dimensional feature spaces Dialogue: 0,1:11:33.15,1:11:37.01,Default,,0000,0000,0000,,and come up with very efficient non-linear classifiers. Dialogue: 0,1:11:37.01,1:11:38.02,Default,,0000,0000,0000,,So Dialogue: 0,1:11:38.02,1:11:41.28,Default,,0000,0000,0000,,there's a ways to go before we turn this into a support vector machine, Dialogue: 0,1:11:41.28,1:11:43.41,Default,,0000,0000,0000,,but this is the first step. Dialogue: 0,1:11:43.41,1:11:50.41,Default,,0000,0000,0000,,So are there questions about this? Yeah. Dialogue: 0,1:12:03.61,1:12:08.18,Default,,0000,0000,0000,,Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a Dialogue: 0,1:12:08.18,1:12:09.90,Default,,0000,0000,0000,,fixed training set, and the Dialogue: 0,1:12:09.90,1:12:12.88,Default,,0000,0000,0000,,scaling of the training set is not something you get to play with, right? Dialogue: 0,1:12:12.88,1:12:17.19,Default,,0000,0000,0000,,So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the Dialogue: 0,1:12:17.19,1:12:20.87,Default,,0000,0000,0000,,Y's. Are there other questions? Okay. So all right. Dialogue: 0,1:12:20.87,1:12:26.15,Default,,0000,0000,0000,, Dialogue: 0,1:12:26.15,1:12:30.13,Default,,0000,0000,0000,, Dialogue: 0,1:12:30.13,1:12:34.78,Default,,0000,0000,0000,,Next week we will take this, and we'll talk about authorization algorithms, Dialogue: 0,1:12:34.78,1:12:35.84,Default,,0000,0000,0000,,and Dialogue: 0,1:12:35.84,1:12:39.95,Default,,0000,0000,0000,,work our way towards turning this into one of the most effective off-theshelf Dialogue: 0,1:12:39.95,1:12:41.92,Default,,0000,0000,0000,,learning algorithms, Dialogue: 0,1:12:41.92,1:12:44.37,Default,,0000,0000,0000,,and just a final reminder again, this next Dialogue: 0,1:12:44.37,1:12:48.17,Default,,0000,0000,0000,,discussion session will be on Matlab and Octaves. So show up for that if you want Dialogue: 0,1:12:48.17,1:12:49.68,Default,,0000,0000,0000,,to see a tutorial. Dialogue: 0,1:12:49.68,1:12:51.53,Default,,0000,0000,0000,,Okay. See you guys in the next class.