[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:10.04,0:00:12.18,Default,,0000,0000,0000,, Dialogue: 0,0:00:12.18,0:00:15.45,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:15.45,0:00:22.45,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:25.12,0:00:29.82,Default,,0000,0000,0000,,So what I want to do today is talk about a different type of learning algorithm, and, in particular, Dialogue: 0,0:00:29.82,0:00:32.79,Default,,0000,0000,0000,,start to talk about generative learning algorithms Dialogue: 0,0:00:32.79,0:00:37.84,Default,,0000,0000,0000,,and the specific algorithm called Gaussian Discriminant Analysis. Dialogue: 0,0:00:37.84,0:00:43.34,Default,,0000,0000,0000,,Take a slight digression, talk about Gaussians, and I'll briefly discuss Dialogue: 0,0:00:43.34,0:00:45.70,Default,,0000,0000,0000,,generative versus discriminative learning algorithms, Dialogue: 0,0:00:45.70,0:00:49.78,Default,,0000,0000,0000,,and then hopefully wrap up today's lecture with a discussion of Naive Bayes and Dialogue: 0,0:00:49.78,0:00:52.14,Default,,0000,0000,0000,,the Laplace Smoothing. Dialogue: 0,0:00:52.14,0:00:55.02,Default,,0000,0000,0000,,So just to motivate our Dialogue: 0,0:00:55.02,0:00:58.59,Default,,0000,0000,0000,,discussion on generative learning algorithms, right, so by way of contrast, Dialogue: 0,0:00:58.59,0:01:02.10,Default,,0000,0000,0000,,the source of classification algorithms we've been talking about Dialogue: 0,0:01:02.10,0:01:08.92,Default,,0000,0000,0000,,I think of algorithms that do this. So you're given a training set, Dialogue: 0,0:01:08.92,0:01:11.64,Default,,0000,0000,0000,,and Dialogue: 0,0:01:11.64,0:01:15.37,Default,,0000,0000,0000,,if you run an algorithm right, we just see progression on those training sets. Dialogue: 0,0:01:15.37,0:01:19.83,Default,,0000,0000,0000,,The way I think of logistic regression is that it's trying to find " look at the date and is Dialogue: 0,0:01:19.83,0:01:24.37,Default,,0000,0000,0000,,trying to find a straight line to divide the crosses and O's, right? So it's, sort of, Dialogue: 0,0:01:24.37,0:01:27.38,Default,,0000,0000,0000,,trying to find a straight line. Let Dialogue: 0,0:01:27.38,0:01:29.40,Default,,0000,0000,0000,,me " just make the days a bit noisier. Dialogue: 0,0:01:29.40,0:01:33.06,Default,,0000,0000,0000,,Trying to find a straight line Dialogue: 0,0:01:33.06,0:01:34.43,Default,,0000,0000,0000,,that separates out Dialogue: 0,0:01:34.43,0:01:38.63,Default,,0000,0000,0000,,the positive and the negative classes as well as pass the law, right? And, Dialogue: 0,0:01:38.63,0:01:42.57,Default,,0000,0000,0000,,in fact, it shows it on the laptop. Maybe just use the screens or the small Dialogue: 0,0:01:42.57,0:01:46.25,Default,,0000,0000,0000,,monitors for this. Dialogue: 0,0:01:46.25,0:01:47.23,Default,,0000,0000,0000,,In fact, Dialogue: 0,0:01:47.23,0:01:49.72,Default,,0000,0000,0000,,you can see there's the data Dialogue: 0,0:01:49.72,0:01:51.10,Default,,0000,0000,0000,,set with Dialogue: 0,0:01:51.10,0:01:52.71,Default,,0000,0000,0000,,logistic regression, Dialogue: 0,0:01:52.71,0:01:54.08,Default,,0000,0000,0000,,and so Dialogue: 0,0:01:54.08,0:01:57.93,Default,,0000,0000,0000,,I've initialized the parameters randomly, and so logistic regression is, kind Dialogue: 0,0:01:57.93,0:01:59.35,Default,,0000,0000,0000,,of, the outputting " it's Dialogue: 0,0:01:59.35,0:02:01.53,Default,,0000,0000,0000,,the, kind of, hypothesis that Dialogue: 0,0:02:01.53,0:02:04.75,Default,,0000,0000,0000,,iteration zero is that straight line shown in the bottom right. Dialogue: 0,0:02:04.75,0:02:08.72,Default,,0000,0000,0000,,And so after one iteration and creating descent, the straight line changes a bit. Dialogue: 0,0:02:08.72,0:02:10.100,Default,,0000,0000,0000,,After two iterations, three, Dialogue: 0,0:02:10.100,0:02:12.36,Default,,0000,0000,0000,,four, Dialogue: 0,0:02:12.36,0:02:15.02,Default,,0000,0000,0000,,until logistic regression converges Dialogue: 0,0:02:15.02,0:02:18.97,Default,,0000,0000,0000,,and has found the straight line that, more or less, separates the positive and negative class, okay? So Dialogue: 0,0:02:18.97,0:02:20.17,Default,,0000,0000,0000,,you can think of this Dialogue: 0,0:02:20.17,0:02:21.49,Default,,0000,0000,0000,,as logistic regression, Dialogue: 0,0:02:21.49,0:02:28.01,Default,,0000,0000,0000,,sort of, searching for a line that separates the positive and the negative classes. Dialogue: 0,0:02:28.01,0:02:30.98,Default,,0000,0000,0000,,What I want to do today is talk about an algorithm that does something slightly Dialogue: 0,0:02:30.98,0:02:32.76,Default,,0000,0000,0000,,different, Dialogue: 0,0:02:32.76,0:02:36.31,Default,,0000,0000,0000,,and to motivate us, let's use our old example of trying to classifythe Dialogue: 0,0:02:36.31,0:02:40.69,Default,,0000,0000,0000,,team malignant cancer and benign cancer, right? So a patient comes in Dialogue: 0,0:02:40.69,0:02:44.76,Default,,0000,0000,0000,,and they have a cancer, you want to know if it's a malignant or a harmful cancer, Dialogue: 0,0:02:44.76,0:02:48.49,Default,,0000,0000,0000,,or if it's a benign, meaning a harmless cancer. Dialogue: 0,0:02:48.49,0:02:51.58,Default,,0000,0000,0000,,So rather than trying to find the straight line to separate the two classes, here's something else Dialogue: 0,0:02:51.58,0:02:53.58,Default,,0000,0000,0000,,we could do. Dialogue: 0,0:02:53.58,0:02:55.76,Default,,0000,0000,0000,,We can go from our training set Dialogue: 0,0:02:55.76,0:03:00.41,Default,,0000,0000,0000,,and look at all the cases of malignant cancers, go through, you know, look for our training set for all the Dialogue: 0,0:03:00.41,0:03:04.13,Default,,0000,0000,0000,,positive examples of malignant cancers, Dialogue: 0,0:03:04.13,0:03:08.80,Default,,0000,0000,0000,,and we can then build a model for what malignant cancer looks like. Dialogue: 0,0:03:08.80,0:03:13.40,Default,,0000,0000,0000,,Then we'll go for our training set again and take out all of the examples of benign cancers, Dialogue: 0,0:03:13.40,0:03:17.86,Default,,0000,0000,0000,,and then we'll build a model for what benign cancers look like, okay? Dialogue: 0,0:03:17.86,0:03:19.21,Default,,0000,0000,0000,,And then Dialogue: 0,0:03:19.21,0:03:20.27,Default,,0000,0000,0000,,when you need to Dialogue: 0,0:03:20.27,0:03:24.62,Default,,0000,0000,0000,,classify a new example, when you have a new patient, and you want to decide is this cancer malignant Dialogue: 0,0:03:24.62,0:03:25.65,Default,,0000,0000,0000,,or benign, Dialogue: 0,0:03:25.65,0:03:27.92,Default,,0000,0000,0000,,you then take your new cancer, and you Dialogue: 0,0:03:27.92,0:03:30.72,Default,,0000,0000,0000,,match it to your model of malignant cancers, Dialogue: 0,0:03:30.72,0:03:35.97,Default,,0000,0000,0000,,and you match it to your model of benign cancers, and you see which model it matches better, and Dialogue: 0,0:03:35.97,0:03:39.39,Default,,0000,0000,0000,,depending on which model it matches better to, you then Dialogue: 0,0:03:39.39,0:03:43.68,Default,,0000,0000,0000,,predict whether the new cancer is malignant or benign, Dialogue: 0,0:03:43.68,0:03:45.81,Default,,0000,0000,0000,,okay? Dialogue: 0,0:03:45.81,0:03:46.90,Default,,0000,0000,0000,,So Dialogue: 0,0:03:46.90,0:03:49.66,Default,,0000,0000,0000,,what I just described, just this cross Dialogue: 0,0:03:49.66,0:03:52.93,Default,,0000,0000,0000,,of methods where you build a second model for malignant cancers and Dialogue: 0,0:03:52.93,0:03:54.95,Default,,0000,0000,0000,,a separate model for benign cancers Dialogue: 0,0:03:54.95,0:03:58.38,Default,,0000,0000,0000,,is called a generative learning algorithm, Dialogue: 0,0:03:58.38,0:04:01.23,Default,,0000,0000,0000,,and let me just, kind of, formalize this. Dialogue: 0,0:04:01.23,0:04:03.09,Default,,0000,0000,0000,, Dialogue: 0,0:04:03.09,0:04:05.46,Default,,0000,0000,0000,,So in the models that we've Dialogue: 0,0:04:05.46,0:04:07.56,Default,,0000,0000,0000,,been talking about previously, those were actually Dialogue: 0,0:04:07.56,0:04:13.90,Default,,0000,0000,0000,,all discriminative learning algorithms, Dialogue: 0,0:04:13.90,0:04:18.96,Default,,0000,0000,0000,,and studied more formally, a discriminative learning algorithm is one Dialogue: 0,0:04:18.96,0:04:22.73,Default,,0000,0000,0000,,that either learns P of Y Dialogue: 0,0:04:22.73,0:04:25.19,Default,,0000,0000,0000,,given X directly, Dialogue: 0,0:04:25.19,0:04:26.89,Default,,0000,0000,0000,, Dialogue: 0,0:04:26.89,0:04:29.62,Default,,0000,0000,0000,,or even Dialogue: 0,0:04:29.62,0:04:32.51,Default,,0000,0000,0000,,learns a hypothesis Dialogue: 0,0:04:32.51,0:04:39.51,Default,,0000,0000,0000,,that Dialogue: 0,0:04:39.70,0:04:43.24,Default,,0000,0000,0000,,outputs value 0, 1 directly, Dialogue: 0,0:04:43.24,0:04:46.51,Default,,0000,0000,0000,,okay? So logistic regression is an example of Dialogue: 0,0:04:46.51,0:04:49.35,Default,,0000,0000,0000,,a discriminative learning algorithm. Dialogue: 0,0:04:49.35,0:04:56.35,Default,,0000,0000,0000,,In contrast, a generative learning algorithm of Dialogue: 0,0:04:58.15,0:04:59.94,Default,,0000,0000,0000,,models P of X given Y. Dialogue: 0,0:04:59.94,0:05:03.27,Default,,0000,0000,0000,,The probability of the features given the class label, Dialogue: 0,0:05:03.27,0:05:04.33,Default,,0000,0000,0000,, Dialogue: 0,0:05:04.33,0:05:11.33,Default,,0000,0000,0000,,and as a technical detail, it also models P of Y, but that's a less important thing, and the Dialogue: 0,0:05:11.94,0:05:15.32,Default,,0000,0000,0000,,interpretation of this is that a generative model Dialogue: 0,0:05:15.32,0:05:17.50,Default,,0000,0000,0000,,builds a probabilistic model for Dialogue: 0,0:05:17.50,0:05:21.96,Default,,0000,0000,0000,,what the features looks like, Dialogue: 0,0:05:21.96,0:05:27.31,Default,,0000,0000,0000,,conditioned on the Dialogue: 0,0:05:27.31,0:05:30.73,Default,,0000,0000,0000,,class label, okay? In other words, conditioned on whether a cancer is Dialogue: 0,0:05:30.73,0:05:35.06,Default,,0000,0000,0000,,malignant or benign, it models probability distribution over what the features Dialogue: 0,0:05:35.06,0:05:37.09,Default,,0000,0000,0000,,of the cancer looks like. Dialogue: 0,0:05:37.09,0:05:40.47,Default,,0000,0000,0000,,Then having built this model " having built a model for P of X Dialogue: 0,0:05:40.47,0:05:42.13,Default,,0000,0000,0000,,given Y and P of Y, Dialogue: 0,0:05:42.13,0:05:45.95,Default,,0000,0000,0000,,then by Bayes rule, obviously, you can compute P of Y given 1, Dialogue: 0,0:05:45.95,0:05:47.44,Default,,0000,0000,0000,,conditioned on X. Dialogue: 0,0:05:47.44,0:05:49.37,Default,,0000,0000,0000,,This is just Dialogue: 0,0:05:49.37,0:05:50.96,Default,,0000,0000,0000,,P of X Dialogue: 0,0:05:50.96,0:05:52.67,Default,,0000,0000,0000,,given Y = 1 Dialogue: 0,0:05:52.67,0:05:56.43,Default,,0000,0000,0000,,times P of X Dialogue: 0,0:05:56.43,0:05:58.92,Default,,0000,0000,0000,,divided by P of X, Dialogue: 0,0:05:58.92,0:06:01.35,Default,,0000,0000,0000,,and, if necessary, Dialogue: 0,0:06:01.35,0:06:05.60,Default,,0000,0000,0000,,you can calculate the denominator Dialogue: 0,0:06:05.60,0:06:12.60,Default,,0000,0000,0000,,using this, right? Dialogue: 0,0:06:18.97,0:06:19.89,Default,,0000,0000,0000,, Dialogue: 0,0:06:19.89,0:06:22.92,Default,,0000,0000,0000,,And so by modeling P of X given Y Dialogue: 0,0:06:22.92,0:06:26.50,Default,,0000,0000,0000,,and modeling P of Y, you can actually use Bayes rule to get back to P of Y given Dialogue: 0,0:06:26.50,0:06:27.57,Default,,0000,0000,0000,,X, Dialogue: 0,0:06:27.57,0:06:29.82,Default,,0000,0000,0000,,but a generative model - Dialogue: 0,0:06:29.82,0:06:33.12,Default,,0000,0000,0000,,generative learning algorithm starts in modeling P of X given Y, rather than P of Y Dialogue: 0,0:06:33.12,0:06:34.98,Default,,0000,0000,0000,,given X, okay? Dialogue: 0,0:06:34.98,0:06:37.81,Default,,0000,0000,0000,,We'll talk about some of the tradeoffs, and why this may be a Dialogue: 0,0:06:37.81,0:06:41.100,Default,,0000,0000,0000,,better or worse idea than a discriminative model a bit later. Dialogue: 0,0:06:41.100,0:06:46.68,Default,,0000,0000,0000,,Let's go for a specific example of a generative learning algorithm, Dialogue: 0,0:06:46.68,0:06:47.91,Default,,0000,0000,0000,,and for Dialogue: 0,0:06:47.91,0:06:51.33,Default,,0000,0000,0000,,this specific motivating example, I'm Dialogue: 0,0:06:51.33,0:06:53.58,Default,,0000,0000,0000,,going to assume Dialogue: 0,0:06:53.58,0:06:55.60,Default,,0000,0000,0000,,that your input feature is X Dialogue: 0,0:06:55.60,0:06:58.20,Default,,0000,0000,0000,,and RN Dialogue: 0,0:06:58.20,0:07:00.49,Default,,0000,0000,0000,, Dialogue: 0,0:07:00.49,0:07:03.11,Default,,0000,0000,0000,,and are Dialogue: 0,0:07:03.11,0:07:05.71,Default,,0000,0000,0000,,continuous values, okay? Dialogue: 0,0:07:05.71,0:07:08.19,Default,,0000,0000,0000,,And under this assumption, let Dialogue: 0,0:07:08.19,0:07:15.19,Default,,0000,0000,0000,,me describe to you a specific algorithm called Gaussian Discriminant Analysis, Dialogue: 0,0:07:20.18,0:07:23.06,Default,,0000,0000,0000,, Dialogue: 0,0:07:23.06,0:07:25.96,Default,,0000,0000,0000,,and the, I Dialogue: 0,0:07:25.96,0:07:30.08,Default,,0000,0000,0000,,guess, core assumption is that we're going to assume in the Gaussian discriminant analysis Dialogue: 0,0:07:30.08,0:07:32.28,Default,,0000,0000,0000,,model of that P of X given Y Dialogue: 0,0:07:32.28,0:07:39.28,Default,,0000,0000,0000,,is Gaussian, okay? So Dialogue: 0,0:07:39.78,0:07:43.45,Default,,0000,0000,0000,,actually just raise your hand, how many of you have seen a multivariate Gaussian before - Dialogue: 0,0:07:43.45,0:07:45.64,Default,,0000,0000,0000,,not a 1D Gaussian, but the higher range though? Dialogue: 0,0:07:45.64,0:07:51.03,Default,,0000,0000,0000,,Okay, cool, like maybe half of you, two-thirds of Dialogue: 0,0:07:51.03,0:07:54.60,Default,,0000,0000,0000,,you. So let me just say a few words about Gaussians, and for those of you that have seen Dialogue: 0,0:07:54.60,0:07:57.12,Default,,0000,0000,0000,,it before, it'll be a refresher. Dialogue: 0,0:07:57.12,0:08:03.25,Default,,0000,0000,0000,,So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the Dialogue: 0,0:08:03.25,0:08:05.31,Default,,0000,0000,0000,,script N for normal Dialogue: 0,0:08:05.31,0:08:08.15,Default,,0000,0000,0000,,with Dialogue: 0,0:08:08.15,0:08:13.38,Default,,0000,0000,0000,,parameters mean U and covariance sigma squared. If Dialogue: 0,0:08:13.38,0:08:16.76,Default,,0000,0000,0000,,Z has a density 1 Dialogue: 0,0:08:16.76,0:08:19.54,Default,,0000,0000,0000,,over 2 Pi, sigma Dialogue: 0,0:08:19.54,0:08:23.94,Default,,0000,0000,0000,,2, Dialogue: 0,0:08:23.94,0:08:30.94,Default,,0000,0000,0000,,okay? Dialogue: 0,0:08:31.45,0:08:35.95,Default,,0000,0000,0000,,That's the formula for the density as a generalization of the one dimension of Gaussians and no Dialogue: 0,0:08:35.95,0:08:38.19,Default,,0000,0000,0000,,more the familiar bell-shape curve. It's Dialogue: 0,0:08:38.19,0:08:42.22,Default,,0000,0000,0000,,a high dimension vector value random variable Dialogue: 0,0:08:42.22,0:08:42.99,Default,,0000,0000,0000,,Z. Dialogue: 0,0:08:42.99,0:08:47.21,Default,,0000,0000,0000,,Don't worry too much about this formula for the density. You rarely end up needing to Dialogue: 0,0:08:47.21,0:08:47.96,Default,,0000,0000,0000,,use it, Dialogue: 0,0:08:47.96,0:08:51.62,Default,,0000,0000,0000,,but the two key quantities are this Dialogue: 0,0:08:51.62,0:08:55.31,Default,,0000,0000,0000,,vector mew is the mean of the Gaussian and this Dialogue: 0,0:08:55.31,0:08:56.100,Default,,0000,0000,0000,,matrix sigma is Dialogue: 0,0:08:56.100,0:09:00.06,Default,,0000,0000,0000,,the covariance matrix - Dialogue: 0,0:09:00.06,0:09:02.47,Default,,0000,0000,0000,,covariance, Dialogue: 0,0:09:02.47,0:09:04.67,Default,,0000,0000,0000,,and so Dialogue: 0,0:09:04.67,0:09:07.41,Default,,0000,0000,0000,,sigma will be equal to, Dialogue: 0,0:09:07.41,0:09:11.81,Default,,0000,0000,0000,,right, the definition of covariance of a vector valued random variable Dialogue: 0,0:09:11.81,0:09:13.71,Default,,0000,0000,0000,,is X - U, X - V conspose, Dialogue: 0,0:09:13.71,0:09:17.26,Default,,0000,0000,0000,,okay? Dialogue: 0,0:09:17.26,0:09:20.32,Default,,0000,0000,0000,,And, actually, if this Dialogue: 0,0:09:20.32,0:09:22.93,Default,,0000,0000,0000,,doesn't look familiar to you, Dialogue: 0,0:09:22.93,0:09:25.72,Default,,0000,0000,0000,,you might re-watch the Dialogue: 0,0:09:25.72,0:09:28.74,Default,,0000,0000,0000,,discussion section that the TAs held last Friday Dialogue: 0,0:09:28.74,0:09:33.69,Default,,0000,0000,0000,,or the one that they'll be holding later this week on, sort of, a recap of Dialogue: 0,0:09:33.69,0:09:34.29,Default,,0000,0000,0000,,probability, okay? Dialogue: 0,0:09:34.29,0:09:37.67,Default,,0000,0000,0000,,So Dialogue: 0,0:09:37.67,0:09:41.37,Default,,0000,0000,0000,,multi-grade Gaussians is parameterized by a mean and a covariance, and let me just - Dialogue: 0,0:09:41.37,0:09:42.42,Default,,0000,0000,0000,,can I Dialogue: 0,0:09:42.42,0:09:45.86,Default,,0000,0000,0000,,have the laptop displayed, please? Dialogue: 0,0:09:45.86,0:09:48.57,Default,,0000,0000,0000,,I'll just go ahead and actually show you, Dialogue: 0,0:09:48.57,0:09:52.09,Default,,0000,0000,0000,,you know, graphically, the effects of Dialogue: 0,0:09:52.09,0:09:53.93,Default,,0000,0000,0000,,varying a Gaussian - Dialogue: 0,0:09:53.93,0:09:56.31,Default,,0000,0000,0000,,varying the parameters of a Gaussian. Dialogue: 0,0:09:56.31,0:09:59.13,Default,,0000,0000,0000,,So what I have up here Dialogue: 0,0:09:59.13,0:10:02.96,Default,,0000,0000,0000,,is the density of a zero mean Gaussian with Dialogue: 0,0:10:02.96,0:10:05.88,Default,,0000,0000,0000,,covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand Dialogue: 0,0:10:05.88,0:10:08.24,Default,,0000,0000,0000,,corner of the slide, and Dialogue: 0,0:10:08.24,0:10:11.60,Default,,0000,0000,0000,,there's the familiar bell-shaped curve in two dimensions. Dialogue: 0,0:10:11.60,0:10:17.32,Default,,0000,0000,0000,,And so if I shrink the covariance matrix, instead of covariance your identity, if Dialogue: 0,0:10:17.32,0:10:21.60,Default,,0000,0000,0000,,I shrink the covariance matrix, then the Gaussian becomes more peaked, Dialogue: 0,0:10:21.60,0:10:25.09,Default,,0000,0000,0000,,and if I widen the covariance, so like same = 2, 2, Dialogue: 0,0:10:25.09,0:10:25.75,Default,,0000,0000,0000,,then Dialogue: 0,0:10:25.75,0:10:30.81,Default,,0000,0000,0000,,the distribution - well, the density becomes more spread out, okay? Dialogue: 0,0:10:30.81,0:10:32.51,Default,,0000,0000,0000,,Those vectors stand at normal, identity Dialogue: 0,0:10:32.51,0:10:34.57,Default,,0000,0000,0000,,covariance one. Dialogue: 0,0:10:34.57,0:10:36.52,Default,,0000,0000,0000,,If I increase Dialogue: 0,0:10:36.52,0:10:39.42,Default,,0000,0000,0000,,the diagonals of a covariance matrix, right, Dialogue: 0,0:10:39.42,0:10:43.95,Default,,0000,0000,0000,,if I make the variables correlated, and the Dialogue: 0,0:10:43.95,0:10:45.35,Default,,0000,0000,0000,,Gaussian becomes flattened out in this X = Y direction, and Dialogue: 0,0:10:45.35,0:10:47.17,Default,,0000,0000,0000,,increase it even further, Dialogue: 0,0:10:47.17,0:10:52.46,Default,,0000,0000,0000,,then my variables, X and Y, right - excuse me, it goes Z1 and Z2 Dialogue: 0,0:10:52.46,0:10:57.02,Default,,0000,0000,0000,,are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in Dialogue: 0,0:10:57.02,0:10:58.94,Default,,0000,0000,0000,,contours. Dialogue: 0,0:10:58.94,0:11:03.31,Default,,0000,0000,0000,,The standard normal of distribution has contours that are - they're actually Dialogue: 0,0:11:03.31,0:11:06.96,Default,,0000,0000,0000,,circles. Because of the aspect ratio, these look like ellipses. Dialogue: 0,0:11:06.96,0:11:09.05,Default,,0000,0000,0000,,These should actually be circles, Dialogue: 0,0:11:09.05,0:11:12.77,Default,,0000,0000,0000,,and if you increase the off diagonals of the Gaussian covariance matrix, Dialogue: 0,0:11:12.77,0:11:14.89,Default,,0000,0000,0000,,then it becomes Dialogue: 0,0:11:14.89,0:11:16.40,Default,,0000,0000,0000,,ellipses aligned along the, sort of, Dialogue: 0,0:11:16.40,0:11:18.09,Default,,0000,0000,0000,,45 degree angle Dialogue: 0,0:11:18.09,0:11:21.12,Default,,0000,0000,0000,,in this example. Dialogue: 0,0:11:21.12,0:11:26.43,Default,,0000,0000,0000,,This is the same thing. Here's an example of a Gaussian density with negative covariances. Dialogue: 0,0:11:26.43,0:11:28.61,Default,,0000,0000,0000,,So now the correlation Dialogue: 0,0:11:28.61,0:11:32.70,Default,,0000,0000,0000,,goes the other way, so that even strong [inaudible] of covariance and the same thing in Dialogue: 0,0:11:32.70,0:11:36.40,Default,,0000,0000,0000,,contours. This is a Gaussian with negative entries on the diagonals and even Dialogue: 0,0:11:36.40,0:11:41.13,Default,,0000,0000,0000,,larger entries on the diagonals, okay? Dialogue: 0,0:11:41.13,0:11:45.48,Default,,0000,0000,0000,,And other parameter for the Gaussian is the mean parameters, so if this is - with mew0, Dialogue: 0,0:11:45.48,0:11:47.84,Default,,0000,0000,0000,,and as he changed the mean parameter, Dialogue: 0,0:11:47.84,0:11:50.06,Default,,0000,0000,0000,,this is mew equals 0.15, Dialogue: 0,0:11:50.06,0:11:56.23,Default,,0000,0000,0000,,the location of the Gaussian just moves around, okay? Dialogue: 0,0:11:56.23,0:12:01.50,Default,,0000,0000,0000,,All right. So that was a quick primer on what Gaussians look like, and here's Dialogue: 0,0:12:01.50,0:12:02.30,Default,,0000,0000,0000,, Dialogue: 0,0:12:02.30,0:12:06.23,Default,,0000,0000,0000,,as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant Dialogue: 0,0:12:06.23,0:12:09.40,Default,,0000,0000,0000,,analysis algorithm, this is what we're going to do. Here's Dialogue: 0,0:12:09.40,0:12:11.64,Default,,0000,0000,0000,,the training set, Dialogue: 0,0:12:11.64,0:12:14.79,Default,,0000,0000,0000,,and in the Gaussian discriminant analysis algorithm, Dialogue: 0,0:12:14.79,0:12:19.48,Default,,0000,0000,0000,,what I'm going to do is I'm going to look at the positive examples, say the crosses, Dialogue: 0,0:12:19.48,0:12:23.52,Default,,0000,0000,0000,,and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the Dialogue: 0,0:12:23.52,0:12:25.25,Default,,0000,0000,0000,,positive examples, and so Dialogue: 0,0:12:25.25,0:12:27.58,Default,,0000,0000,0000,,maybe I end up with a Gaussian distribution like that, Dialogue: 0,0:12:27.58,0:12:30.96,Default,,0000,0000,0000,,okay? So there's P of X given Y = 1. Dialogue: 0,0:12:30.96,0:12:34.34,Default,,0000,0000,0000,,And then I'll look at the negative examples, the O's in this figure, Dialogue: 0,0:12:34.34,0:12:36.99,Default,,0000,0000,0000,,and I'll fit a Gaussian to that, and maybe I get a Dialogue: 0,0:12:36.99,0:12:37.74,Default,,0000,0000,0000,,Gaussian Dialogue: 0,0:12:37.74,0:12:40.70,Default,,0000,0000,0000,,centered over there. This is the concept of my second Gaussian, Dialogue: 0,0:12:40.70,0:12:41.91,Default,,0000,0000,0000,,and together - Dialogue: 0,0:12:41.91,0:12:44.77,Default,,0000,0000,0000,,we'll say how later - Dialogue: 0,0:12:44.77,0:12:51.17,Default,,0000,0000,0000,,together these two Gaussian densities will define a separator for these two classes, okay? Dialogue: 0,0:12:51.17,0:12:55.30,Default,,0000,0000,0000,,And it'll turn out that the separator will turn out to be a little bit Dialogue: 0,0:12:55.30,0:12:55.87,Default,,0000,0000,0000,,different Dialogue: 0,0:12:55.87,0:12:57.34,Default,,0000,0000,0000,,from what logistic regression Dialogue: 0,0:12:57.34,0:12:58.72,Default,,0000,0000,0000,,gives you. Dialogue: 0,0:12:58.72,0:13:00.39,Default,,0000,0000,0000,,If you run logistic regression, Dialogue: 0,0:13:00.39,0:13:03.72,Default,,0000,0000,0000,,you actually get the division bound to be shown in the green line, whereas Gaussian discriminant Dialogue: 0,0:13:03.72,0:13:07.34,Default,,0000,0000,0000,,analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the Dialogue: 0,0:13:07.34,0:13:14.34,Default,,0000,0000,0000,, Dialogue: 0,0:13:29.46,0:13:34.44,Default,,0000,0000,0000,,Gaussian discriminant analysis model, put Dialogue: 0,0:13:34.44,0:13:37.16,Default,,0000,0000,0000,,into model P of Y Dialogue: 0,0:13:37.16,0:13:40.94,Default,,0000,0000,0000,,as a Bernoulli random variable as usual, but Dialogue: 0,0:13:40.94,0:13:45.54,Default,,0000,0000,0000,,as a Bernoulli random variable and parameterized by parameter phi; you've Dialogue: 0,0:13:45.54,0:13:47.30,Default,,0000,0000,0000,,seen this before. Dialogue: 0,0:13:47.30,0:13:48.90,Default,,0000,0000,0000,, Dialogue: 0,0:13:48.90,0:13:55.90,Default,,0000,0000,0000,,Model P of X given Y = 0 as a Gaussian - Dialogue: 0,0:13:58.09,0:14:01.06,Default,,0000,0000,0000,,oh, you know what? Yeah, Dialogue: 0,0:14:01.06,0:14:04.87,Default,,0000,0000,0000,,yes, excuse me. I Dialogue: 0,0:14:04.87,0:14:06.05,Default,,0000,0000,0000,,thought this looked strange. Dialogue: 0,0:14:06.05,0:14:11.73,Default,,0000,0000,0000,,This Dialogue: 0,0:14:11.73,0:14:13.04,Default,,0000,0000,0000,,should be a sigma, Dialogue: 0,0:14:13.04,0:14:16.23,Default,,0000,0000,0000,,determined in a sigma to the one-half of the denominator there. Dialogue: 0,0:14:16.23,0:14:18.11,Default,,0000,0000,0000,,It's no big deal. It was - yeah, Dialogue: 0,0:14:18.11,0:14:20.35,Default,,0000,0000,0000,,well, okay. Right. Dialogue: 0,0:14:20.35,0:14:23.12,Default,,0000,0000,0000,,I Dialogue: 0,0:14:23.12,0:14:26.69,Default,,0000,0000,0000,,was listing the sigma to the determining the sigma to the one-half on a previous Dialogue: 0,0:14:26.69,0:14:33.69,Default,,0000,0000,0000,,board, excuse me. Okay, Dialogue: 0,0:14:40.18,0:14:44.13,Default,,0000,0000,0000,,and so I model P of X given Y = 0 as a Gaussian Dialogue: 0,0:14:44.13,0:14:47.70,Default,,0000,0000,0000,,with mean mew0 and covariance sigma to the sigma to Dialogue: 0,0:14:47.70,0:14:54.70,Default,,0000,0000,0000,,the minus one-half, Dialogue: 0,0:14:54.80,0:15:01.21,Default,,0000,0000,0000,, Dialogue: 0,0:15:01.21,0:15:02.00,Default,,0000,0000,0000,,and Dialogue: 0,0:15:02.00,0:15:09.00,Default,,0000,0000,0000,,- Dialogue: 0,0:15:11.20,0:15:12.84,Default,,0000,0000,0000,,okay? Dialogue: 0,0:15:12.84,0:15:16.84,Default,,0000,0000,0000,,And so the parameters of this model are Dialogue: 0,0:15:16.84,0:15:18.31,Default,,0000,0000,0000,,phi, Dialogue: 0,0:15:18.31,0:15:20.64,Default,,0000,0000,0000,,mew0, Dialogue: 0,0:15:20.64,0:15:23.90,Default,,0000,0000,0000,,mew1, and sigma, Dialogue: 0,0:15:23.90,0:15:25.10,Default,,0000,0000,0000,,and so Dialogue: 0,0:15:25.10,0:15:29.39,Default,,0000,0000,0000,,I can now write down the likelihood of the parameters Dialogue: 0,0:15:29.39,0:15:30.38,Default,,0000,0000,0000,,as - oh, excuse Dialogue: 0,0:15:30.38,0:15:37.38,Default,,0000,0000,0000,,me, actually, the log likelihood of the parameters as the log of Dialogue: 0,0:15:40.50,0:15:41.44,Default,,0000,0000,0000,,that, Dialogue: 0,0:15:41.44,0:15:45.08,Default,,0000,0000,0000,,right? Dialogue: 0,0:15:45.08,0:15:48.26,Default,,0000,0000,0000,,So, in other words, if I'm given the training set, then Dialogue: 0,0:15:48.26,0:15:52.69,Default,,0000,0000,0000,,they can write down the log likelihood of the parameters as the log of, you know, Dialogue: 0,0:15:52.69,0:15:58.06,Default,,0000,0000,0000,,the probative probabilities of P of XI, YI, right? Dialogue: 0,0:15:58.06,0:16:03.83,Default,,0000,0000,0000,, Dialogue: 0,0:16:03.83,0:16:10.03,Default,,0000,0000,0000,, Dialogue: 0,0:16:10.03,0:16:11.71,Default,,0000,0000,0000,,And Dialogue: 0,0:16:11.71,0:16:14.65,Default,,0000,0000,0000,,this is just equal to that where each of these terms, P of XI given YI, Dialogue: 0,0:16:14.65,0:16:16.98,Default,,0000,0000,0000,,or P of YI is Dialogue: 0,0:16:16.98,0:16:18.72,Default,,0000,0000,0000,,then given Dialogue: 0,0:16:18.72,0:16:19.60,Default,,0000,0000,0000,, Dialogue: 0,0:16:19.60,0:16:24.25,Default,,0000,0000,0000,,by one of these three equations on top, okay? Dialogue: 0,0:16:24.25,0:16:26.42,Default,,0000,0000,0000,,And I just want Dialogue: 0,0:16:26.42,0:16:30.00,Default,,0000,0000,0000,,to contrast this again with discriminative learning algorithms, right? Dialogue: 0,0:16:30.00,0:16:31.79,Default,,0000,0000,0000,,So Dialogue: 0,0:16:31.79,0:16:34.78,Default,,0000,0000,0000,,to give this a name, I guess, this sometimes is actually called Dialogue: 0,0:16:34.78,0:16:39.25,Default,,0000,0000,0000,,the Joint Data Likelihood - the Joint Likelihood, Dialogue: 0,0:16:39.25,0:16:41.79,Default,,0000,0000,0000,,and Dialogue: 0,0:16:41.79,0:16:44.93,Default,,0000,0000,0000,,let me just contrast this with what we had previously Dialogue: 0,0:16:44.93,0:16:47.71,Default,,0000,0000,0000,,when we're talking about logistic Dialogue: 0,0:16:47.71,0:16:51.86,Default,,0000,0000,0000,,regression. Where I said with the log likelihood of the parameter's theater Dialogue: 0,0:16:51.86,0:16:53.09,Default,,0000,0000,0000,,was log Dialogue: 0,0:16:53.09,0:16:57.00,Default,,0000,0000,0000,,of a product I = 1 to M, P of YI Dialogue: 0,0:16:57.00,0:16:59.03,Default,,0000,0000,0000,,given XI Dialogue: 0,0:16:59.03,0:17:01.99,Default,,0000,0000,0000,,and parameterized Dialogue: 0,0:17:01.99,0:17:03.96,Default,,0000,0000,0000,,by a theater, right? Dialogue: 0,0:17:03.96,0:17:05.15,Default,,0000,0000,0000,,So Dialogue: 0,0:17:05.15,0:17:08.04,Default,,0000,0000,0000,,back where we're fitting logistic regression models or generalized learning Dialogue: 0,0:17:08.04,0:17:08.67,Default,,0000,0000,0000,,models, Dialogue: 0,0:17:08.67,0:17:14.36,Default,,0000,0000,0000,,we're always modeling P of YI given XI and parameterized by a theater, and that was the Dialogue: 0,0:17:14.36,0:17:16.42,Default,,0000,0000,0000,,conditional Dialogue: 0,0:17:16.42,0:17:17.46,Default,,0000,0000,0000,, Dialogue: 0,0:17:17.46,0:17:19.58,Default,,0000,0000,0000,,likelihood, okay, Dialogue: 0,0:17:19.58,0:17:23.91,Default,,0000,0000,0000,,in Dialogue: 0,0:17:23.91,0:17:26.70,Default,,0000,0000,0000,,which we're modeling P of YI given XI, Dialogue: 0,0:17:26.70,0:17:27.84,Default,,0000,0000,0000,,whereas, now, Dialogue: 0,0:17:27.84,0:17:30.93,Default,,0000,0000,0000,,regenerative learning algorithms, we're going to look at the joint likelihood which Dialogue: 0,0:17:30.93,0:17:35.44,Default,,0000,0000,0000,,is P of XI, YI, okay? Dialogue: 0,0:17:35.44,0:17:36.27,Default,,0000,0000,0000,,So let's Dialogue: 0,0:17:36.27,0:17:43.27,Default,,0000,0000,0000,,see. Dialogue: 0,0:17:46.24,0:17:48.05,Default,,0000,0000,0000,,So given the training sets Dialogue: 0,0:17:48.05,0:17:51.26,Default,,0000,0000,0000,,and using the Gaussian discriminant analysis model Dialogue: 0,0:17:51.26,0:17:55.55,Default,,0000,0000,0000,,to fit the parameters of the model, we'll do maximize likelihood estimation as usual, Dialogue: 0,0:17:55.55,0:17:58.78,Default,,0000,0000,0000,,and so you maximize your Dialogue: 0,0:17:58.78,0:18:02.27,Default,,0000,0000,0000,,L Dialogue: 0,0:18:02.27,0:18:05.91,Default,,0000,0000,0000,,with respect to the parameters phi, mew0, Dialogue: 0,0:18:05.91,0:18:07.85,Default,,0000,0000,0000,,mew1, sigma, Dialogue: 0,0:18:07.85,0:18:10.15,Default,,0000,0000,0000,,and so Dialogue: 0,0:18:10.15,0:18:15.80,Default,,0000,0000,0000,,if we find the maximum likelihood estimate of parameters, you find that phi is Dialogue: 0,0:18:15.80,0:18:17.40,Default,,0000,0000,0000,,- Dialogue: 0,0:18:17.40,0:18:22.04,Default,,0000,0000,0000,,the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for Dialogue: 0,0:18:22.04,0:18:23.75,Default,,0000,0000,0000,,indicating notation, Dialogue: 0,0:18:23.75,0:18:27.22,Default,,0000,0000,0000,,all right? So the maximum likelihood estimate for phi would be Sum over Dialogue: 0,0:18:27.22,0:18:29.52,Default,,0000,0000,0000,,I, YI á M, Dialogue: 0,0:18:29.52,0:18:32.04,Default,,0000,0000,0000,, Dialogue: 0,0:18:32.04,0:18:35.49,Default,,0000,0000,0000,,or written alternatively as Sum over - Dialogue: 0,0:18:35.49,0:18:41.62,Default,,0000,0000,0000,,all your training examples of indicator YI = 1 á M, okay? Dialogue: 0,0:18:41.62,0:18:42.87,Default,,0000,0000,0000,, Dialogue: 0,0:18:42.87,0:18:45.88,Default,,0000,0000,0000,,In other words, maximum likelihood estimate for a Dialogue: 0,0:18:45.88,0:18:47.39,Default,,0000,0000,0000,, Dialogue: 0,0:18:47.39,0:18:52.60,Default,,0000,0000,0000,,newly parameter phi is just the faction of training examples with Dialogue: 0,0:18:52.60,0:18:56.88,Default,,0000,0000,0000,,label one, with Y equals 1. Dialogue: 0,0:18:56.88,0:19:02.80,Default,,0000,0000,0000,,Maximum likelihood estimate for mew0 is this, okay? Dialogue: 0,0:19:02.80,0:19:09.80,Default,,0000,0000,0000,,You Dialogue: 0,0:19:22.01,0:19:24.04,Default,,0000,0000,0000,,should stare at this for a second and Dialogue: 0,0:19:24.04,0:19:26.20,Default,,0000,0000,0000,,see if it makes sense. Dialogue: 0,0:19:26.20,0:19:33.20,Default,,0000,0000,0000,,Actually, I'll just write on the next one for mew1 while you do that. Dialogue: 0,0:19:34.31,0:19:41.31,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:19:49.88,0:19:51.90,Default,,0000,0000,0000,,So what this is is Dialogue: 0,0:19:51.90,0:19:57.06,Default,,0000,0000,0000,,what the denominator is sum of your training sets indicated YI = 0. Dialogue: 0,0:19:57.06,0:20:02.09,Default,,0000,0000,0000,,So for every training example for which YI = 0, this will Dialogue: 0,0:20:02.09,0:20:05.67,Default,,0000,0000,0000,,increment the count by one, all right? So the Dialogue: 0,0:20:05.67,0:20:08.03,Default,,0000,0000,0000,,denominator is just Dialogue: 0,0:20:08.03,0:20:10.88,Default,,0000,0000,0000,,the number of examples Dialogue: 0,0:20:10.88,0:20:14.55,Default,,0000,0000,0000,, Dialogue: 0,0:20:14.55,0:20:15.37,Default,,0000,0000,0000,,with Dialogue: 0,0:20:15.37,0:20:20.70,Default,,0000,0000,0000,,label zero, all right? Dialogue: 0,0:20:20.70,0:20:22.68,Default,,0000,0000,0000,,And then the numerator will be, let's Dialogue: 0,0:20:22.68,0:20:25.66,Default,,0000,0000,0000,,see, Sum from I = 1 for M, or every time Dialogue: 0,0:20:25.66,0:20:30.35,Default,,0000,0000,0000,,YI is equal to 0, this will be a one, and otherwise, this thing will be zero, Dialogue: 0,0:20:30.35,0:20:31.48,Default,,0000,0000,0000,,and so Dialogue: 0,0:20:31.48,0:20:34.33,Default,,0000,0000,0000,,this indicator function means that you're including Dialogue: 0,0:20:34.33,0:20:37.63,Default,,0000,0000,0000,,only the times for Dialogue: 0,0:20:37.63,0:20:40.27,Default,,0000,0000,0000,,which YI is equal to one - only the turns which Y is equal to zero Dialogue: 0,0:20:40.27,0:20:45.02,Default,,0000,0000,0000,,because for all the times where YI is equal to one, Dialogue: 0,0:20:45.02,0:20:47.99,Default,,0000,0000,0000,,this sum and will be equal to zero, Dialogue: 0,0:20:47.99,0:20:52.89,Default,,0000,0000,0000,,and then you multiply that by XI, and so the numerator is Dialogue: 0,0:20:52.89,0:20:55.77,Default,,0000,0000,0000,,really the sum Dialogue: 0,0:20:55.77,0:21:00.91,Default,,0000,0000,0000,,of XI's corresponding to Dialogue: 0,0:21:00.91,0:21:02.97,Default,,0000,0000,0000,, Dialogue: 0,0:21:02.97,0:21:08.65,Default,,0000,0000,0000,,examples where the class labels were zero, okay? Dialogue: 0,0:21:08.65,0:21:14.53,Default,,0000,0000,0000,,Raise your hand if this makes sense. Okay, cool. Dialogue: 0,0:21:14.53,0:21:17.35,Default,,0000,0000,0000,,So just to say this fancifully, Dialogue: 0,0:21:17.35,0:21:19.63,Default,,0000,0000,0000,,this just means look for your training set, Dialogue: 0,0:21:19.63,0:21:22.90,Default,,0000,0000,0000,,find all the examples for which Y = 0, Dialogue: 0,0:21:22.90,0:21:25.16,Default,,0000,0000,0000,,and take the average Dialogue: 0,0:21:25.16,0:21:29.57,Default,,0000,0000,0000,,of the value of X for all your examples which Y = 0. So take all your negative fitting Dialogue: 0,0:21:29.57,0:21:30.47,Default,,0000,0000,0000,,examples Dialogue: 0,0:21:30.47,0:21:32.95,Default,,0000,0000,0000,,and average the values for X Dialogue: 0,0:21:32.95,0:21:37.73,Default,,0000,0000,0000,,and Dialogue: 0,0:21:37.73,0:21:41.35,Default,,0000,0000,0000,,that's mew0, okay? If this Dialogue: 0,0:21:41.35,0:21:45.14,Default,,0000,0000,0000,,notation is still a little bit cryptic - if you're still not sure why this Dialogue: 0,0:21:45.14,0:21:48.00,Default,,0000,0000,0000,,equation translates into Dialogue: 0,0:21:48.00,0:21:52.94,Default,,0000,0000,0000,,what I just said, do go home and stare at it for a while until it just makes sense. This is, sort Dialogue: 0,0:21:52.94,0:21:57.04,Default,,0000,0000,0000,,of, no surprise. It just says to estimate the mean for the negative examples, Dialogue: 0,0:21:57.04,0:21:59.86,Default,,0000,0000,0000,,take all your negative examples, and average them. So Dialogue: 0,0:21:59.86,0:22:06.83,Default,,0000,0000,0000,,no surprise, but this is a useful practice to indicate a notation. Dialogue: 0,0:22:06.83,0:22:07.50,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,0:22:07.50,0:22:11.33,Default,,0000,0000,0000,,divide the maximum likelihood estimate for sigma. I won't do that. You can read that in Dialogue: 0,0:22:11.33,0:22:16.63,Default,,0000,0000,0000,,the notes yourself. Dialogue: 0,0:22:16.63,0:22:21.39,Default,,0000,0000,0000,, Dialogue: 0,0:22:21.39,0:22:24.34,Default,,0000,0000,0000,,And so having fit Dialogue: 0,0:22:24.34,0:22:28.94,Default,,0000,0000,0000,,the parameters find mew0, mew1, and sigma Dialogue: 0,0:22:28.94,0:22:30.47,Default,,0000,0000,0000,,to your data, Dialogue: 0,0:22:30.47,0:22:35.59,Default,,0000,0000,0000,,well, you now need to make a prediction. You Dialogue: 0,0:22:35.59,0:22:39.71,Default,,0000,0000,0000,,know, when you're given a new value of X, when you're given a new cancer, you need to predict whether Dialogue: 0,0:22:39.71,0:22:41.63,Default,,0000,0000,0000,,it's malignant or benign. Dialogue: 0,0:22:41.63,0:22:44.61,Default,,0000,0000,0000,,Your prediction is then going to be, Dialogue: 0,0:22:44.61,0:22:46.71,Default,,0000,0000,0000,,let's say, Dialogue: 0,0:22:46.71,0:22:50.40,Default,,0000,0000,0000,,the most likely value of Y given X. I should Dialogue: 0,0:22:50.40,0:22:54.72,Default,,0000,0000,0000,,write semicolon the parameters there. I'll just give that Dialogue: 0,0:22:54.72,0:22:57.23,Default,,0000,0000,0000,,- which is the [inaudible] of a Y Dialogue: 0,0:22:57.23,0:23:04.23,Default,,0000,0000,0000,,by Bayes rule, all right? Dialogue: 0,0:23:05.87,0:23:12.87,Default,,0000,0000,0000,,And that is, in turn, Dialogue: 0,0:23:14.54,0:23:15.48,Default,,0000,0000,0000,,just that Dialogue: 0,0:23:15.48,0:23:19.79,Default,,0000,0000,0000,,because the denominator P of X doesn't depend on Y, Dialogue: 0,0:23:19.79,0:23:21.53,Default,,0000,0000,0000,,and Dialogue: 0,0:23:21.53,0:23:23.54,Default,,0000,0000,0000,,if P of Y Dialogue: 0,0:23:23.54,0:23:27.45,Default,,0000,0000,0000,,is uniform. Dialogue: 0,0:23:27.45,0:23:30.95,Default,,0000,0000,0000,,In other words, if each of your constants is equally likely, Dialogue: 0,0:23:30.95,0:23:33.04,Default,,0000,0000,0000,,so if P of Y Dialogue: 0,0:23:33.04,0:23:36.84,Default,,0000,0000,0000,,takes the same value for all values Dialogue: 0,0:23:36.84,0:23:38.01,Default,,0000,0000,0000,,of Y, Dialogue: 0,0:23:38.01,0:23:42.25,Default,,0000,0000,0000,,then this is just arc X over Y, P of X Dialogue: 0,0:23:42.25,0:23:47.34,Default,,0000,0000,0000,,given Y, okay? Dialogue: 0,0:23:47.34,0:23:50.97,Default,,0000,0000,0000,,This happens sometimes, maybe not very often, so usually you end up using this Dialogue: 0,0:23:50.97,0:23:53.05,Default,,0000,0000,0000,,formula where you Dialogue: 0,0:23:53.05,0:23:56.47,Default,,0000,0000,0000,,compute P of X given Y and P of Y using Dialogue: 0,0:23:56.47,0:23:59.91,Default,,0000,0000,0000,,your model, okay? Student:Can Dialogue: 0,0:23:59.91,0:24:00.78,Default,,0000,0000,0000,,you give Dialogue: 0,0:24:00.78,0:24:05.48,Default,,0000,0000,0000,,us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually Dialogue: 0,0:24:05.48,0:24:08.75,Default,,0000,0000,0000,,let me. Dialogue: 0,0:24:08.75,0:24:09.77,Default,,0000,0000,0000,,So the min of Dialogue: 0,0:24:09.77,0:24:15.47,Default,,0000,0000,0000,,- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just Dialogue: 0,0:24:15.47,0:24:18.60,Default,,0000,0000,0000,,for an example, the min of X - 5 Dialogue: 0,0:24:18.60,0:24:22.61,Default,,0000,0000,0000,,squared is 0 because by choosing X equals 5, you can get this to be zero, Dialogue: 0,0:24:22.61,0:24:24.47,Default,,0000,0000,0000,,and the argument over X Dialogue: 0,0:24:24.47,0:24:30.87,Default,,0000,0000,0000,,of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay? Dialogue: 0,0:24:30.87,0:24:37.87,Default,,0000,0000,0000,,Cool. Thanks Dialogue: 0,0:24:38.10,0:24:45.10,Default,,0000,0000,0000,,for Dialogue: 0,0:24:45.14,0:24:52.14,Default,,0000,0000,0000,,asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is Dialogue: 0,0:25:00.91,0:25:04.32,Default,,0000,0000,0000,,distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose Dialogue: 0,0:25:04.32,0:25:05.65,Default,,0000,0000,0000,,here. Dialogue: 0,0:25:05.65,0:25:07.24,Default,,0000,0000,0000,,I meant if Dialogue: 0,0:25:07.24,0:25:10.89,Default,,0000,0000,0000,,P of Y = 0 is equal to P of Y = 1, or if Y is the Dialogue: 0,0:25:10.89,0:25:12.78,Default,,0000,0000,0000,,uniform distribution over Dialogue: 0,0:25:12.78,0:25:13.86,Default,,0000,0000,0000,,the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0 Dialogue: 0,0:25:13.86,0:25:15.45,Default,,0000,0000,0000,,zero = P of Y given 1. That's all I mean, see? Anything else? Dialogue: 0,0:25:15.45,0:25:17.08,Default,,0000,0000,0000,,All Dialogue: 0,0:25:17.08,0:25:20.03,Default,,0000,0000,0000,,right. Okay. So Dialogue: 0,0:25:20.03,0:25:22.04,Default,,0000,0000,0000,, Dialogue: 0,0:25:22.04,0:25:28.04,Default,,0000,0000,0000,, Dialogue: 0,0:25:28.04,0:25:35.04,Default,,0000,0000,0000,,it Dialogue: 0,0:25:42.13,0:25:46.98,Default,,0000,0000,0000,,turns out Gaussian discriminant analysis has an interesting relationship Dialogue: 0,0:25:46.98,0:25:48.59,Default,,0000,0000,0000,,to logistic Dialogue: 0,0:25:48.59,0:25:51.64,Default,,0000,0000,0000,,regression. Let me illustrate that. Dialogue: 0,0:25:51.64,0:25:54.54,Default,,0000,0000,0000,, Dialogue: 0,0:25:54.54,0:25:59.20,Default,,0000,0000,0000,,So let's say you have a training set Dialogue: 0,0:25:59.20,0:26:06.20,Default,,0000,0000,0000,,- actually let me just go ahead and draw 1D training set, and that will Dialogue: 0,0:26:11.82,0:26:14.58,Default,,0000,0000,0000,,kind of work, yes, okay. Dialogue: 0,0:26:14.58,0:26:18.47,Default,,0000,0000,0000,,So let's say we have a training set comprising a few negative and a few positive examples, Dialogue: 0,0:26:18.47,0:26:23.07,Default,,0000,0000,0000,,and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two Dialogue: 0,0:26:23.07,0:26:24.68,Default,,0000,0000,0000,,densities - a Gaussian density to each of these two - to my positive Dialogue: 0,0:26:24.68,0:26:27.44,Default,,0000,0000,0000,,and negative training Dialogue: 0,0:26:27.44,0:26:30.09,Default,,0000,0000,0000,,examples, Dialogue: 0,0:26:30.09,0:26:33.47,Default,,0000,0000,0000,,and so maybe my Dialogue: 0,0:26:33.47,0:26:37.96,Default,,0000,0000,0000,,positive examples, the X's, are fit with a Gaussian like this, Dialogue: 0,0:26:37.96,0:26:39.52,Default,,0000,0000,0000,, Dialogue: 0,0:26:39.52,0:26:44.56,Default,,0000,0000,0000,, Dialogue: 0,0:26:44.56,0:26:47.84,Default,,0000,0000,0000,,and my negative examples I will fit, and you have a Dialogue: 0,0:26:47.84,0:26:54.84,Default,,0000,0000,0000,,Gaussian that looks like that, okay? Dialogue: 0,0:26:57.03,0:27:00.18,Default,,0000,0000,0000,,Now, I Dialogue: 0,0:27:00.18,0:27:04.15,Default,,0000,0000,0000,,hope this [inaudible]. Now, let's Dialogue: 0,0:27:04.15,0:27:06.79,Default,,0000,0000,0000,,vary along the X axis, Dialogue: 0,0:27:06.79,0:27:10.26,Default,,0000,0000,0000,,and what I want to do is I'll Dialogue: 0,0:27:10.26,0:27:14.47,Default,,0000,0000,0000,,overlay on top of this plot. I'm going to plot Dialogue: 0,0:27:14.47,0:27:21.47,Default,,0000,0000,0000,,P of Y = 1 - no, actually, Dialogue: 0,0:27:24.42,0:27:25.82,Default,,0000,0000,0000,,given X Dialogue: 0,0:27:25.82,0:27:31.57,Default,,0000,0000,0000,,for a variety of values X, okay? Dialogue: 0,0:27:31.57,0:27:32.93,Default,,0000,0000,0000,,So I actually realize what I should have done. Dialogue: 0,0:27:32.93,0:27:37.25,Default,,0000,0000,0000,,I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just Dialogue: 0,0:27:37.25,0:27:39.72,Default,,0000,0000,0000,,makes this part come in better. Dialogue: 0,0:27:39.72,0:27:44.27,Default,,0000,0000,0000,,So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal Dialogue: 0,0:27:44.27,0:27:45.20,Default,,0000,0000,0000,,axis. Dialogue: 0,0:27:45.20,0:27:48.81,Default,,0000,0000,0000,,Then what's the probability of Y being equal to one conditioned on X? Well, Dialogue: 0,0:27:48.81,0:27:52.77,Default,,0000,0000,0000,,the way you calculate that is you write P of Y Dialogue: 0,0:27:52.77,0:27:56.90,Default,,0000,0000,0000,,= 1 given X, and then you plug in all these formulas as usual, right? It's P of X Dialogue: 0,0:27:56.90,0:27:59.14,Default,,0000,0000,0000,,given Y = 1, which is Dialogue: 0,0:27:59.14,0:28:00.93,Default,,0000,0000,0000,,your Gaussian density, Dialogue: 0,0:28:00.93,0:28:04.07,Default,,0000,0000,0000,,times P of Y = 1, you know, which is Dialogue: 0,0:28:04.07,0:28:08.06,Default,,0000,0000,0000,,essentially - this is just going to be equal to phi, Dialogue: 0,0:28:08.06,0:28:09.51,Default,,0000,0000,0000,,and then divided by, Dialogue: 0,0:28:09.51,0:28:12.21,Default,,0000,0000,0000,,right, P of X, and then this shows you how you can calculate this. Dialogue: 0,0:28:12.21,0:28:13.66,Default,,0000,0000,0000,,By using Dialogue: 0,0:28:13.66,0:28:18.73,Default,,0000,0000,0000,,these two Gaussians and my phi on P of Y, I actually compute what P of Y Dialogue: 0,0:28:18.73,0:28:21.34,Default,,0000,0000,0000,,= 1 given X is, and Dialogue: 0,0:28:21.34,0:28:24.53,Default,,0000,0000,0000,,in this case, Dialogue: 0,0:28:24.53,0:28:29.23,Default,,0000,0000,0000,,if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to Dialogue: 0,0:28:29.23,0:28:31.33,Default,,0000,0000,0000,,a positive class, and so Dialogue: 0,0:28:31.33,0:28:36.37,Default,,0000,0000,0000,,it'll be very small; it'll be very close to zero say, okay? Dialogue: 0,0:28:36.37,0:28:40.77,Default,,0000,0000,0000,,And then we can increment the value of X a bit, and study a different value of X, and Dialogue: 0,0:28:40.77,0:28:44.23,Default,,0000,0000,0000,,plot what is the P of Y given X - P of Y Dialogue: 0,0:28:44.23,0:28:49.66,Default,,0000,0000,0000,,= 1 given X, and, again, it'll be pretty small. Let's Dialogue: 0,0:28:49.66,0:28:52.20,Default,,0000,0000,0000,,use a point like that, right? At this point, Dialogue: 0,0:28:52.20,0:28:54.48,Default,,0000,0000,0000,,the two Gaussian densities Dialogue: 0,0:28:54.48,0:28:56.49,Default,,0000,0000,0000,,have equal value, Dialogue: 0,0:28:56.49,0:28:57.54,Default,,0000,0000,0000,,and if Dialogue: 0,0:28:57.54,0:28:58.80,Default,,0000,0000,0000,,I ask Dialogue: 0,0:28:58.80,0:29:01.78,Default,,0000,0000,0000,,if X is this value, right, shown by the arrow, Dialogue: 0,0:29:01.78,0:29:03.08,Default,,0000,0000,0000,, Dialogue: 0,0:29:03.08,0:29:06.77,Default,,0000,0000,0000,,what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if Dialogue: 0,0:29:06.77,0:29:11.27,Default,,0000,0000,0000,,you fill Dialogue: 0,0:29:11.27,0:29:13.93,Default,,0000,0000,0000,,in a bunch more points, you get a Dialogue: 0,0:29:13.93,0:29:16.47,Default,,0000,0000,0000,,curve like that, Dialogue: 0,0:29:16.47,0:29:20.55,Default,,0000,0000,0000,,and then you can keep going. Let's say for a point like that, you can ask what's the probability of X Dialogue: 0,0:29:20.55,0:29:21.92,Default,,0000,0000,0000,,being one? Well, if Dialogue: 0,0:29:21.92,0:29:25.32,Default,,0000,0000,0000,,it's that far out, then clearly, it belongs to this Dialogue: 0,0:29:25.32,0:29:27.54,Default,,0000,0000,0000,,rightmost Gaussian, and so Dialogue: 0,0:29:27.54,0:29:31.51,Default,,0000,0000,0000,,the probability of Y being a one would be very high; it would be almost one, okay? Dialogue: 0,0:29:31.51,0:29:33.83,Default,,0000,0000,0000,,And so you Dialogue: 0,0:29:33.83,0:29:36.43,Default,,0000,0000,0000,,can repeat this exercise Dialogue: 0,0:29:36.43,0:29:38.51,Default,,0000,0000,0000,,for a bunch of points. All right, Dialogue: 0,0:29:38.51,0:29:41.62,Default,,0000,0000,0000,,compute P of Y equals one given X for a bunch of points, Dialogue: 0,0:29:41.62,0:29:46.49,Default,,0000,0000,0000,,and if you connect up these points, Dialogue: 0,0:29:46.49,0:29:48.15,Default,,0000,0000,0000,,you find that the Dialogue: 0,0:29:48.15,0:29:50.04,Default,,0000,0000,0000,,curve you get [Pause] plotted Dialogue: 0,0:29:50.04,0:29:56.41,Default,,0000,0000,0000,,takes a form of sigmoid function, okay? So, Dialogue: 0,0:29:56.41,0:30:00.03,Default,,0000,0000,0000,,in other words, when you make the assumptions under the Gaussian Dialogue: 0,0:30:00.03,0:30:02.35,Default,,0000,0000,0000,, Dialogue: 0,0:30:02.35,0:30:05.55,Default,,0000,0000,0000,,discriminant analysis model, Dialogue: 0,0:30:05.55,0:30:08.13,Default,,0000,0000,0000,,that P of X given Y is Gaussian, Dialogue: 0,0:30:08.13,0:30:13.33,Default,,0000,0000,0000,,when you go back and compute what P of Y given X is, you actually get back Dialogue: 0,0:30:13.33,0:30:19.29,Default,,0000,0000,0000,,exactly the same sigmoid function Dialogue: 0,0:30:19.29,0:30:22.96,Default,,0000,0000,0000,,that we're using which is the progression, okay? But it turns out the key difference is that Dialogue: 0,0:30:22.96,0:30:27.36,Default,,0000,0000,0000,,Gaussian discriminant analysis will end up choosing a different Dialogue: 0,0:30:27.36,0:30:28.90,Default,,0000,0000,0000,,position Dialogue: 0,0:30:28.90,0:30:31.66,Default,,0000,0000,0000,,and a steepness of the sigmoid Dialogue: 0,0:30:31.66,0:30:35.73,Default,,0000,0000,0000,,than would logistic regression. Is there a question? Student:I'm just Dialogue: 0,0:30:35.73,0:30:39.57,Default,,0000,0000,0000,,wondering, Dialogue: 0,0:30:39.57,0:30:44.60,Default,,0000,0000,0000,,the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is Dialogue: 0,0:30:44.60,0:30:47.50,Default,,0000,0000,0000,,P of X given Y = 1, and Dialogue: 0,0:30:47.50,0:30:50.07,Default,,0000,0000,0000,,this Gaussian is P of X Dialogue: 0,0:30:50.07,0:30:57.07,Default,,0000,0000,0000,,given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you Dialogue: 0,0:31:01.92,0:31:04.04,Default,,0000,0000,0000,,decide what Y Dialogue: 0,0:31:04.04,0:31:05.76,Default,,0000,0000,0000,,given Dialogue: 0,0:31:05.76,0:31:09.77,Default,,0000,0000,0000,,P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you Dialogue: 0,0:31:09.77,0:31:12.33,Default,,0000,0000,0000,,figured out where Dialogue: 0,0:31:12.33,0:31:13.75,Default,,0000,0000,0000,,to draw each dot? Instructor (Andrew Ng):Let's see, Dialogue: 0,0:31:13.75,0:31:15.92,Default,,0000,0000,0000,,okay. So the Dialogue: 0,0:31:15.92,0:31:18.60,Default,,0000,0000,0000,,computation is as follows, right? The steps Dialogue: 0,0:31:18.60,0:31:22.02,Default,,0000,0000,0000,,are I have the training sets, and so given my training set, I'm going to fit Dialogue: 0,0:31:22.02,0:31:25.16,Default,,0000,0000,0000,,a Gaussian discriminant analysis model to it, Dialogue: 0,0:31:25.16,0:31:28.94,Default,,0000,0000,0000,,and what that means is I'll build a model for P of X given Y = 1. I'll Dialogue: 0,0:31:28.94,0:31:30.16,Default,,0000,0000,0000,,build Dialogue: 0,0:31:30.16,0:31:33.06,Default,,0000,0000,0000,,a model for P of X given Y = 0, Dialogue: 0,0:31:33.06,0:31:37.95,Default,,0000,0000,0000,,and I'll also fit a Bernoulli distribution to Dialogue: 0,0:31:37.95,0:31:39.29,Default,,0000,0000,0000,,P of Y, okay? Dialogue: 0,0:31:39.29,0:31:43.27,Default,,0000,0000,0000,,So, in other words, given my training set, I'll fit P of X given Y and P of Y Dialogue: 0,0:31:43.27,0:31:46.54,Default,,0000,0000,0000,,to my data, and now I've chosen my parameters Dialogue: 0,0:31:46.54,0:31:48.63,Default,,0000,0000,0000,,of find mew0, Dialogue: 0,0:31:48.63,0:31:49.60,Default,,0000,0000,0000,,mew1, Dialogue: 0,0:31:49.60,0:31:52.31,Default,,0000,0000,0000,,and the sigma, okay? Then Dialogue: 0,0:31:52.31,0:31:54.93,Default,,0000,0000,0000,,this is the process I went through Dialogue: 0,0:31:54.93,0:31:58.71,Default,,0000,0000,0000,,to plot all these dots, right? It's just I pick a point in the X axis, Dialogue: 0,0:31:58.71,0:32:00.95,Default,,0000,0000,0000,,and then I compute Dialogue: 0,0:32:00.95,0:32:02.97,Default,,0000,0000,0000,,P of Y given X Dialogue: 0,0:32:02.97,0:32:05.20,Default,,0000,0000,0000,,for that value of X, Dialogue: 0,0:32:05.20,0:32:09.38,Default,,0000,0000,0000,,and P of Y given 1 conditioned on X will be some value between zero and one. It'll Dialogue: 0,0:32:09.38,0:32:12.50,Default,,0000,0000,0000,,be some real number, and whatever that real number is, I then plot it on the vertical Dialogue: 0,0:32:12.50,0:32:14.49,Default,,0000,0000,0000,,axis, Dialogue: 0,0:32:14.49,0:32:19.93,Default,,0000,0000,0000,,okay? And the way I compute P of Y = 1 conditioned on X is Dialogue: 0,0:32:19.93,0:32:21.38,Default,,0000,0000,0000,,I would Dialogue: 0,0:32:21.38,0:32:23.47,Default,,0000,0000,0000,,use these quantities. I would use Dialogue: 0,0:32:23.47,0:32:25.08,Default,,0000,0000,0000,,P of X given Y Dialogue: 0,0:32:25.08,0:32:30.05,Default,,0000,0000,0000,,and P of Y, and, sort of, plug them into Bayes rule, and that allows me Dialogue: 0,0:32:30.05,0:32:30.76,Default,,0000,0000,0000,,to Dialogue: 0,0:32:30.76,0:32:31.69,Default,,0000,0000,0000,,compute P of Y given X Dialogue: 0,0:32:31.69,0:32:34.21,Default,,0000,0000,0000,,from these three quantities; does that make Dialogue: 0,0:32:34.21,0:32:35.24,Default,,0000,0000,0000,,sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that - Dialogue: 0,0:32:35.24,0:32:39.08,Default,,0000,0000,0000,,Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so Dialogue: 0,0:32:39.08,0:32:42.27,Default,,0000,0000,0000,,- Dialogue: 0,0:32:42.27,0:32:43.17,Default,,0000,0000,0000,,well, Dialogue: 0,0:32:43.17,0:32:45.34,Default,,0000,0000,0000,,got this right Dialogue: 0,0:32:45.34,0:32:49.09,Default,,0000,0000,0000,,here. So P of X can be written as, Dialogue: 0,0:32:49.09,0:32:50.45,Default,,0000,0000,0000,, Dialogue: 0,0:32:50.45,0:32:52.81,Default,,0000,0000,0000,,right, Dialogue: 0,0:32:52.81,0:32:55.03,Default,,0000,0000,0000,,so Dialogue: 0,0:32:55.03,0:32:58.66,Default,,0000,0000,0000,,P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y = Dialogue: 0,0:32:58.66,0:33:03.14,Default,,0000,0000,0000,,1, right? Dialogue: 0,0:33:03.14,0:33:06.74,Default,,0000,0000,0000,,And so each of these terms, P of X given Y Dialogue: 0,0:33:06.74,0:33:11.17,Default,,0000,0000,0000,,and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant Dialogue: 0,0:33:11.17,0:33:13.89,Default,,0000,0000,0000,,analysis model. Each of these terms is something that Dialogue: 0,0:33:13.89,0:33:16.30,Default,,0000,0000,0000,,my model gives me directly, Dialogue: 0,0:33:16.30,0:33:18.35,Default,,0000,0000,0000,,so plugged in as the denominator, Dialogue: 0,0:33:18.35,0:33:25.35,Default,,0000,0000,0000,,and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool. Dialogue: 0,0:33:30.57,0:33:37.57,Default,,0000,0000,0000,, Dialogue: 0,0:33:53.17,0:33:57.86,Default,,0000,0000,0000,,So let's talk a little bit about the advantages and disadvantages of using a Dialogue: 0,0:33:57.86,0:34:00.36,Default,,0000,0000,0000,,generative Dialogue: 0,0:34:00.36,0:34:04.76,Default,,0000,0000,0000,,learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we Dialogue: 0,0:34:04.76,0:34:05.96,Default,,0000,0000,0000,,assume that Dialogue: 0,0:34:05.96,0:34:08.08,Default,,0000,0000,0000,,X conditions on Y Dialogue: 0,0:34:08.08,0:34:13.83,Default,,0000,0000,0000,,is Gaussian, Dialogue: 0,0:34:13.83,0:34:17.09,Default,,0000,0000,0000,,and the argument I showed on the previous chalkboard, I didn't prove it formally, Dialogue: 0,0:34:17.09,0:34:20.29,Default,,0000,0000,0000,,but you can actually go back and prove it yourself Dialogue: 0,0:34:20.29,0:34:24.27,Default,,0000,0000,0000,,is that if you assume X given Y is Gaussian, Dialogue: 0,0:34:24.27,0:34:27.42,Default,,0000,0000,0000,,then that implies that Dialogue: 0,0:34:27.42,0:34:28.59,Default,,0000,0000,0000,,when you plot Y Dialogue: 0,0:34:28.59,0:34:30.64,Default,,0000,0000,0000,,given X, Dialogue: 0,0:34:30.64,0:34:37.64,Default,,0000,0000,0000,,you find that - well, let me just write logistic posterior, okay? Dialogue: 0,0:34:40.69,0:34:43.52,Default,,0000,0000,0000,,And the argument I showed just now, which I didn't prove; you can go home and prove it Dialogue: 0,0:34:43.52,0:34:44.49,Default,,0000,0000,0000,,yourself, Dialogue: 0,0:34:44.49,0:34:49.33,Default,,0000,0000,0000,,is that if you assume X given Y is Gaussian, then that implies that the posterior Dialogue: 0,0:34:49.33,0:34:53.93,Default,,0000,0000,0000,,distribution or the form of Dialogue: 0,0:34:53.93,0:34:57.23,Default,,0000,0000,0000,,P of Y = 1 given X Dialogue: 0,0:34:57.23,0:35:00.57,Default,,0000,0000,0000,,is going to be a logistic function, Dialogue: 0,0:35:00.57,0:35:02.09,Default,,0000,0000,0000,, Dialogue: 0,0:35:02.09,0:35:05.81,Default,,0000,0000,0000,,and it turns out this Dialogue: 0,0:35:05.81,0:35:08.32,Default,,0000,0000,0000,,implication in the opposite direction Dialogue: 0,0:35:08.32,0:35:11.80,Default,,0000,0000,0000,,does not hold true, Dialogue: 0,0:35:11.80,0:35:16.35,Default,,0000,0000,0000,,okay? In particular, it actually turns out - this is actually, kind of, cool. It Dialogue: 0,0:35:16.35,0:35:19.36,Default,,0000,0000,0000,,turns out that if you're Dialogue: 0,0:35:19.36,0:35:22.74,Default,,0000,0000,0000,,seeing that X given Y = 1 is Dialogue: 0,0:35:22.74,0:35:24.39,Default,,0000,0000,0000,, Dialogue: 0,0:35:24.39,0:35:26.17,Default,,0000,0000,0000,,Hessian with Dialogue: 0,0:35:26.17,0:35:28.61,Default,,0000,0000,0000,,parameter lambda 1, Dialogue: 0,0:35:28.61,0:35:32.04,Default,,0000,0000,0000,,and X given Y = 0, Dialogue: 0,0:35:32.04,0:35:36.51,Default,,0000,0000,0000,,is Hessian Dialogue: 0,0:35:36.51,0:35:39.20,Default,,0000,0000,0000,,with parameter lambda 0. Dialogue: 0,0:35:39.20,0:35:41.42,Default,,0000,0000,0000,,It turns out if you assumed this, Dialogue: 0,0:35:41.42,0:35:42.64,Default,,0000,0000,0000,,then Dialogue: 0,0:35:42.64,0:35:43.85,Default,,0000,0000,0000,,that also Dialogue: 0,0:35:43.85,0:35:46.84,Default,,0000,0000,0000,,implies that P of Y Dialogue: 0,0:35:46.84,0:35:51.13,Default,,0000,0000,0000,,given X Dialogue: 0,0:35:51.13,0:35:52.44,Default,,0000,0000,0000,, Dialogue: 0,0:35:52.44,0:35:54.43,Default,,0000,0000,0000,,is logistic, okay? Dialogue: 0,0:35:54.43,0:35:57.26,Default,,0000,0000,0000,,So there are lots of assumptions on X given Y Dialogue: 0,0:35:57.26,0:36:04.26,Default,,0000,0000,0000,,that will lead to P of Y given X being logistic, and, Dialogue: 0,0:36:04.62,0:36:06.06,Default,,0000,0000,0000,,therefore, Dialogue: 0,0:36:06.06,0:36:11.61,Default,,0000,0000,0000,,this, the assumption that X given Y being Gaussian is the stronger assumption Dialogue: 0,0:36:11.61,0:36:15.02,Default,,0000,0000,0000,,than the assumption that Y given X is logistic, Dialogue: 0,0:36:15.02,0:36:17.40,Default,,0000,0000,0000,,okay? Because this implies this, Dialogue: 0,0:36:17.40,0:36:22.83,Default,,0000,0000,0000,,right? That means that this is a stronger assumption than this because Dialogue: 0,0:36:22.83,0:36:29.83,Default,,0000,0000,0000,,this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa. Dialogue: 0,0:36:29.94,0:36:31.10,Default,,0000,0000,0000,,And so this leaves some Dialogue: 0,0:36:31.10,0:36:35.75,Default,,0000,0000,0000,,of the tradeoffs between Gaussian discriminant analysis and logistic Dialogue: 0,0:36:35.75,0:36:36.56,Default,,0000,0000,0000,,regression, Dialogue: 0,0:36:36.56,0:36:40.08,Default,,0000,0000,0000,,right? Gaussian discriminant analysis makes a much stronger assumption Dialogue: 0,0:36:40.08,0:36:43.05,Default,,0000,0000,0000,,that X given Y is Gaussian, Dialogue: 0,0:36:43.05,0:36:47.10,Default,,0000,0000,0000,,and so when this assumption is true, when this assumption approximately holds, if you plot the Dialogue: 0,0:36:47.10,0:36:48.57,Default,,0000,0000,0000,,data, Dialogue: 0,0:36:48.57,0:36:52.04,Default,,0000,0000,0000,,and if X given Y is, indeed, approximately Gaussian, Dialogue: 0,0:36:52.04,0:36:56.20,Default,,0000,0000,0000,,then if you make this assumption, explicit to the algorithm, then the Dialogue: 0,0:36:56.20,0:36:57.50,Default,,0000,0000,0000,,algorithm will do better Dialogue: 0,0:36:57.50,0:36:59.24,Default,,0000,0000,0000,,because it's as if the Dialogue: 0,0:36:59.24,0:37:02.61,Default,,0000,0000,0000,,algorithm is making use of more information about the data. The algorithm knows that Dialogue: 0,0:37:02.61,0:37:04.38,Default,,0000,0000,0000,,the data is Gaussian, Dialogue: 0,0:37:04.38,0:37:05.29,Default,,0000,0000,0000,,right? And so Dialogue: 0,0:37:05.29,0:37:07.61,Default,,0000,0000,0000,,if the Gaussian assumption, you know, Dialogue: 0,0:37:07.61,0:37:09.53,Default,,0000,0000,0000,,holds or roughly holds, Dialogue: 0,0:37:09.53,0:37:10.22,Default,,0000,0000,0000,,then Gaussian Dialogue: 0,0:37:10.22,0:37:13.57,Default,,0000,0000,0000,,discriminant analysis may do better than logistic regression. Dialogue: 0,0:37:13.57,0:37:19.30,Default,,0000,0000,0000,,If, conversely, if you're actually not sure what X given Y is, then Dialogue: 0,0:37:19.30,0:37:22.82,Default,,0000,0000,0000,,logistic regression, the discriminant algorithm may do better, Dialogue: 0,0:37:22.82,0:37:25.58,Default,,0000,0000,0000,,and, in particular, use logistic regression, Dialogue: 0,0:37:25.58,0:37:26.11,Default,,0000,0000,0000,,and Dialogue: 0,0:37:26.11,0:37:29.69,Default,,0000,0000,0000,,maybe you see [inaudible] before the data was Gaussian, but it turns out the data Dialogue: 0,0:37:29.69,0:37:31.43,Default,,0000,0000,0000,,was actually Poisson, right? Dialogue: 0,0:37:31.43,0:37:33.98,Default,,0000,0000,0000,,Then logistic regression will still do perfectly fine because if Dialogue: 0,0:37:33.98,0:37:36.87,Default,,0000,0000,0000,,the data were actually Poisson, Dialogue: 0,0:37:36.87,0:37:40.21,Default,,0000,0000,0000,,then P of Y = 1 given X will be logistic, and it'll do perfectly Dialogue: 0,0:37:40.21,0:37:44.52,Default,,0000,0000,0000,,fine, but if you assumed it was Gaussian, then the algorithm may go off and do something Dialogue: 0,0:37:44.52,0:37:51.52,Default,,0000,0000,0000,,that's not as good, okay? Dialogue: 0,0:37:52.11,0:37:56.14,Default,,0000,0000,0000,,So it turns out that - right. Dialogue: 0,0:37:56.14,0:37:58.80,Default,,0000,0000,0000,,So it's slightly different. Dialogue: 0,0:37:58.80,0:38:01.09,Default,,0000,0000,0000,,It Dialogue: 0,0:38:01.09,0:38:05.36,Default,,0000,0000,0000,,turns out the real advantage of generative learning algorithms is often that it Dialogue: 0,0:38:05.36,0:38:07.84,Default,,0000,0000,0000,,requires less data, Dialogue: 0,0:38:07.84,0:38:09.05,Default,,0000,0000,0000,,and, in particular, Dialogue: 0,0:38:09.05,0:38:13.47,Default,,0000,0000,0000,,data is never really exactly Gaussian, right? Because data is often Dialogue: 0,0:38:13.47,0:38:15.89,Default,,0000,0000,0000,,approximately Gaussian; it's never exactly Gaussian. Dialogue: 0,0:38:15.89,0:38:19.99,Default,,0000,0000,0000,,And it turns out, generative learning algorithms often do surprisingly well Dialogue: 0,0:38:19.99,0:38:20.75,Default,,0000,0000,0000,,even when Dialogue: 0,0:38:20.75,0:38:24.37,Default,,0000,0000,0000,,these modeling assumptions are not met, but Dialogue: 0,0:38:24.37,0:38:26.02,Default,,0000,0000,0000,,one other tradeoff Dialogue: 0,0:38:26.02,0:38:26.88,Default,,0000,0000,0000,,is that Dialogue: 0,0:38:26.88,0:38:29.93,Default,,0000,0000,0000,,by making stronger assumptions Dialogue: 0,0:38:29.93,0:38:31.90,Default,,0000,0000,0000,,about the data, Dialogue: 0,0:38:31.90,0:38:34.04,Default,,0000,0000,0000,,Gaussian discriminant analysis Dialogue: 0,0:38:34.04,0:38:36.23,Default,,0000,0000,0000,,often needs less data Dialogue: 0,0:38:36.23,0:38:40.44,Default,,0000,0000,0000,,in order to fit, like, an okay model, even if there's less training data. Whereas, in Dialogue: 0,0:38:40.44,0:38:41.87,Default,,0000,0000,0000,,contrast, Dialogue: 0,0:38:41.87,0:38:45.96,Default,,0000,0000,0000,,logistic regression by making less Dialogue: 0,0:38:45.96,0:38:48.96,Default,,0000,0000,0000,,assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're Dialogue: 0,0:38:48.96,0:38:50.73,Default,,0000,0000,0000,,making less assumptions, Dialogue: 0,0:38:50.73,0:38:54.15,Default,,0000,0000,0000,,but sometimes it takes a slightly larger training set to fit than Gaussian discriminant Dialogue: 0,0:38:54.15,0:38:55.42,Default,,0000,0000,0000,,analysis. Question? Student:In order Dialogue: 0,0:38:55.42,0:39:00.02,Default,,0000,0000,0000,,to meet any assumption about the number [inaudible], plus Dialogue: 0,0:39:00.02,0:39:01.03,Default,,0000,0000,0000,, Dialogue: 0,0:39:01.03,0:39:03.14,Default,,0000,0000,0000,,here Dialogue: 0,0:39:03.14,0:39:09.09,Default,,0000,0000,0000,,we assume that P of Y = 1, equal Dialogue: 0,0:39:09.09,0:39:11.16,Default,,0000,0000,0000,,two Dialogue: 0,0:39:11.16,0:39:14.85,Default,,0000,0000,0000,,number of. Dialogue: 0,0:39:14.85,0:39:16.32,Default,,0000,0000,0000,,[Inaudible]. Is true when Dialogue: 0,0:39:16.32,0:39:18.07,Default,,0000,0000,0000,,the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see. Dialogue: 0,0:39:18.07,0:39:21.80,Default,,0000,0000,0000,,So there's a question of is this true - what Dialogue: 0,0:39:21.80,0:39:23.38,Default,,0000,0000,0000,,was that? Let me translate that Dialogue: 0,0:39:23.38,0:39:25.43,Default,,0000,0000,0000,,differently. So Dialogue: 0,0:39:25.43,0:39:28.83,Default,,0000,0000,0000,,the marving assumptions are made independently of the size Dialogue: 0,0:39:28.83,0:39:29.88,Default,,0000,0000,0000,,of Dialogue: 0,0:39:29.88,0:39:32.52,Default,,0000,0000,0000,,your training set, right? So, like, in least/great regression - well, Dialogue: 0,0:39:32.52,0:39:36.65,Default,,0000,0000,0000,,in all of these models I'm assuming that these are random variables Dialogue: 0,0:39:36.65,0:39:41.38,Default,,0000,0000,0000,,flowing from some distribution, and then, finally, I'm giving a single training set Dialogue: 0,0:39:41.38,0:39:43.64,Default,,0000,0000,0000,,and that as for the parameters of the Dialogue: 0,0:39:43.64,0:39:44.45,Default,,0000,0000,0000,,distribution, right? Dialogue: 0,0:39:44.45,0:39:51.45,Default,,0000,0000,0000,,Student:So Dialogue: 0,0:39:51.89,0:39:54.17,Default,,0000,0000,0000,,what's the probability of Y = 1? Dialogue: 0,0:39:54.17,0:39:55.07,Default,,0000,0000,0000,,Instructor (Andrew Ng):Probability of Y + 1? Dialogue: 0,0:39:55.07,0:39:56.50,Default,,0000,0000,0000,,Student:Yeah, you used the - Instructor (Andrew Ng):Sort Dialogue: 0,0:39:56.50,0:39:57.91,Default,,0000,0000,0000,,of, this like Dialogue: 0,0:39:57.91,0:39:59.59,Default,,0000,0000,0000,,- Dialogue: 0,0:39:59.59,0:40:02.83,Default,,0000,0000,0000,,back to the philosophy of mass molecular estimation, Dialogue: 0,0:40:02.83,0:40:04.70,Default,,0000,0000,0000,,right? I'm assuming that Dialogue: 0,0:40:04.70,0:40:08.33,Default,,0000,0000,0000,,they're P of Y is equal to phi to the Y, Dialogue: 0,0:40:08.33,0:40:13.39,Default,,0000,0000,0000,,Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y Dialogue: 0,0:40:13.39,0:40:14.01,Default,,0000,0000,0000,,generating Dialogue: 0,0:40:14.01,0:40:15.88,Default,,0000,0000,0000,,all my data, Dialogue: 0,0:40:15.88,0:40:18.85,Default,,0000,0000,0000,,and then Dialogue: 0,0:40:18.85,0:40:19.55,Default,,0000,0000,0000,,- Dialogue: 0,0:40:19.55,0:40:25.59,Default,,0000,0000,0000,,well, when I write this, I guess, maybe what I should write isn't - Dialogue: 0,0:40:25.59,0:40:27.61,Default,,0000,0000,0000,,so when I write this, I Dialogue: 0,0:40:27.61,0:40:29.90,Default,,0000,0000,0000,,guess there are already two values of phi. One is Dialogue: 0,0:40:29.90,0:40:32.20,Default,,0000,0000,0000,,there's a true underlying value of phi Dialogue: 0,0:40:32.20,0:40:35.16,Default,,0000,0000,0000,,that guards the use to generate the data, Dialogue: 0,0:40:35.16,0:40:39.63,Default,,0000,0000,0000,,and then there's the maximum likelihood estimate of the value of phi, and so when I was writing Dialogue: 0,0:40:39.63,0:40:41.51,Default,,0000,0000,0000,,those formulas earlier, Dialogue: 0,0:40:41.51,0:40:45.21,Default,,0000,0000,0000,,those formulas are writing for phi, and mew0, and mew1 Dialogue: 0,0:40:45.21,0:40:49.36,Default,,0000,0000,0000,,were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true Dialogue: 0,0:40:49.36,0:40:55.82,Default,,0000,0000,0000,,underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum Dialogue: 0,0:40:55.82,0:40:57.56,Default,,0000,0000,0000,,likelihood estimate comes from the data, Dialogue: 0,0:40:57.56,0:41:01.53,Default,,0000,0000,0000,,and there's some, sort of, true underlying value of phi that I'm trying to estimate, Dialogue: 0,0:41:01.53,0:41:05.32,Default,,0000,0000,0000,,and my maximum likelihood estimate is my attempt to estimate the true value, Dialogue: 0,0:41:05.32,0:41:08.15,Default,,0000,0000,0000,,but, you know, by notational and convention Dialogue: 0,0:41:08.15,0:41:11.78,Default,,0000,0000,0000,,often are just right as that as well without bothering to distinguish between Dialogue: 0,0:41:11.78,0:41:15.47,Default,,0000,0000,0000,,the maximum likelihood value and the true underlying value that I'm assuming is out Dialogue: 0,0:41:15.47,0:41:16.34,Default,,0000,0000,0000,,there, and that I'm Dialogue: 0,0:41:16.34,0:41:23.34,Default,,0000,0000,0000,,only hoping to estimate. Dialogue: 0,0:41:23.98,0:41:25.58,Default,,0000,0000,0000,,Actually, yeah, Dialogue: 0,0:41:25.58,0:41:29.37,Default,,0000,0000,0000,,so for the sample of questions like these about maximum likelihood and so on, I hope Dialogue: 0,0:41:29.37,0:41:34.06,Default,,0000,0000,0000,,to tease to the Friday discussion section Dialogue: 0,0:41:34.06,0:41:36.28,Default,,0000,0000,0000,,as a good time to Dialogue: 0,0:41:36.28,0:41:37.89,Default,,0000,0000,0000,,ask questions about, sort of, Dialogue: 0,0:41:37.89,0:41:40.88,Default,,0000,0000,0000,,probabilistic definitions like these as well. Are there any Dialogue: 0,0:41:40.88,0:41:47.88,Default,,0000,0000,0000,,other questions? No, great. Okay. Dialogue: 0,0:41:52.90,0:41:59.68,Default,,0000,0000,0000,,So, Dialogue: 0,0:41:59.68,0:42:01.19,Default,,0000,0000,0000,,great. Oh, it Dialogue: 0,0:42:01.19,0:42:07.59,Default,,0000,0000,0000,,turns out, just to mention one more thing that's, kind of, cool. Dialogue: 0,0:42:07.59,0:42:09.30,Default,,0000,0000,0000,,I said that Dialogue: 0,0:42:09.30,0:42:13.51,Default,,0000,0000,0000,,if X given Y is Poisson, and you also go logistic posterior, Dialogue: 0,0:42:13.51,0:42:17.21,Default,,0000,0000,0000,,it actually turns out there's a more general version of this. If you assume Dialogue: 0,0:42:17.21,0:42:18.93,Default,,0000,0000,0000,,X Dialogue: 0,0:42:18.93,0:42:23.90,Default,,0000,0000,0000,,given Y = 1 is exponential family Dialogue: 0,0:42:23.90,0:42:26.28,Default,,0000,0000,0000,,with parameter A to 1, and then you assume Dialogue: 0,0:42:26.28,0:42:33.28,Default,,0000,0000,0000,,X given Y = 0 is exponential family Dialogue: 0,0:42:33.86,0:42:38.89,Default,,0000,0000,0000,,with parameter A to 0, then Dialogue: 0,0:42:38.89,0:42:43.27,Default,,0000,0000,0000,,this implies that P of Y = 1 given X is also logistic, okay? And Dialogue: 0,0:42:43.27,0:42:44.68,Default,,0000,0000,0000,,that's, kind of, cool. Dialogue: 0,0:42:44.68,0:42:47.59,Default,,0000,0000,0000,,It means that Y given X could be - I don't Dialogue: 0,0:42:47.59,0:42:50.18,Default,,0000,0000,0000,,know, some strange thing. It could be gamma because Dialogue: 0,0:42:50.18,0:42:52.51,Default,,0000,0000,0000,,we've seen Gaussian right Dialogue: 0,0:42:52.51,0:42:53.61,Default,,0000,0000,0000,,next to the - I Dialogue: 0,0:42:53.61,0:42:57.03,Default,,0000,0000,0000,,don't know, gamma exponential. Dialogue: 0,0:42:57.03,0:42:59.10,Default,,0000,0000,0000,,They're actually a beta. I'm Dialogue: 0,0:42:59.10,0:43:02.09,Default,,0000,0000,0000,,just rattling off my mental list of exponential family extrusions. It could be any one Dialogue: 0,0:43:02.09,0:43:03.69,Default,,0000,0000,0000,,of those things, Dialogue: 0,0:43:03.69,0:43:07.52,Default,,0000,0000,0000,,so [inaudible] the same exponential family distribution for the two classes Dialogue: 0,0:43:07.52,0:43:09.44,Default,,0000,0000,0000,,with different natural parameters Dialogue: 0,0:43:09.44,0:43:12.24,Default,,0000,0000,0000,,than the Dialogue: 0,0:43:12.24,0:43:13.28,Default,,0000,0000,0000,,posterior Dialogue: 0,0:43:13.28,0:43:16.85,Default,,0000,0000,0000,,P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows Dialogue: 0,0:43:16.85,0:43:19.30,Default,,0000,0000,0000,,the robustness of logistic regression Dialogue: 0,0:43:19.30,0:43:22.48,Default,,0000,0000,0000,,to the choice of modeling assumptions because it could be that Dialogue: 0,0:43:22.48,0:43:24.99,Default,,0000,0000,0000,,the data was actually, you know, gamma distributed, Dialogue: 0,0:43:24.99,0:43:28.74,Default,,0000,0000,0000,,and just still turns out to be logistic. So it's the Dialogue: 0,0:43:28.74,0:43:35.74,Default,,0000,0000,0000,,robustness of logistic regression to modeling Dialogue: 0,0:43:38.81,0:43:40.96,Default,,0000,0000,0000,,assumptions. Dialogue: 0,0:43:40.96,0:43:42.90,Default,,0000,0000,0000,,And this is the density. I think, Dialogue: 0,0:43:42.90,0:43:44.40,Default,,0000,0000,0000,,early on I promised Dialogue: 0,0:43:44.40,0:43:48.62,Default,,0000,0000,0000,,two justifications for where I pulled the logistic function out of the hat, right? So Dialogue: 0,0:43:48.62,0:43:53.30,Default,,0000,0000,0000,,one was the exponential family derivation we went through last time, and this is, sort of, the second one. Dialogue: 0,0:43:53.30,0:44:00.30,Default,,0000,0000,0000,,That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off Dialogue: 0,0:44:05.11,0:44:09.97,Default,,0000,0000,0000,,mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also Dialogue: 0,0:44:09.97,0:44:11.06,Default,,0000,0000,0000,,not true, right? Dialogue: 0,0:44:11.06,0:44:12.50,Default,,0000,0000,0000,,Yeah, so this exponential Dialogue: 0,0:44:12.50,0:44:14.20,Default,,0000,0000,0000,,family distribution Dialogue: 0,0:44:14.20,0:44:18.27,Default,,0000,0000,0000,,implies Y = 1 is logistic, but the reverse assumption is also not true. Dialogue: 0,0:44:18.27,0:44:22.60,Default,,0000,0000,0000,,There are actually all sorts of really bizarre distributions Dialogue: 0,0:44:22.60,0:44:29.60,Default,,0000,0000,0000,,for X that would give rise to logistic function, okay? Okay. So Dialogue: 0,0:44:29.89,0:44:34.72,Default,,0000,0000,0000,,let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second Dialogue: 0,0:44:34.72,0:44:37.61,Default,,0000,0000,0000,,generative learning algorithm, Dialogue: 0,0:44:37.61,0:44:44.61,Default,,0000,0000,0000,,and the motivating example, actually this is called a Naive Bayes algorithm, Dialogue: 0,0:44:44.85,0:44:49.69,Default,,0000,0000,0000,,and the motivating example that I'm gonna use will be spam classification. All right. So let's Dialogue: 0,0:44:49.69,0:44:54.11,Default,,0000,0000,0000,,say that you want to build a spam classifier to take your incoming stream of email and decide if Dialogue: 0,0:44:54.11,0:44:56.53,Default,,0000,0000,0000,,it's spam or Dialogue: 0,0:44:56.53,0:44:57.49,Default,,0000,0000,0000,,not. Dialogue: 0,0:44:57.49,0:45:02.30,Default,,0000,0000,0000,,So let's Dialogue: 0,0:45:02.30,0:45:04.76,Default,,0000,0000,0000,,see. Y will be 0 Dialogue: 0,0:45:04.76,0:45:06.85,Default,,0000,0000,0000,,or Dialogue: 0,0:45:06.85,0:45:11.68,Default,,0000,0000,0000,,1, with 1 being spam email Dialogue: 0,0:45:11.68,0:45:12.59,Default,,0000,0000,0000,,and 0 being non-spam, and Dialogue: 0,0:45:12.59,0:45:16.65,Default,,0000,0000,0000,,the first decision we need to make is, given a piece of email, Dialogue: 0,0:45:16.65,0:45:20.79,Default,,0000,0000,0000,,how do you represent a piece of email using a feature vector X, Dialogue: 0,0:45:20.79,0:45:24.45,Default,,0000,0000,0000,,right? So email is just a piece of text, right? Email Dialogue: 0,0:45:24.45,0:45:27.89,Default,,0000,0000,0000,,is like a list of words or a list of ASCII characters. Dialogue: 0,0:45:27.89,0:45:30.84,Default,,0000,0000,0000,,So I can represent email as a feature of vector X. Dialogue: 0,0:45:30.84,0:45:32.37,Default,,0000,0000,0000,,So we'll use a couple of different Dialogue: 0,0:45:32.37,0:45:34.10,Default,,0000,0000,0000,,representations, Dialogue: 0,0:45:34.10,0:45:36.58,Default,,0000,0000,0000,,but the one I'll use today is Dialogue: 0,0:45:36.58,0:45:38.05,Default,,0000,0000,0000,,we will Dialogue: 0,0:45:38.05,0:45:42.18,Default,,0000,0000,0000,,construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of Dialogue: 0,0:45:42.18,0:45:44.40,Default,,0000,0000,0000,,all the words in my dictionary, okay? So Dialogue: 0,0:45:44.40,0:45:46.47,Default,,0000,0000,0000,,the first word is Dialogue: 0,0:45:46.47,0:45:52.29,Default,,0000,0000,0000,,RA. The second word in my dictionary is Aardvark, ausworth, Dialogue: 0,0:45:52.29,0:45:55.57,Default,,0000,0000,0000,,okay? Dialogue: 0,0:45:55.57,0:45:59.42,Default,,0000,0000,0000,,You know, and somewhere along the way you see the word buy in the spam email telling you to buy Dialogue: 0,0:45:59.42,0:46:01.38,Default,,0000,0000,0000,,stuff. Dialogue: 0,0:46:01.38,0:46:04.14,Default,,0000,0000,0000,,Tell you how you collect your list of words, Dialogue: 0,0:46:04.14,0:46:08.58,Default,,0000,0000,0000,,you know, you won't find CS229, right, course number in a dictionary, but Dialogue: 0,0:46:08.58,0:46:09.56,Default,,0000,0000,0000,,if you Dialogue: 0,0:46:09.56,0:46:14.25,Default,,0000,0000,0000,,collect a list of words via other emails you've gotten, you have this list somewhere Dialogue: 0,0:46:14.25,0:46:17.98,Default,,0000,0000,0000,,as well, and then the last word in my dictionary was Dialogue: 0,0:46:17.98,0:46:21.97,Default,,0000,0000,0000,,zicmergue, which Dialogue: 0,0:46:21.97,0:46:24.74,Default,,0000,0000,0000,,pertains to the technological chemistry that deals with Dialogue: 0,0:46:24.74,0:46:27.90,Default,,0000,0000,0000,,the fermentation process in Dialogue: 0,0:46:27.90,0:46:30.83,Default,,0000,0000,0000,,brewing. Dialogue: 0,0:46:30.83,0:46:35.18,Default,,0000,0000,0000,,So say I get a piece of email, and what I'll do is I'll then Dialogue: 0,0:46:35.18,0:46:37.86,Default,,0000,0000,0000,,scan through this list of words, and wherever Dialogue: 0,0:46:37.86,0:46:41.25,Default,,0000,0000,0000,,a certain word appears in my email, I'll put a 1 there. So if a particular Dialogue: 0,0:46:41.25,0:46:44.32,Default,,0000,0000,0000,,email has the word aid then that's 1. Dialogue: 0,0:46:44.32,0:46:46.56,Default,,0000,0000,0000,,You know, my email doesn't have the words ausworth Dialogue: 0,0:46:46.56,0:46:49.21,Default,,0000,0000,0000,,or aardvark, so it gets zeros. And again, Dialogue: 0,0:46:49.21,0:46:52.94,Default,,0000,0000,0000,,a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay? Dialogue: 0,0:46:52.94,0:46:56.91,Default,,0000,0000,0000,,So Dialogue: 0,0:46:56.91,0:46:59.29,Default,,0000,0000,0000,,this would be Dialogue: 0,0:46:59.29,0:47:01.94,Default,,0000,0000,0000,,one way of creating a feature vector Dialogue: 0,0:47:01.94,0:47:03.62,Default,,0000,0000,0000,,to represent a Dialogue: 0,0:47:03.62,0:47:05.74,Default,,0000,0000,0000,,piece of email. Dialogue: 0,0:47:05.74,0:47:09.90,Default,,0000,0000,0000,,Now, let's throw Dialogue: 0,0:47:09.90,0:47:16.90,Default,,0000,0000,0000,,the generative model out for this. Actually, Dialogue: 0,0:47:18.06,0:47:19.29,Default,,0000,0000,0000,,let's use Dialogue: 0,0:47:19.29,0:47:20.85,Default,,0000,0000,0000,,this. Dialogue: 0,0:47:20.85,0:47:24.40,Default,,0000,0000,0000,,In other words, I want to model P of X given Y. The Dialogue: 0,0:47:24.40,0:47:27.96,Default,,0000,0000,0000,,given Y = 0 or Y = 1, all right? Dialogue: 0,0:47:27.96,0:47:31.39,Default,,0000,0000,0000,,And my feature vectors are going to be 0, 1 Dialogue: 0,0:47:31.39,0:47:35.71,Default,,0000,0000,0000,,to the N. It's going to be these split vectors, binary value vectors. They're N Dialogue: 0,0:47:35.71,0:47:36.40,Default,,0000,0000,0000,,dimensional. Dialogue: 0,0:47:36.40,0:47:38.13,Default,,0000,0000,0000,,Where N Dialogue: 0,0:47:38.13,0:47:38.83,Default,,0000,0000,0000,,may Dialogue: 0,0:47:38.83,0:47:41.95,Default,,0000,0000,0000,,be on the order of, say, 50,000, if you have 50,000 Dialogue: 0,0:47:41.95,0:47:44.09,Default,,0000,0000,0000,,words in your dictionary, Dialogue: 0,0:47:44.09,0:47:46.11,Default,,0000,0000,0000,,which is not atypical. So Dialogue: 0,0:47:46.11,0:47:47.46,Default,,0000,0000,0000,,values from - I don't Dialogue: 0,0:47:47.46,0:47:48.45,Default,,0000,0000,0000,,know, Dialogue: 0,0:47:48.45,0:47:52.22,Default,,0000,0000,0000,,mid-thousands to tens of thousands is very typical Dialogue: 0,0:47:52.22,0:47:53.94,Default,,0000,0000,0000,,for problems like Dialogue: 0,0:47:53.94,0:47:56.62,Default,,0000,0000,0000,,these. And, therefore, Dialogue: 0,0:47:56.62,0:48:00.34,Default,,0000,0000,0000,,there two to the 50,000 possible values for X, right? So two to Dialogue: 0,0:48:00.34,0:48:01.50,Default,,0000,0000,0000,,50,000 Dialogue: 0,0:48:01.50,0:48:02.73,Default,,0000,0000,0000,,possible bit vectors Dialogue: 0,0:48:02.73,0:48:03.82,Default,,0000,0000,0000,,of length Dialogue: 0,0:48:03.82,0:48:05.20,Default,,0000,0000,0000,, Dialogue: 0,0:48:05.20,0:48:07.24,Default,,0000,0000,0000,,50,000, and so Dialogue: 0,0:48:07.24,0:48:08.83,Default,,0000,0000,0000,,one way to model this is Dialogue: 0,0:48:08.83,0:48:10.75,Default,,0000,0000,0000,,the multinomial distribution, Dialogue: 0,0:48:10.75,0:48:13.90,Default,,0000,0000,0000,,but because there are two to the 50,000 possible values for X, Dialogue: 0,0:48:13.90,0:48:20.34,Default,,0000,0000,0000,,I would need two to the 50,000, but maybe -1 parameters, Dialogue: 0,0:48:20.34,0:48:21.06,Default,,0000,0000,0000,,right? Because you have Dialogue: 0,0:48:21.06,0:48:23.70,Default,,0000,0000,0000,,this sum to 1, right? So Dialogue: 0,0:48:23.70,0:48:27.09,Default,,0000,0000,0000,,-1. And this is clearly way too many parameters Dialogue: 0,0:48:27.09,0:48:28.48,Default,,0000,0000,0000,,to model Dialogue: 0,0:48:28.48,0:48:30.27,Default,,0000,0000,0000,,using the multinomial distribution Dialogue: 0,0:48:30.27,0:48:34.44,Default,,0000,0000,0000,,over all two to 50,000 possibilities. Dialogue: 0,0:48:34.44,0:48:35.84,Default,,0000,0000,0000,,So Dialogue: 0,0:48:35.84,0:48:42.29,Default,,0000,0000,0000,,in a Naive Bayes algorithm, we're Dialogue: 0,0:48:42.29,0:48:46.58,Default,,0000,0000,0000,,going to make a very strong assumption on P of X given Y, Dialogue: 0,0:48:46.58,0:48:49.37,Default,,0000,0000,0000,,and, in particular, I'm going to assume - let Dialogue: 0,0:48:49.37,0:48:53.61,Default,,0000,0000,0000,,me just say what it's called; then I'll write out what it means. I'm going to assume that the Dialogue: 0,0:48:53.61,0:48:54.63,Default,,0000,0000,0000,,XI's Dialogue: 0,0:48:54.63,0:49:01.63,Default,,0000,0000,0000,,are conditionally independent Dialogue: 0,0:49:06.49,0:49:07.95,Default,,0000,0000,0000,,given Y, okay? Dialogue: 0,0:49:07.95,0:49:11.21,Default,,0000,0000,0000,,Let me say what this means. Dialogue: 0,0:49:11.21,0:49:18.21,Default,,0000,0000,0000,,So I have that P of X1, X2, up to X50,000, Dialogue: 0,0:49:18.73,0:49:20.56,Default,,0000,0000,0000,,right, given the Dialogue: 0,0:49:20.56,0:49:25.91,Default,,0000,0000,0000,,Y. By the key rule of probability, this is P of X1 given Y Dialogue: 0,0:49:25.91,0:49:27.76,Default,,0000,0000,0000,,times P of X2 Dialogue: 0,0:49:27.76,0:49:30.37,Default,,0000,0000,0000,,given Dialogue: 0,0:49:30.37,0:49:32.47,Default,,0000,0000,0000,, Dialogue: 0,0:49:32.47,0:49:33.99,Default,,0000,0000,0000,,Y, Dialogue: 0,0:49:33.99,0:49:39.41,Default,,0000,0000,0000,,X1 Dialogue: 0,0:49:39.41,0:49:42.92,Default,,0000,0000,0000,,times PF - I'll just put dot, dot, dot. I'll just write 1, 1 à dot, dot, dot up to, you know, well - Dialogue: 0,0:49:42.92,0:49:46.37,Default,,0000,0000,0000,,whatever. You get the idea, up to P of X50,000, okay? Dialogue: 0,0:49:46.37,0:49:49.53,Default,,0000,0000,0000,,So this is the chain were of probability. This always holds. I've not Dialogue: 0,0:49:49.53,0:49:52.85,Default,,0000,0000,0000,,made any assumption yet, and now, we're Dialogue: 0,0:49:52.85,0:49:54.53,Default,,0000,0000,0000,,gonna Dialogue: 0,0:49:54.53,0:49:58.09,Default,,0000,0000,0000,,meet what's called the Naive Bayes assumption, or this assumption that X Dialogue: 0,0:49:58.09,0:50:01.87,Default,,0000,0000,0000,,defies a conditionally independent given Y. Going Dialogue: 0,0:50:01.87,0:50:03.71,Default,,0000,0000,0000,,to assume that - Dialogue: 0,0:50:03.71,0:50:06.62,Default,,0000,0000,0000,,well, nothing changes for the first term, Dialogue: 0,0:50:06.62,0:50:11.72,Default,,0000,0000,0000,,but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3 Dialogue: 0,0:50:11.72,0:50:13.36,Default,,0000,0000,0000,,given Dialogue: 0,0:50:13.36,0:50:16.98,Default,,0000,0000,0000,,the Dialogue: 0,0:50:16.98,0:50:17.76,Default,,0000,0000,0000,,Y, Dialogue: 0,0:50:17.76,0:50:20.10,Default,,0000,0000,0000,,and so on, up Dialogue: 0,0:50:20.10,0:50:24.49,Default,,0000,0000,0000,,to P of X50,000 given Y, okay? Dialogue: 0,0:50:24.49,0:50:28.85,Default,,0000,0000,0000,,Or just written more compactly, Dialogue: 0,0:50:28.85,0:50:32.12,Default,,0000,0000,0000,, Dialogue: 0,0:50:32.12,0:50:35.99,Default,,0000,0000,0000,,means assume that P of X1, P of X50,000 given Y is Dialogue: 0,0:50:35.99,0:50:41.15,Default,,0000,0000,0000,,the product from I = 1 to 50,000 or P of XI Dialogue: 0,0:50:41.15,0:50:44.17,Default,,0000,0000,0000,,given the Y, Dialogue: 0,0:50:44.17,0:50:45.30,Default,,0000,0000,0000,,okay? Dialogue: 0,0:50:45.30,0:50:49.13,Default,,0000,0000,0000,,And stating informally what this means is that I'm, sort of, assuming that - Dialogue: 0,0:50:49.13,0:50:53.01,Default,,0000,0000,0000,,so unless you know the cost label Y, so long as you know whether this is spam or not Dialogue: 0,0:50:53.01,0:50:54.96,Default,,0000,0000,0000,,spam, Dialogue: 0,0:50:54.96,0:50:58.58,Default,,0000,0000,0000,,then knowing whether the word A appears in email Dialogue: 0,0:50:58.58,0:50:59.81,Default,,0000,0000,0000,,does not affect Dialogue: 0,0:50:59.81,0:51:01.36,Default,,0000,0000,0000,,the probability Dialogue: 0,0:51:01.36,0:51:02.85,Default,,0000,0000,0000,,of whether the word Dialogue: 0,0:51:02.85,0:51:06.83,Default,,0000,0000,0000,,Ausworth appears in the email, all right? Dialogue: 0,0:51:06.83,0:51:10.70,Default,,0000,0000,0000,,And, in other words, there's assuming - once you know whether an email is spam Dialogue: 0,0:51:10.70,0:51:12.62,Default,,0000,0000,0000,,or not spam, Dialogue: 0,0:51:12.62,0:51:15.52,Default,,0000,0000,0000,,then knowing whether other words appear in the email won't help Dialogue: 0,0:51:15.52,0:51:19.05,Default,,0000,0000,0000,,you predict whether any other word appears in the email, okay? Dialogue: 0,0:51:19.05,0:51:20.29,Default,,0000,0000,0000,,And, Dialogue: 0,0:51:20.29,0:51:23.49,Default,,0000,0000,0000,,obviously, this assumption is false, right? This Dialogue: 0,0:51:23.49,0:51:25.98,Default,,0000,0000,0000,,assumption can't possibly be Dialogue: 0,0:51:25.98,0:51:28.19,Default,,0000,0000,0000,,true. I mean, if you see the word Dialogue: 0,0:51:28.19,0:51:31.41,Default,,0000,0000,0000,,- I don't know, CS229 in an email, you're much more likely to see my name in the email, or Dialogue: 0,0:51:31.41,0:51:37.05,Default,,0000,0000,0000,,the TA's names, or whatever. So this assumption is normally just false Dialogue: 0,0:51:37.05,0:51:37.47,Default,,0000,0000,0000,, Dialogue: 0,0:51:37.47,0:51:39.35,Default,,0000,0000,0000,,under English, right, Dialogue: 0,0:51:39.35,0:51:41.24,Default,,0000,0000,0000,,for normal written English, Dialogue: 0,0:51:41.24,0:51:42.52,Default,,0000,0000,0000,,but it Dialogue: 0,0:51:42.52,0:51:44.27,Default,,0000,0000,0000,,turns out that despite Dialogue: 0,0:51:44.27,0:51:46.60,Default,,0000,0000,0000,,this assumption being, sort of, Dialogue: 0,0:51:46.60,0:51:48.18,Default,,0000,0000,0000,,false in the literal sense, Dialogue: 0,0:51:48.18,0:51:50.83,Default,,0000,0000,0000,,the Naive Bayes algorithm is, sort of, Dialogue: 0,0:51:50.83,0:51:53.47,Default,,0000,0000,0000,,an extremely effective Dialogue: 0,0:51:53.47,0:51:57.87,Default,,0000,0000,0000,,algorithm for classifying text documents into spam or not spam, for Dialogue: 0,0:51:57.87,0:52:01.52,Default,,0000,0000,0000,,classifying your emails into different emails for your automatic view, for Dialogue: 0,0:52:01.52,0:52:03.59,Default,,0000,0000,0000,,looking at web pages and classifying Dialogue: 0,0:52:03.59,0:52:06.29,Default,,0000,0000,0000,,whether this webpage is trying to sell something or whatever. It Dialogue: 0,0:52:06.29,0:52:07.96,Default,,0000,0000,0000,,turns out, this assumption Dialogue: 0,0:52:07.96,0:52:11.82,Default,,0000,0000,0000,,works very well for classifying text documents and for other applications too that I'll Dialogue: 0,0:52:11.82,0:52:16.42,Default,,0000,0000,0000,,talk a bit about later. Dialogue: 0,0:52:16.42,0:52:20.69,Default,,0000,0000,0000,,As a digression that'll make sense only to some of you. Dialogue: 0,0:52:20.69,0:52:23.02,Default,,0000,0000,0000,,Let me just say that Dialogue: 0,0:52:23.02,0:52:27.96,Default,,0000,0000,0000,,if you're familiar with Bayesian X world, say Dialogue: 0,0:52:27.96,0:52:29.84,Default,,0000,0000,0000,, Dialogue: 0,0:52:29.84,0:52:33.100,Default,,0000,0000,0000,,graphical models, the Bayesian network associated with this model looks like this, and you're assuming Dialogue: 0,0:52:33.100,0:52:34.51,Default,,0000,0000,0000,,that Dialogue: 0,0:52:34.51,0:52:36.77,Default,,0000,0000,0000,,this is random variable Y Dialogue: 0,0:52:36.77,0:52:38.78,Default,,0000,0000,0000,,that then generates X1, X2, through Dialogue: 0,0:52:38.78,0:52:41.09,Default,,0000,0000,0000,,X50,000, okay? If you've not seen the Dialogue: 0,0:52:41.09,0:52:43.07,Default,,0000,0000,0000,,Bayes Net before, if Dialogue: 0,0:52:43.07,0:52:47.55,Default,,0000,0000,0000,,you don't know your graphical model, just ignore this. It's not important to our purposes, but Dialogue: 0,0:52:47.55,0:52:54.55,Default,,0000,0000,0000,,if you've seen it before, that's what it will look like. Okay. Dialogue: 0,0:52:58.02,0:53:03.35,Default,,0000,0000,0000,,So Dialogue: 0,0:53:03.35,0:53:10.35,Default,,0000,0000,0000,, Dialogue: 0,0:53:11.66,0:53:16.24,Default,,0000,0000,0000,,the parameters of the model Dialogue: 0,0:53:16.24,0:53:18.07,Default,,0000,0000,0000,,are as follows Dialogue: 0,0:53:18.07,0:53:21.66,Default,,0000,0000,0000,,with phi FI given Y = 1, which Dialogue: 0,0:53:21.66,0:53:23.19,Default,,0000,0000,0000,, Dialogue: 0,0:53:23.19,0:53:28.02,Default,,0000,0000,0000,,is probably FX = 1 or XI = 1 Dialogue: 0,0:53:28.02,0:53:28.84,Default,,0000,0000,0000,,given Y = 1, Dialogue: 0,0:53:28.84,0:53:35.53,Default,,0000,0000,0000,,phi I Dialogue: 0,0:53:35.53,0:53:39.96,Default,,0000,0000,0000,,given Y = 0, and phi Y, okay? Dialogue: 0,0:53:39.96,0:53:43.46,Default,,0000,0000,0000,,So these are the parameters of the model, Dialogue: 0,0:53:43.46,0:53:45.42,Default,,0000,0000,0000,, Dialogue: 0,0:53:45.42,0:53:47.25,Default,,0000,0000,0000,,and, therefore, Dialogue: 0,0:53:47.25,0:53:50.14,Default,,0000,0000,0000,,to fit the parameters of the model, you Dialogue: 0,0:53:50.14,0:53:57.14,Default,,0000,0000,0000,,can write down the joint likelihood, right, Dialogue: 0,0:54:01.85,0:54:07.33,Default,,0000,0000,0000,,is Dialogue: 0,0:54:07.33,0:54:14.33,Default,,0000,0000,0000,,equal to, as usual, okay? Dialogue: 0,0:54:18.67,0:54:20.78,Default,,0000,0000,0000,,So given the training sets, Dialogue: 0,0:54:20.78,0:54:27.78,Default,,0000,0000,0000,,you can write down the joint Dialogue: 0,0:54:28.98,0:54:32.58,Default,,0000,0000,0000,,likelihood of the parameters, and Dialogue: 0,0:54:32.58,0:54:39.58,Default,,0000,0000,0000,,then when Dialogue: 0,0:54:43.73,0:54:44.64,Default,,0000,0000,0000,,you Dialogue: 0,0:54:44.64,0:54:46.87,Default,,0000,0000,0000,,do maximum likelihood estimation, Dialogue: 0,0:54:46.87,0:54:50.86,Default,,0000,0000,0000,,you find that the maximum likelihood estimate of the parameters are Dialogue: 0,0:54:50.86,0:54:53.24,Default,,0000,0000,0000,,- they're really, pretty much, what you'd expect. Dialogue: 0,0:54:53.24,0:54:55.63,Default,,0000,0000,0000,,Maximum likelihood estimate for phi J given Y = 1 is Dialogue: 0,0:54:55.63,0:54:57.88,Default,,0000,0000,0000,,sum from I = 1 to Dialogue: 0,0:54:57.88,0:55:01.76,Default,,0000,0000,0000,,M, Dialogue: 0,0:55:01.76,0:55:03.46,Default,,0000,0000,0000,,indicator Dialogue: 0,0:55:03.46,0:55:07.13,Default,,0000,0000,0000,,XIJ = Dialogue: 0,0:55:07.13,0:55:14.13,Default,,0000,0000,0000,,1, YI = 1, okay? Dialogue: 0,0:55:14.70,0:55:18.48,Default,,0000,0000,0000,, Dialogue: 0,0:55:18.48,0:55:19.36,Default,,0000,0000,0000,, Dialogue: 0,0:55:19.36,0:55:20.72,Default,,0000,0000,0000,,And this is just a, Dialogue: 0,0:55:20.72,0:55:22.83,Default,,0000,0000,0000,,I guess, stated more simply, Dialogue: 0,0:55:22.83,0:55:24.97,Default,,0000,0000,0000,,the numerator just says, Run for Dialogue: 0,0:55:24.97,0:55:27.98,Default,,0000,0000,0000,,your entire training set, some [inaudible] examples, Dialogue: 0,0:55:27.98,0:55:30.92,Default,,0000,0000,0000,,and count up the number of times you saw word Jay Dialogue: 0,0:55:30.92,0:55:32.55,Default,,0000,0000,0000,,in a piece of email Dialogue: 0,0:55:32.55,0:55:34.65,Default,,0000,0000,0000,,for which the label Y was equal to 1. So, in other words, look Dialogue: 0,0:55:34.65,0:55:36.80,Default,,0000,0000,0000,,through all your spam emails Dialogue: 0,0:55:36.80,0:55:39.32,Default,,0000,0000,0000,,and count the number of emails in which the word Dialogue: 0,0:55:39.32,0:55:40.56,Default,,0000,0000,0000,,Jay appeared out of Dialogue: 0,0:55:40.56,0:55:42.42,Default,,0000,0000,0000,,all your spam emails, Dialogue: 0,0:55:42.42,0:55:44.71,Default,,0000,0000,0000,,and the denominator is, you know, Dialogue: 0,0:55:44.71,0:55:45.99,Default,,0000,0000,0000,,sum from I = 1 to M, Dialogue: 0,0:55:45.99,0:55:47.95,Default,,0000,0000,0000,,the number of spam. The Dialogue: 0,0:55:47.95,0:55:51.44,Default,,0000,0000,0000,,denominator is just the number of spam emails you got. Dialogue: 0,0:55:51.44,0:55:54.33,Default,,0000,0000,0000,,And so this ratio is Dialogue: 0,0:55:54.33,0:55:57.77,Default,,0000,0000,0000,,in all your spam emails in your training set, Dialogue: 0,0:55:57.77,0:56:00.40,Default,,0000,0000,0000,,what fraction of these emails Dialogue: 0,0:56:00.40,0:56:03.04,Default,,0000,0000,0000,,did the word Jay Dialogue: 0,0:56:03.04,0:56:03.66,Default,,0000,0000,0000,,appear in - Dialogue: 0,0:56:03.66,0:56:06.64,Default,,0000,0000,0000,,did the, Jay you wrote in your dictionary appear in? Dialogue: 0,0:56:06.64,0:56:10.24,Default,,0000,0000,0000,,And that's the maximum likelihood estimate Dialogue: 0,0:56:10.24,0:56:13.82,Default,,0000,0000,0000,,for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your Dialogue: 0,0:56:13.82,0:56:17.50,Default,,0000,0000,0000,, Dialogue: 0,0:56:17.50,0:56:20.50,Default,,0000,0000,0000,, Dialogue: 0,0:56:20.50,0:56:23.16,Default,,0000,0000,0000,,maximum likelihood estimate for phi Dialogue: 0,0:56:23.16,0:56:24.100,Default,,0000,0000,0000,,Y Dialogue: 0,0:56:24.100,0:56:28.88,Default,,0000,0000,0000,,is pretty much what you'd expect, right? Dialogue: 0,0:56:28.88,0:56:35.88,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:56:41.41,0:56:43.39,Default,,0000,0000,0000,,And so Dialogue: 0,0:56:43.39,0:56:47.67,Default,,0000,0000,0000,,having estimated all these parameters, Dialogue: 0,0:56:47.67,0:56:50.96,Default,,0000,0000,0000,,when you're given a new piece of email that you want to classify, Dialogue: 0,0:56:50.96,0:56:53.43,Default,,0000,0000,0000,,you can then compute P of Y given X Dialogue: 0,0:56:53.43,0:56:56.08,Default,,0000,0000,0000,,using Bayes rule, right? Dialogue: 0,0:56:56.08,0:56:57.58,Default,,0000,0000,0000,,Same as before because Dialogue: 0,0:56:57.58,0:57:04.23,Default,,0000,0000,0000,,together these parameters gives you a model for P of X given Y and for P of Y, Dialogue: 0,0:57:04.23,0:57:07.85,Default,,0000,0000,0000,,and by using Bayes rule, given these two terms, you can compute Dialogue: 0,0:57:07.85,0:57:10.71,Default,,0000,0000,0000,,P of X given Y, and Dialogue: 0,0:57:10.71,0:57:15.65,Default,,0000,0000,0000,,there's your spam classifier, okay? Dialogue: 0,0:57:15.65,0:57:18.100,Default,,0000,0000,0000,,Turns out we need one more elaboration to this idea, but let me check if there are Dialogue: 0,0:57:18.100,0:57:25.11,Default,,0000,0000,0000,,questions about this so far. Dialogue: 0,0:57:25.11,0:57:29.27,Default,,0000,0000,0000,,Student:So does this model depend Dialogue: 0,0:57:29.27,0:57:29.66,Default,,0000,0000,0000,,on Dialogue: 0,0:57:29.66,0:57:34.38,Default,,0000,0000,0000,,the number of inputs? Instructor (Andrew Ng):What do Dialogue: 0,0:57:34.38,0:57:35.15,Default,,0000,0000,0000,,you Dialogue: 0,0:57:35.15,0:57:38.17,Default,,0000,0000,0000,,mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this Dialogue: 0,0:57:38.17,0:57:45.17,Default,,0000,0000,0000,,given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make Dialogue: 0,0:57:48.96,0:57:52.63,Default,,0000,0000,0000,,sense? Or M is the number of training examples, so when you have M training examples, you plug them Dialogue: 0,0:57:52.63,0:57:53.100,Default,,0000,0000,0000,,into this formula, Dialogue: 0,0:57:53.100,0:58:00.100,Default,,0000,0000,0000,,and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the Dialogue: 0,0:58:01.29,0:58:04.09,Default,,0000,0000,0000,,number of emails? Instructor (Andrew Ng):Yeah, right. So, right. Dialogue: 0,0:58:04.09,0:58:07.13,Default,,0000,0000,0000,,So it's, kind of, your training set. I would go through all the email I've gotten Dialogue: 0,0:58:07.13,0:58:08.87,Default,,0000,0000,0000,,in the last two months Dialogue: 0,0:58:08.87,0:58:11.13,Default,,0000,0000,0000,,and label them as spam or not spam, Dialogue: 0,0:58:11.13,0:58:14.45,Default,,0000,0000,0000,,and so you have - I don't Dialogue: 0,0:58:14.45,0:58:16.81,Default,,0000,0000,0000,,know, like, a few hundred emails Dialogue: 0,0:58:16.81,0:58:18.63,Default,,0000,0000,0000,,labeled as spam or not spam, Dialogue: 0,0:58:18.63,0:58:25.63,Default,,0000,0000,0000,,and that will comprise your training sets for X1 and Y1 through XM, Dialogue: 0,0:58:28.05,0:58:28.62,Default,,0000,0000,0000,,YM, Dialogue: 0,0:58:28.62,0:58:32.82,Default,,0000,0000,0000,,where X is one of those vectors representing which words appeared in the email and Y Dialogue: 0,0:58:32.82,0:58:39.82,Default,,0000,0000,0000,,is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number Dialogue: 0,0:58:41.31,0:58:42.02,Default,,0000,0000,0000,, Dialogue: 0,0:58:42.02,0:58:49.02,Default,,0000,0000,0000,,of examples, but the last model doesn't depend on the models, but your phi is the Dialogue: 0,0:58:49.83,0:58:53.63,Default,,0000,0000,0000,,same for either one. Instructor (Andrew Ng):They're Dialogue: 0,0:58:53.63,0:58:58.02,Default,,0000,0000,0000,,different things, right? There's the model which is Dialogue: 0,0:58:58.02,0:59:00.66,Default,,0000,0000,0000,,- the modeling assumptions aren't made very well. Dialogue: 0,0:59:00.66,0:59:04.07,Default,,0000,0000,0000,,I'm assuming that - I'm making the Naive Bayes assumption. Dialogue: 0,0:59:04.07,0:59:08.90,Default,,0000,0000,0000,,So the probabilistic model is an assumption on the joint distribution Dialogue: 0,0:59:08.90,0:59:10.32,Default,,0000,0000,0000,,of X and Y. Dialogue: 0,0:59:10.32,0:59:12.39,Default,,0000,0000,0000,,That's what the model is, Dialogue: 0,0:59:12.39,0:59:16.65,Default,,0000,0000,0000,,and then I'm given a fixed number of training examples. I'm given M training examples, and Dialogue: 0,0:59:16.65,0:59:20.31,Default,,0000,0000,0000,,then it's, like, after I'm given the training sets, I'll then go in to write the maximum Dialogue: 0,0:59:20.31,0:59:22.30,Default,,0000,0000,0000,,likelihood estimate of the parameters, right? So that's, sort Dialogue: 0,0:59:22.30,0:59:24.53,Default,,0000,0000,0000,,of, Dialogue: 0,0:59:24.53,0:59:31.53,Default,,0000,0000,0000,,maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like, Dialogue: 0,0:59:38.39,0:59:41.08,Default,,0000,0000,0000,,if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So Dialogue: 0,0:59:41.08,0:59:42.36,Default,,0000,0000,0000,,it turns out Dialogue: 0,0:59:42.36,0:59:45.86,Default,,0000,0000,0000,,this is, sort of, a very practical question, really. How do I count this list of Dialogue: 0,0:59:45.86,0:59:49.63,Default,,0000,0000,0000,,words? One common way to do this is to actually Dialogue: 0,0:59:49.63,0:59:52.90,Default,,0000,0000,0000,,find some way to count a list of words, like go through all your emails, go through Dialogue: 0,0:59:52.90,0:59:53.91,Default,,0000,0000,0000,,all the - Dialogue: 0,0:59:53.91,0:59:56.93,Default,,0000,0000,0000,,in practice, one common way to count a list of words Dialogue: 0,0:59:56.93,1:00:00.34,Default,,0000,0000,0000,,is to just take all the words that appear in your training set. That's one fairly common way Dialogue: 0,1:00:00.34,1:00:02.19,Default,,0000,0000,0000,,to do it, Dialogue: 0,1:00:02.19,1:00:06.13,Default,,0000,0000,0000,,or if that turns out to be too many words, you can take all words that appear Dialogue: 0,1:00:06.13,1:00:07.03,Default,,0000,0000,0000,,at least Dialogue: 0,1:00:07.03,1:00:08.08,Default,,0000,0000,0000,,three times Dialogue: 0,1:00:08.08,1:00:09.35,Default,,0000,0000,0000,,in your training set. So Dialogue: 0,1:00:09.35,1:00:10.66,Default,,0000,0000,0000,,words that Dialogue: 0,1:00:10.66,1:00:14.75,Default,,0000,0000,0000,,you didn't even see three times in the emails you got in the last Dialogue: 0,1:00:14.75,1:00:17.21,Default,,0000,0000,0000,,two months, you discard. So those are - I Dialogue: 0,1:00:17.21,1:00:20.26,Default,,0000,0000,0000,,was talking about going through a dictionary, which is a nice way of thinking about it, but in Dialogue: 0,1:00:20.26,1:00:22.05,Default,,0000,0000,0000,,practice, you might go through Dialogue: 0,1:00:22.05,1:00:26.41,Default,,0000,0000,0000,,your training set and then just take the union of all the words that appear in Dialogue: 0,1:00:26.41,1:00:29.76,Default,,0000,0000,0000,,it. In some of the tests I've even, by the way, said select these features, but this is one Dialogue: 0,1:00:29.76,1:00:32.31,Default,,0000,0000,0000,,way to think about Dialogue: 0,1:00:32.31,1:00:34.14,Default,,0000,0000,0000,,creating your feature vector, Dialogue: 0,1:00:34.14,1:00:41.14,Default,,0000,0000,0000,,right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On Dialogue: 0,1:00:49.21,1:00:51.61,Default,,0000,0000,0000,,how I came up with the parameters? Dialogue: 0,1:00:51.61,1:00:53.93,Default,,0000,0000,0000,,Student:Correct. Instructor (Andrew Ng):Let's see. Dialogue: 0,1:00:53.93,1:00:58.25,Default,,0000,0000,0000,,So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right? Dialogue: 0,1:00:58.25,1:00:59.28,Default,,0000,0000,0000,,In Naive Bayes, Dialogue: 0,1:00:59.28,1:01:01.50,Default,,0000,0000,0000,,I need to build a model Dialogue: 0,1:01:01.50,1:01:03.55,Default,,0000,0000,0000,,for P of X given Y and for Dialogue: 0,1:01:03.55,1:01:05.36,Default,,0000,0000,0000,,P of Y, Dialogue: 0,1:01:05.36,1:01:09.75,Default,,0000,0000,0000,,right? So this is, I mean, in generous of learning algorithms, I need to come up with Dialogue: 0,1:01:09.75,1:01:11.08,Default,,0000,0000,0000,,models for these. Dialogue: 0,1:01:11.08,1:01:15.36,Default,,0000,0000,0000,,So how'd I model P of Y? Well, I just those to model it using a Bernoulli Dialogue: 0,1:01:15.36,1:01:16.39,Default,,0000,0000,0000,,distribution, Dialogue: 0,1:01:16.39,1:01:17.15,Default,,0000,0000,0000,,and so Dialogue: 0,1:01:17.15,1:01:19.71,Default,,0000,0000,0000,,P of Y will be Dialogue: 0,1:01:19.71,1:01:22.11,Default,,0000,0000,0000,,parameterized by that, all right? Student:Okay. Dialogue: 0,1:01:22.11,1:01:26.52,Default,,0000,0000,0000,,Instructor (Andrew Ng):And then how'd I model P of X given Y? Well, Dialogue: 0,1:01:26.52,1:01:28.53,Default,,0000,0000,0000,,let's keep changing bullets. Dialogue: 0,1:01:28.53,1:01:32.73,Default,,0000,0000,0000,,My model for P of X given Y under the Naive Bayes assumption, I assume Dialogue: 0,1:01:32.73,1:01:34.54,Default,,0000,0000,0000,,that P of X given Y Dialogue: 0,1:01:34.54,1:01:37.55,Default,,0000,0000,0000,,is the product of these probabilities, Dialogue: 0,1:01:37.55,1:01:40.90,Default,,0000,0000,0000,,and so I'm going to need parameters to tell me Dialogue: 0,1:01:40.90,1:01:43.51,Default,,0000,0000,0000,,what's the probability of each word occurring, Dialogue: 0,1:01:43.51,1:01:46.31,Default,,0000,0000,0000,,you know, of each word occurring or not occurring, Dialogue: 0,1:01:46.31,1:01:53.31,Default,,0000,0000,0000,,conditions on the email being spam or not spam email, okay? Student:How is that Dialogue: 0,1:01:54.64,1:01:58.83,Default,,0000,0000,0000,,Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature Dialogue: 0,1:01:58.83,1:02:00.80,Default,,0000,0000,0000,,vectors, XI Dialogue: 0,1:02:00.80,1:02:05.50,Default,,0000,0000,0000,,is either one or zero, depending on whether words I appear as in the email, Dialogue: 0,1:02:05.50,1:02:08.86,Default,,0000,0000,0000,,right? So by the way I define the Dialogue: 0,1:02:08.86,1:02:11.42,Default,,0000,0000,0000,,feature vectors, XI - Dialogue: 0,1:02:11.42,1:02:16.64,Default,,0000,0000,0000,,the XI is always zero or one. So that by definition, if XI, you know, is either zero or Dialogue: 0,1:02:16.64,1:02:19.97,Default,,0000,0000,0000,,one, then it has to be a Bernoulli distribution, right? Dialogue: 0,1:02:19.97,1:02:22.42,Default,,0000,0000,0000,,If XI would continue as then Dialogue: 0,1:02:22.42,1:02:24.82,Default,,0000,0000,0000,,you might model this as Gaussian and say you end up Dialogue: 0,1:02:24.82,1:02:27.30,Default,,0000,0000,0000,,like we did in Gaussian discriminant analysis. It's Dialogue: 0,1:02:27.30,1:02:30.72,Default,,0000,0000,0000,,just that the way I constructed my features for email, XI is always binary Dialogue: 0,1:02:30.72,1:02:32.13,Default,,0000,0000,0000,,value, and so you end up Dialogue: 0,1:02:32.13,1:02:32.99,Default,,0000,0000,0000,,with a Dialogue: 0,1:02:32.99,1:02:36.06,Default,,0000,0000,0000,,Bernoulli here, okay? All right. I Dialogue: 0,1:02:36.06,1:02:40.39,Default,,0000,0000,0000,,should move on. So Dialogue: 0,1:02:40.39,1:02:42.39,Default,,0000,0000,0000,,it turns out that Dialogue: 0,1:02:42.39,1:02:44.34,Default,,0000,0000,0000,,this idea Dialogue: 0,1:02:44.34,1:02:46.37,Default,,0000,0000,0000,,almost works. Dialogue: 0,1:02:46.37,1:02:47.72,Default,,0000,0000,0000,,Now, here's the problem. Dialogue: 0,1:02:47.72,1:02:49.96,Default,,0000,0000,0000,,So let's say you Dialogue: 0,1:02:49.96,1:02:54.44,Default,,0000,0000,0000,,complete this class and you start to do, maybe do the class project, and you Dialogue: 0,1:02:54.44,1:02:56.76,Default,,0000,0000,0000,,keep working on your class project for a bit, and it Dialogue: 0,1:02:56.76,1:03:01.44,Default,,0000,0000,0000,,becomes really good, and you want to submit your class project to a conference, right? So, Dialogue: 0,1:03:01.44,1:03:03.48,Default,,0000,0000,0000,,you know, around - I don't know, Dialogue: 0,1:03:03.48,1:03:08.04,Default,,0000,0000,0000,,June every year is the conference deadline for the next conference. Dialogue: 0,1:03:08.04,1:03:10.89,Default,,0000,0000,0000,,It's just the name of the conference; it's an acronym. Dialogue: 0,1:03:10.89,1:03:12.36,Default,,0000,0000,0000,,And so maybe Dialogue: 0,1:03:12.36,1:03:16.08,Default,,0000,0000,0000,,you send your project partners or senior friends even, and say, Hey, let's Dialogue: 0,1:03:16.08,1:03:20.07,Default,,0000,0000,0000,,work on a project and submit it to the NIPS conference. And so you're getting these emails Dialogue: 0,1:03:20.07,1:03:21.49,Default,,0000,0000,0000,,with the word NIPS in them, Dialogue: 0,1:03:21.49,1:03:24.33,Default,,0000,0000,0000,,which you've probably never seen before, Dialogue: 0,1:03:24.33,1:03:28.18,Default,,0000,0000,0000,,and so a Dialogue: 0,1:03:28.18,1:03:32.52,Default,,0000,0000,0000,,piece of email comes from your project partner, and so you Dialogue: 0,1:03:32.52,1:03:35.64,Default,,0000,0000,0000,,go, Let's send a paper to the NIPS conference. Dialogue: 0,1:03:35.64,1:03:37.31,Default,,0000,0000,0000,,And then your stamp classifier Dialogue: 0,1:03:37.31,1:03:38.76,Default,,0000,0000,0000,,will say Dialogue: 0,1:03:38.76,1:03:39.93,Default,,0000,0000,0000,,P of X - Dialogue: 0,1:03:39.93,1:03:44.63,Default,,0000,0000,0000,,let's say NIPS is the 30,000th word in your dictionary, okay? Dialogue: 0,1:03:44.63,1:03:46.67,Default,,0000,0000,0000,,So X30,000 given Dialogue: 0,1:03:46.67,1:03:50.19,Default,,0000,0000,0000,,the 1, given Dialogue: 0,1:03:50.19,1:03:51.62,Default,,0000,0000,0000,,Y = Dialogue: 0,1:03:51.62,1:03:52.80,Default,,0000,0000,0000,,1 Dialogue: 0,1:03:52.80,1:03:54.03,Default,,0000,0000,0000,,will be equal to 0. Dialogue: 0,1:03:54.03,1:03:57.48,Default,,0000,0000,0000,,That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in Dialogue: 0,1:03:57.48,1:04:01.20,Default,,0000,0000,0000,,your training set, so maximum likelihood of the parameter is that probably have seen the word Dialogue: 0,1:04:01.20,1:04:02.46,Default,,0000,0000,0000,,NIPS is zero, Dialogue: 0,1:04:02.46,1:04:06.79,Default,,0000,0000,0000,,and, similarly, Dialogue: 0,1:04:06.79,1:04:08.38,Default,,0000,0000,0000,,you Dialogue: 0,1:04:08.38,1:04:12.46,Default,,0000,0000,0000,,know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also Dialogue: 0,1:04:12.46,1:04:14.73,Default,,0000,0000,0000,,estimated Dialogue: 0,1:04:14.73,1:04:21.21,Default,,0000,0000,0000,,as zero. Dialogue: 0,1:04:21.21,1:04:28.21,Default,,0000,0000,0000,,So Dialogue: 0,1:04:34.79,1:04:40.71,Default,,0000,0000,0000,,when your spam classifier goes to compute P of Y = 1 given X, it will Dialogue: 0,1:04:40.71,1:04:44.53,Default,,0000,0000,0000,,compute this Dialogue: 0,1:04:44.53,1:04:47.21,Default,,0000,0000,0000,,right here P of Y Dialogue: 0,1:04:47.21,1:04:54.21,Default,,0000,0000,0000,,over - well, Dialogue: 0,1:04:56.18,1:05:03.18,Default,,0000,0000,0000,,all Dialogue: 0,1:05:05.09,1:05:09.51,Default,,0000,0000,0000,,right. Dialogue: 0,1:05:09.51,1:05:11.33,Default,,0000,0000,0000,,And so Dialogue: 0,1:05:11.33,1:05:15.76,Default,,0000,0000,0000,,you look at that terms, say, this will be product from I = Dialogue: 0,1:05:15.76,1:05:18.38,Default,,0000,0000,0000,,1 to 50,000, Dialogue: 0,1:05:18.38,1:05:22.36,Default,,0000,0000,0000,,P of XI given Y, Dialogue: 0,1:05:22.36,1:05:26.14,Default,,0000,0000,0000,,and one of those probabilities will be equal to Dialogue: 0,1:05:26.14,1:05:28.17,Default,,0000,0000,0000,, Dialogue: 0,1:05:28.17,1:05:32.34,Default,,0000,0000,0000,,zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a Dialogue: 0,1:05:32.34,1:05:36.70,Default,,0000,0000,0000,,zero in this product, and so the numerator is zero, Dialogue: 0,1:05:36.70,1:05:40.25,Default,,0000,0000,0000,,and in the same way, it turns out the denominator will also be zero, and so you end Dialogue: 0,1:05:40.25,1:05:42.72,Default,,0000,0000,0000,,up with - Dialogue: 0,1:05:42.72,1:05:45.84,Default,,0000,0000,0000,,actually all of these terms end up being zero. So you end up with P of Y = 1 Dialogue: 0,1:05:45.84,1:05:48.78,Default,,0000,0000,0000,,given X is 0 over 0 + 0, okay, which is Dialogue: 0,1:05:48.78,1:05:51.41,Default,,0000,0000,0000,,undefined. And the Dialogue: 0,1:05:51.41,1:05:53.96,Default,,0000,0000,0000,,problem with this is that it's Dialogue: 0,1:05:53.96,1:05:57.24,Default,,0000,0000,0000,,just statistically a bad idea Dialogue: 0,1:05:57.24,1:06:00.21,Default,,0000,0000,0000,,to say that P of X30,000 Dialogue: 0,1:06:00.21,1:06:02.95,Default,,0000,0000,0000,,given Y is Dialogue: 0,1:06:02.95,1:06:03.49,Default,,0000,0000,0000,,0, Dialogue: 0,1:06:03.49,1:06:06.27,Default,,0000,0000,0000,,right? Just because you haven't seen the word NIPS in your last Dialogue: 0,1:06:06.27,1:06:10.42,Default,,0000,0000,0000,,two months worth of email, it's also statistically not sound to say that, Dialogue: 0,1:06:10.42,1:06:16.32,Default,,0000,0000,0000,,therefore, the chance of ever seeing this word is zero, right? Dialogue: 0,1:06:16.32,1:06:19.17,Default,,0000,0000,0000,,And so Dialogue: 0,1:06:19.17,1:06:20.48,Default,,0000,0000,0000,, Dialogue: 0,1:06:20.48,1:06:22.72,Default,,0000,0000,0000,, Dialogue: 0,1:06:22.72,1:06:25.82,Default,,0000,0000,0000,,is this idea that just because you haven't seen something Dialogue: 0,1:06:25.82,1:06:30.40,Default,,0000,0000,0000,,before, that may mean that that event is unlikely, but it doesn't mean that Dialogue: 0,1:06:30.40,1:06:33.92,Default,,0000,0000,0000,,it's impossible, and just saying that if you've never seen the word NIPS before, Dialogue: 0,1:06:33.92,1:06:40.76,Default,,0000,0000,0000,,then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero. Dialogue: 0,1:06:40.76,1:06:47.76,Default,,0000,0000,0000,,So we're gonna fix this, Dialogue: 0,1:06:48.34,1:06:50.08,Default,,0000,0000,0000,,and Dialogue: 0,1:06:50.08,1:06:54.12,Default,,0000,0000,0000,,to motivate the fix I'll talk about Dialogue: 0,1:06:54.12,1:06:58.94,Default,,0000,0000,0000,,- the example we're gonna use is let's say that you've been following the Stanford basketball Dialogue: 0,1:06:58.94,1:07:04.10,Default,,0000,0000,0000,,team for all of their away games, and been, sort of, tracking their wins and losses Dialogue: 0,1:07:04.10,1:07:08.17,Default,,0000,0000,0000,,to gather statistics, and, maybe - I don't know, form a betting pool about Dialogue: 0,1:07:08.17,1:07:11.06,Default,,0000,0000,0000,,whether they're likely to win or lose the next game, okay? Dialogue: 0,1:07:11.06,1:07:12.63,Default,,0000,0000,0000,,So Dialogue: 0,1:07:12.63,1:07:15.41,Default,,0000,0000,0000,,these are some of the statistics. Dialogue: 0,1:07:15.41,1:07:17.59,Default,,0000,0000,0000,, Dialogue: 0,1:07:17.59,1:07:19.43,Default,,0000,0000,0000,, Dialogue: 0,1:07:19.43,1:07:24.82,Default,,0000,0000,0000,,So on, I guess, the 8th of February Dialogue: 0,1:07:24.82,1:07:29.30,Default,,0000,0000,0000,,last season they played Washington State, and they Dialogue: 0,1:07:29.30,1:07:31.78,Default,,0000,0000,0000,,did not win. Dialogue: 0,1:07:31.78,1:07:36.24,Default,,0000,0000,0000,,On Dialogue: 0,1:07:36.24,1:07:38.21,Default,,0000,0000,0000,,the 11th of February, Dialogue: 0,1:07:38.21,1:07:42.35,Default,,0000,0000,0000,,they play Washington, 22nd Dialogue: 0,1:07:42.35,1:07:47.26,Default,,0000,0000,0000,,they played USC, Dialogue: 0,1:07:47.26,1:07:49.77,Default,,0000,0000,0000,, Dialogue: 0,1:07:49.77,1:07:54.93,Default,,0000,0000,0000,,played UCLA, Dialogue: 0,1:07:54.93,1:07:57.63,Default,,0000,0000,0000,,played USC again, Dialogue: 0,1:07:57.63,1:08:03.74,Default,,0000,0000,0000,, Dialogue: 0,1:08:03.74,1:08:05.82,Default,,0000,0000,0000,,and now you want to estimate Dialogue: 0,1:08:05.82,1:08:09.88,Default,,0000,0000,0000,,what's the chance that they'll win or lose against Louisville, right? Dialogue: 0,1:08:09.88,1:08:11.17,Default,,0000,0000,0000,,So Dialogue: 0,1:08:11.17,1:08:14.91,Default,,0000,0000,0000,,find the four guys last year or five times and they weren't good in their away games, but it Dialogue: 0,1:08:14.91,1:08:17.01,Default,,0000,0000,0000,,seems awfully harsh to say that - so it Dialogue: 0,1:08:17.01,1:08:24.01,Default,,0000,0000,0000,,seems awfully harsh to say there's zero chance that they'll Dialogue: 0,1:08:37.08,1:08:40.27,Default,,0000,0000,0000,,win in the last - in the 5th game. So here's the idea behind Laplace smoothing Dialogue: 0,1:08:40.27,1:08:42.63,Default,,0000,0000,0000,,which is Dialogue: 0,1:08:42.63,1:08:44.82,Default,,0000,0000,0000,,that we're estimate Dialogue: 0,1:08:44.82,1:08:48.22,Default,,0000,0000,0000,,the probably of Y being equal to one, right? Dialogue: 0,1:08:48.22,1:08:53.19,Default,,0000,0000,0000,,Normally, the maximum likelihood [inaudible] is the Dialogue: 0,1:08:53.19,1:08:56.06,Default,,0000,0000,0000,,number of ones Dialogue: 0,1:08:56.06,1:08:57.66,Default,,0000,0000,0000,,divided by Dialogue: 0,1:08:57.66,1:09:00.76,Default,,0000,0000,0000,,the number of zeros Dialogue: 0,1:09:00.76,1:09:05.35,Default,,0000,0000,0000,,plus the number of ones, okay? I Dialogue: 0,1:09:05.35,1:09:07.87,Default,,0000,0000,0000,,hope this informal notation makes sense, right? Knowing Dialogue: 0,1:09:07.87,1:09:11.35,Default,,0000,0000,0000,,the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random Dialogue: 0,1:09:11.35,1:09:12.22,Default,,0000,0000,0000,,variable Dialogue: 0,1:09:12.22,1:09:14.63,Default,,0000,0000,0000,,is Dialogue: 0,1:09:14.63,1:09:16.97,Default,,0000,0000,0000,,just the number of ones you saw Dialogue: 0,1:09:16.97,1:09:20.86,Default,,0000,0000,0000,,divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in Dialogue: 0,1:09:20.86,1:09:22.86,Default,,0000,0000,0000,,the Laplace Dialogue: 0,1:09:22.86,1:09:24.17,Default,,0000,0000,0000,,Smoothing Dialogue: 0,1:09:24.17,1:09:25.80,Default,,0000,0000,0000,,we're going to Dialogue: 0,1:09:25.80,1:09:29.55,Default,,0000,0000,0000,,just take each of these terms, the number of ones and, sort of, add one to that, the number Dialogue: 0,1:09:29.55,1:09:31.83,Default,,0000,0000,0000,,of zeros and add one to that, the Dialogue: 0,1:09:31.83,1:09:34.13,Default,,0000,0000,0000,,number of ones and add one to that, Dialogue: 0,1:09:34.13,1:09:37.11,Default,,0000,0000,0000,,and so in our example, Dialogue: 0,1:09:37.11,1:09:38.30,Default,,0000,0000,0000,,instead of estimating Dialogue: 0,1:09:38.30,1:09:42.89,Default,,0000,0000,0000,,the probability of winning the next game to be 0 á Dialogue: 0,1:09:42.89,1:09:45.21,Default,,0000,0000,0000,,5 + 0, Dialogue: 0,1:09:45.21,1:09:50.32,Default,,0000,0000,0000,,we'll add one to all of these counts, and so we say that the chance of Dialogue: 0,1:09:50.32,1:09:51.85,Default,,0000,0000,0000,,their Dialogue: 0,1:09:51.85,1:09:53.80,Default,,0000,0000,0000,,winning the next game is 1/7th, Dialogue: 0,1:09:53.80,1:09:55.18,Default,,0000,0000,0000,,okay? Which is Dialogue: 0,1:09:55.18,1:09:59.31,Default,,0000,0000,0000,,that having seen them lose, you know, five away games in a row, we aren't terribly - Dialogue: 0,1:09:59.31,1:10:02.51,Default,,0000,0000,0000,,we don't think it's terribly likely they'll win the next game, but at Dialogue: 0,1:10:02.51,1:10:05.98,Default,,0000,0000,0000,,least we're not saying it's impossible. Dialogue: 0,1:10:05.98,1:10:09.95,Default,,0000,0000,0000,,As a historical side note, the Laplace actually came up with the method. Dialogue: 0,1:10:09.95,1:10:14.32,Default,,0000,0000,0000,,It's called the Laplace smoothing after him. Dialogue: 0,1:10:14.32,1:10:18.40,Default,,0000,0000,0000,, Dialogue: 0,1:10:18.40,1:10:21.92,Default,,0000,0000,0000,,When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale Dialogue: 0,1:10:21.92,1:10:25.24,Default,,0000,0000,0000,,was in a lot of days now, we've seen the sun rise, Dialogue: 0,1:10:25.24,1:10:28.49,Default,,0000,0000,0000,,but that doesn't mean we can be absolutely certain the sun will rise tomorrow. Dialogue: 0,1:10:28.49,1:10:31.42,Default,,0000,0000,0000,,He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, Dialogue: 0,1:10:31.42,1:10:36.45,Default,,0000,0000,0000,,cool. So, Dialogue: 0,1:10:36.45,1:10:38.55,Default,,0000,0000,0000,,and more generally, Dialogue: 0,1:10:38.55,1:10:41.57,Default,,0000,0000,0000,, Dialogue: 0,1:10:41.57,1:10:43.97,Default,,0000,0000,0000,, Dialogue: 0,1:10:43.97,1:10:45.43,Default,,0000,0000,0000,, Dialogue: 0,1:10:45.43,1:10:46.45,Default,,0000,0000,0000,, Dialogue: 0,1:10:46.45,1:10:49.41,Default,,0000,0000,0000,,if Y Dialogue: 0,1:10:49.41,1:10:52.84,Default,,0000,0000,0000,,takes on Dialogue: 0,1:10:52.84,1:10:55.15,Default,,0000,0000,0000,,K possible of values, Dialogue: 0,1:10:55.15,1:10:59.48,Default,,0000,0000,0000,,if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1. Dialogue: 0,1:10:59.48,1:11:01.73,Default,,0000,0000,0000,,Let's Dialogue: 0,1:11:01.73,1:11:05.90,Default,,0000,0000,0000,,see. Dialogue: 0,1:11:05.90,1:11:11.38,Default,,0000,0000,0000,,So the maximum likelihood estimate will be Sum from J = 1 to M, Dialogue: 0,1:11:11.38,1:11:16.50,Default,,0000,0000,0000,,indicator YI = J á M, Dialogue: 0,1:11:16.50,1:11:19.38,Default,,0000,0000,0000,,right? Dialogue: 0,1:11:19.38,1:11:21.63,Default,,0000,0000,0000,,That's the maximum likelihood estimate Dialogue: 0,1:11:21.63,1:11:24.37,Default,,0000,0000,0000,,of a multinomial probability of Y Dialogue: 0,1:11:24.37,1:11:26.93,Default,,0000,0000,0000,,being equal to - oh, Dialogue: 0,1:11:26.93,1:11:29.23,Default,,0000,0000,0000,,excuse me, Y = J. All right. Dialogue: 0,1:11:29.23,1:11:33.11,Default,,0000,0000,0000,,That's the maximum likelihood estimate for the probability of Y = J, Dialogue: 0,1:11:33.11,1:11:36.89,Default,,0000,0000,0000,,and so when you apply Laplace smoothing to that, Dialogue: 0,1:11:36.89,1:11:39.64,Default,,0000,0000,0000,,you add one to the numerator, and Dialogue: 0,1:11:39.64,1:11:41.89,Default,,0000,0000,0000,,add K to the denominator, Dialogue: 0,1:11:41.89,1:11:48.89,Default,,0000,0000,0000,,if Y can take up K possible values, okay? Dialogue: 0,1:12:06.47,1:12:08.58,Default,,0000,0000,0000,,So for Naive Bayes, Dialogue: 0,1:12:08.58,1:12:14.35,Default,,0000,0000,0000,,what that gives us is - Dialogue: 0,1:12:14.35,1:12:21.35,Default,,0000,0000,0000,,shoot. Dialogue: 0,1:12:38.64,1:12:44.32,Default,,0000,0000,0000,,Right? So that was the maximum likelihood estimate, and what you Dialogue: 0,1:12:44.32,1:12:47.95,Default,,0000,0000,0000,,end up doing is adding one to the numerator and adding two to the denominator, Dialogue: 0,1:12:47.95,1:12:52.03,Default,,0000,0000,0000,,and this solves the problem of the zero probabilities, and when your friend sends Dialogue: 0,1:12:52.03,1:12:57.49,Default,,0000,0000,0000,,you email about the NIPS conference, Dialogue: 0,1:12:57.49,1:13:00.80,Default,,0000,0000,0000,,your spam filter will still be able to Dialogue: 0,1:13:00.80,1:13:07.80,Default,,0000,0000,0000,,make a meaningful prediction, all right? Okay. Dialogue: 0,1:13:12.03,1:13:14.83,Default,,0000,0000,0000,,Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the Dialogue: 0,1:13:14.83,1:13:16.24,Default,,0000,0000,0000,,games on the right, it's liberal assumptions that the probability Dialogue: 0,1:13:16.24,1:13:23.24,Default,,0000,0000,0000,,of Dialogue: 0,1:13:24.65,1:13:27.96,Default,,0000,0000,0000,,winning is very close to zero, so, I mean, the prediction should Dialogue: 0,1:13:27.96,1:13:29.79,Default,,0000,0000,0000,,be equal to PF, 0. Instructor (Andrew Ng):Right. Dialogue: 0,1:13:29.79,1:13:35.96,Default,,0000,0000,0000,,I would say that Dialogue: 0,1:13:35.96,1:13:38.14,Default,,0000,0000,0000,,in this case the prediction Dialogue: 0,1:13:38.14,1:13:41.17,Default,,0000,0000,0000,,is 1/7th, right? We don't have a lot of - if you see somebody lose five games Dialogue: 0,1:13:41.17,1:13:44.29,Default,,0000,0000,0000,,in a row, you may not have a lot of faith in them, Dialogue: 0,1:13:44.29,1:13:48.52,Default,,0000,0000,0000,,but as an extreme example, suppose you saw them lose one game, Dialogue: 0,1:13:48.52,1:13:51.50,Default,,0000,0000,0000,,right? It's just not reasonable to say that the chances of winning the next game Dialogue: 0,1:13:51.50,1:13:52.86,Default,,0000,0000,0000,,is zero, but Dialogue: 0,1:13:52.86,1:13:58.90,Default,,0000,0000,0000,,that's what maximum likelihood Dialogue: 0,1:13:58.90,1:14:01.15,Default,,0000,0000,0000,,estimate Dialogue: 0,1:14:01.15,1:14:05.10,Default,,0000,0000,0000,,will say. Student:Yes. Instructor (Andrew Ng):And - Dialogue: 0,1:14:05.10,1:14:08.14,Default,,0000,0000,0000,,Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you Dialogue: 0,1:14:08.14,1:14:11.31,Default,,0000,0000,0000,,know, given just five training examples, what's a reasonable estimate for the chance of Dialogue: 0,1:14:11.31,1:14:13.17,Default,,0000,0000,0000,,winning the next game, Dialogue: 0,1:14:13.17,1:14:14.12,Default,,0000,0000,0000,,and Dialogue: 0,1:14:14.12,1:14:18.46,Default,,0000,0000,0000,,1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance. Dialogue: 0,1:14:18.46,1:14:21.12,Default,,0000,0000,0000,,We're saying the chances of winning the next game is less Dialogue: 0,1:14:21.12,1:14:25.07,Default,,0000,0000,0000,,than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of Dialogue: 0,1:14:25.07,1:14:27.86,Default,,0000,0000,0000,,Bayesian assumptions about the prior and posterior, Dialogue: 0,1:14:27.86,1:14:31.28,Default,,0000,0000,0000,,this Laplace smoothing actually gives the optimal estimate, Dialogue: 0,1:14:31.28,1:14:33.46,Default,,0000,0000,0000,,in a certain sense I won't go into Dialogue: 0,1:14:33.46,1:14:37.01,Default,,0000,0000,0000,,of what's the chance of winning the next game, and so under a certain assumption Dialogue: 0,1:14:37.01,1:14:38.08,Default,,0000,0000,0000,,about the Dialogue: 0,1:14:38.08,1:14:40.68,Default,,0000,0000,0000,,Bayesian prior on the parameter. Dialogue: 0,1:14:40.68,1:14:44.26,Default,,0000,0000,0000,,So I don't know. It actually seems like a pretty reasonable assumption to me. Dialogue: 0,1:14:44.26,1:14:46.94,Default,,0000,0000,0000,,Although, I should say, it actually Dialogue: 0,1:14:46.94,1:14:50.06,Default,,0000,0000,0000,,turned out - Dialogue: 0,1:14:50.06,1:14:53.53,Default,,0000,0000,0000,,No, I'm just being mean. We actually are a pretty good basketball team, but I chose a Dialogue: 0,1:14:53.53,1:14:54.66,Default,,0000,0000,0000,,losing streak Dialogue: 0,1:14:54.66,1:14:58.77,Default,,0000,0000,0000,,because it's funnier that way. Dialogue: 0,1:14:58.77,1:15:05.77,Default,,0000,0000,0000,,Let's see. Shoot. Does someone want to - are there other questions about Dialogue: 0,1:15:07.49,1:15:09.32,Default,,0000,0000,0000,, Dialogue: 0,1:15:09.32,1:15:13.38,Default,,0000,0000,0000,,this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but Dialogue: 0,1:15:13.38,1:15:15.06,Default,,0000,0000,0000,, Dialogue: 0,1:15:15.06,1:15:15.86,Default,,0000,0000,0000,,we'll do that in the next lecture. So let's wrap it.