1
00:00:10,170 --> 00:00:11,549


2
00:00:11,549 --> 00:00:14,819
This presentation is delivered by the Stanford Center for Professional

3
00:00:14,819 --> 00:00:21,819
Development.

4
00:00:24,739 --> 00:00:28,499
Okay. Good morning. Welcome back.

5
00:00:28,499 --> 00:00:31,239
What I want to do today is

6
00:00:31,239 --> 00:00:36,640
actually wrap up our discussion on learning theory and sort of on

7
00:00:36,640 --> 00:00:37,670
and I'm gonna start by

8
00:00:37,670 --> 00:00:41,810
talking about Bayesian statistics and regularization,

9
00:00:41,810 --> 00:00:45,680
and then take a very brief digression to tell you about online learning.

10
00:00:45,680 --> 00:00:50,320
And most of today's lecture will actually be on various pieces of that, so applying

11
00:00:50,320 --> 00:00:52,450
machine learning algorithms to problems like, you know,

12
00:00:52,450 --> 00:00:55,920
like the project or other problems you may go work on after you graduate from this

13
00:00:57,010 --> 00:00:57,739
class. But

14
00:00:57,739 --> 00:01:02,300
let's start the talk about Bayesian statistics and regularization.

15
00:01:02,300 --> 00:01:04,130
So you remember from last week,

16
00:01:04,130 --> 00:01:08,960
we started to talk about learning theory and we learned about bias and

17
00:01:08,960 --> 00:01:12,260
variance. And I guess in the previous lecture, we spent most of the previous

18
00:01:12,260 --> 00:01:13,240
lecture

19
00:01:13,240 --> 00:01:17,510
talking about algorithms for model selection and for

20
00:01:17,510 --> 00:01:21,360
feature selection. We talked about cross-validation. Right? So

21
00:01:21,360 --> 00:01:24,200
most of the methods we talked about in the previous lecture were

22
00:01:24,200 --> 00:01:28,460
ways for you to try to simply the model. So for example,

23
00:01:28,460 --> 00:01:31,920
the feature selection algorithms we talked about gives you a way to eliminate a number

24
00:01:31,920 --> 00:01:32,909
of features,

25
00:01:32,909 --> 00:01:36,350
so as to reduce the number of parameters you need to fit and thereby reduce

26
00:01:36,350 --> 00:01:39,330
overfitting. Right? You remember that? So feature

27
00:01:39,330 --> 00:01:40,670
selection algorithms

28
00:01:42,149 --> 00:01:43,970
choose a subset of the features

29
00:01:43,970 --> 00:01:48,470
so that you have less parameters and you may be less likely to overfit. Right?

30
00:01:48,470 --> 00:01:52,790
What I want to do today is to talk about a different way to prevent overfitting.

31
00:01:52,790 --> 00:01:53,870
And

32
00:01:53,870 --> 00:01:57,860
there's a method called regularization and there's a way that lets you keep all the

33
00:01:57,860 --> 00:01:58,859
parameters.

34
00:01:59,920 --> 00:02:04,200
So here's the idea, and I'm gonna illustrate this example with,

35
00:02:04,200 --> 00:02:06,960
say, linear regression.

36
00:02:06,960 --> 00:02:08,709
So

37
00:02:08,709 --> 00:02:13,329
you take

38
00:02:13,329 --> 00:02:16,920
the linear regression model, the very first model we learned about,

39
00:02:16,920 --> 00:02:17,969
right,

40
00:02:17,969 --> 00:02:21,639
we said that we would choose the parameters

41
00:02:21,639 --> 00:02:23,609
via

42
00:02:23,609 --> 00:02:29,519
maximum likelihood.

43
00:02:29,519 --> 00:02:31,669
Right? And that meant that,

44
00:02:31,669 --> 00:02:35,299
you know, you would choose the parameters theta

45
00:02:35,299 --> 00:02:39,879
that maximized

46
00:02:39,879 --> 00:02:43,009
the probability of the data,

47
00:02:43,009 --> 00:02:46,109
which is parameters theta that maximized the probability of the data we observe.

48
00:02:46,109 --> 00:02:50,709
Right?

49
00:02:50,709 --> 00:02:52,049
And so

50
00:02:52,049 --> 00:02:56,279
to give this sort of procedure a name, this is one example of most

51
00:02:56,279 --> 00:03:00,259
common frequencies procedure,

52
00:03:00,259 --> 00:03:05,419
and frequency, you can think of sort of as maybe one school of statistics.

53
00:03:05,419 --> 00:03:07,079
And the philosophical view

54
00:03:07,079 --> 00:03:08,869
behind writing this down was

55
00:03:08,869 --> 00:03:13,480
we envisioned that there was some true parameter theta out there that generated,

56
00:03:13,480 --> 00:03:16,469
you know, the Xs and the Ys. There's some true parameter theta

57
00:03:16,469 --> 00:03:20,529
that govern housing prices, Y is a function of X,

58
00:03:20,529 --> 00:03:23,459
and we don't know what the value of theta is,

59
00:03:23,459 --> 00:03:27,630
and we'd like to come up with some procedure for estimating the value of theta. Okay?

60
00:03:27,630 --> 00:03:28,379
And so,

61
00:03:28,379 --> 00:03:31,269
maximum likelihood is just one possible procedure

62
00:03:31,269 --> 00:03:33,209
for estimating the unknown value

63
00:03:33,209 --> 00:03:35,879
for theta.

64
00:03:35,879 --> 00:03:37,609
And

65
00:03:37,609 --> 00:03:41,060
the way you formulated this, you know, theta was not a random variable. Right?

66
00:03:41,060 --> 00:03:42,140
That's what why said,

67
00:03:42,140 --> 00:03:45,490
so theta is just some true value out there. It's not random or anything, we just don't

68
00:03:45,490 --> 00:03:50,329
know what it is, and we have a procedure called maximum likelihood for estimating the

69
00:03:50,329 --> 00:03:55,469
value for theta. So this is one example of what's called a frequencies procedure.

70
00:03:55,469 --> 00:03:57,279
The alternative to the, I

71
00:03:57,279 --> 00:04:04,279
guess, the frequency school of statistics is the Bayesian school,

72
00:04:04,389 --> 00:04:06,869
in which

73
00:04:06,869 --> 00:04:10,239
we're gonna say that we don't know what theta,

74
00:04:10,239 --> 00:04:13,339
and so we will put a prior

75
00:04:13,339 --> 00:04:14,769
on

76
00:04:14,769 --> 00:04:18,239
theta. Okay? So in the Bayesian school students would say, "Well

77
00:04:18,239 --> 00:04:22,379
don't know what the value of theta so let's represent our uncertainty over theta with a

78
00:04:22,379 --> 00:04:27,229
prior.

79
00:04:28,259 --> 00:04:32,009
So for example,

80
00:04:32,009 --> 00:04:34,770
our prior on theta

81
00:04:34,770 --> 00:04:36,860
may be a Gaussian distribution

82
00:04:36,860 --> 00:04:39,849
with mean zero and curvalence matrix

83
00:04:39,849 --> 00:04:42,889
given by tau squared I. Okay?

84
00:04:42,889 --> 00:04:45,319
And so -

85
00:04:45,319 --> 00:04:52,319
actually, if I use S to denote my training set, well - right,

86
00:04:56,279 --> 00:05:00,289
so theta represents my beliefs about what the parameters are in the absence of

87
00:05:00,289 --> 00:05:01,380
any data. So

88
00:05:01,380 --> 00:05:03,290
not having seen any data, theta

89
00:05:03,290 --> 00:05:05,709
represents, you know, what I think theta - it

90
00:05:05,709 --> 00:05:10,090
probably represents what I think theta is most likely to be.

91
00:05:10,090 --> 00:05:15,659
And so given the training set, S, in the sort of Bayesian procedure, we would,

92
00:05:15,659 --> 00:05:22,439
well,

93
00:05:22,439 --> 00:05:26,539
calculate the probability, the posterior probability by parameters

94
00:05:26,539 --> 00:05:28,869
given my training sets, and - let's

95
00:05:28,869 --> 00:05:30,469
write

96
00:05:30,469 --> 00:05:31,779
this on the next board.

97
00:05:31,779 --> 00:05:33,360
So my posterior

98
00:05:33,360 --> 00:05:36,740
on my parameters given my training set, by Bayes' rule, this will

99
00:05:36,740 --> 00:05:40,160
be proportional to, you

100
00:05:40,160 --> 00:05:47,160
know, this.

101
00:05:51,459 --> 00:05:56,559
Right? So by Bayes' rule. Let's call

102
00:05:56,559 --> 00:06:01,429
it posterior.

103
00:06:01,429 --> 00:06:05,669
And this distribution now represents my beliefs about what theta is after I've

104
00:06:05,669 --> 00:06:08,219
seen the training set.

105
00:06:08,219 --> 00:06:15,219
And when you now want to make a new prediction on the price of a new house,

106
00:06:20,060 --> 00:06:21,959
on the input X,

107
00:06:21,959 --> 00:06:26,709
I would say that, well, the distribution over the possible housing prices for

108
00:06:26,709 --> 00:06:30,560
this new house I'm trying to estimate the price of, say,

109
00:06:30,560 --> 00:06:34,270
given the size of the house, the features of the house at X, and the training

110
00:06:34,270 --> 00:06:36,559
set I had previously, it

111
00:06:36,559 --> 00:06:43,559
is going to be given by

112
00:06:47,280 --> 00:06:50,470
an integral over my parameters theta of

113
00:06:50,470 --> 00:06:53,020
probably of Y given X comma theta

114
00:06:53,020 --> 00:06:54,649
and times

115
00:06:54,649 --> 00:06:59,929
the posterior distribution of theta given the training set. Okay?

116
00:06:59,929 --> 00:07:01,789
And in

117
00:07:01,789 --> 00:07:03,719
particular, if you want your prediction to be

118
00:07:03,719 --> 00:07:07,559
the expected value

119
00:07:07,559 --> 00:07:08,729
of Y given

120
00:07:08,729 --> 00:07:10,279
the

121
00:07:10,279 --> 00:07:13,559
input X in training set, you would

122
00:07:13,559 --> 00:07:16,349
say integrate

123
00:07:16,349 --> 00:07:23,349
over Y

124
00:07:23,750 --> 00:07:26,539
times the posterior. Okay?

125
00:07:26,539 --> 00:07:33,539
You would take an expectation of Y with respect to your posterior distribution. Okay?

126
00:07:35,440 --> 00:07:39,169
And you notice that when I was writing this down, so with the Bayesian

127
00:07:39,169 --> 00:07:41,959
formulation, and now started to write up here Y given X

128
00:07:41,959 --> 00:07:44,110
comma theta because

129
00:07:44,110 --> 00:07:47,590
this formula now is the property of Y conditioned on the values of the

130
00:07:47,590 --> 00:07:51,909
random variables X and theta. So I'm no longer writing semicolon theta, I'm writing

131
00:07:51,909 --> 00:07:53,099
comma theta

132
00:07:53,099 --> 00:07:56,379
because I'm now treating theta

133
00:07:56,379 --> 00:08:00,259
as a random variable. So all

134
00:08:00,259 --> 00:08:02,849
of this is somewhat abstract but this is

135
00:08:02,849 --> 00:08:04,109
- and

136
00:08:04,109 --> 00:08:11,109
it turns out - actually let's check. Are there questions about this? No? Okay. Let's

137
00:08:14,779 --> 00:08:17,819
try to make this more concrete. It turns out that

138
00:08:17,819 --> 00:08:21,699
for many problems,

139
00:08:21,699 --> 00:08:26,059
both of these steps in the computation are difficult because if,

140
00:08:26,059 --> 00:08:29,469
you know, theta is an N plus onedimensional vector, is an N plus

141
00:08:29,469 --> 00:08:31,219
one-dimensional parameter vector,

142
00:08:31,219 --> 00:08:35,010
then this is one an integral over an N plus one-dimensional, you know, over RN

143
00:08:35,010 --> 00:08:36,390
plus one.

144
00:08:36,390 --> 00:08:38,660
And because numerically it's very difficult

145
00:08:38,660 --> 00:08:41,130
to compute integrals over

146
00:08:41,130 --> 00:08:44,800
very high dimensional spaces, all right? So

147
00:08:44,800 --> 00:08:48,740
usually this integral - actually usually it's hard to compute the posterior in theta

148
00:08:48,740 --> 00:08:54,310
and it's also hard to compute this integral if theta is very high dimensional. There are few

149
00:08:54,310 --> 00:08:55,910
exceptions for which this can be done in

150
00:08:55,910 --> 00:08:58,290
closed form, but for

151
00:08:58,290 --> 00:09:00,460
many learning algorithms, say,

152
00:09:00,460 --> 00:09:04,940
Bayesian logistic regression, this is hard to do.

153
00:09:04,940 --> 00:09:09,780
And so what's commonly done is to take the posterior distribution

154
00:09:09,780 --> 00:09:13,390
and instead of actually computing a full posterior distribution, chi of theta

155
00:09:13,390 --> 00:09:15,280
given S,

156
00:09:15,280 --> 00:09:16,510
we'll instead

157
00:09:16,510 --> 00:09:19,330
take this quantity on the right-hand side

158
00:09:19,330 --> 00:09:23,880
and just maximize this quantity on the right-hand side. So let me write this down.

159
00:09:23,880 --> 00:09:26,650
So

160
00:09:26,650 --> 00:09:30,100
commonly, instead of computing the full posterior distribution,

161
00:09:30,100 --> 00:09:37,100
we will choose the following. Okay?

162
00:09:42,280 --> 00:09:47,160
We will choose what's called the MAP estimate, or the maximum a posteriori

163
00:09:47,160 --> 00:09:50,270
estimate of theta, which is the most likely value of theta,

164
00:09:50,270 --> 00:09:53,990
most probable value of theta onto your posterior distribution.

165
00:09:53,990 --> 00:09:56,430
And that's just

166
00:09:56,430 --> 00:10:03,430
ont max chi

167
00:10:07,960 --> 00:10:11,870
of theta.

168
00:10:11,870 --> 00:10:18,870
And then when you need to make a prediction,

169
00:10:21,480 --> 00:10:28,480
you know, you would just predict, say, well,

170
00:10:35,530 --> 00:10:37,599
using your usual hypothesis

171
00:10:37,599 --> 00:10:41,290
and using this MAP value of theta

172
00:10:41,290 --> 00:10:42,479
in place of

173
00:10:42,479 --> 00:10:47,380
- as the parameter vector you'd choose. Okay? And

174
00:10:47,380 --> 00:10:51,340
notice, the only difference between this and standard maximum likelihood estimation

175
00:10:51,340 --> 00:10:53,060
is that when you're choosing,

176
00:10:53,060 --> 00:10:56,920
you know, the - instead of choosing the maximum likelihood value for

177
00:10:56,920 --> 00:11:01,140
theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,

178
00:11:01,140 --> 00:11:05,170
and then times this other quantity which is the prior.

179
00:11:05,170 --> 00:11:06,650
Right?

180
00:11:06,650 --> 00:11:09,520
And

181
00:11:09,520 --> 00:11:10,870
let's see,

182
00:11:10,870 --> 00:11:16,560
when intuition is that if your prior

183
00:11:16,560 --> 00:11:21,470
is

184
00:11:21,470 --> 00:11:23,650
theta being Gaussian and with mean

185
00:11:23,650 --> 00:11:26,030
zero and some covariance,

186
00:11:26,030 --> 00:11:30,550
then for a distribution like this, most of the [inaudible] mass is close to zero. Right?

187
00:11:30,550 --> 00:11:34,010
So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.

188
00:11:37,990 --> 00:11:40,550
And so the prior distribution, instead of saying that

189
00:11:40,550 --> 00:11:44,800
you think most of the parameters should be close to

190
00:11:44,800 --> 00:11:48,780
zero, and if you remember our discussion on feature selection, if you eliminate a

191
00:11:48,780 --> 00:11:50,880
feature from consideration

192
00:11:50,880 --> 00:11:53,280
that's the same as

193
00:11:53,280 --> 00:11:56,630
setting the source and value of theta to be equal to zero. All right? So if you

194
00:11:56,630 --> 00:11:58,020
set

195
00:11:58,020 --> 00:12:02,090
theta five to be equal to zero, that's the same as, you know, eliminating feature

196
00:12:02,090 --> 00:12:04,550
five from the your hypothesis.

197
00:12:04,550 --> 00:12:09,710
And so, this is the prior that drives most of the parameter values to zero

198
00:12:09,710 --> 00:12:12,710
- to values close to zero. And you'll think of this as

199
00:12:12,710 --> 00:12:17,410
doing something analogous, if doing something reminiscent of feature selection. Okay? And

200
00:12:17,410 --> 00:12:21,560
it turns out that with this formulation, the parameters won't actually be

201
00:12:21,560 --> 00:12:24,990
exactly zero but many of the values will be close to zero.

202
00:12:24,990 --> 00:12:27,030
And

203
00:12:27,030 --> 00:12:31,510
I guess in pictures,

204
00:12:34,550 --> 00:12:36,210
if you remember,

205
00:12:36,210 --> 00:12:40,070
I said that if you have, say, five data points and you fit

206
00:12:40,070 --> 00:12:47,070
a fourth-order polynomial -

207
00:12:47,620 --> 00:12:51,040
well I think that had too many bumps in it, but never mind.

208
00:12:51,040 --> 00:12:54,360
If you fit it a - if you fit very high polynomial to a

209
00:12:54,360 --> 00:12:58,180
very small dataset, you can get these very large oscillations

210
00:12:58,180 --> 00:13:01,420
if you use maximum likelihood estimation. All right?

211
00:13:01,420 --> 00:13:05,610
In contrast, if you apply this sort of Bayesian regularization,

212
00:13:05,610 --> 00:13:08,200
you can actually fit a higherorder polynomial

213
00:13:08,200 --> 00:13:11,160
that still get

214
00:13:11,160 --> 00:13:13,250
sort of a smoother and smoother fit to the data

215
00:13:13,250 --> 00:13:17,840
as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer

216
00:13:17,840 --> 00:13:19,830
to zero. And that

217
00:13:19,830 --> 00:13:22,750
in practice - it's sort of hard to see, but you can take my word for it.

218
00:13:22,750 --> 00:13:24,900
As tau becomes smaller and smaller,

219
00:13:24,900 --> 00:13:28,760
the curves you tend to fit your data also become smoother and smoother, and so you

220
00:13:28,760 --> 00:13:29,560
tend

221
00:13:29,560 --> 00:13:35,700
less and less overfit, even when you're fitting a large number of

222
00:13:35,700 --> 00:13:37,670
parameters.

223
00:13:37,670 --> 00:13:41,290
Okay? Let's see,

224
00:13:41,290 --> 00:13:44,590
and

225
00:13:44,590 --> 00:13:48,129
one last piece of intuition that I would just toss out there. And

226
00:13:48,129 --> 00:13:49,920
you get to play more with

227
00:13:49,920 --> 00:13:52,850
this particular set of ideas more in Problem Set

228
00:13:52,850 --> 00:13:54,980
3, which I'll post online later

229
00:13:54,980 --> 00:13:57,160
this week I guess.

230
00:13:57,160 --> 00:14:01,950
Is that whereas maximum likelihood tries to minimize,

231
00:14:01,950 --> 00:14:08,950
say, this,

232
00:14:12,210 --> 00:14:15,830
right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing

233
00:14:15,830 --> 00:14:17,030
this,

234
00:14:17,030 --> 00:14:19,290
it turns out that if you

235
00:14:19,290 --> 00:14:21,220
add this prior term there,

236
00:14:21,220 --> 00:14:26,500
it turns out that the authorization objective you end up optimizing

237
00:14:26,500 --> 00:14:31,160
turns out to be that. Where you add an extra term that, you know,

238
00:14:31,160 --> 00:14:34,380
penalizes your parameter theta as being large.

239
00:14:34,980 --> 00:14:38,820
And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to

240
00:14:38,820 --> 00:14:40,890
keep your parameters small.

241
00:14:40,890 --> 00:14:42,810
And this has the effect.

242
00:14:42,810 --> 00:14:46,530
Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of

243
00:14:46,530 --> 00:14:48,260
keeping the functions you fit

244
00:14:48,260 --> 00:14:55,260
to be smoother and less likely to overfit. Okay? Okay, hopefully this will

245
00:14:57,870 --> 00:15:02,200
make more sense when you play with these ideas a bit more in the next problem set. But let's check

246
00:15:02,200 --> 00:15:09,200
questions about all this.

247
00:15:12,560 --> 00:15:16,000
The smoothing behavior is it because [inaudible] actually get different [inaudible]?

248
00:15:19,260 --> 00:15:23,010
Let's see. Yeah. It depends on - well

249
00:15:23,010 --> 00:15:28,160
most priors with most of the mass close to zero will get this effect, I guess. And just by

250
00:15:28,160 --> 00:15:29,090
convention, the Gaussian

251
00:15:29,090 --> 00:15:31,770
prior is what's most -

252
00:15:31,770 --> 00:15:33,820
used the most common for models like logistic

253
00:15:33,820 --> 00:15:37,910
regression and linear regression, generalized in your models. There are a few

254
00:15:37,910 --> 00:15:41,190
other priors that I sometimes use, like the Laplace prior,

255
00:15:41,190 --> 00:15:48,190
but all of them will tend to have these sorts of smoothing effects. All right. Cool.

256
00:15:50,270 --> 00:15:52,640
And so it turns out that for

257
00:15:52,640 --> 00:15:57,250
problems like text classification, text classification is like 30,000 features or 50,000

258
00:15:57,250 --> 00:15:59,560
features,

259
00:15:59,560 --> 00:16:02,980
where it seems like an algorithm like logistic regression would be very much prone to

260
00:16:02,980 --> 00:16:03,670
overfitting. Right?

261
00:16:03,670 --> 00:16:06,470
So imagine trying to build a spam classifier,

262
00:16:06,470 --> 00:16:10,050
maybe you have 100 training examples but you have 30,000

263
00:16:10,050 --> 00:16:12,530
features or 50,000 features,

264
00:16:12,530 --> 00:16:16,820
that seems clearly to be prone to overfitting. Right? But it turns out that with

265
00:16:16,820 --> 00:16:18,440
this sort of Bayesian

266
00:16:18,440 --> 00:16:19,460
regularization,

267
00:16:19,460 --> 00:16:22,100
with [inaudible] Gaussian,

268
00:16:22,100 --> 00:16:26,710
logistic regression becomes a very effective text classification algorithm

269
00:16:26,710 --> 00:16:32,510
with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,

270
00:16:32,510 --> 00:16:36,950
right, and so pick - and to pick either tau squared or lambda.

271
00:16:36,950 --> 00:16:40,260
I think the relation is lambda equals one over tau squared. But right, so pick either

272
00:16:40,260 --> 00:16:41,840
tau squared or lambda, you could

273
00:16:41,840 --> 00:16:48,840
use cross-validation, yeah. All right? Okay, cool. So all right,

274
00:16:49,120 --> 00:16:51,740
that

275
00:16:51,740 --> 00:16:55,240
was all I want to say about methods for preventing

276
00:16:55,240 --> 00:16:56,410
overfitting. What I

277
00:16:56,410 --> 00:17:00,820
want to do next is just spend, you know, five minutes talking about

278
00:17:00,820 --> 00:17:02,300
online learning.

279
00:17:02,300 --> 00:17:04,650
And this is sort of a digression.

280
00:17:04,650 --> 00:17:06,390
And so, you know, when

281
00:17:06,390 --> 00:17:09,680
you're designing the syllabus of a class, I guess, sometimes

282
00:17:09,680 --> 00:17:13,040
there are just some ideas you want to talk about but can't find a very good place to

283
00:17:13,040 --> 00:17:16,480
fit in anywhere. So this is one of those ideas that may

284
00:17:16,480 --> 00:17:20,890
seem a bit disjointed from the rest of the class but I just

285
00:17:20,890 --> 00:17:25,770
want to

286
00:17:25,770 --> 00:17:29,730
tell

287
00:17:29,730 --> 00:17:33,720
you a little bit about it. Okay. So here's the idea.

288
00:17:33,720 --> 00:17:37,820
So far, all the learning algorithms we've talked about are what's called batch learning

289
00:17:37,820 --> 00:17:38,520
algorithms,

290
00:17:38,520 --> 00:17:41,650
where you're given a training set and then you get to run your learning algorithm on the

291
00:17:41,650 --> 00:17:46,910
training set and then maybe you test it on some other test set.

292
00:17:46,910 --> 00:17:49,980
And there's another learning setting called online learning,

293
00:17:49,980 --> 00:17:53,670
in which you have to make predictions even while you are in the process of learning.

294
00:17:53,670 --> 00:17:57,650
So here's how the problem sees.

295
00:17:57,650 --> 00:17:59,360
All right? I'm first gonna give you X one.

296
00:17:59,360 --> 00:18:01,410
Let's say there's a classification problem,

297
00:18:01,410 --> 00:18:02,780
so I'm first gonna give you

298
00:18:02,780 --> 00:18:07,420
X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"

299
00:18:07,420 --> 00:18:09,670
And you've not seen any data yet.

300
00:18:09,670 --> 00:18:11,679
And so, you make a guess. Right? You

301
00:18:11,679 --> 00:18:12,990
guess -

302
00:18:12,990 --> 00:18:15,190
we'll call your guess Y hat one.

303
00:18:15,190 --> 00:18:18,010
And after you've made your prediction, I

304
00:18:18,010 --> 00:18:21,840
will then reveal to you the true label Y one. Okay? And

305
00:18:21,840 --> 00:18:25,580
not having seen any data before, your odds of getting the first one right are only

306
00:18:25,580 --> 00:18:28,750
50 percent, right, if you guess randomly.

307
00:18:28,750 --> 00:18:29,950


308
00:18:29,950 --> 00:18:35,080
And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And

309
00:18:35,080 --> 00:18:35,910
so you now maybe

310
00:18:35,910 --> 00:18:40,200
are gonna make a slightly more educated guess and call that Y hat two.

311
00:18:40,200 --> 00:18:44,880
And after you've made your guess, I reveal the true label to you.

312
00:18:44,880 --> 00:18:50,000
And so, then I show you X three, and then you make

313
00:18:50,000 --> 00:18:53,000
your guess, and learning proceeds as follows.

314
00:18:53,000 --> 00:18:54,860
So this is just a lot of

315
00:18:54,860 --> 00:18:59,370
machine learning and batch learning, and the model settings where

316
00:18:59,370 --> 00:19:03,710
you have to keep learning even as you're making predictions,

317
00:19:03,710 --> 00:19:08,050
okay? So I don't know, setting your website and you have users coming in. And as the first user

318
00:19:08,050 --> 00:19:10,990
comes in, you need to start making predictions already about what

319
00:19:10,990 --> 00:19:13,450
the user likes or dislikes. And there's only, you know,

320
00:19:13,450 --> 00:19:18,370
as you're making predictions you get to show more and more training examples.

321
00:19:18,370 --> 00:19:25,370
So in online learning what you care about is the total online error,

322
00:19:27,370 --> 00:19:28,400


323
00:19:28,400 --> 00:19:32,900
which is sum from I equals one to MC if you get the sequence of M examples

324
00:19:32,900 --> 00:19:34,390
all together,

325
00:19:34,390 --> 00:19:37,860
indicator Y hat I

326
00:19:37,860 --> 00:19:40,900
not equal to Y hi. Okay?

327
00:19:40,900 --> 00:19:43,510
So the total online error is

328
00:19:43,510 --> 00:19:45,920
the total number of mistakes you make

329
00:19:45,920 --> 00:19:49,029
on a sequence of examples like this.

330
00:19:49,029 --> 00:19:51,910
And

331
00:19:51,910 --> 00:19:54,049
it turns out that, you know,

332
00:19:54,049 --> 00:19:57,690
many of the learning algorithms you have - when you finish all the learning algorithms, you've

333
00:19:57,690 --> 00:20:00,690
learned about and can apply to this setting. One thing

334
00:20:00,690 --> 00:20:02,680
you could do is

335
00:20:02,680 --> 00:20:03,850
when you're asked

336
00:20:03,850 --> 00:20:06,230
to make prediction on Y hat three,

337
00:20:06,230 --> 00:20:10,409
right, one simple thing to do is well you've seen some other training examples

338
00:20:10,409 --> 00:20:14,040
up to this point so you can just take your learning algorithm and run it

339
00:20:14,040 --> 00:20:15,429
on the examples,

340
00:20:15,429 --> 00:20:18,970
you know, leading up to Y hat three. So just run the learning algorithm on all

341
00:20:18,970 --> 00:20:21,420
the examples you've seen previous

342
00:20:21,420 --> 00:20:25,400
to being asked to make a prediction on certain example, and then use your

343
00:20:25,400 --> 00:20:27,320
learning

344
00:20:27,320 --> 00:20:30,940
algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we

345
00:20:30,940 --> 00:20:34,190
saw that you could use the stochastic gradient descent, that,

346
00:20:34,190 --> 00:20:41,190
you know, can be adapted very nicely to this. So as a concrete example, if you

347
00:20:41,740 --> 00:20:45,660
remember the perceptron algorithms, say, right,

348
00:20:45,660 --> 00:20:47,350
you would

349
00:20:47,350 --> 00:20:52,280
say initial the parameter theta to be equal to zero.

350
00:20:52,280 --> 00:20:55,940
And then after seeing the Ith training

351
00:20:55,940 --> 00:20:59,230
example, you'd

352
00:20:59,230 --> 00:21:06,230
update the parameters, you

353
00:21:13,740 --> 00:21:15,420
know,

354
00:21:15,420 --> 00:21:19,740
using - you've see this reel a lot of times now, right, using the standard

355
00:21:19,740 --> 00:21:21,720
perceptron learning rule. And

356
00:21:21,720 --> 00:21:26,230
the same thing, if you were using logistic regression you can then,

357
00:21:26,230 --> 00:21:29,940
again, after seeing each training example, just run, you know, essentially run

358
00:21:29,940 --> 00:21:33,620
one-step stochastic gradient descent

359
00:21:33,620 --> 00:21:38,050
on just the example you saw. Okay?

360
00:21:38,050 --> 00:21:39,800
And so

361
00:21:39,800 --> 00:21:41,040
the reason I've

362
00:21:41,040 --> 00:21:45,380
put this into the sort of "learning theory" section of this class was because

363
00:21:45,380 --> 00:21:49,190
it turns that sometimes you can prove fairly amazing results

364
00:21:49,190 --> 00:21:51,500
on your total online error

365
00:21:51,500 --> 00:21:53,790
using algorithms like these.

366
00:21:53,790 --> 00:21:57,500
I will actually - I don't actually want to spend the time in the main

367
00:21:57,500 --> 00:21:58,640
lecture to prove

368
00:21:58,640 --> 00:22:01,880
this, but, for example, you can prove that

369
00:22:01,880 --> 00:22:04,790
when you use the perceptron algorithm,

370
00:22:04,790 --> 00:22:07,890
then even when

371
00:22:07,890 --> 00:22:11,710
the features XI,

372
00:22:11,710 --> 00:22:15,250
maybe infinite dimensional feature vectors, like we saw for

373
00:22:15,250 --> 00:22:18,700
simple vector machines. And sometimes, infinite feature dimensional vectors

374
00:22:18,700 --> 00:22:20,740
may use kernel representations. Okay? But so it

375
00:22:20,740 --> 00:22:22,180
turns out that you can prove that

376
00:22:22,180 --> 00:22:24,790
when you a perceptron algorithm,

377
00:22:24,790 --> 00:22:27,590
even when the data is maybe

378
00:22:27,590 --> 00:22:28,920
extremely high dimensional and it

379
00:22:28,920 --> 00:22:32,900
seems like you'd be prone to overfitting, right, you can prove that so as the

380
00:22:32,900 --> 00:22:36,550
long as the positive and negative examples

381
00:22:36,550 --> 00:22:38,740


382
00:22:38,740 --> 00:22:41,860
are separated by a margin,

383
00:22:41,860 --> 00:22:43,360
right. So in this

384
00:22:43,360 --> 00:22:48,050
infinite dimensional space,

385
00:22:48,050 --> 00:22:49,740
so long as,

386
00:22:49,740 --> 00:22:53,390
you know, there is some margin

387
00:22:53,390 --> 00:22:57,480
down there separating the positive and negative examples,

388
00:22:57,480 --> 00:22:59,300
you can prove that

389
00:22:59,300 --> 00:23:02,150
perceptron algorithm will converge

390
00:23:02,150 --> 00:23:06,870
to a hypothesis that perfectly separates the positive and negative examples.

391
00:23:06,870 --> 00:23:10,760
Okay? And then so after seeing only a finite number of examples, it'll

392
00:23:10,760 --> 00:23:14,410
converge to digital boundary that perfectly separates the

393
00:23:14,410 --> 00:23:18,590
positive and negative examples, even though you may in an infinite dimensional space. Okay?

394
00:23:18,590 --> 00:23:20,970
So

395
00:23:20,970 --> 00:23:22,260
let's see.

396
00:23:22,260 --> 00:23:25,920
The proof itself would take me sort of almost an entire lecture to do,

397
00:23:25,920 --> 00:23:27,179
and there are sort of

398
00:23:27,179 --> 00:23:29,460
other things that I want to do more than that.

399
00:23:29,460 --> 00:23:32,850
So you want to see the proof of this yourself, it's actually written up in the lecture

400
00:23:32,850 --> 00:23:35,010
notes that I posted online.

401
00:23:35,010 --> 00:23:38,210
For the purposes of this class' syllabus, the proof of this

402
00:23:38,210 --> 00:23:41,059
result, you can treat this as optional reading. And by that, I mean,

403
00:23:41,059 --> 00:23:44,799
you know, it won't appear on the midterm and you won't be asked about this

404
00:23:44,799 --> 00:23:47,039
specifically in the problem sets,

405
00:23:47,039 --> 00:23:48,770
but I thought it'd be -

406
00:23:48,770 --> 00:23:52,340
I know some of you are curious after the previous lecture so why

407
00:23:52,340 --> 00:23:53,750
you can prove

408
00:23:53,750 --> 00:23:55,450
that, you know, SVMs can have

409
00:23:55,450 --> 00:23:59,240
bounded VC dimension, even in these infinite dimensional spaces, and how do

410
00:23:59,240 --> 00:24:00,940
you prove things in these -

411
00:24:00,940 --> 00:24:04,450
how do you prove learning theory results in these infinite dimensional feature spaces. And

412
00:24:04,450 --> 00:24:05,190
so

413
00:24:05,190 --> 00:24:08,480
the perceptron bound that I just talked about was the simplest instance I

414
00:24:08,480 --> 00:24:11,040
know of that you can sort of read in like half an hour and understand

415
00:24:11,040 --> 00:24:11,960
it.

416
00:24:11,960 --> 00:24:14,700
So if you're

417
00:24:14,700 --> 00:24:16,179
interested, there are lecture notes online for

418
00:24:16,179 --> 00:24:20,720
how this perceptron bound is actually proved. It's a very

419
00:24:21,580 --> 00:24:25,710
[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But

420
00:24:25,710 --> 00:24:28,450
regardless of the theoretical results,

421
00:24:28,450 --> 00:24:30,880
you know, the online learning setting is something

422
00:24:30,880 --> 00:24:33,490
that you - that comes reasonably often. And so,

423
00:24:33,490 --> 00:24:38,750
these algorithms based on stochastic gradient descent often go very well. Okay, any

424
00:24:38,750 --> 00:24:45,750
questions about this before I move on?

425
00:24:49,990 --> 00:24:54,799
All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to

426
00:24:54,799 --> 00:24:56,500
PowerPoint slides, please,

427
00:24:56,500 --> 00:24:59,810
is I actually want to spend most of today's lecture sort of talking about

428
00:24:59,810 --> 00:25:05,270
advice for applying different machine learning

429
00:25:05,270 --> 00:25:10,850
algorithms.

430
00:25:10,850 --> 00:25:11,970
And so,

431
00:25:11,970 --> 00:25:14,019
you know, right now, already you have a,

432
00:25:14,019 --> 00:25:16,250
I think, a good understanding of

433
00:25:16,250 --> 00:25:18,620
really the most powerful tools

434
00:25:18,620 --> 00:25:21,020
known to humankind in machine learning.

435
00:25:21,020 --> 00:25:22,090
Right? And

436
00:25:22,090 --> 00:25:26,520
what I want to do today is give you some advice on how to apply them really

437
00:25:26,520 --> 00:25:27,720
powerfully because,

438
00:25:27,720 --> 00:25:30,299
you know, the same tool - it turns out that you can

439
00:25:30,299 --> 00:25:33,249
take the same machine learning tool, say logistic regression,

440
00:25:33,249 --> 00:25:36,800
and you can ask two different people to apply it to the same problem.

441
00:25:36,800 --> 00:25:38,230
And

442
00:25:38,230 --> 00:25:41,290
sometimes one person will do an amazing job and it'll work amazingly well, and the second

443
00:25:41,290 --> 00:25:43,150
person will sort of

444
00:25:43,150 --> 00:25:47,150
not really get it to work, even though it was exactly the same algorithm. Right?

445
00:25:47,150 --> 00:25:51,049
And so what I want to do today, in the rest of the time I have today, is try

446
00:25:51,049 --> 00:25:52,190
to

447
00:25:52,190 --> 00:25:55,210
convey to you, you know, some of the methods for how to

448
00:25:55,210 --> 00:25:56,600
make sure you're one of -

449
00:25:56,600 --> 00:26:03,600
you really know how to get these learning algorithms to work well in problems. So

450
00:26:05,120 --> 00:26:07,340
just some caveats on what I'm gonna, I

451
00:26:07,340 --> 00:26:08,859
guess, talk about in the

452
00:26:08,859 --> 00:26:11,080
rest of today's lecture.

453
00:26:11,940 --> 00:26:15,760
Something I want to talk about is actually not very mathematical but is also

454
00:26:15,760 --> 00:26:17,610
some of the hardest,

455
00:26:17,610 --> 00:26:18,480
most

456
00:26:18,480 --> 00:26:21,910
conceptually most difficult material in this class to understand. All right? So this

457
00:26:21,910 --> 00:26:23,070
is

458
00:26:23,070 --> 00:26:25,400
not mathematical but this is not easy.

459
00:26:25,400 --> 00:26:29,730
And I want to say this caveat some of what I'll say today is debatable.

460
00:26:29,730 --> 00:26:33,309
I think most good machine learning people will agree with most of what I say but maybe not

461
00:26:33,309 --> 00:26:35,730
everything I say.

462
00:26:35,730 --> 00:26:39,350
And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about

463
00:26:39,350 --> 00:26:41,680
this later. What I'm

464
00:26:41,680 --> 00:26:43,730
focusing on today is advice for

465
00:26:43,730 --> 00:26:47,269
how to just get stuff to work. If you work in the company and you want to deliver a

466
00:26:47,269 --> 00:26:49,120
product or you're, you know,

467
00:26:49,120 --> 00:26:52,750
building a system and you just want your machine learning system to work. Okay? Some of what I'm about

468
00:26:52,750 --> 00:26:54,000
to say today

469
00:26:54,000 --> 00:26:57,559
isn't great advice if you goal is to invent a new machine learning algorithm, but

470
00:26:57,559 --> 00:26:59,179
this is advice for how to

471
00:26:59,179 --> 00:27:02,040
make machine learning algorithm work and, you know, and

472
00:27:02,040 --> 00:27:05,080
deploy a working system. So three

473
00:27:05,080 --> 00:27:08,900
key areas I'm gonna talk about. One: diagnostics for

474
00:27:08,900 --> 00:27:11,890
debugging learning algorithms. Second: sort of

475
00:27:11,890 --> 00:27:16,820
talk briefly about error analyses and ablative analysis.

476
00:27:16,820 --> 00:27:21,470
And third, I want to talk about just advice for how to get started on a machine-learning

477
00:27:23,410 --> 00:27:24,420
problem.

478
00:27:24,420 --> 00:27:27,340
And one theme that'll come up later is it

479
00:27:27,340 --> 00:27:31,230
turns out you've heard about premature optimization, right,

480
00:27:31,230 --> 00:27:33,890
in writing software. This is when

481
00:27:33,890 --> 00:27:38,120
someone over-designs from the start, when someone, you know, is writing piece of code and

482
00:27:38,120 --> 00:27:41,330
they choose a subroutine to optimize

483
00:27:41,330 --> 00:27:45,350
heavily. And maybe you write the subroutine as assembly or something. And that's often

484
00:27:45,350 --> 00:27:48,919
- and many of us have been guilty of premature optimization, where

485
00:27:48,919 --> 00:27:51,510
we're trying to get a piece of code to run faster. And

486
00:27:51,510 --> 00:27:54,710
we choose probably a piece of code and we implement it an assembly, and

487
00:27:54,710 --> 00:27:56,820
really tune and get to run really quickly. And

488
00:27:56,820 --> 00:28:01,280
it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature

489
00:28:01,280 --> 00:28:02,210
optimization.

490
00:28:02,210 --> 00:28:06,140
And in undergraduate programming classes, we warn people all the time not to do

491
00:28:06,140 --> 00:28:10,250
premature optimization and people still do it all the time. Right?

492
00:28:10,250 --> 00:28:11,750
And

493
00:28:11,750 --> 00:28:16,020
turns out, a very similar thing happens in building machine-learning systems. That

494
00:28:16,020 --> 00:28:20,980
many people are often guilty of, what I call, premature statistical optimization, where they

495
00:28:20,980 --> 00:28:21,630


496
00:28:21,630 --> 00:28:26,090
heavily optimize part of a machine learning system and that turns out

497
00:28:26,090 --> 00:28:27,830
not to be the important piece. Okay?

498
00:28:27,830 --> 00:28:30,170
So I'll talk about that later, as well.

499
00:28:30,170 --> 00:28:35,680
So let's first talk about debugging learning algorithms.

500
00:28:35,680 --> 00:28:37,700


501
00:28:37,700 --> 00:28:39,560


502
00:28:39,560 --> 00:28:42,610
As a motivating

503
00:28:42,610 --> 00:28:46,910
example, let's say you want to build an anti-spam system. And

504
00:28:46,910 --> 00:28:51,780
let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?

505
00:28:51,780 --> 00:28:54,410
So instead of using 50,000 words, you're chosen a small set of 100

506
00:28:54,410 --> 00:28:55,390
features

507
00:28:55,390 --> 00:28:59,010
to use for your anti-spam system.

508
00:28:59,010 --> 00:29:02,140
And let's say you implement Bayesian logistic regression, implement gradient

509
00:29:02,140 --> 00:29:02,850
descent,

510
00:29:02,850 --> 00:29:07,310
and you get 20 percent test error, which is unacceptably high. Right?

511
00:29:07,310 --> 00:29:09,750
So this

512
00:29:09,750 --> 00:29:14,260
is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda

513
00:29:14,260 --> 00:29:15,160
squared term.

514
00:29:15,160 --> 00:29:19,950
And we're maximizing rather than minimizing as well, so there's a minus lambda

515
00:29:19,950 --> 00:29:23,809
theta square instead of plus lambda theta squared. So

516
00:29:23,809 --> 00:29:28,390
the question is, you implemented your Bayesian logistic regression algorithm,

517
00:29:28,390 --> 00:29:34,050
and you tested it on your test set and you got unacceptably high error, so what do you do

518
00:29:34,050 --> 00:29:35,370
next? Right?

519
00:29:35,370 --> 00:29:37,310
So,

520
00:29:37,310 --> 00:29:40,730
you know, one thing you could do is think about the ways you could improve this algorithm.

521
00:29:40,730 --> 00:29:44,510
And this is probably what most people will do instead of, "Well let's sit down and think what

522
00:29:44,510 --> 00:29:47,929
could've gone wrong, and then we'll try to improve the algorithm."

523
00:29:47,929 --> 00:29:51,370
Well obviously having more training data could only help, so one thing you can do is try to

524
00:29:51,370 --> 00:29:55,140
get more training examples.

525
00:29:55,140 --> 00:29:58,880
Maybe you suspect, that even 100 features was too many, so you might try to

526
00:29:58,880 --> 00:30:01,059
get a smaller set of

527
00:30:01,059 --> 00:30:04,510
features. What's more common is you might suspect your features aren't good enough, so you might

528
00:30:04,510 --> 00:30:06,880
spend some time, look at the email headers, see if

529
00:30:06,880 --> 00:30:09,240
you can figure out better features for, you know,

530
00:30:09,240 --> 00:30:12,890
finding spam emails or whatever.

531
00:30:12,890 --> 00:30:14,460
Right. And

532
00:30:14,460 --> 00:30:17,720
right, so and

533
00:30:17,720 --> 00:30:20,780
just sit around and come up with better features, such as for email headers.

534
00:30:20,780 --> 00:30:24,940
You may also suspect that gradient descent hasn't quite converged yet, and so let's try

535
00:30:24,940 --> 00:30:28,090
running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,

536
00:30:28,090 --> 00:30:29,480
just run

537
00:30:29,480 --> 00:30:30,940
gradient descent longer.

538
00:30:30,940 --> 00:30:35,620
Or maybe you remember, you know, you remember hearing from class that maybe

539
00:30:35,620 --> 00:30:38,580
Newton's method converges better, so let's

540
00:30:38,580 --> 00:30:39,809
try

541
00:30:39,809 --> 00:30:41,840
that instead. You may want to tune the value for lambda, because not

542
00:30:41,840 --> 00:30:43,560
sure if that was the right thing,

543
00:30:43,560 --> 00:30:46,960
or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only

544
00:30:46,960 --> 00:30:50,240
listed eight things

545
00:30:50,240 --> 00:30:54,549
here, but you can imagine if you were actually sitting down, building machine-learning

546
00:30:54,549 --> 00:30:55,720
system,

547
00:30:55,720 --> 00:30:58,040
the options to you are endless. You can think of, you

548
00:30:58,040 --> 00:31:01,040
know, hundreds of ways to improve a learning system.

549
00:31:01,040 --> 00:31:02,429
And some of these things like,

550
00:31:02,429 --> 00:31:05,670
well getting more training examples, surely that's gonna help, so that seems like it's a good

551
00:31:05,670 --> 00:31:08,980
use of your time. Right?

552
00:31:08,980 --> 00:31:11,290
And it turns out that

553
00:31:11,290 --> 00:31:15,210
this [inaudible] of picking ways to improve the learning algorithm and picking one and going

554
00:31:15,210 --> 00:31:16,120
for it,

555
00:31:16,120 --> 00:31:20,620
it might work in the sense that it may eventually get you to a working system, but

556
00:31:20,620 --> 00:31:25,030
often it's very time-consuming. And I think it's often a largely - largely a matter of

557
00:31:25,030 --> 00:31:28,140
luck, whether you end up fixing what the problem is.

558
00:31:28,140 --> 00:31:29,490
In particular, these

559
00:31:29,490 --> 00:31:33,030
eight improvements all fix very different problems.

560
00:31:33,030 --> 00:31:37,340
And some of them will be fixing problems that you don't have. And

561
00:31:37,340 --> 00:31:40,070
if you can rule out six of

562
00:31:40,070 --> 00:31:44,129
eight of these, say, you could - if by somehow looking at the problem more deeply,

563
00:31:44,129 --> 00:31:46,809
you can figure out which one of these eight things is actually the right thing

564
00:31:46,809 --> 00:31:47,850
to do,

565
00:31:47,850 --> 00:31:48,850
you can save yourself

566
00:31:48,850 --> 00:31:50,760
a lot of time.

567
00:31:50,760 --> 00:31:56,090
So let's see how we can go about doing that.

568
00:31:56,090 --> 00:32:01,830
The people in industry and in research that I see that are really good, would not

569
00:32:01,830 --> 00:32:05,490
go and try to change a learning algorithm randomly. There are lots of things that

570
00:32:05,490 --> 00:32:08,110
obviously improve your learning algorithm,

571
00:32:08,110 --> 00:32:12,460
but the problem is there are so many of them it's hard to know what to do.

572
00:32:12,460 --> 00:32:16,590
So you find all the really good ones that run various diagnostics to figure out

573
00:32:16,590 --> 00:32:18,010
the problem is

574
00:32:18,010 --> 00:32:21,610
and they think where a problem is. Okay?

575
00:32:21,610 --> 00:32:23,830
So

576
00:32:23,830 --> 00:32:27,309
for our motivating story, right, we said - let's say Bayesian logistic regression test

577
00:32:27,309 --> 00:32:29,010
error was 20 percent,

578
00:32:29,010 --> 00:32:32,020
which let's say is unacceptably high.

579
00:32:32,020 --> 00:32:34,830
And let's suppose you suspected the problem is

580
00:32:34,830 --> 00:32:36,440
either overfitting,

581
00:32:36,440 --> 00:32:37,790
so it's high bias,

582
00:32:37,790 --> 00:32:42,240
or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse

583
00:32:42,240 --> 00:32:45,220
me; I think I

584
00:32:45,220 --> 00:32:46,620
wrote that wrong.

585
00:32:46,620 --> 00:32:48,080
Let's firstly - so let's

586
00:32:48,080 --> 00:32:49,219
forget - forget the tables.

587
00:32:49,219 --> 00:32:52,839
Suppose you suspect the problem is either high bias or high variance, and some of the text

588
00:32:52,839 --> 00:32:54,730
here

589
00:32:54,730 --> 00:32:55,250
doesn't make sense. And

590
00:32:55,250 --> 00:32:56,429
you want to know

591
00:32:56,429 --> 00:33:00,850
if you're overfitting, which would be high variance, or you have too few

592
00:33:00,850 --> 00:33:06,240
features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So

593
00:33:06,240 --> 00:33:08,750
how can you figure out whether the problem

594
00:33:08,750 --> 00:33:10,790
is one of high bias

595
00:33:10,790 --> 00:33:15,610
or high variance? Right? So it turns

596
00:33:15,610 --> 00:33:19,009
out there's a simple diagnostic you can look at that will tell you

597
00:33:19,009 --> 00:33:24,150
whether the problem is high bias or high variance. If you

598
00:33:24,150 --> 00:33:27,900
remember the cartoon we'd seen previously for high variance problems, when you have high

599
00:33:27,900 --> 00:33:29,710
variance

600
00:33:29,710 --> 00:33:33,280
the training error will be much lower than the test error. All right? When you

601
00:33:33,280 --> 00:33:36,140
have a high variance problem, that's when you're fitting

602
00:33:36,140 --> 00:33:39,480
your training set very well. That's when you're fitting, you know, a tenth order polynomial to

603
00:33:39,480 --> 00:33:41,650
11 data points. All right? And

604
00:33:41,650 --> 00:33:44,670
that's when you're just fitting the data set very well, and so your training error will be

605
00:33:44,670 --> 00:33:45,670
much lower than

606
00:33:45,670 --> 00:33:47,640
your test

607
00:33:47,640 --> 00:33:49,940
error. And in contrast, if you have high bias,

608
00:33:49,940 --> 00:33:52,700
that's when your training error will also be high. Right?

609
00:33:52,700 --> 00:33:56,450
That's when your data is quadratic, say, but you're fitting a linear function to it

610
00:33:56,450 --> 00:34:02,290
and so you aren't even fitting your training set well. So

611
00:34:02,290 --> 00:34:04,450
just in cartoons, I guess,

612
00:34:04,450 --> 00:34:07,950
this is a - this is what a typical learning curve for high variance looks

613
00:34:07,950 --> 00:34:09,339
like.

614
00:34:09,339 --> 00:34:13,690
On your horizontal axis, I'm plotting the training set size M, right,

615
00:34:13,690 --> 00:34:16,429
and on vertical axis, I'm plotting the error.

616
00:34:16,429 --> 00:34:19,469
And so, let's see,

617
00:34:19,469 --> 00:34:21,029
you know, as you increase -

618
00:34:21,029 --> 00:34:25,119
if you have a high variance problem, you'll notice as the training set size, M,

619
00:34:25,119 --> 00:34:29,219
increases, your test set error will keep on decreasing.

620
00:34:29,219 --> 00:34:32,829
And so this sort of suggests that, well, if you can increase the training set size even

621
00:34:32,829 --> 00:34:36,359
further, maybe if you extrapolate the green curve out, maybe

622
00:34:36,359 --> 00:34:39,970
that test set error will decrease even further. All right?

623
00:34:39,970 --> 00:34:43,399
Another thing that's useful to plot here is - let's say

624
00:34:43,399 --> 00:34:46,539
the red horizontal line is the desired performance

625
00:34:46,539 --> 00:34:50,259
you're trying to reach, another useful thing to plot is actually the training error. Right?

626
00:34:50,259 --> 00:34:52,009
And it turns out that

627
00:34:52,009 --> 00:34:59,009
your training error will actually grow as a function of the training set size

628
00:34:59,249 --> 00:35:01,609
because the larger your training set,

629
00:35:01,609 --> 00:35:03,619
the harder it is to fit,

630
00:35:03,619 --> 00:35:06,149
you know, your training set perfectly. Right?

631
00:35:06,149 --> 00:35:09,249
So this is just a cartoon, don't take it too seriously, but in general, your training error

632
00:35:09,249 --> 00:35:11,420
will actually grow

633
00:35:11,420 --> 00:35:15,079
as a function of your training set size. Because smart training sets, if you have one data point,

634
00:35:15,079 --> 00:35:17,769
it's really easy to fit that perfectly, but if you have

635
00:35:17,769 --> 00:35:22,099
10,000 data points, it's much harder to fit that perfectly.

636
00:35:22,099 --> 00:35:23,149
All right?

637
00:35:23,149 --> 00:35:27,960
And so another diagnostic for high variance, and the one that I tend to use more,

638
00:35:27,960 --> 00:35:31,670
is to just look at training versus test error. And if there's a large gap between

639
00:35:31,670 --> 00:35:32,789
them,

640
00:35:32,789 --> 00:35:34,160
then this suggests that, you know,

641
00:35:34,160 --> 00:35:39,630
getting more training data may allow you to help close that gap. Okay?

642
00:35:39,630 --> 00:35:41,420
So this is

643
00:35:41,420 --> 00:35:42,340
what the

644
00:35:42,340 --> 00:35:45,059
cartoon would look like when - in the

645
00:35:45,059 --> 00:35:49,199
case of high variance.

646
00:35:49,199 --> 00:35:53,099
This is what the cartoon looks like for high bias. Right? If you

647
00:35:53,099 --> 00:35:54,779
look at the learning curve, you

648
00:35:54,779 --> 00:35:57,499
see that the curve for test error

649
00:35:57,499 --> 00:36:01,419
has flattened out already. And so this is a sign that,

650
00:36:01,419 --> 00:36:05,179
you know, if you get more training examples, if you extrapolate this curve

651
00:36:05,179 --> 00:36:06,519
further to the right,

652
00:36:06,519 --> 00:36:09,670
it's maybe not likely to go down much further.

653
00:36:09,670 --> 00:36:12,469
And this is a property of high bias: that getting more training data won't

654
00:36:12,469 --> 00:36:15,619
necessarily help.

655
00:36:15,619 --> 00:36:18,999
But again, to me the more useful diagnostic is

656
00:36:18,999 --> 00:36:20,299
if you plot

657
00:36:20,299 --> 00:36:23,999
training errors well, if you look at your training error as well as your, you know,

658
00:36:23,999 --> 00:36:26,369
hold out test set error.

659
00:36:26,369 --> 00:36:29,409
If you find that even your training error

660
00:36:29,409 --> 00:36:31,529
is high,

661
00:36:31,529 --> 00:36:34,779
then that's a sign that getting more training data is not

662
00:36:34,779 --> 00:36:38,269
going to help. Right?

663
00:36:38,269 --> 00:36:42,199
In fact, you know, think about it,

664
00:36:42,199 --> 00:36:44,539
training error

665
00:36:44,539 --> 00:36:48,089
grows as a function of your training set size.

666
00:36:48,089 --> 00:36:50,449
And so if your

667
00:36:50,449 --> 00:36:55,569
training error is already above your level of desired performance,

668
00:36:55,569 --> 00:36:56,599
then

669
00:36:56,599 --> 00:37:00,789
getting even more training data is not going to reduce your training

670
00:37:00,789 --> 00:37:03,009
error down to the desired level of performance. Right?

671
00:37:03,009 --> 00:37:06,469
Because, you know, your training error sort of only gets worse as you get more and more training

672
00:37:06,469 --> 00:37:07,549
examples.

673
00:37:07,549 --> 00:37:10,799
So if you extrapolate further to the right, it's not like this blue line will come

674
00:37:10,799 --> 00:37:13,399
back down to the level of desired performance. Right? This will stay up

675
00:37:13,399 --> 00:37:17,479
there. Okay? So for

676
00:37:17,479 --> 00:37:21,339
me personally, I actually, when looking at a curve like the green

677
00:37:21,339 --> 00:37:25,380
curve on test error, I actually personally tend to find it very difficult to tell

678
00:37:25,380 --> 00:37:29,000
if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very

679
00:37:29,000 --> 00:37:31,009
often, it's somewhat

680
00:37:31,009 --> 00:37:32,899
ambiguous. So for me personally,

681
00:37:32,899 --> 00:37:37,129
the diagnostic I tend to use the most often to tell if I have a bias problem or a variance

682
00:37:37,129 --> 00:37:37,859
problem

683
00:37:37,859 --> 00:37:41,319
is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,

684
00:37:41,319 --> 00:37:45,420
going

685
00:37:45,420 --> 00:37:47,130
back to

686
00:37:47,130 --> 00:37:52,399
the list of fixes, look

687
00:37:52,399 --> 00:37:54,109
at the first fix,

688
00:37:54,109 --> 00:37:56,339
getting more training examples

689
00:37:56,339 --> 00:37:58,650
is a way to fix high variance.

690
00:37:58,650 --> 00:38:02,749
Right? If you have a high variance problem, getting more training examples will help.

691
00:38:02,749 --> 00:38:05,529
Trying a smaller set of features:

692
00:38:05,529 --> 00:38:11,759
that also fixes high variance. All right?

693
00:38:11,759 --> 00:38:15,869
Trying a larger set of features or adding email features, these

694
00:38:15,869 --> 00:38:20,150
are solutions that fix high bias. Right?

695
00:38:20,150 --> 00:38:26,769
So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?

696
00:38:26,769 --> 00:38:29,069
And so

697
00:38:29,069 --> 00:38:33,579
quite often you see people working on machine learning problems

698
00:38:33,579 --> 00:38:34,589
and

699
00:38:34,589 --> 00:38:37,569
they'll remember that getting more training examples helps. And

700
00:38:37,569 --> 00:38:41,119
so, they'll build a learning system, build an anti-spam system and it doesn't work.

701
00:38:41,119 --> 00:38:42,229
And then they

702
00:38:42,229 --> 00:38:45,999
go off and spend lots of time and money and effort collecting more training data

703
00:38:45,999 --> 00:38:50,509
because they'll say, "Oh well, getting more data's obviously got to help."

704
00:38:50,509 --> 00:38:53,319
But if they had a high bias problem in the first place, and not a high variance

705
00:38:53,319 --> 00:38:54,890
problem,

706
00:38:54,890 --> 00:38:56,769
it's entirely possible to spend

707
00:38:56,769 --> 00:39:00,149
three months or six months collecting more and more training data,

708
00:39:00,149 --> 00:39:04,999
not realizing that it couldn't possibly help. Right?

709
00:39:04,999 --> 00:39:07,619
And so, this actually happens a lot in, you

710
00:39:07,619 --> 00:39:12,409
know, in Silicon Valley and companies, this happens a lot. There will often

711
00:39:12,409 --> 00:39:15,329
people building various machine learning systems, and

712
00:39:15,329 --> 00:39:19,519
they'll often - you often see people spending six months working on fixing a

713
00:39:19,519 --> 00:39:20,999
learning algorithm

714
00:39:20,999 --> 00:39:23,940
and you could've told them six months ago that, you know,

715
00:39:23,940 --> 00:39:27,210
that couldn't possibly have helped. But because they didn't know what the

716
00:39:27,210 --> 00:39:28,709
problem was, and

717
00:39:28,709 --> 00:39:33,549
they'd easily spend six months trying to invent new features or something. And

718
00:39:33,549 --> 00:39:37,809
this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and

719
00:39:37,809 --> 00:39:42,289
told them, "I could've told you six months ago that this was not going to help." And

720
00:39:42,289 --> 00:39:46,149
the six months is not a joke, you actually see

721
00:39:46,149 --> 00:39:47,709
this.

722
00:39:47,709 --> 00:39:49,510
And in contrast, if you

723
00:39:49,510 --> 00:39:53,049
actually figure out the problem's one of high bias or high variance, then

724
00:39:53,049 --> 00:39:54,299
you can rule out

725
00:39:54,299 --> 00:39:55,799
two of these solutions and

726
00:39:55,799 --> 00:40:00,779
save yourself many months of fruitless effort. Okay? I actually

727
00:40:00,779 --> 00:40:03,709
want to talk about these four at the bottom as well. But before I move on, let me

728
00:40:03,709 --> 00:40:05,319
just check if there were questions about what I've talked

729
00:40:05,319 --> 00:40:12,319
about so far. No? Okay, great. So bias

730
00:40:20,209 --> 00:40:23,220
versus variance is one thing that comes up

731
00:40:23,220 --> 00:40:29,539
often. This bias versus variance is one common diagnostic. And so,

732
00:40:29,539 --> 00:40:33,180
for other machine learning problems, it's often up to your own ingenuity to figure out

733
00:40:33,180 --> 00:40:35,700
your own diagnostics to figure out what's wrong. All right?

734
00:40:35,700 --> 00:40:41,230
So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you

735
00:40:41,230 --> 00:40:44,300
know, to construct your own tests. Like do you look at the difference training and

736
00:40:44,300 --> 00:40:46,499
test errors or do you look at something else?

737
00:40:46,499 --> 00:40:49,929
It's often up to your own ingenuity to construct your own diagnostics to figure out what's

738
00:40:49,929 --> 00:40:52,589
going on.

739
00:40:52,589 --> 00:40:55,029
What I want to do is go through another example. All right?

740
00:40:55,029 --> 00:40:58,890
And this one is slightly more contrived but it'll illustrate another

741
00:40:58,890 --> 00:41:02,769
common question that comes up, another one of the most common

742
00:41:02,769 --> 00:41:04,750
issues that comes up in applying

743
00:41:04,750 --> 00:41:06,089
learning algorithms.

744
00:41:06,089 --> 00:41:08,319
So in this example, it's slightly more contrived,

745
00:41:08,319 --> 00:41:11,579
let's say you implement Bayesian logistic regression

746
00:41:11,579 --> 00:41:17,549
and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So

747
00:41:17,549 --> 00:41:19,150
it's rejecting, you know,

748
00:41:19,150 --> 00:41:21,449
2 percent of -

749
00:41:21,449 --> 00:41:25,179
it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all

750
00:41:25,179 --> 00:41:26,959
spam gets

751
00:41:26,959 --> 00:41:30,660
through which is fine, but is also rejecting 2 percent of your good email,

752
00:41:30,660 --> 00:41:35,489
2 percent of the email from your friends and that's unacceptably high, let's

753
00:41:35,489 --> 00:41:36,909
say.

754
00:41:36,909 --> 00:41:39,010
And let's say that

755
00:41:39,010 --> 00:41:41,899
a simple vector machine using a linear kernel

756
00:41:41,899 --> 00:41:44,830
gets 10 percent error on spam and

757
00:41:44,830 --> 00:41:49,069
0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this

758
00:41:49,069 --> 00:41:53,359
example, let's say you're trying to build an anti-spam system. Right?

759
00:41:53,359 --> 00:41:56,170
Let's say that you really want to deploy

760
00:41:56,170 --> 00:41:57,679
logistic regression

761
00:41:57,679 --> 00:42:01,209
to your customers because of computational efficiency or because you need

762
00:42:01,209 --> 00:42:03,389
retrain overnight every day,

763
00:42:03,389 --> 00:42:07,319
and because logistic regression just runs more easily and more quickly or something. Okay? So let's

764
00:42:07,319 --> 00:42:08,669
say you want to deploy logistic

765
00:42:08,669 --> 00:42:12,649
regression, but it's just not working out well. So

766
00:42:12,649 --> 00:42:17,609
question is: What do you do next? So it

767
00:42:17,609 --> 00:42:18,829
turns out that this -

768
00:42:18,829 --> 00:42:22,319
the issue that comes up here, the one other common question that

769
00:42:22,319 --> 00:42:24,889
comes up is

770
00:42:24,889 --> 00:42:30,189
a question of is the algorithm converging. So you might suspect that maybe

771
00:42:30,189 --> 00:42:33,299
the problem with logistic regression is that it's just not converging.

772
00:42:33,299 --> 00:42:36,309
Maybe you need to run iterations. And

773
00:42:36,309 --> 00:42:37,759
it

774
00:42:37,759 --> 00:42:40,359
turns out that, again if you look at the optimization objective, say,

775
00:42:40,359 --> 00:42:43,710
logistic regression is, let's say, optimizing J

776
00:42:43,710 --> 00:42:46,730
of theta, it actually turns out that if you look at optimizing your objective as a function of the number

777
00:42:46,730 --> 00:42:51,809
of iterations, when you look

778
00:42:51,809 --> 00:42:55,009
at this curve, you know, it sort of looks like it's going up but it sort of

779
00:42:55,009 --> 00:42:57,630
looks like there's absentiles. And

780
00:42:57,630 --> 00:43:00,949
when you look at these curves, it's often very hard to tell

781
00:43:00,949 --> 00:43:03,729
if the curve has already flattened out. All right? And you look at these

782
00:43:03,729 --> 00:43:05,979
curves a lot so you can ask:

783
00:43:05,979 --> 00:43:08,229
Well has the algorithm converged? When you look at the J of theta like this, it's

784
00:43:08,229 --> 00:43:10,329
often hard to tell.

785
00:43:10,329 --> 00:43:14,149
You can run this ten times as long and see if it's flattened out. And you can run this ten

786
00:43:14,149 --> 00:43:21,079
times as long and it'll often still look like maybe it's going up very slowly, or something. Right?

787
00:43:21,079 --> 00:43:24,919
So a better diagnostic for what logistic regression is converged than

788
00:43:24,919 --> 00:43:28,809
looking at this curve.

789
00:43:28,809 --> 00:43:32,089
The other question you might wonder - the other thing you might

790
00:43:32,089 --> 00:43:36,709
suspect is a problem is are you optimizing the right function.

791
00:43:36,709 --> 00:43:38,920
So

792
00:43:38,920 --> 00:43:40,599
what you care about,

793
00:43:40,599 --> 00:43:42,879
right, in spam, say,

794
00:43:42,879 --> 00:43:44,260
is a

795
00:43:44,260 --> 00:43:47,499
weighted accuracy function like that. So A of theta is,

796
00:43:47,499 --> 00:43:49,190
you know, sum over your

797
00:43:49,190 --> 00:43:52,249
examples of some weights times whether you got it right.

798
00:43:52,249 --> 00:43:56,809
And so the weight may be higher for non-spam than for spam mail because you care

799
00:43:56,809 --> 00:43:57,710
about getting

800
00:43:57,710 --> 00:44:01,469
your predictions correct for spam email much more than non-spam mail, say. So let's

801
00:44:01,469 --> 00:44:02,359


802
00:44:02,359 --> 00:44:05,469
say A of theta

803
00:44:05,469 --> 00:44:10,819
is the optimization objective that you really care about, but Bayesian logistic regression is

804
00:44:10,819 --> 00:44:15,400
that it optimizes a quantity like that. Right? It's this

805
00:44:15,400 --> 00:44:17,689
sort of maximum likelihood thing

806
00:44:17,689 --> 00:44:19,380
and then with this

807
00:44:19,380 --> 00:44:20,849
two-nom, you know,

808
00:44:20,849 --> 00:44:22,779
penalty thing that we saw previously. And you

809
00:44:22,779 --> 00:44:26,499
might be wondering: Is this the right optimization function to be optimizing.

810
00:44:26,499 --> 00:44:30,949
Okay? And: Or do I maybe need to change the value for lambda

811
00:44:30,949 --> 00:44:33,899
to change this parameter? Or:

812
00:44:33,899 --> 00:44:39,819
Should I maybe really be switching to support vector machine optimization objective?

813
00:44:39,819 --> 00:44:42,130
Okay? Does that make sense? So

814
00:44:42,130 --> 00:44:44,490
the second diagnostic I'm gonna talk about

815
00:44:44,490 --> 00:44:46,989
is let's say you want to figure out

816
00:44:46,989 --> 00:44:50,609
is the algorithm converging, is the optimization algorithm converging, or

817
00:44:50,609 --> 00:44:51,900
is the problem with

818
00:44:51,900 --> 00:44:57,749
the optimization objective I chose in the first place? Okay?

819
00:44:57,749 --> 00:45:02,819
So here's

820
00:45:02,819 --> 00:45:07,329
the diagnostic you can use. Let me let - right. So to

821
00:45:07,329 --> 00:45:11,029
just reiterate the story, right, let's say an SVM outperforms Bayesian

822
00:45:11,029 --> 00:45:13,519
logistic regression but you really want to deploy

823
00:45:13,519 --> 00:45:16,759
Bayesian logistic regression to your problem. Let me

824
00:45:16,759 --> 00:45:19,049
let theta subscript SVM, be the

825
00:45:19,049 --> 00:45:21,669
parameters learned by an SVM,

826
00:45:21,669 --> 00:45:25,259
and I'll let theta subscript BLR be the parameters learned by Bayesian

827
00:45:25,259 --> 00:45:28,049
logistic regression.

828
00:45:28,049 --> 00:45:32,480
So the optimization objective you care about is this, you know, weighted accuracy

829
00:45:32,480 --> 00:45:35,079
criteria that I talked about just now.

830
00:45:35,079 --> 00:45:37,859
And

831
00:45:37,859 --> 00:45:41,739
the support vector machine outperforms Bayesian logistic regression. And so, you know,

832
00:45:41,739 --> 00:45:44,969
the weighted accuracy on the supportvector-machine parameters

833
00:45:44,969 --> 00:45:46,969
is better than the weighted accuracy

834
00:45:46,969 --> 00:45:50,179
for Bayesian logistic regression.

835
00:45:50,179 --> 00:45:53,929
So

836
00:45:53,929 --> 00:45:57,039
further, Bayesian logistic regression tries to optimize

837
00:45:57,039 --> 00:45:59,410
an optimization objective like that, which I

838
00:45:59,410 --> 00:46:02,269
denoted J theta.

839
00:46:02,269 --> 00:46:05,839
And so, the diagnostic I choose to use is

840
00:46:05,839 --> 00:46:08,430
to see if J of SVM

841
00:46:08,430 --> 00:46:12,269
is bigger-than or less-than J of BLR. Okay?

842
00:46:12,269 --> 00:46:14,609
So I explain this on the next slide.

843
00:46:14,609 --> 00:46:15,569
So

844
00:46:15,569 --> 00:46:19,529
we know two facts. We know that - well we know one fact. We know that a weighted

845
00:46:19,529 --> 00:46:20,519
accuracy

846
00:46:20,519 --> 00:46:23,160
of support vector machine, right,

847
00:46:23,160 --> 00:46:24,479
is bigger than

848
00:46:24,479 --> 00:46:28,859
this weighted accuracy of Bayesian logistic regression. So

849
00:46:28,859 --> 00:46:32,209
in order for me to figure out whether Bayesian logistic regression is converging,

850
00:46:32,209 --> 00:46:35,379
or whether I'm just optimizing the wrong objective function,

851
00:46:35,379 --> 00:46:41,059
the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?

852
00:46:41,059 --> 00:46:43,549
So let me explain this,

853
00:46:43,549 --> 00:46:44,769
so in Case 1,

854
00:46:44,769 --> 00:46:46,029
right,

855
00:46:46,029 --> 00:46:48,319
it's just those two equations copied over.

856
00:46:48,319 --> 00:46:50,489
In Case 1, let's say that

857
00:46:50,489 --> 00:46:54,589
J of SVM is, indeed, is greater than J of BLR - or J of

858
00:46:54,589 --> 00:47:01,169
theta SVM is greater than J of theta BLR. But

859
00:47:01,169 --> 00:47:04,439
we know that Bayesian logistic regression

860
00:47:04,439 --> 00:47:07,519
was trying to maximize J of theta;

861
00:47:07,519 --> 00:47:08,869
that's the definition of

862
00:47:08,869 --> 00:47:12,359
Bayesian logistic regression.

863
00:47:12,359 --> 00:47:16,759
So this means that

864
00:47:16,759 --> 00:47:17,599
theta -

865
00:47:17,599 --> 00:47:22,029
the value of theta output that Bayesian logistic regression actually fails to

866
00:47:22,029 --> 00:47:24,209
maximize J

867
00:47:24,209 --> 00:47:27,309
because the support back to machine actually returned the value of theta that,

868
00:47:27,309 --> 00:47:28,719
you know does a

869
00:47:28,719 --> 00:47:31,349
better job out-maximizing J.

870
00:47:31,349 --> 00:47:36,509
And so, this tells me that Bayesian logistic regression didn't actually maximize J

871
00:47:36,509 --> 00:47:39,319
correctly, and so the problem is with

872
00:47:39,319 --> 00:47:41,099
the optimization algorithm. The

873
00:47:41,099 --> 00:47:45,269
optimization algorithm hasn't converged. The other

874
00:47:45,269 --> 00:47:46,099
case

875
00:47:46,099 --> 00:47:49,890
is as follows, where

876
00:47:49,890 --> 00:47:52,579
J of theta SVM is less-than/equal to J of theta

877
00:47:52,579 --> 00:47:55,719
BLR. Okay?

878
00:47:55,719 --> 00:47:58,389
In this case, what does

879
00:47:58,389 --> 00:47:59,140
that mean?

880
00:47:59,140 --> 00:48:01,849
This means that Bayesian logistic regression

881
00:48:01,849 --> 00:48:04,599
actually attains the higher value

882
00:48:04,599 --> 00:48:07,289
for the optimization objective J

883
00:48:07,289 --> 00:48:10,929
then doesn't support back to machine.

884
00:48:10,929 --> 00:48:13,159
The support back to machine,

885
00:48:13,159 --> 00:48:14,969
which does worse

886
00:48:14,969 --> 00:48:17,669
on your optimization problem,

887
00:48:17,669 --> 00:48:19,199
actually does better

888
00:48:19,199 --> 00:48:24,329
on the weighted accuracy measure.

889
00:48:24,329 --> 00:48:27,999
So what this means is that something that does worse on your optimization

890
00:48:27,999 --> 00:48:28,789
objective,

891
00:48:28,789 --> 00:48:29,789
on J,

892
00:48:29,789 --> 00:48:31,429
can actually do better

893
00:48:31,429 --> 00:48:34,039
on the weighted accuracy objective.

894
00:48:34,039 --> 00:48:37,109
And this really means that maximizing

895
00:48:37,109 --> 00:48:38,369
J of theta,

896
00:48:38,369 --> 00:48:42,059
you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.

897
00:48:42,059 --> 00:48:43,430


898
00:48:43,430 --> 00:48:47,359
And therefore, this tells you that J of theta is maybe the wrong optimization

899
00:48:47,359 --> 00:48:49,649
objective to be maximizing. Right?

900
00:48:49,649 --> 00:48:51,160
That just maximizing J of

901
00:48:51,160 --> 00:48:53,149
theta just wasn't a good objective

902
00:48:53,149 --> 00:49:00,149
to be choosing if you care about the weighted accuracy. Okay? Can you

903
00:49:02,669 --> 00:49:03,460
raise your hand

904
00:49:03,460 --> 00:49:09,989
if this made sense?

905
00:49:09,989 --> 00:49:11,519
Cool, good. So

906
00:49:11,519 --> 00:49:16,829
that tells us whether the problem is with the optimization objective

907
00:49:16,829 --> 00:49:19,380
or whether it's with the objective function.

908
00:49:19,380 --> 00:49:21,009
And so going back to this

909
00:49:21,009 --> 00:49:23,149
slide, the eight fixes we had,

910
00:49:23,149 --> 00:49:24,180
you notice that if you

911
00:49:24,180 --> 00:49:27,170
run gradient descent for more iterations

912
00:49:27,170 --> 00:49:31,019
that fixes the optimization algorithm. You try and use this method

913
00:49:31,019 --> 00:49:33,259
fixes the optimization algorithm,

914
00:49:33,259 --> 00:49:37,289
whereas using a different value for lambda, in that lambda times norm of data

915
00:49:37,289 --> 00:49:39,469
squared, you know, in your objective,

916
00:49:39,469 --> 00:49:42,359
fixes the optimization objective. And

917
00:49:42,359 --> 00:49:47,629
changing to an SVM is also another way of trying to fix the optimization objective. Okay?

918
00:49:47,629 --> 00:49:49,329
And so

919
00:49:49,329 --> 00:49:52,309
once again, you actually see this quite often that -

920
00:49:52,309 --> 00:49:55,079
actually, you see it very often, people will

921
00:49:55,079 --> 00:49:58,479
have a problem with the optimization objective

922
00:49:58,479 --> 00:50:00,989
and be working harder and harder

923
00:50:00,989 --> 00:50:03,179
to fix the optimization algorithm.

924
00:50:03,179 --> 00:50:06,079
That's another very common pattern that

925
00:50:06,079 --> 00:50:10,190
the problem is in the formula from your J of theta, that often you see people, you know,

926
00:50:10,190 --> 00:50:13,269
just running more and more iterations of gradient descent. Like trying Newton's

927
00:50:13,269 --> 00:50:16,010
method and trying conjugate and then trying

928
00:50:16,010 --> 00:50:18,589
more and more crazy optimization algorithms,

929
00:50:18,589 --> 00:50:20,889
whereas the problem was, you know,

930
00:50:20,889 --> 00:50:24,459
optimizing J of theta wasn't going to fix the problem at all. Okay?

931
00:50:24,459 --> 00:50:28,649
So there's another example of when these sorts of diagnostics will

932
00:50:28,649 --> 00:50:31,909
help you figure out whether you should be fixing your optimization algorithm

933
00:50:31,909 --> 00:50:33,259
or fixing the

934
00:50:33,259 --> 00:50:38,849
optimization

935
00:50:38,849 --> 00:50:45,339
objective. Okay? Let me think

936
00:50:45,339 --> 00:50:47,599
how much time I have.

937
00:50:47,599 --> 00:50:48,819
Hmm, let's

938
00:50:48,819 --> 00:50:49,620
see. Well okay, we have time. Let's do this.

939
00:50:49,620 --> 00:50:52,980
Show you one last example of a diagnostic. This is one that came up in,

940
00:50:52,980 --> 00:50:56,999
you know, my students' and my work on flying helicopters.

941
00:50:56,999 --> 00:50:57,839


942
00:50:57,839 --> 00:51:00,189
This one actually,

943
00:51:00,189 --> 00:51:04,179
this example is the most complex of the three examples I'm gonna do

944
00:51:04,179 --> 00:51:05,609
today.

945
00:51:05,609 --> 00:51:08,559
I'm going to somewhat quickly, and

946
00:51:08,559 --> 00:51:11,259
this actually draws on reinforcement learning which is something that I'm not

947
00:51:11,259 --> 00:51:14,499
gonna talk about until towards - close to the end of the course here, but this just

948
00:51:14,499 --> 00:51:16,759
a more

949
00:51:16,759 --> 00:51:20,009
complicated example of a diagnostic we're gonna go over.

950
00:51:20,009 --> 00:51:23,759
What I'll do is probably go over this fairly quickly, and then after we've talked about

951
00:51:23,759 --> 00:51:26,839
reinforcement learning in the class, I'll probably actually come back and redo this exact

952
00:51:26,839 --> 00:51:32,919
same example because you'll understand it more deeply. Okay?

953
00:51:32,919 --> 00:51:37,099
So some of you know that my students and I fly autonomous helicopters, so how do you get a

954
00:51:37,099 --> 00:51:41,559
machine-learning algorithm to design the controller for

955
00:51:41,559 --> 00:51:44,199
helicopter? This is what we do. All right?

956
00:51:44,199 --> 00:51:48,519
This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our

957
00:51:48,519 --> 00:51:49,619
simulator.

958
00:51:49,619 --> 00:51:53,499
This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you

959
00:51:53,499 --> 00:51:55,679


960
00:51:55,679 --> 00:51:57,190
choose a cost function, it's

961
00:51:57,190 --> 00:52:00,849
actually called a [inaudible] function, but for this actually I'll call it cost function.

962
00:52:00,849 --> 00:52:02,949
Say J of theta is, you know,

963
00:52:02,949 --> 00:52:06,589
the expected squared error in your helicopter's

964
00:52:06,589 --> 00:52:08,150
position. Okay? So this is J of theta is

965
00:52:08,150 --> 00:52:08,509
maybe

966
00:52:08,509 --> 00:52:12,359
it's expected square error or just the square error.

967
00:52:12,359 --> 00:52:16,909
And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms

968
00:52:16,909 --> 00:52:18,599
in a few weeks.

969
00:52:18,599 --> 00:52:22,499
You run reinforcement learning algorithm in your simulator

970
00:52:22,499 --> 00:52:26,640
to try to minimize this cost function; try to minimize the squared error of

971
00:52:26,640 --> 00:52:31,439
how well you're controlling your helicopter's position. Okay?

972
00:52:31,439 --> 00:52:35,279
The reinforcement learning algorithm will output some parameters, which I'm denoting theta

973
00:52:35,279 --> 00:52:37,209
subscript RL,

974
00:52:37,209 --> 00:52:41,709
and then you'll use that to fly your helicopter.

975
00:52:41,709 --> 00:52:44,959
So suppose you run this learning algorithm and

976
00:52:44,959 --> 00:52:48,589
you get out a set of controller parameters, theta subscript RL,

977
00:52:48,589 --> 00:52:52,299
that gives much worse performance than a human pilot. Then

978
00:52:52,299 --> 00:52:54,729
what do you do next? And in particular, you

979
00:52:54,729 --> 00:52:57,959
know, corresponding to the three steps above, there are three

980
00:52:57,959 --> 00:53:00,589
natural things you can try. Right? You can

981
00:53:00,589 --> 00:53:01,869
try to - oh, the bottom of

982
00:53:01,869 --> 00:53:03,919
the slide got chopped off.

983
00:53:03,919 --> 00:53:07,529
You can try to improve the simulator. And

984
00:53:07,529 --> 00:53:10,329
maybe you think your simulator's isn't that accurate, you need to capture

985
00:53:10,329 --> 00:53:12,339
the aerodynamic effects more

986
00:53:12,339 --> 00:53:15,429
accurately. You need to capture the airflow and the turbulence affects around the helicopter

987
00:53:15,429 --> 00:53:18,279
more accurately.

988
00:53:18,279 --> 00:53:21,439
Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe

989
00:53:21,439 --> 00:53:24,719
what a human pilot does isn't just optimizing square area but it's something more

990
00:53:24,719 --> 00:53:25,989
subtle.

991
00:53:25,989 --> 00:53:26,769
Or maybe

992
00:53:26,769 --> 00:53:32,989
the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So

993
00:53:32,989 --> 00:53:36,799
these are the diagnostics that I actually used, and my students and I actually use to figure out what's

994
00:53:36,799 --> 00:53:41,299
going on.

995
00:53:41,299 --> 00:53:44,509
Actually, why don't you just think about this for a second and think what you'd do, and then

996
00:53:44,509 --> 00:53:51,509
I'll go on and tell you what we do. All right,

997
00:54:46,229 --> 00:54:47,869
so let me tell you what -

998
00:54:47,869 --> 00:54:49,599
how we do this and see

999
00:54:49,599 --> 00:54:52,770
whether it's the same as yours or not. And if you have a better idea than I do, let me

1000
00:54:52,770 --> 00:54:53,569
know and I'll let you try it

1001
00:54:53,569 --> 00:54:55,919
on my helicopter.

1002
00:54:55,919 --> 00:54:58,239
So

1003
00:54:58,239 --> 00:55:01,449
here's a reasoning that I wanted to experiment, right. So,

1004
00:55:01,449 --> 00:55:03,680
yeah, let's say the controller output

1005
00:55:03,680 --> 00:55:10,369
by our reinforcement-learning algorithm does poorly. Well

1006
00:55:10,369 --> 00:55:12,609
suppose the following three things hold true.

1007
00:55:12,609 --> 00:55:15,149
Suppose the contrary, I guess. Suppose that

1008
00:55:15,149 --> 00:55:19,649
the helicopter simulator is accurate, so let's assume we have an accurate model

1009
00:55:19,649 --> 00:55:22,449
of our helicopter. And

1010
00:55:22,449 --> 00:55:25,299
let's suppose that the reinforcement learning algorithm,

1011
00:55:25,299 --> 00:55:28,890
you know, correctly controls the helicopter in simulation,

1012
00:55:28,890 --> 00:55:31,819
so we tend to run a learning algorithm in simulation so that, you know, the

1013
00:55:31,819 --> 00:55:35,149
learning algorithm can crash a helicopter and it's fine. Right?

1014
00:55:35,149 --> 00:55:37,230
So let's assume our reinforcement-learning

1015
00:55:37,230 --> 00:55:40,110
algorithm correctly controls the helicopter so as to minimize the cost

1016
00:55:40,110 --> 00:55:42,099
function J of theta.

1017
00:55:42,099 --> 00:55:43,740
And let's suppose that

1018
00:55:43,740 --> 00:55:47,630
minimizing J of theta does indeed correspond to accurate or the correct autonomous

1019
00:55:47,630 --> 00:55:49,339
flight.

1020
00:55:49,339 --> 00:55:52,069
If all of these things held true,

1021
00:55:52,069 --> 00:55:53,909
then that means that

1022
00:55:53,909 --> 00:55:58,459
the parameters, theta RL, should actually fly well on my real

1023
00:55:58,459 --> 00:56:01,039
helicopter. Right?

1024
00:56:01,039 --> 00:56:03,280
And so the fact that the learning

1025
00:56:03,280 --> 00:56:05,339
control parameters, theta RL,

1026
00:56:05,339 --> 00:56:08,599
does not fly well on my helicopter, that sort

1027
00:56:08,599 --> 00:56:11,249
of means that ones of these three assumptions must be wrong

1028
00:56:11,249 --> 00:56:17,869
and I'd like to figure out which of these

1029
00:56:17,869 --> 00:56:19,669
three assumptions

1030
00:56:19,669 --> 00:56:22,089
is wrong. Okay? So these are the diagnostics we use.

1031
00:56:22,089 --> 00:56:25,449
First one is

1032
00:56:25,449 --> 00:56:31,719
we look at the controller and see if it even flies well in

1033
00:56:31,719 --> 00:56:35,089
simulation. Right? So the simulator of the helicopter that we did the learning on,

1034
00:56:35,089 --> 00:56:38,699
and so if the learning algorithm flies well in the simulator but

1035
00:56:38,699 --> 00:56:42,029
it doesn't fly well on my real helicopter,

1036
00:56:42,029 --> 00:56:46,109
then that tells me the problem is probably in the simulator. Right?

1037
00:56:46,109 --> 00:56:48,049
My simulator predicts

1038
00:56:48,049 --> 00:56:51,909
the helicopter's controller will fly well but it doesn't actually fly well in real life, so

1039
00:56:51,909 --> 00:56:53,580
could be the problem's in the simulator

1040
00:56:53,580 --> 00:56:59,239
and we should spend out efforts improving the accuracy of our simulator.

1041
00:56:59,239 --> 00:57:03,170
Otherwise, let me write theta subscript human, be the human

1042
00:57:03,170 --> 00:57:07,049
control policy. All right? So

1043
00:57:07,049 --> 00:57:11,639
let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it

1044
00:57:11,639 --> 00:57:13,479
could be in real life,

1045
00:57:13,479 --> 00:57:16,769
and let's measure, you know, the means squared error

1046
00:57:16,769 --> 00:57:20,209
of the human pilot's flight. And

1047
00:57:20,209 --> 00:57:24,239
let's see if the human pilot does better or worse

1048
00:57:24,239 --> 00:57:26,089
than the learned controller,

1049
00:57:26,089 --> 00:57:28,249
in terms of optimizing this

1050
00:57:28,249 --> 00:57:31,969
objective function J of theta. Okay?

1051
00:57:31,969 --> 00:57:33,929
So if the human does

1052
00:57:33,929 --> 00:57:36,890
worse, if even a very good human pilot

1053
00:57:36,890 --> 00:57:41,439
attains a worse value on my optimization objective, on my cost

1054
00:57:41,439 --> 00:57:42,410
function,

1055
00:57:42,410 --> 00:57:48,619
than my learning algorithm,

1056
00:57:48,619 --> 00:57:51,799
then the problem is in the reinforcement-learning algorithm.

1057
00:57:51,799 --> 00:57:56,089
Because my reinforcement-learning algorithm was trying to minimize J of

1058
00:57:56,089 --> 00:58:00,140
theta, but a human actually attains a lower value for J of theta than does my

1059
00:58:00,140 --> 00:58:01,779
algorithm.

1060
00:58:01,779 --> 00:58:05,489
And so that tells me that clearly my algorithm's not

1061
00:58:05,489 --> 00:58:07,819
managing to minimize J of theta

1062
00:58:07,819 --> 00:58:12,880
and that tells me the problem's in the reinforcement learning algorithm.

1063
00:58:12,880 --> 00:58:17,650
And finally, if J of theta - if the human actually attains a larger value

1064
00:58:17,650 --> 00:58:19,549
for theta - excuse me,

1065
00:58:19,549 --> 00:58:24,399
if the human actually attains a larger value for J of theta, the human actually

1066
00:58:24,399 --> 00:58:27,859
has, you know, larger mean squared error for the helicopter position than

1067
00:58:27,859 --> 00:58:30,599
does my reinforcement learning algorithms, that's

1068
00:58:30,599 --> 00:58:34,000
I like - but I like the way the human flies much better than my reinforcement learning

1069
00:58:34,000 --> 00:58:35,319
algorithm. So

1070
00:58:35,319 --> 00:58:37,229
if that holds true,

1071
00:58:37,229 --> 00:58:39,779
then clearly the problem's in the cost function, right,

1072
00:58:39,779 --> 00:58:42,880
because the human does worse on my cost function

1073
00:58:42,880 --> 00:58:46,069
but flies much better than my learning algorithm.

1074
00:58:46,069 --> 00:58:48,359
And so that means the problem's in the cost function. It

1075
00:58:48,359 --> 00:58:50,089
means - oh

1076
00:58:50,089 --> 00:58:50,539
excuse me, I

1077
00:58:50,539 --> 00:58:53,679
meant minimizing it, not maximizing it, there's a typo on the slide,

1078
00:58:53,679 --> 00:58:55,379
because that means that minimizing

1079
00:58:55,379 --> 00:58:57,089
the cost function

1080
00:58:57,089 --> 00:59:00,219
- my learning algorithm does a better job minimizing the cost function but doesn't

1081
00:59:00,219 --> 00:59:03,439
fly as well as a human pilot. So that tells you that

1082
00:59:03,439 --> 00:59:04,719
minimizing the cost function

1083
00:59:04,719 --> 00:59:06,880
doesn't correspond to good autonomous flight. And what

1084
00:59:06,880 --> 00:59:11,859
you should do it go back and see if you can change J of

1085
00:59:11,859 --> 00:59:13,099
theta. Okay?

1086
00:59:13,099 --> 00:59:18,379
And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement

1087
00:59:18,379 --> 00:59:21,730
learning algorithms just work but when they don't work,

1088
00:59:21,730 --> 00:59:26,200
these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,

1089
00:59:26,200 --> 00:59:30,329
on changing the cost function, or on changing the reinforcement learning

1090
00:59:30,329 --> 00:59:32,089
algorithm. And

1091
00:59:32,089 --> 00:59:37,039
again, if you don't know which of your three problems it is, it's entirely possible,

1092
00:59:37,039 --> 00:59:40,280
you know, to spend two years, whatever, changing, building a better simulator

1093
00:59:40,280 --> 00:59:42,599
for your helicopter.

1094
00:59:42,599 --> 00:59:43,950
But it turns out that

1095
00:59:43,950 --> 00:59:47,690
modeling helicopter aerodynamics is an active area of research. There are people, you know, writing

1096
00:59:47,690 --> 00:59:49,789
entire PhD theses on this still.

1097
00:59:49,789 --> 00:59:53,559
So it's entirely possible to go out and spend six years and write a PhD thesis and build

1098
00:59:53,559 --> 00:59:55,499
a much better helicopter simulator, but if you're fixing

1099
00:59:55,499 --> 01:00:02,499
the wrong problem it's not gonna help.

1100
01:00:03,209 --> 01:00:05,529
So

1101
01:00:05,529 --> 01:00:08,919
quite often, you need to come up with your own diagnostics to figure out what's happening in an

1102
01:00:08,919 --> 01:00:11,639
algorithm when something is going wrong.

1103
01:00:11,639 --> 01:00:15,679
And unfortunately I don't know of - what I've described

1104
01:00:15,679 --> 01:00:17,149
are sort of maybe

1105
01:00:17,149 --> 01:00:20,509
some of the most common diagnostics that I've used, that I've seen,

1106
01:00:20,509 --> 01:00:23,709
you know, to be useful for many problems. But very often, you need to come up

1107
01:00:23,709 --> 01:00:28,189
with your own for your own specific learning problem.

1108
01:00:28,189 --> 01:00:31,729
And I just want to point out that even when the learning algorithm is working well, it's

1109
01:00:31,729 --> 01:00:35,159
often a good idea to run diagnostics, like the ones I talked

1110
01:00:35,159 --> 01:00:36,069
about,

1111
01:00:36,069 --> 01:00:38,309
to make sure you really understand what's going on.

1112
01:00:38,309 --> 01:00:41,599
All right? And this is useful for a couple of reasons. One is that

1113
01:00:41,599 --> 01:00:45,609
diagnostics like these will often help you to understand your application

1114
01:00:45,609 --> 01:00:47,899
problem better.

1115
01:00:47,899 --> 01:00:52,159
So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying

1116
01:00:52,159 --> 01:00:56,349
job to apply machine-learning algorithms to some application problem of, you

1117
01:00:56,349 --> 01:00:59,299
know, significant economic interest. Right?

1118
01:00:59,299 --> 01:01:02,929
And you're gonna be working on one specific

1119
01:01:02,929 --> 01:01:08,059
important machine learning application for many months, or even for years.

1120
01:01:08,059 --> 01:01:10,989
One of the most valuable things for you personally will be for you to

1121
01:01:10,989 --> 01:01:13,249
get in - for you personally

1122
01:01:13,249 --> 01:01:16,909
to get in an intuitive understanding of what works and what doesn't work your

1123
01:01:16,909 --> 01:01:17,369
problem.

1124
01:01:17,369 --> 01:01:21,240
Sort of right now in the industry, in Silicon Valley or around the world,

1125
01:01:21,240 --> 01:01:24,830
there are many companies with important machine learning problems and there are often people

1126
01:01:24,830 --> 01:01:26,950
working on the same machine learning problem, you

1127
01:01:26,950 --> 01:01:31,209
know, for many months or for years on end. And

1128
01:01:31,209 --> 01:01:34,659
when you're doing that, I mean solving a really important problem using learning algorithms, one of

1129
01:01:34,659 --> 01:01:38,719
the most valuable things is just your own personal intuitive understanding of the

1130
01:01:38,719 --> 01:01:40,999
problem.

1131
01:01:40,999 --> 01:01:42,169
Okay?

1132
01:01:42,169 --> 01:01:43,410
And diagnostics, like

1133
01:01:43,410 --> 01:01:48,149
the sort I talked about, will be one way for you to get a better and better understanding of

1134
01:01:48,149 --> 01:01:50,279
these problems. It

1135
01:01:50,279 --> 01:01:54,089
turns out, by the way, there are some of Silicon Valley companies that outsource their

1136
01:01:54,089 --> 01:01:56,679
machine learning. So there's sometimes, you know, whatever.

1137
01:01:56,679 --> 01:01:59,529
They're a company in Silicon Valley and they'll, you know,

1138
01:01:59,529 --> 01:02:03,230
hire a firm in New York to run all their learning algorithms for them.

1139
01:02:03,230 --> 01:02:06,890
And I'm not a businessman, but I personally think that's

1140
01:02:06,890 --> 01:02:09,309
often a terrible idea because

1141
01:02:09,309 --> 01:02:13,639
if your expertise, if your understanding of your data is given,

1142
01:02:13,639 --> 01:02:15,709
you know, to an outsource agency,

1143
01:02:15,709 --> 01:02:19,589
then if you don't maintain that expertise, if there's a problem you really care about

1144
01:02:19,589 --> 01:02:22,299
then it'll be your own, you know,

1145
01:02:22,299 --> 01:02:26,009
understanding of the problem that you build up over months that'll be really valuable.

1146
01:02:26,009 --> 01:02:28,649
And if that knowledge is outsourced, you don't get to keep that knowledge

1147
01:02:28,649 --> 01:02:29,489
yourself.

1148
01:02:29,489 --> 01:02:31,699
I personally think that's a terrible idea.

1149
01:02:31,699 --> 01:02:35,809
But I'm not a businessman, but I just see people do that a lot,

1150
01:02:35,809 --> 01:02:39,109
and just. Let's see.

1151
01:02:39,109 --> 01:02:42,949
Another reason for running diagnostics like these is actually in writing research

1152
01:02:42,949 --> 01:02:43,609
papers,

1153
01:02:43,609 --> 01:02:46,149
right? So

1154
01:02:46,149 --> 01:02:49,329
diagnostics and error analyses, which I'll talk about in a minute,

1155
01:02:49,329 --> 01:02:53,019
often help to convey insight about the problem and help justify your research

1156
01:02:53,019 --> 01:02:54,109
claims.

1157
01:02:54,109 --> 01:02:56,559


1158
01:02:56,559 --> 01:02:57,780
So for example,

1159
01:02:57,780 --> 01:03:00,790
rather than writing a research paper, say, that's says, you know, "Oh well here's

1160
01:03:00,790 --> 01:03:04,039
an algorithm that works. I built this helicopter and it flies," or whatever,

1161
01:03:04,039 --> 01:03:05,650
it's often much more interesting to say,

1162
01:03:05,650 --> 01:03:09,609
"Here's an algorithm that works, and it works because of a specific

1163
01:03:09,609 --> 01:03:13,920
component X. And moreover, here's the diagnostic that gives you justification that shows X was

1164
01:03:13,920 --> 01:03:19,159
the thing that fixed this problem," and that's where you made it work. Okay? So

1165
01:03:19,159 --> 01:03:21,389
that leads me

1166
01:03:21,389 --> 01:03:25,929
into a discussion on error analysis, which is often good machine learning practice,

1167
01:03:25,929 --> 01:03:26,439


1168
01:03:26,439 --> 01:03:32,099
is a way for understanding what your sources of errors are. So what I

1169
01:03:32,099 --> 01:03:34,689
call error analyses - and let's check

1170
01:03:34,689 --> 01:03:41,689
questions about this.

1171
01:03:41,769 --> 01:03:45,789
Yeah?

1172
01:03:45,789 --> 01:03:49,809
Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.

1173
01:03:49,809 --> 01:03:53,500
The thing that is most difficult a helicopter is actually building a

1174
01:03:53,500 --> 01:03:55,109
very - I don't know. It

1175
01:03:55,109 --> 01:03:58,489
changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter

1176
01:03:58,489 --> 01:04:02,859
is very hard. Yeah. Okay. So

1177
01:04:02,859 --> 01:04:03,930
for error

1178
01:04:03,930 --> 01:04:06,269
analyses,

1179
01:04:06,269 --> 01:04:10,809
this is a way for figuring out what is working in your algorithm and what isn't working.

1180
01:04:10,809 --> 01:04:17,709
And we're gonna talk about two specific examples. So there are

1181
01:04:17,709 --> 01:04:21,529
many learning - there are many sort of IA systems, many machine learning systems, that

1182
01:04:21,529 --> 01:04:22,469
combine

1183
01:04:22,469 --> 01:04:24,890
many different components into a pipeline. So

1184
01:04:24,890 --> 01:04:27,469
here's sort of a contrived example for this,

1185
01:04:27,469 --> 01:04:31,019
not dissimilar in many ways from the actual machine learning systems you see.

1186
01:04:31,019 --> 01:04:32,390
So let's say you want to

1187
01:04:32,390 --> 01:04:37,749
recognize people from images. This is a picture of one of my friends.

1188
01:04:37,749 --> 01:04:41,899
So you take this input in camera image, say, and you often run it through a long pipeline. So

1189
01:04:41,899 --> 01:04:43,069
for example,

1190
01:04:43,069 --> 01:04:47,859
the first thing you may do may be preprocess the image and remove the background, so you remove the

1191
01:04:47,859 --> 01:04:49,189
background.

1192
01:04:49,189 --> 01:04:51,909
And then you run a

1193
01:04:51,909 --> 01:04:55,209
face detection algorithm, so a machine learning algorithm to detect people's faces.

1194
01:04:55,209 --> 01:04:56,109
Right?

1195
01:04:56,109 --> 01:04:59,759
And then, you know, let's say you want to recognize the identity of the person, right, this is your

1196
01:04:59,759 --> 01:05:01,719
application.

1197
01:05:01,719 --> 01:05:04,440
You then segment of the eyes, segment of the nose,

1198
01:05:04,440 --> 01:05:08,329
and have different learning algorithms to detect the mouth and so on.

1199
01:05:08,329 --> 01:05:10,029
I know; she might not want to be friend

1200
01:05:10,029 --> 01:05:13,249
after she sees this.

1201
01:05:13,249 --> 01:05:16,769
And then having found all these features, based on, you know, what the nose looks like, what the eyes

1202
01:05:16,769 --> 01:05:18,610
looks like, whatever, then you

1203
01:05:18,610 --> 01:05:22,800
feed all the features into a logistic regression algorithm. And your logistic

1204
01:05:22,800 --> 01:05:24,770
regression or soft match regression, or whatever,

1205
01:05:24,770 --> 01:05:30,379
will tell you the identity of this person. Okay?

1206
01:05:30,379 --> 01:05:32,459
So

1207
01:05:32,459 --> 01:05:35,059
this is what error analysis is.

1208
01:05:35,059 --> 01:05:40,329
You have a long complicated pipeline combining many machine learning

1209
01:05:40,329 --> 01:05:43,919
components. Many of these would be used in learning algorithms.

1210
01:05:43,919 --> 01:05:45,689
And so,

1211
01:05:45,689 --> 01:05:50,419
it's often very useful to figure out how much of your error can be attributed to each of

1212
01:05:50,419 --> 01:05:55,179
these components.

1213
01:05:55,179 --> 01:05:56,179
So

1214
01:05:56,179 --> 01:05:59,589
what we'll do in a typical error analysis procedure

1215
01:05:59,589 --> 01:06:03,709
is we'll repeatedly plug in the ground-truth for each component and see how the

1216
01:06:03,709 --> 01:06:05,129
accuracy changes.

1217
01:06:05,129 --> 01:06:07,589
So what I mean by that is the

1218
01:06:07,589 --> 01:06:11,389
figure on the bottom left - bottom right, let's say the overall accuracy of the system is

1219
01:06:11,389 --> 01:06:12,739
85 percent. Right?

1220
01:06:12,739 --> 01:06:14,689
Then I want to know

1221
01:06:14,689 --> 01:06:17,399
where my 15 percent of error comes from.

1222
01:06:17,399 --> 01:06:19,159
And so what I'll do is I'll go

1223
01:06:19,159 --> 01:06:21,329
to my test set

1224
01:06:21,329 --> 01:06:26,629
and I'll actually code it and - oh, instead of - actually implement my correct

1225
01:06:26,629 --> 01:06:29,750
background removal. So actually, go in and give it,

1226
01:06:29,750 --> 01:06:33,439
give my algorithm what is the correct background versus foreground.

1227
01:06:33,439 --> 01:06:36,840
And if I do that, let's color that blue to denote that I'm

1228
01:06:36,840 --> 01:06:39,529
giving that ground-truth data in the test set,

1229
01:06:39,529 --> 01:06:43,839
let's assume our accuracy increases to 85.1 percent. Okay?

1230
01:06:43,839 --> 01:06:47,760
And now I'll go in and, you know, give my algorithm the ground-truth,

1231
01:06:47,760 --> 01:06:48,929
face detection

1232
01:06:48,929 --> 01:06:53,019
output. So I'll go in and actually on my test set I'll just tell the algorithm where the

1233
01:06:53,019 --> 01:06:55,130
face is. And if I do that,

1234
01:06:55,130 --> 01:06:59,049
let's say my algorithm's accuracy increases to 91 percent,

1235
01:06:59,049 --> 01:07:02,519
and so on. And then I'll go for each of these components

1236
01:07:02,519 --> 01:07:05,019
and just give it

1237
01:07:05,019 --> 01:07:08,660
the ground-truth label for each of the components,

1238
01:07:08,660 --> 01:07:11,640
because say, like, the nose segmentation algorithm's trying to figure out

1239
01:07:11,640 --> 01:07:13,219
where the nose is. I just in

1240
01:07:13,219 --> 01:07:16,589
and tell it where the nose is so that it doesn't have to figure that out.

1241
01:07:16,589 --> 01:07:20,559
And as I do this, one component through the other, you know, I end up giving it the correct output

1242
01:07:20,559 --> 01:07:23,650
label and end up with 100 percent accuracy.

1243
01:07:23,650 --> 01:07:27,000
And now you can look at this table - I'm sorry this is cut off on the bottom,

1244
01:07:27,000 --> 01:07:29,119
it says logistic regression 100 percent. Now you can

1245
01:07:29,119 --> 01:07:30,719
look at this

1246
01:07:30,719 --> 01:07:31,670
table and

1247
01:07:31,670 --> 01:07:33,009
see,

1248
01:07:33,009 --> 01:07:36,079
you know, how much giving the ground-truth labels for each of these

1249
01:07:36,079 --> 01:07:39,029
components could help boost your final performance.

1250
01:07:39,029 --> 01:07:42,419
In particular, if you look at this table, you notice that

1251
01:07:42,419 --> 01:07:45,269
when I added the face detection ground-truth,

1252
01:07:45,269 --> 01:07:48,279
my performance jumped from 85.1 percent accuracy

1253
01:07:48,279 --> 01:07:50,619
to 91 percent accuracy. Right?

1254
01:07:50,619 --> 01:07:54,529
So this tells me that if only I can get better face detection,

1255
01:07:54,529 --> 01:07:58,029
maybe I can boost my accuracy by 6 percent.

1256
01:07:58,029 --> 01:08:00,499
Whereas in contrast, when I,

1257
01:08:00,499 --> 01:08:04,349
you know, say plugged in better, I don't know,

1258
01:08:04,349 --> 01:08:07,059
background removal, my accuracy improved from 85

1259
01:08:07,059 --> 01:08:08,669
to 85.1 percent.

1260
01:08:08,669 --> 01:08:11,519
And so, this sort of diagnostic also tells you that if your goal

1261
01:08:11,519 --> 01:08:13,869
is to improve the system, it's probably a waste of

1262
01:08:13,869 --> 01:08:17,679
your time to try to improve your background subtraction. Because if

1263
01:08:17,679 --> 01:08:19,219
even if you got the ground-truth,

1264
01:08:19,219 --> 01:08:22,059
this is gives you, at most, 0.1 percent accuracy,

1265
01:08:22,059 --> 01:08:24,600
whereas if you do better face detection, maybe there's a much

1266
01:08:24,600 --> 01:08:26,399
larger potential for gains there. Okay?

1267
01:08:26,399 --> 01:08:28,669
So this sort of diagnostic,

1268
01:08:28,669 --> 01:08:29,899
again,

1269
01:08:29,899 --> 01:08:33,149
is very useful because if your is to improve the system,

1270
01:08:33,149 --> 01:08:35,999
there are so many different pieces you can easily choose to spend the next three

1271
01:08:35,999 --> 01:08:36,650
months on. Right?

1272
01:08:36,650 --> 01:08:39,259
And choosing the right piece

1273
01:08:39,259 --> 01:08:42,799
is critical, and this sort of diagnostic tells you what's the piece that may

1274
01:08:42,799 --> 01:08:48,729
actually be worth your time to work on.

1275
01:08:48,729 --> 01:08:51,709
There's sort of another type of analyses that's sort of the opposite of what I just

1276
01:08:51,709 --> 01:08:53,369
talked about.

1277
01:08:53,369 --> 01:08:55,469
The error analysis I just talked about

1278
01:08:55,469 --> 01:08:58,279
tries to explain the difference between the current performance and perfect

1279
01:08:58,279 --> 01:08:59,770
performance,

1280
01:08:59,770 --> 01:09:03,619
whereas this sort of ablative analysis tries to explain the difference

1281
01:09:03,619 --> 01:09:09,119
between some baselines, some really bad performance and your current performance.

1282
01:09:09,119 --> 01:09:13,089
So for this example, let's suppose you've built a very good anti-spam classifier for

1283
01:09:13,089 --> 01:09:17,150
adding lots of clever features to your logistic regression algorithm. Right? So you added

1284
01:09:17,150 --> 01:09:20,690
features for spam correction, for, you know, sender host features, for email header

1285
01:09:20,690 --> 01:09:21,409
features,

1286
01:09:21,409 --> 01:09:24,799
email text parser features, JavaScript parser features,

1287
01:09:24,799 --> 01:09:26,839
features for embedded images, and so on.

1288
01:09:26,839 --> 01:09:30,230
So now let's say you preview the system and you want to figure out, you know, how well did

1289
01:09:30,230 --> 01:09:33,799
each of these - how much did each of these components actually contribute? Maybe you want

1290
01:09:33,799 --> 01:09:37,130
to write a research paper and claim this was the piece that made the

1291
01:09:37,130 --> 01:09:40,949
big difference. Can you actually document that claim and justify it?

1292
01:09:40,949 --> 01:09:43,319
So in ablative analysis,

1293
01:09:43,319 --> 01:09:44,569
here's what we do.

1294
01:09:44,569 --> 01:09:46,330
So in this example,

1295
01:09:46,330 --> 01:09:49,670
let's say that simple logistic regression without any of your clever

1296
01:09:49,670 --> 01:09:52,089
improvements get 94 percent performance. And

1297
01:09:52,089 --> 01:09:55,479
you want to figure out what accounts for your improvement from 94 to

1298
01:09:55,479 --> 01:09:58,429
99.9 percent performance.

1299
01:09:58,429 --> 01:10:03,280
So in ablative analysis and so instead of adding components one at a day, we'll instead

1300
01:10:03,280 --> 01:10:06,840
remove components one at a time to see how it rates.

1301
01:10:06,840 --> 01:10:11,460
So start with your overall system, which is 99 percent accuracy.

1302
01:10:11,460 --> 01:10:14,130
And then we remove spelling correction and see how much performance

1303
01:10:14,130 --> 01:10:15,389
drops.

1304
01:10:15,389 --> 01:10:22,389
Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,

1305
01:10:24,219 --> 01:10:28,150
in this contrived example,

1306
01:10:28,150 --> 01:10:31,120
you see that, I guess, the biggest drop

1307
01:10:31,120 --> 01:10:32,380
occurred when you remove

1308
01:10:32,380 --> 01:10:37,560
the text parser features. And so you can then make a credible case that,

1309
01:10:37,560 --> 01:10:41,280
you know, the text parser features where what really made the biggest difference here. Okay?

1310
01:10:41,280 --> 01:10:42,699
And you can also tell,

1311
01:10:42,699 --> 01:10:45,530
for instance, that, I don't know,

1312
01:10:45,530 --> 01:10:49,360
removing the sender host features on this

1313
01:10:49,360 --> 01:10:52,280
line, right, performance dropped from 99.9 to 98.9. And so this also means

1314
01:10:52,280 --> 01:10:53,139
that

1315
01:10:53,139 --> 01:10:56,449
in case you want to get rid of the sender host features to speed up

1316
01:10:56,449 --> 01:11:03,449
computational something that would be a good candidate for elimination. Okay? Are there any

1317
01:11:03,630 --> 01:11:05,420
guarantees that if you shuffle around the order in which

1318
01:11:05,420 --> 01:11:06,420
you drop those

1319
01:11:06,420 --> 01:11:09,580
features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's

1320
01:11:09,580 --> 01:11:12,110
no guarantee you'd get the similar result.

1321
01:11:12,110 --> 01:11:13,890
So in practice,

1322
01:11:13,890 --> 01:11:17,730
sometimes there's a fairly natural of ordering for both types of analyses, the error

1323
01:11:17,730 --> 01:11:19,329
analyses and ablative analysis,

1324
01:11:19,329 --> 01:11:22,749
sometimes there's a fairly natural ordering in which you add things or remove things,

1325
01:11:22,749 --> 01:11:24,559
sometimes there's isn't. And

1326
01:11:24,559 --> 01:11:28,469
quite often, you either choose one ordering and just go for it

1327
01:11:28,469 --> 01:11:32,070
or " And don't think of these analyses as sort of formulas that are constants, though; I mean

1328
01:11:32,070 --> 01:11:35,239
feel free to invent your own, as well. You know

1329
01:11:35,239 --> 01:11:36,639
one of the things

1330
01:11:36,639 --> 01:11:37,920
that's done quite often is

1331
01:11:37,920 --> 01:11:39,310
take the overall system

1332
01:11:39,310 --> 01:11:43,289
and just remove one and then put it back, then remove a different one

1333
01:11:43,289 --> 01:11:48,129
then put it back until all of these things are done. Okay.

1334
01:11:48,129 --> 01:11:51,009
So the very last thing I want to talk about is sort of this

1335
01:11:51,009 --> 01:11:57,979
general advice for how to get started on a learning problem. So

1336
01:11:57,979 --> 01:12:03,840
here's a cartoon description on two broad to get started on learning problem.

1337
01:12:03,840 --> 01:12:05,740
The first one is

1338
01:12:05,740 --> 01:12:07,610
carefully design your system, so

1339
01:12:07,610 --> 01:12:11,739
you spend a long time designing exactly the right features, collecting the right data set, and

1340
01:12:11,739 --> 01:12:14,189
designing the right algorithmic structure, then you

1341
01:12:14,189 --> 01:12:17,679
implement it and hope it works. All right?

1342
01:12:17,679 --> 01:12:21,039
The benefit of this sort of approach is you get maybe nicer, maybe more scalable

1343
01:12:21,039 --> 01:12:22,429
algorithms,

1344
01:12:22,429 --> 01:12:26,719
and maybe you come up with new elegant learning algorithms. And if your goal is to,

1345
01:12:26,719 --> 01:12:30,760
you know, contribute to basic research in machine learning, if your goal is to invent new machine learning

1346
01:12:30,760 --> 01:12:31,499
algorithms,

1347
01:12:31,499 --> 01:12:33,550
this process of slowing down and

1348
01:12:33,550 --> 01:12:36,300
thinking deeply about the problem, you know, that is sort of the right way to go

1349
01:12:36,300 --> 01:12:37,119
about is

1350
01:12:37,119 --> 01:12:41,099
think deeply about a problem and invent new solutions.

1351
01:12:41,099 --> 01:12:42,280


1352
01:12:42,280 --> 01:12:44,079
Second sort of approach

1353
01:12:44,079 --> 01:12:48,839
is what I call build-and-fix, which is we input something quick and dirty

1354
01:12:48,839 --> 01:12:52,309
and then you run error analyses and diagnostics to figure out what's wrong and

1355
01:12:52,309 --> 01:12:54,199
you fix those errors.

1356
01:12:54,199 --> 01:12:58,130
The benefit of this second type of approach is that it'll often get your

1357
01:12:58,130 --> 01:13:01,119
application working much more quickly.

1358
01:13:01,119 --> 01:13:04,400
And especially with those of you, if you end up working in a company, and sometimes - if you end up working in

1359
01:13:04,400 --> 01:13:05,550
a company,

1360
01:13:05,550 --> 01:13:07,460
you know, very often it's not

1361
01:13:07,460 --> 01:13:10,900
the best product that wins; it's the first product to market that

1362
01:13:10,900 --> 01:13:11,690
wins. And

1363
01:13:11,690 --> 01:13:14,869
so there's - especially in the industry. There's really something to be said for,

1364
01:13:14,869 --> 01:13:18,790
you know, building a system quickly and getting it deployed quickly.

1365
01:13:18,790 --> 01:13:23,139
And the second approach of building a quick-and-dirty, I'm gonna say hack

1366
01:13:23,139 --> 01:13:26,469
and then fixing the problems will actually get you to a

1367
01:13:26,469 --> 01:13:27,839
system that works well

1368
01:13:27,839 --> 01:13:30,969
much more quickly.

1369
01:13:30,969 --> 01:13:32,649
And the reason is

1370
01:13:32,649 --> 01:13:36,149
very often it's really not clear what parts of a system are easier to think of to

1371
01:13:36,149 --> 01:13:37,590
build and therefore what

1372
01:13:37,590 --> 01:13:40,179
you need to spends lot of time focusing on.

1373
01:13:40,179 --> 01:13:43,420
So there's that example I talked about just now. Right?

1374
01:13:43,420 --> 01:13:46,929
For identifying

1375
01:13:46,929 --> 01:13:48,710
people, say.

1376
01:13:48,710 --> 01:13:53,199
And with a big complicated learning system like this, a big complicated pipeline like this,

1377
01:13:53,199 --> 01:13:55,590
it's really not obvious at the outset

1378
01:13:55,590 --> 01:13:59,130
which of these components you should spend lots of time working on. Right? And if

1379
01:13:59,130 --> 01:14:00,960
you didn't know that

1380
01:14:00,960 --> 01:14:03,800
preprocessing wasn't the right component, you could easily have

1381
01:14:03,800 --> 01:14:07,269
spent three months working on better background subtraction, not knowing that it's

1382
01:14:07,269 --> 01:14:09,880
just not gonna ultimately matter.

1383
01:14:09,880 --> 01:14:10,769
And so

1384
01:14:10,769 --> 01:14:13,690
the only way to find out what really works was inputting something quickly and

1385
01:14:13,690 --> 01:14:15,350
you find out what parts -

1386
01:14:15,350 --> 01:14:16,889
and find out

1387
01:14:16,889 --> 01:14:17,889
what parts

1388
01:14:17,889 --> 01:14:21,359
are really the hard parts to implement, or what parts are hard parts that could make a

1389
01:14:21,359 --> 01:14:23,079
difference in performance.

1390
01:14:23,079 --> 01:14:26,579
In fact, say that if your goal is to build a

1391
01:14:26,579 --> 01:14:29,309
people recognition system, a system like this is actually far too

1392
01:14:29,309 --> 01:14:31,639
complicated as your initial system.

1393
01:14:31,639 --> 01:14:35,560
Maybe after you're prototyped a few systems, and you converged a system like this. But if this

1394
01:14:35,560 --> 01:14:42,560
is your first system you're designing, this is much too complicated. Also, this is a

1395
01:14:43,570 --> 01:14:48,059
very concrete piece of advice, and this applies to your projects as well.

1396
01:14:48,059 --> 01:14:51,229
If your goal is to build a working application,

1397
01:14:51,229 --> 01:14:55,260
Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your

1398
01:14:55,260 --> 01:14:57,279
data.

1399
01:14:57,279 --> 01:15:01,219
And very often, and if you just take the data you're trying to predict and just plot your

1400
01:15:01,219 --> 01:15:05,729
data, plot X, plot Y, plot your data everywhere you can think of,

1401
01:15:05,729 --> 01:15:10,309
you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they

1402
01:15:10,309 --> 01:15:13,899
should be positive. Something's wrong with this dataset." And it's about

1403
01:15:13,899 --> 01:15:18,389
half the time you find something obviously wrong with your data or something very surprising.

1404
01:15:18,389 --> 01:15:21,570
And this is something you find out just by plotting your data, and that you

1405
01:15:21,570 --> 01:15:28,179
won't find out be implementing these big complicated learning algorithms on it. Plotting

1406
01:15:28,179 --> 01:15:31,920
the data sounds so simple, it was one of the pieces of advice that lots of us give but

1407
01:15:31,920 --> 01:15:38,570
hardly anyone follows, so you can take that for what it's worth.

1408
01:15:38,570 --> 01:15:42,199
Let me just reiterate, what I just said here may be bad advice

1409
01:15:42,199 --> 01:15:44,019
if your goal is to come up with

1410
01:15:44,019 --> 01:15:46,639
new machine learning algorithms. All right? So

1411
01:15:46,639 --> 01:15:51,019
for me personally, the learning algorithm I use the most often is probably

1412
01:15:51,019 --> 01:15:53,600
logistic regression because I have code lying around. So give me a

1413
01:15:53,600 --> 01:15:56,770
learning problem, I probably won't try anything more complicated than logistic

1414
01:15:56,770 --> 01:15:58,260
regression on it first. And it's

1415
01:15:58,260 --> 01:16:01,940
only after trying something really simple and figure our what's easy, what's hard, then you know

1416
01:16:01,940 --> 01:16:03,940
where to focus your efforts. But

1417
01:16:03,940 --> 01:16:07,610
again, if your goal is to invent new machine learning algorithms, then you sort of don't

1418
01:16:07,610 --> 01:16:10,750
want to hack up something and then add another hack to fix it, and hack it even more to

1419
01:16:10,750 --> 01:16:12,219
fix it. Right? So if

1420
01:16:12,219 --> 01:16:15,919
your goal is to do novel machine learning research, then it pays to think more deeply about the

1421
01:16:15,919 --> 01:16:21,340
problem and not gonna follow this specifically.

1422
01:16:21,340 --> 01:16:22,919
Shoot, you know what? All

1423
01:16:22,919 --> 01:16:28,280
right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.

1424
01:16:28,280 --> 01:16:30,620
And so, this is what I think

1425
01:16:30,620 --> 01:16:33,459
of as premature statistical optimization,

1426
01:16:33,459 --> 01:16:35,079
where quite often,

1427
01:16:35,079 --> 01:16:38,320
just like premature optimization of code, quite often

1428
01:16:38,320 --> 01:16:44,369
people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more

1429
01:16:44,369 --> 01:16:46,949
slides. This

1430
01:16:46,949 --> 01:16:48,539
was -

1431
01:16:48,539 --> 01:16:52,070
this is a sort of cartoon that highly influenced my own thinking. It was based on

1432
01:16:52,070 --> 01:16:55,340
a paper written by Christos Papadimitriou.

1433
01:16:55,340 --> 01:16:57,429
This is how

1434
01:16:57,429 --> 01:16:59,360
progress - this is how

1435
01:16:59,360 --> 01:17:02,360
developmental progress of research often happens. Right?

1436
01:17:02,360 --> 01:17:05,559
Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it

1437
01:17:05,559 --> 01:17:06,519
seems like a useful thing to have.

1438
01:17:06,519 --> 01:17:09,670
Right? You know free up people, don't have

1439
01:17:09,670 --> 01:17:12,760
to deliver mail. So what -

1440
01:17:12,760 --> 01:17:14,280
to deliver mail,

1441
01:17:14,280 --> 01:17:19,139
obviously you need a robot to wander around indoor environments and you need a robot to

1442
01:17:19,139 --> 01:17:21,480
manipulate objects and pickup envelopes. And so,

1443
01:17:21,480 --> 01:17:24,890
you need to build those two components in order to get a mail delivery robot. And

1444
01:17:24,890 --> 01:17:25,590
so I've

1445
01:17:25,590 --> 01:17:29,650
drawing those two components and little arrows to denote that, you know, obstacle avoidance

1446
01:17:29,650 --> 01:17:30,460
is

1447
01:17:30,460 --> 01:17:32,229
needed or would help build

1448
01:17:32,229 --> 01:17:35,510
your mail delivery robot. Well

1449
01:17:35,510 --> 01:17:37,189
for obstacle for avoidance,

1450
01:17:37,189 --> 01:17:43,159
clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.

1451
01:17:43,159 --> 01:17:46,840
Now we're gonna use computer vision to detect the objects. And so,

1452
01:17:46,840 --> 01:17:51,120
we know that, you know, lighting sometimes changes, right, depending on whether it's the

1453
01:17:51,120 --> 01:17:52,709
morning or noontime or evening. This

1454
01:17:52,709 --> 01:17:53,930
is lighting

1455
01:17:53,930 --> 01:17:56,639
causes the color of things to change, and so you need

1456
01:17:56,639 --> 01:18:00,509
an object detection system that's invariant to the specific colors of an

1457
01:18:00,509 --> 01:18:01,199
object. Right?

1458
01:18:01,199 --> 01:18:04,420
Because lighting

1459
01:18:04,420 --> 01:18:05,400
changes,

1460
01:18:05,400 --> 01:18:09,849
say. Well color, or RGB values, is represented by three-dimensional vectors. And

1461
01:18:09,849 --> 01:18:11,170
so you need to learn

1462
01:18:11,170 --> 01:18:13,499
when two colors might be the same thing,

1463
01:18:13,499 --> 01:18:15,260
when two, you know,

1464
01:18:15,260 --> 01:18:18,159
visual appearance of two colors may be the same thing as just the lighting change or

1465
01:18:18,159 --> 01:18:19,539
something.

1466
01:18:19,539 --> 01:18:20,590
And

1467
01:18:20,590 --> 01:18:24,059
to understand that properly, we can go out and study differential geometry

1468
01:18:24,059 --> 01:18:27,510
of 3d manifolds because that helps us build a sound theory on which

1469
01:18:27,510 --> 01:18:32,250
to develop our 3d similarity learning algorithms.

1470
01:18:32,250 --> 01:18:36,159
And to really understand the fundamental aspects of this problem,

1471
01:18:36,159 --> 01:18:40,110
we have to study the complexity of non-Riemannian geometries. And on

1472
01:18:40,110 --> 01:18:43,850
and on it goes until eventually you're proving convergence bounds for

1473
01:18:43,850 --> 01:18:49,790
sampled of non-monotonic logic. I don't even know what this is because I just made it up.

1474
01:18:49,790 --> 01:18:51,530
Whereas in reality,

1475
01:18:51,530 --> 01:18:53,970
you know, chances are that link isn't real.

1476
01:18:53,970 --> 01:18:55,660
Color variance

1477
01:18:55,660 --> 01:18:59,550
just barely helped object recognition maybe. I'm making this up.

1478
01:18:59,550 --> 01:19:03,499
Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?

1479
01:19:03,499 --> 01:19:05,270
So, each of

1480
01:19:05,270 --> 01:19:09,130
these circles can represent a person, or a research community, or a thought in your

1481
01:19:09,130 --> 01:19:12,020
head. And there's a very real chance that

1482
01:19:12,020 --> 01:19:15,469
maybe there are all these papers written on differential geometry of 3d manifolds, and they are

1483
01:19:15,469 --> 01:19:18,570
written because some guy once told someone else that it'll help 3d similarity learning.

1484
01:19:18,570 --> 01:19:20,489
And,

1485
01:19:20,489 --> 01:19:23,369
you know, it's like "A friend of mine told me that color invariance would help in

1486
01:19:23,369 --> 01:19:26,119
object recognition, so I'm working on color invariance. And now I'm gonna tell a friend

1487
01:19:26,119 --> 01:19:27,440
of mine

1488
01:19:27,440 --> 01:19:30,280
that his thing will help my problem. And he'll tell a friend of his that his thing will help

1489
01:19:30,280 --> 01:19:31,619
with his problem."

1490
01:19:31,619 --> 01:19:33,520
And pretty soon, you're working on

1491
01:19:33,520 --> 01:19:37,540
convergence bound for sampled non-monotonic logic, when in reality none of these will

1492
01:19:37,540 --> 01:19:39,129
see the light of

1493
01:19:39,129 --> 01:19:42,519
day of your mail delivery robot. Okay?

1494
01:19:42,519 --> 01:19:46,599
I'm not criticizing the role of theory. There are very powerful theories, like the

1495
01:19:46,599 --> 01:19:48,400
theory of VC dimension,

1496
01:19:48,400 --> 01:19:52,090
which is far, far, far to the right of this. So VC dimension is about

1497
01:19:52,090 --> 01:19:53,289
as theoretical

1498
01:19:53,289 --> 01:19:57,119
as it can get. And it's clearly had a huge impact on many applications. And there's,

1499
01:19:57,119 --> 01:19:59,559
you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,

1500
01:19:59,559 --> 01:20:00,750
is about

1501
01:20:00,750 --> 01:20:04,219
theoretical as it can get. It's

1502
01:20:04,219 --> 01:20:05,800
like a huge application

1503
01:20:05,800 --> 01:20:09,309
on all of computer science, the theory of NP-hardness.

1504
01:20:09,309 --> 01:20:10,669
But

1505
01:20:10,669 --> 01:20:13,799
when you are off working on highly theoretical things, I guess, to me

1506
01:20:13,799 --> 01:20:16,849
personally it's important to keep in mind

1507
01:20:16,849 --> 01:20:19,699
are you working on something like VC dimension, which is high impact, or are you

1508
01:20:19,699 --> 01:20:23,290
working on something like convergence bound for sampled nonmonotonic logic, which

1509
01:20:23,290 --> 01:20:24,710
you're only hoping

1510
01:20:24,710 --> 01:20:25,900
has some peripheral relevance

1511
01:20:25,900 --> 01:20:30,040
to some application. Okay?

1512
01:20:30,040 --> 01:20:34,849
For me personally, I tend to work on an application only if I - excuse me.

1513
01:20:34,849 --> 01:20:36,989
For me personally, and this is a personal choice,

1514
01:20:36,989 --> 01:20:41,340
I tend to trust something only if I personally can see a link from the

1515
01:20:41,340 --> 01:20:42,679
theory I'm working on

1516
01:20:42,679 --> 01:20:44,430
all the way back to an application.

1517
01:20:44,430 --> 01:20:46,010
And

1518
01:20:46,010 --> 01:20:50,299
if I don't personally see a direct link from what I'm doing to an application then,

1519
01:20:50,299 --> 01:20:53,429
you know, then that's fine. Then I can choose to work on theory, but

1520
01:20:53,429 --> 01:20:55,650
I wouldn't necessarily trust that

1521
01:20:55,650 --> 01:20:59,210
what the theory I'm working on will relate to an application, if I don't personally

1522
01:20:59,210 --> 01:21:02,429
see a link all the way back.

1523
01:21:02,429 --> 01:21:04,400
Just to summarize.

1524
01:21:04,400 --> 01:21:06,409


1525
01:21:06,409 --> 01:21:08,679
One lesson to take away from today is I think

1526
01:21:08,679 --> 01:21:12,529
time spent coming up with diagnostics for learning algorithms is often time well spent.

1527
01:21:12,529 --> 01:21:13,029


1528
01:21:13,029 --> 01:21:16,199
It's often up to your own ingenuity to come up with great diagnostics. And

1529
01:21:16,199 --> 01:21:19,019
just when I personally, when I work on machine learning algorithm,

1530
01:21:19,019 --> 01:21:21,169
it's not uncommon for me to be spending like

1531
01:21:21,169 --> 01:21:23,679
between a third and often half of my time

1532
01:21:23,679 --> 01:21:26,409
just writing diagnostics and trying to figure out what's going right and what's

1533
01:21:26,409 --> 01:21:28,079
going on.

1534
01:21:28,079 --> 01:21:31,500
Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and

1535
01:21:31,500 --> 01:21:34,780
making progress. You don't want to be spending all this time, you know, implementing tests on your

1536
01:21:34,780 --> 01:21:38,280
learning algorithms; it doesn't feel like when you're doing anything. But when

1537
01:21:38,280 --> 01:21:41,419
I implement learning algorithms, at least a third, and quite often half of

1538
01:21:41,419 --> 01:21:45,880
my time, is actually spent implementing those tests and you can figure out what to work on. And

1539
01:21:45,880 --> 01:21:49,219
I think it's actually one of the best uses of your time. Talked

1540
01:21:49,219 --> 01:21:50,729
about error

1541
01:21:50,729 --> 01:21:54,319
analyses and ablative analyses, and lastly

1542
01:21:54,319 --> 01:21:56,890
talked about, you know, different approaches and the

1543
01:21:56,890 --> 01:22:00,979
risks of premature statistical optimization. Okay.

1544
01:22:00,979 --> 01:22:04,339
Sorry I ran you over. I'll be here for a few more minutes for your questions.