1
00:00:00,000 --> 00:00:18,229
35C3 preroll music
2
00:00:18,229 --> 00:00:24,750
Herald Angel: Welcome to our introduction
to deep learning with Teubi. Deep
3
00:00:24,750 --> 00:00:30,247
learning, also often called machine
learning is a hype word which we hear in
4
00:00:30,247 --> 00:00:37,152
the media all the time. It's nearly as bad
as blockchain. It's a solution for
5
00:00:37,152 --> 00:00:43,249
everything. Today we'll get a sneak peek
into the internals of this mystical black
6
00:00:43,249 --> 00:00:48,820
box, they are talking about. And Teubi
will show us why people, who know what
7
00:00:48,820 --> 00:00:53,040
machine learning really is about, have to
facepalm so often, when they read the
8
00:00:53,040 --> 00:00:58,715
news. So please welcome Teubi
with a big round of applause!
9
00:00:58,715 --> 00:01:10,245
Applause
Teubi: Alright! Good morning and welcome
10
00:01:10,245 --> 00:01:14,470
to Introduction to Deep Learning. The
title will already tell you what this talk
11
00:01:14,470 --> 00:01:19,920
is about. I want to give you an
introduction onto how deep learning works,
12
00:01:19,920 --> 00:01:27,090
what happens inside this black box. But,
first of all, who am I? I'm Teubi. It's a
13
00:01:27,090 --> 00:01:32,280
German nickname, it has nothing to do with
toys or bees. You might have heard my
14
00:01:32,280 --> 00:01:36,480
voice before, because I host the
Nussschale podcast. There I explain
15
00:01:36,480 --> 00:01:41,560
scientific topics in under 10 minutes.
I'll have to use a little more time today,
16
00:01:41,560 --> 00:01:46,850
and you'll also have fancy animations
which hopefully will help. In my day job
17
00:01:46,850 --> 00:01:52,540
I'm a research scientist at an institute
for computer vision. I analyze microscopy
18
00:01:52,540 --> 00:01:58,240
images of bone marrow blood cells and try
to find ways to teach the computer to
19
00:01:58,240 --> 00:02:04,660
understand what it sees. Namely, to
differentiate between certain cells or,
20
00:02:04,660 --> 00:02:09,449
first of all, find cells in an image,
which is a task that is more complex than
21
00:02:09,449 --> 00:02:17,180
it might sound like. Let me start with the
introduction to deep learning. We all know
22
00:02:17,180 --> 00:02:22,769
how to code. We code in a very simple way.
We have some input for all computer
23
00:02:22,769 --> 00:02:27,618
algorithm. Then we have an algorithm which
says: Do this, do that. If this, then
24
00:02:28,510 --> 00:02:28,906
that. And in that way we generate some
output. This is not how machine learning
25
00:02:29,495 --> 00:02:30,754
works. Machine learning assumes you have
some input, and you also have some output.
26
00:02:40,810 --> 00:02:46,180
And what you also have is some statistical
model. This statistical model is flexible.
27
00:02:46,180 --> 00:02:51,549
It has certain parameters, which it can
learn from the distribution of inputs and
28
00:02:51,549 --> 00:02:57,430
outputs you give it for training. So you
basically learn the statistical model to
29
00:02:57,430 --> 00:03:03,659
generate the desired output from the given
input. Let me give you a really simple
30
00:03:03,659 --> 00:03:09,980
example of how this might work. Let's say
we have two animals. Well, we have two
31
00:03:09,980 --> 00:03:15,689
kinds of animals: unicorns and rabbits.
And now we want to find an algorithm that
32
00:03:15,689 --> 00:03:24,270
tells us whether this animal we have right
now as an input is a rabbit or a unicorn.
33
00:03:24,270 --> 00:03:28,230
We can write a simple algorithm to do
that, but we can also do it with machine
34
00:03:28,230 --> 00:03:34,590
learning. The first thing we need is some
input. I choose two features that are able
35
00:03:34,590 --> 00:03:42,269
to tell me whether this animal is a rabbit
or a unicorn. Namely, speed and size. We
36
00:03:42,269 --> 00:03:46,859
call these features, and they describe
something about what we want to classify.
37
00:03:46,859 --> 00:03:52,409
And the class is in this case our animal.
First thing I need is some training data,
38
00:03:52,409 --> 00:03:59,170
some input. The input here are just pairs
of speed and size. What I also need is
39
00:03:59,170 --> 00:04:04,129
information about the desired output. The
desired output, of course, being the
40
00:04:04,129 --> 00:04:12,999
class. So either unicorn or rabbit, here
denoted by yellow and red X's. So let's
41
00:04:12,999 --> 00:04:18,298
try to find a statistical model which we
can use to separate this feature space
42
00:04:18,298 --> 00:04:24,150
into two halves: One for the rabbits, one
for the unicorns. Looking at this, we can
43
00:04:24,150 --> 00:04:28,660
actually find a really simple statistical
model, and our statistical model in this
44
00:04:28,660 --> 00:04:34,390
case is just a straight line. And the
learning process is then to find where in
45
00:04:34,390 --> 00:04:41,080
this feature space the line should be.
Ideally, for example, here. Right in the
46
00:04:41,080 --> 00:04:45,220
middle between the two classes rabbit and
unicorn. Of course this is an overly
47
00:04:45,220 --> 00:04:50,370
simplified example. Real-world
applications have feature distributions
48
00:04:50,370 --> 00:04:56,080
which look much more like this. So, we
have a gradient, we don't have a perfect
49
00:04:56,080 --> 00:05:00,130
separation between those two classes, and
those two classes are definitely not
50
00:05:00,130 --> 00:05:05,560
separable by a line. If we look again at
some training samples — training samples
51
00:05:05,560 --> 00:05:11,730
are the data points we use for the machine
learning process, so, to try to find the
52
00:05:11,730 --> 00:05:17,540
parameters of our statistical model — if
we look at the line again, then this will
53
00:05:17,540 --> 00:05:23,000
not be able to separate this training set.
Well, we will have a line that has some
54
00:05:23,000 --> 00:05:27,320
errors, some unicorns which will be
classified as rabbits, some rabbits which
55
00:05:27,320 --> 00:05:33,070
will be classified as unicorns. This is
what we call underfitting. Our model is
56
00:05:33,070 --> 00:05:40,150
just not able to express what we want it
to learn. There is the opposite case. The
57
00:05:40,150 --> 00:05:45,510
opposite case being: we just learn all the
training samples by heart. This is if we
58
00:05:45,510 --> 00:05:50,020
have a very complex model and just a few
training samples to teach the model what
59
00:05:50,020 --> 00:05:55,120
it should learn. In this case we have a
perfect separation of unicorns and
60
00:05:55,120 --> 00:06:00,700
rabbits, at least for the few data points
we have. If we draw another example from
61
00:06:00,700 --> 00:06:07,300
the real world,some other data points,
they will most likely be wrong. And this
62
00:06:07,300 --> 00:06:11,380
is what we call overfitting. The perfect
scenario in this case would be something
63
00:06:11,380 --> 00:06:17,340
like this: a classifier which is really
close to the distribution we have in the
64
00:06:17,340 --> 00:06:23,350
real world and machine learning is tasked
with finding this perfect model and its
65
00:06:23,350 --> 00:06:28,960
parameters. Let me show you a different
kind of model, something you probably all
66
00:06:28,960 --> 00:06:35,670
have heard about: Neural networks. Neural
networks are inspired by the brain.
67
00:06:35,670 --> 00:06:41,210
Or more precisely, by the neurons in our
brain. Neurons are tiny objects, tiny
68
00:06:41,210 --> 00:06:47,250
cells in our brain that take some input
and generate some output. Sounds familiar,
69
00:06:47,250 --> 00:06:52,680
right? We have inputs usually in the form
of electrical signals. And if they are
70
00:06:52,680 --> 00:06:57,860
strong enough, this neuron will also send
out an electrical signal. And this is
71
00:06:57,860 --> 00:07:03,430
something we can model in a computer-
engineering way. So, what we do is: We
72
00:07:03,430 --> 00:07:09,240
take a neuron. The neuron is just a simple
mapping from input to output. Input here,
73
00:07:09,240 --> 00:07:17,200
just three input nodes. We denote them by
i1, i2 and i3 and output denoted by o. And
74
00:07:17,200 --> 00:07:20,840
now you will actually see some
mathematical equations. There are not many
75
00:07:20,840 --> 00:07:26,700
of these in this foundation talk, don't
worry, and it's really simple. There's one
76
00:07:26,700 --> 00:07:30,250
more thing we need first, though, if we
want to map input to output in the way a
77
00:07:30,250 --> 00:07:35,490
neuron does. Namely, the weights. The
weights are just some arbitrary numbers
78
00:07:35,490 --> 00:07:43,020
for now. Let's call them w1, w2 and w3.
So, we take those weights and we multiply
79
00:07:43,020 --> 00:07:51,360
them with the input. Input1 times weight1,
input2 times weight2, and so on. And this,
80
00:07:51,360 --> 00:07:57,550
this sum just will be our output. Well,
not quite. We make it a little bit more
81
00:07:57,550 --> 00:08:02,430
complicated. We also use something called
an activation function. The activation
82
00:08:02,430 --> 00:08:08,520
function is just a mapping from one scalar
value to another scalar value. In this
83
00:08:08,520 --> 00:08:14,280
case from what we got as an output, the
sum, to something that more closely fits
84
00:08:14,280 --> 00:08:19,360
what we need. This could for example be
something binary, where we have all the
85
00:08:19,360 --> 00:08:23,780
negative numbers being mapped to zero and
all the positive numbers being mapped to
86
00:08:23,780 --> 00:08:30,910
one. And then this zero and one can encode
something. For example: rabbit or unicorn.
87
00:08:30,910 --> 00:08:35,309
So, let me give you an example of how we
can make the previous example with the
88
00:08:35,309 --> 00:08:41,729
rabbits and unicorns work with such a
simple neuron. We just use speed, size,
89
00:08:41,729 --> 00:08:49,650
and the arbitrarily chosen number 10 as
our inputs and the weights 1, 1, and -1.
90
00:08:49,650 --> 00:08:54,400
If we look at the equations, then we get
for our negative numbers — so, speed plus
91
00:08:54,400 --> 00:09:01,440
size being less than 10 — a 0, and a 1 for
all positive numbers — being speed plus
92
00:09:01,440 --> 00:09:07,680
size larger than 10, greater than 10. This
way we again have a separating line
93
00:09:07,680 --> 00:09:14,600
between unicorns and rabbits. But again we
have this really simplistic model. We want
94
00:09:14,600 --> 00:09:21,529
to become more and more complicated in
order to express more complex tasks. So
95
00:09:21,529 --> 00:09:26,279
what do we do? We take more neurons. We
take our three input values and put them
96
00:09:26,279 --> 00:09:31,920
into one neuron, and into a second neuron,
and into a third neuron. And we take the
97
00:09:31,920 --> 00:09:38,330
output of those three neurons as input for
another neuron. We also call this a
98
00:09:38,330 --> 00:09:42,140
multilayer perceptron, perceptron just
being a different name for a neuron, what
99
00:09:42,140 --> 00:09:48,670
we have there. And the whole thing is also
called a neural network. So now the
100
00:09:48,670 --> 00:09:53,300
question: How do we train this? How do we
learn what this network should encode?
101
00:09:53,300 --> 00:09:57,620
Well, we want a mapping from input to
output, and what we can change are the
102
00:09:57,620 --> 00:10:02,880
weights. First, what we do is we take a
training sample, some input. Put it
103
00:10:02,880 --> 00:10:07,010
through the network, get an output. But
this might not be the desired output which
104
00:10:07,010 --> 00:10:13,570
we know. So, in the binary case there are
four possible cases: computed output,
105
00:10:13,570 --> 00:10:19,860
expected output, each two values, 0 and 1.
The best case would be: we want a 0, get a
106
00:10:19,860 --> 00:10:27,120
0, want a 1 and get a 1. But there is also
the opposite case. In these two cases we
107
00:10:27,120 --> 00:10:31,440
can learn something about our model.
Namely, in which direction to change the
108
00:10:31,440 --> 00:10:37,270
weights. It's a little bit simplified, but
in principle you just raise the weights if
109
00:10:37,270 --> 00:10:41,250
you need a higher number as output and you
lower the weights if you need a lower
110
00:10:41,250 --> 00:10:47,350
number as output. To tell you how much, we
have two terms. First term being the
111
00:10:47,350 --> 00:10:53,110
error, so in this case just the difference
between desired and expected output – also
112
00:10:53,110 --> 00:10:56,890
often called a loss function, especially
in deep learning and more complex
113
00:10:56,890 --> 00:11:04,120
applications. You also have a second term
we call the act the learning rate, and the
114
00:11:04,120 --> 00:11:09,170
learning rate is what tells us how quickly
we should change the weights, how quickly
115
00:11:09,170 --> 00:11:14,890
we should adapt the weights. Okay, this is
how we learn a model. This is almost
116
00:11:14,890 --> 00:11:18,550
everything you need to know. There are
mathematical equations that tell you how
117
00:11:18,550 --> 00:11:23,770
much to change based on the error and the
learning function. And this is the entire
118
00:11:23,770 --> 00:11:30,339
learning process. Let's get back to the
terminology. We have the input layer. We
119
00:11:30,339 --> 00:11:34,020
have the output layer, which somehow
encodes our output either in one value or
120
00:11:34,020 --> 00:11:39,650
in several values if we have a multiple,
if we have multiple classes. We also have
121
00:11:39,650 --> 00:11:45,930
the hidden layers, which are actually what
makes our model deep. What we can change,
122
00:11:45,930 --> 00:11:51,980
what we can learn, is the are the weights,
the parameters of this model. But what we
123
00:11:51,980 --> 00:11:55,490
also need to keep in mind, is the number
of layers, the number of neurons per
124
00:11:55,490 --> 00:11:59,590
layer, the learning rate, and the
activation function. These are called
125
00:11:59,590 --> 00:12:04,240
hyper parameters, and they determine how
complex our model is, how well it is
126
00:12:04,240 --> 00:12:09,970
suited to solve the task at hand. I quite
often spoke about solving tasks, so the
127
00:12:09,970 --> 00:12:14,630
question is: What can we actually do with
neural networks? Mostly classification
128
00:12:14,630 --> 00:12:19,560
tasks, for example: Tell me, is this
animal a rabbit or unicorn? Is this text
129
00:12:19,560 --> 00:12:24,690
message spam or legitimate? Is this
patient healthy or ill? Is this image a
130
00:12:24,690 --> 00:12:30,710
picture of a cat or a dog? We already saw
for the animal that we need something
131
00:12:30,710 --> 00:12:35,040
called features, which somehow encodes
information about what we want to
132
00:12:35,040 --> 00:12:39,530
classify, something we can use as input
for the neural network. Some kind of
133
00:12:39,530 --> 00:12:43,830
number that is meaningful. So, for the
animal it could be speed, size, or
134
00:12:43,830 --> 00:12:48,740
something like color. Color, of course,
being more complex again, because we have,
135
00:12:48,740 --> 00:12:55,940
for example, RGB, so three values. And,
text message being a more complex case
136
00:12:55,940 --> 00:13:00,060
again, because we somehow need to encode
the sender, and whether the sender is
137
00:13:00,060 --> 00:13:04,770
legitimate. Same for the recipient, or the
number of hyperlinks, or where the
138
00:13:04,770 --> 00:13:11,400
hyperlinks refer to, or the, whether there
are certain words present in the text. It
139
00:13:11,400 --> 00:13:16,720
gets more and more complicated. Even more
so for a patient. How do we encode medical
140
00:13:16,720 --> 00:13:22,420
history in a proper way for the network to
learn. I mean, temperature is simple. It's
141
00:13:22,420 --> 00:13:26,750
a scalar value, we just have a number. But
how do we encode whether certain symptoms
142
00:13:26,750 --> 00:13:32,720
are present. And the image, which is
actually what I work with everyday, is
143
00:13:32,720 --> 00:13:38,350
again quite complex. We have values, we
have numbers, but only pixel values, which
144
00:13:38,350 --> 00:13:43,450
make it difficult, which are difficult to
use as input for a neural network. Why?
145
00:13:43,450 --> 00:13:48,350
I'll show you. I'll actually show you with
this picture, it's a very famous picture,
146
00:13:48,350 --> 00:13:53,970
and everybody uses it in computer vision.
They will tell you, it's because there is
147
00:13:53,970 --> 00:14:01,010
a multitude of different characteristics
in this image: shapes, edges, whatever you
148
00:14:01,010 --> 00:14:07,080
desire. The truth is, it's a crop from the
centrefold of the Playboy, and in earlier
149
00:14:07,080 --> 00:14:12,070
years, the computer vision engineers was a
mostly male audience. Anyway, let's take
150
00:14:12,070 --> 00:14:16,850
five by five pixels. Let's assume, this is
a five by five pixels, a really small,
151
00:14:16,850 --> 00:14:22,230
image. If we take those 25 pixels and use
them as input for a neural network you
152
00:14:22,230 --> 00:14:26,730
already see that we have many connections
- many weights - which means a very
153
00:14:26,730 --> 00:14:32,540
complex model. Complex model, of course,
prone to overfitting. But there are more
154
00:14:32,540 --> 00:14:38,800
problems. First being, we have
disconnected the pixels from its neigh-, a
155
00:14:38,800 --> 00:14:43,670
pixel from its neighbors. We can't encode
information about the neighborhood
156
00:14:43,670 --> 00:14:47,850
anymore, and that really sucks. If we just
take the whole picture, and move it to the
157
00:14:47,850 --> 00:14:52,790
left or to the right by just one pixel,
the network will see something completely
158
00:14:52,790 --> 00:14:58,470
different, even though to us it is exactly
the same. But, we can solve that with some
159
00:14:58,470 --> 00:15:03,400
very clever engineering, something we call
a convolutional layer. It is again a
160
00:15:03,400 --> 00:15:08,860
hidden layer in a neural network, but it
does something special. It actually is a
161
00:15:08,860 --> 00:15:13,970
very simple neuron again, just four input
values - one output value. But the four
162
00:15:13,970 --> 00:15:19,780
input values look at two by two pixels,
and encode one output value. And then the
163
00:15:19,780 --> 00:15:23,790
same network is shifted to the right, and
encodes another pixel, and another pixel,
164
00:15:23,790 --> 00:15:30,150
and the next row of pixels. And in this
way creates another 2D image. We have
165
00:15:30,150 --> 00:15:34,900
preserved information about the
neighborhood, and we just have a very low
166
00:15:34,900 --> 00:15:41,910
number of weights, not the huge number of
parameters we saw earlier. We can use this
167
00:15:41,910 --> 00:15:49,640
once, or twice, or several hundred times.
And this is actually where we go deep.
168
00:15:49,640 --> 00:15:54,920
Deep means: We have several layers, and
having layers that don't need thousands or
169
00:15:54,920 --> 00:16:01,040
millions of connections, but only a few.
This is what allows us to go really deep.
170
00:16:01,040 --> 00:16:06,250
And in this fashion we can encode an
entire image in just a few meaningful
171
00:16:06,250 --> 00:16:11,480
values. How these values look like, and
what they encode, this is learned through
172
00:16:11,480 --> 00:16:18,240
the learning process. And we can then, for
example, use these few values as input for
173
00:16:18,240 --> 00:16:24,709
a classification network.
The fully connected network we saw earlier.
174
00:16:24,709 --> 00:16:29,560
Or we can do something more clever. We can
do the inverse operation and create an image
175
00:16:29,560 --> 00:16:35,170
again, for example, the same image, which
is then called an auto encoder. Auto
176
00:16:35,170 --> 00:16:40,200
encoders are tremendously useful, even
though they don't appear that way. For
177
00:16:40,200 --> 00:16:43,959
example, imagine you want to check whether
something has a defect, or not, a picture
178
00:16:43,959 --> 00:16:51,290
of a fabric, or of something. You just
train the network with normal pictures.
179
00:16:51,290 --> 00:16:56,770
And then, if you have a defect picture,
the network is not able to produce this
180
00:16:56,770 --> 00:17:02,149
defect. And so the difference of the
reproduced picture, and the real picture
181
00:17:02,149 --> 00:17:07,420
will show you where errors are. If it
works properly, I'll have to admit that.
182
00:17:07,420 --> 00:17:12,569
But we can go even further. Let's say, we
want to encode something entirely else.
183
00:17:12,569 --> 00:17:17,400
Well, let's encode the image, the
information in the image, but in another
184
00:17:17,400 --> 00:17:21,859
representation. For example, let's say we
have three classes again. The background
185
00:17:21,859 --> 00:17:30,049
class in grey, a class called hat or
headwear in blue, and person in green. We
186
00:17:30,049 --> 00:17:34,309
can also use this for other applications
than just for pictures of humans. For
187
00:17:34,309 --> 00:17:38,370
example, we have a picture of a street and
want to encode: Where is the car, where's
188
00:17:38,370 --> 00:17:44,860
the pedestrian? Tremendously useful. Or we
have an MRI scan of a brain: Where in the
189
00:17:44,860 --> 00:17:51,110
brain is the tumor? Can we somehow learn
this? Yes we can do this, with methods
190
00:17:51,110 --> 00:17:57,480
like these, if they are trained properly.
More about that later. Well we expect
191
00:17:57,480 --> 00:18:01,020
something like this to come out but the
truth looks rather like this – especially
192
00:18:01,020 --> 00:18:05,870
if it's not properly trained. We have not
the real shape we want to get but
193
00:18:05,870 --> 00:18:11,980
something distorted. So here is again
where we need to do learning. First we
194
00:18:11,980 --> 00:18:15,790
take a picture, put it through the
network, get our output representation.
195
00:18:15,790 --> 00:18:21,110
And we have the information about how we
want it to look. We again compute some
196
00:18:21,110 --> 00:18:27,040
kind of loss value. This time for example
being the overlap between the shape we get
197
00:18:27,040 --> 00:18:34,040
out of the model and the shape we want to
have. And we use this error, this lost
198
00:18:34,040 --> 00:18:38,660
function, to update the weights of our
network. Again – even though it's more
199
00:18:38,660 --> 00:18:43,570
complicated here, even though we have more
layers, and even though the layers look
200
00:18:43,570 --> 00:18:48,640
slightly different – it is the same
process all over again as with a binary
201
00:18:48,640 --> 00:18:56,540
case. And we need lots of training data.
This is something that you'll hear often
202
00:18:56,540 --> 00:19:02,960
in connection with deep learning: You need
lots of training data to make this work.
203
00:19:02,960 --> 00:19:10,100
Images are complex things and in order to
meaningful extract knowledge from them,
204
00:19:10,100 --> 00:19:17,090
the network needs to see a multitude of
different images. Well now I already
205
00:19:17,090 --> 00:19:22,230
showed you some things we use in network
architecture, some support networks: The
206
00:19:22,230 --> 00:19:26,679
fully convolutional encoder, which takes
an image and produces a few meaningful
207
00:19:26,679 --> 00:19:33,110
values out of this image; its counterpart
the fully convolutional decoder – fully
208
00:19:33,110 --> 00:19:36,960
convolutional meaning by the way that we
only have these convolutional layers with
209
00:19:36,960 --> 00:19:42,980
a few parameters that somehow encode
spatial information and keep it for the
210
00:19:42,980 --> 00:19:49,360
next layers. The decoder takes a few
meaningful numbers and reproduces an image
211
00:19:49,360 --> 00:19:55,420
– either the same image or another
representation of the information encoded
212
00:19:55,420 --> 00:20:01,400
in the image. We also already saw the
fully connected network. Fully connected
213
00:20:01,400 --> 00:20:06,640
meaning every neuron is connected to every
neuron in the next layer. This of course
214
00:20:06,640 --> 00:20:12,570
can be dangerous because this is where we
actually get most of our parameters. If we
215
00:20:12,570 --> 00:20:16,390
have a fully connected network, this is
where the most parameters will be present
216
00:20:16,390 --> 00:20:21,580
because connecting every node to every
node … this is just a high number of
217
00:20:21,580 --> 00:20:25,860
connections. We can also do other things.
For example something called a pooling
218
00:20:25,860 --> 00:20:32,280
layer. A pooling layer being basically the
same as one of those convolutional layers,
219
00:20:32,280 --> 00:20:36,370
just that we don't have parameters we need
to learn. This works without parameters
220
00:20:36,370 --> 00:20:43,740
because this neuron just chooses whichever
value is the highest and takes that value
221
00:20:43,740 --> 00:20:49,600
as output. This is really great for
reducing the size of your image and also
222
00:20:49,600 --> 00:20:55,150
getting rid of information that might not
be that important. We can also do some
223
00:20:55,150 --> 00:20:59,890
clever techniques like adding a dropout
layer. A dropout layer just being a normal
224
00:20:59,890 --> 00:21:05,799
layer in a neural network where we remove
some connections: In one training step
225
00:21:05,799 --> 00:21:10,720
these connections, in the next training
step some other connections. This way we
226
00:21:10,720 --> 00:21:18,049
teach the other connections to become more
resilient against errors. I would like to
227
00:21:18,049 --> 00:21:22,750
start with something I call the "Model
Show" now, and show you some models and
228
00:21:22,750 --> 00:21:28,870
how we train those models. And I will
start with a fully convolutional decoder
229
00:21:28,870 --> 00:21:34,740
we saw earlier: This thing that takes a
number and creates a picture. I would like
230
00:21:34,740 --> 00:21:41,420
to take this model, put in some number and
get out a picture – a picture of a horse
231
00:21:41,420 --> 00:21:46,000
for example. If I put in a different
number I also want to get a picture of a
232
00:21:46,000 --> 00:21:52,390
horse, but of a different horse. So what I
want to get is a mapping from some
233
00:21:52,390 --> 00:21:56,730
numbers, some features that encode
something about the horse picture, and get
234
00:21:56,730 --> 00:22:03,450
a horse picture out of it. You might see
already why this is problematic. It is
235
00:22:03,450 --> 00:22:08,230
problematic because we don't have a
mapping from feature to horse or from
236
00:22:08,230 --> 00:22:15,050
horse to features. So we don't have a
truth value we can use to learn how to
237
00:22:15,050 --> 00:22:21,790
generate this mapping. Well computer
vision engineers – or deep learning
238
00:22:21,790 --> 00:22:26,800
professionals – they're smart and have
clever ideas. Let's just assume we have
239
00:22:26,800 --> 00:22:32,870
such a network and let's call it a
generator. Let's take some numbers put,
240
00:22:32,870 --> 00:22:39,240
them into the generator and get some
horses. Well it doesn't work yet. We still
241
00:22:39,240 --> 00:22:42,490
have to train it. So they're probably not
only horses but also some very special
242
00:22:42,490 --> 00:22:47,970
unicorns among the horses; which might be
nice for other applications, but I wanted
243
00:22:47,970 --> 00:22:55,480
pictures of horses right now. So I can't
train with this data directly. But what I
244
00:22:55,480 --> 00:23:01,600
can do is I can create a second network.
This network is called a discriminator and
245
00:23:01,600 --> 00:23:08,820
I can give it the input generated from the
generator as well as the real data I have:
246
00:23:08,820 --> 00:23:13,920
the real horse pictures. And then I can
teach the discriminator to distinguish
247
00:23:13,920 --> 00:23:22,080
between those. Tell me it is a real horse
or it's not a real horse. And there I know
248
00:23:22,080 --> 00:23:27,000
what is the truth because I either take
real horse pictures or fake horse pictures
249
00:23:27,000 --> 00:23:34,170
from the generator. So I have a truth
value for this discriminator. But in doing
250
00:23:34,170 --> 00:23:39,070
this I also have a truth value for the
generator. Because I want the generator to
251
00:23:39,070 --> 00:23:43,799
work against the discriminator. So I can
also use the information how well the
252
00:23:43,799 --> 00:23:51,010
discriminator does to train the generator
to become better in fooling. This is
253
00:23:51,010 --> 00:23:57,470
called a generative adversarial network.
And it can be used to generate pictures of
254
00:23:57,470 --> 00:24:02,350
an arbitrary distribution. Let's do this
with numbers and I will actually show you
255
00:24:02,350 --> 00:24:07,590
the training process. Before I start the
video, I'll tell you what I did. I took
256
00:24:07,590 --> 00:24:11,550
some handwritten digits. There is a
database called "??? of handwritten
257
00:24:11,550 --> 00:24:18,570
digits" so the numbers of 0 to 9. And I
took those and used them as training data.
258
00:24:18,570 --> 00:24:24,299
I trained a generator in the way I showed
you on the previous slide, and then I just
259
00:24:24,299 --> 00:24:30,110
took some random numbers. I put those
random numbers into the network and just
260
00:24:30,110 --> 00:24:35,960
stored the image of what came out of the
network. And here in the video you'll see
261
00:24:35,960 --> 00:24:43,090
how the network improved with ongoing
training. You will see that we start
262
00:24:43,090 --> 00:24:50,179
basically with just noisy images … and
then after some – what we call apox(???)
263
00:24:50,179 --> 00:24:55,919
so training iterations – the network is
able to almost perfectly generate
264
00:24:55,919 --> 00:25:05,679
handwritten digits just from noise. Which
I find truly fascinating. Of course this
265
00:25:05,679 --> 00:25:11,270
is an example where it works. It highly
depends on your data set and how you train
266
00:25:11,270 --> 00:25:15,600
the model whether it is a success or not.
But if it works, you can use it to
267
00:25:15,600 --> 00:25:22,559
generate fonts. You can generate
characters, 3D objects, pictures of
268
00:25:22,559 --> 00:25:28,700
animals, whatever you want as long as you
have training data. Let's go more crazy.
269
00:25:28,700 --> 00:25:34,539
Let's take two of those and let's say we
have pictures of horses and pictures of
270
00:25:34,539 --> 00:25:41,150
zebras. I want to convert those pictures
of horses into pictures of zebras, and I
271
00:25:41,150 --> 00:25:44,590
want to convert pictures of zebras into
pictures of horses. So I want to have the
272
00:25:44,590 --> 00:25:49,690
same picture just with the other animal.
But I don't have training data of the same
273
00:25:49,690 --> 00:25:56,270
situation just once with a horse and once
with a zebra. Doesn't matter. We can train
274
00:25:56,270 --> 00:26:00,650
a network that does that for us. Again we
just have a network – we call it the
275
00:26:00,650 --> 00:26:05,730
generator – and we have two of those: One
that converts horses to zebras and one
276
00:26:05,730 --> 00:26:14,840
that converts zebras to horses. And then
we also have two discriminators that tell
277
00:26:14,840 --> 00:26:21,150
us: real horse – fake horse – real zebra –
fake zebra. And then we again need to
278
00:26:21,150 --> 00:26:27,210
perform some training. So we need to
somehow encode: Did it work what we wanted
279
00:26:27,210 --> 00:26:31,460
to do? And a very simple way to do this is
we take a picture of a horse put it
280
00:26:31,460 --> 00:26:35,470
through the generator that generates a
zebra. Take this fake picture of a zebra,
281
00:26:35,470 --> 00:26:39,340
put it through the generator that
generates a picture of a horse. And if
282
00:26:39,340 --> 00:26:43,700
this is the same picture as we put in,
then our model worked. And if it didn't,
283
00:26:43,700 --> 00:26:48,549
we can use that information to update the
weights. I just took a random picture,
284
00:26:48,549 --> 00:26:54,460
from a free library in the Internet, of a
horse and generated a zebra and it worked
285
00:26:54,460 --> 00:26:59,470
remarkably well. I actually didn't even do
training. It also doesn't need to be a
286
00:26:59,470 --> 00:27:03,120
picture. You can also convert text to
images: You describe something in words
287
00:27:03,120 --> 00:27:09,570
and generate images. You can age your face
or age a cell; or make a patient healthy
288
00:27:09,570 --> 00:27:15,510
or sick – or the image of a patient, not
the patient self, unfortunately. You can
289
00:27:15,510 --> 00:27:20,690
do style transfer like take a picture of
Van Gogh and apply it to your own picture.
290
00:27:20,690 --> 00:27:27,559
Stuff like that. Something else that we
can do with neural networks. Let's assume
291
00:27:27,559 --> 00:27:31,030
we have a classification network, we have
a picture of a toothbrush and the network
292
00:27:31,030 --> 00:27:36,770
tells us: Well, this is a toothbrush.
Great! But how resilient is this network?
293
00:27:36,770 --> 00:27:44,530
Does it really work in every scenario.
There's a second network we can apply: We
294
00:27:44,530 --> 00:27:48,701
call it an adversarial network. And that
network is trained to do one thing: Look
295
00:27:48,701 --> 00:27:52,289
at the network, look at the picture, and
then find the one weak spot in the
296
00:27:52,289 --> 00:27:55,880
picture: Just change one pixel slightly so
that the network will tell me this
297
00:27:55,880 --> 00:28:03,600
toothbrush is an octopus. Works remarkably
well. Also works with just changing the
298
00:28:03,600 --> 00:28:08,940
picture slightly, so changing all the
pixels, but just slight minute changes
299
00:28:08,940 --> 00:28:12,860
that we don't perceive, but the network –
the classification network – is completely
300
00:28:12,860 --> 00:28:19,640
thrown off. Well sounds bad. Is bad if you
don't consider it. But you can also for
301
00:28:19,640 --> 00:28:24,200
example use this for training your network
and make your network resilient. So
302
00:28:24,200 --> 00:28:28,460
there's always an upside and downside.
Something entirely else: Now I'd like to
303
00:28:28,460 --> 00:28:32,880
show you something about text. A word-
language model. I want to generate
304
00:28:32,880 --> 00:28:38,101
sentences for my podcast. I have a network
that gives me a word, and then if I want
305
00:28:38,101 --> 00:28:42,640
to somehow get the next word in the
sentence, I also need to consider this
306
00:28:42,640 --> 00:28:47,070
word. So another network architecture –
quite interestingly – just takes the
307
00:28:47,070 --> 00:28:52,179
hidden states of the network and uses them
as the input for the same network so that
308
00:28:52,179 --> 00:28:58,780
in the next iteration we still know what
we did in the previous step. I tried to
309
00:28:58,780 --> 00:29:04,730
train a network that generates podcast
episodes for my podcasts. Didn't work.
310
00:29:04,730 --> 00:29:08,450
What I learned is I don't have enough
training data. I really need to produce
311
00:29:08,450 --> 00:29:15,790
more podcast episodes in order to train a
model to do my job for me. And this is
312
00:29:15,790 --> 00:29:21,539
very important, a very crucial point:
Training data. We need shitloads of
313
00:29:21,539 --> 00:29:26,081
training data. And actually the more
complicated our model and our training
314
00:29:26,081 --> 00:29:30,990
process becomes, the more training data we
need. I started with a supervised case –
315
00:29:30,990 --> 00:29:35,990
the really simple case where we, really
simple, the really simpler case where we
316
00:29:35,990 --> 00:29:40,660
have a picture and a label that
corresponds to that picture; or a
317
00:29:40,660 --> 00:29:46,280
representation of that picture showing
entirely what I wanted to learn. But we
318
00:29:46,280 --> 00:29:51,909
also saw a more complex task, where I had
to pictures – horses and zebras – that are
319
00:29:51,909 --> 00:29:56,400
from two different domains – but domains
with no direct mapping. What can also
320
00:29:56,400 --> 00:30:01,020
happen – and actually happens quite a lot
– is weakly annotated data, so data that
321
00:30:01,020 --> 00:30:08,750
is not precisely annotated; where we can't
rely on the information we get. Or even
322
00:30:08,750 --> 00:30:13,050
more complicated: Something called
reinforcement learning where we perform a
323
00:30:13,050 --> 00:30:19,380
sequence of actions and then in the end
are told "yeah that was great". Which is
324
00:30:19,380 --> 00:30:24,080
often not enough information to really
perform proper training. But of course
325
00:30:24,080 --> 00:30:28,190
there are also methods for that. As well
as there are methods for the unsupervised
326
00:30:28,190 --> 00:30:33,590
case where we don't have annotations,
labeled data – no ground truth at all –
327
00:30:33,590 --> 00:30:41,241
just the picture itself. Well I talked
about pictures. I told you that we can
328
00:30:41,241 --> 00:30:45,320
learn features and create images from
them. And we can use them for
329
00:30:45,320 --> 00:30:51,640
classification. And for this there exist
many databases. There are public data sets
330
00:30:51,640 --> 00:30:56,659
we can use. Often they refer to for
example Flickr. They're just hyperlinks
331
00:30:56,659 --> 00:31:00,960
which is also why I didn't show you many
pictures right here, because I am honestly
332
00:31:00,960 --> 00:31:05,690
not sure about the copyright in those
cases. But there are also challenge
333
00:31:05,690 --> 00:31:11,190
datasets where you can just sign up, get
some for example medical data sets, and
334
00:31:11,190 --> 00:31:16,650
then compete against other researchers.
And of course there are those companies
335
00:31:16,650 --> 00:31:22,090
that just have lots of data. And those
companies also have the means, the
336
00:31:22,090 --> 00:31:28,110
capacity to perform intense computations.
And those are also often the companies you
337
00:31:28,110 --> 00:31:36,179
hear from in terms of innovation for deep
learning. Well this was mostly to tell you
338
00:31:36,179 --> 00:31:40,200
that you can process images quite well
with deep learning if you have enough
339
00:31:40,200 --> 00:31:46,029
training data, if you have a proper
training process and also a little if you
340
00:31:46,029 --> 00:31:52,090
know what you're doing. But you can also
process text, you can process audio and
341
00:31:52,090 --> 00:31:58,520
time series like prices or a stack
exchange – stuff like that. You can
342
00:31:58,520 --> 00:32:02,929
process almost everything if you make it
encodeable to your network. Sounds like a
343
00:32:02,929 --> 00:32:08,120
dream come true. But – as I already told
you – you need data, a lot of it. I told
344
00:32:08,120 --> 00:32:14,020
you about those companies that have lots
of data sets and the publicly available
345
00:32:14,020 --> 00:32:21,370
data sets which you can actually use to
get started with your own experiments. But
346
00:32:21,370 --> 00:32:24,309
that also makes it a little dangerous
because deep learning still is a black box
347
00:32:24,309 --> 00:32:30,820
to us. I told you what happens inside the
black box on a level that teaches you how
348
00:32:30,820 --> 00:32:36,529
we learn and how the network is
structured, but not really what the
349
00:32:36,529 --> 00:32:42,831
network learned. It is for us computer
vision engineers really nice that we can
350
00:32:42,831 --> 00:32:48,590
visualize the first layers of a neural
network and see what is actually encoded
351
00:32:48,590 --> 00:32:53,950
in those first layers; what information
the network looks at. But you can't really
352
00:32:53,950 --> 00:32:59,059
mathematically prove what happens in a
network. Which is one major downside. And
353
00:32:59,059 --> 00:33:02,150
so if you want to use it, the numbers may
be really great but be sure to properly
354
00:33:02,150 --> 00:33:08,059
evaluate them. In summary I call that
"easy to learn". Every one – every single
355
00:33:08,059 --> 00:33:12,679
one of you – can just start with deep
learning right away. You don't need to do
356
00:33:12,679 --> 00:33:19,440
much work. You don't need to do much
learning. The model learns for you. But
357
00:33:19,440 --> 00:33:23,770
they're hard to master in a way that makes
them useful for production use cases for
358
00:33:23,770 --> 00:33:29,900
example. So if you want to use deep
learning for something – if you really
359
00:33:29,900 --> 00:33:34,299
want to seriously use it –, make sure that
it really does what you wanted to and
360
00:33:34,299 --> 00:33:38,900
doesn't learn something else – which also
happens. Pretty sure you saw some talks
361
00:33:38,900 --> 00:33:43,670
about deep learning fails – which is not
what this talk is about. They're quite
362
00:33:43,670 --> 00:33:47,370
funny to look at. Just make sure that they
don't happen to you! If you do that
363
00:33:47,370 --> 00:33:53,300
though, you'll achieve great things with
deep learning, I'm sure. And that was
364
00:33:53,300 --> 00:34:00,740
introduction to deep learning. Thank you!
Applause
365
00:34:09,172 --> 00:34:13,449
Herald Angel: So now it's question and
answer time. So if you have a question,
366
00:34:13,449 --> 00:34:19,110
please line up at the mikes. We have in
total eight, so it shouldn't be far from
367
00:34:19,110 --> 00:34:26,139
you. They are here in the corridors and on
these sides. Please line up! For
368
00:34:26,139 --> 00:34:31,540
everybody: A question consists of one
sentence with the question mark in the end
369
00:34:31,540 --> 00:34:38,449
– not three minutes of rambling. And also
if you go to the microphone, speak into
370
00:34:38,449 --> 00:34:53,889
the microphone, so you really get close to
it. Okay. Where do we have … Number 7!
371
00:34:53,889 --> 00:35:02,200
We start with mic number 7:
Question: Hello. My question is: How did
372
00:35:02,200 --> 00:35:13,020
you compute the example for the fonts, the
numbers? I didn't really understand it,
373
00:35:13,020 --> 00:35:19,770
you just said it was made from white
noise.
374
00:35:19,770 --> 00:35:25,580
Teubi: I'll give you a really brief recap
of what I did. I showed you that we have a
375
00:35:25,580 --> 00:35:31,140
model that maps image to some meaningful
values, that an image can be encoded in
376
00:35:31,140 --> 00:35:36,860
just a few values. What happens here is
exactly the other way round. We have some
377
00:35:36,860 --> 00:35:43,270
values, just some arbitrary values we
actually know nothing about. We can
378
00:35:43,270 --> 00:35:47,480
generate pictures out of those. So I
trained this model to just take some
379
00:35:47,480 --> 00:35:54,560
random values and show the pictures
generated from the model. The training
380
00:35:54,560 --> 00:36:03,320
process was this "min max game", as its
called. We have two networks that try to
381
00:36:03,320 --> 00:36:08,260
compete against each other. One network
trying to distinguish, whether a picture
382
00:36:08,260 --> 00:36:12,790
it sees is real or one of those fake
pictures, and the network that actually
383
00:36:12,790 --> 00:36:18,510
generates those pictures and in training
the network that is able to distinguish
384
00:36:18,510 --> 00:36:24,599
between those, we can also get information
for the training of the network that
385
00:36:24,599 --> 00:36:30,410
generates the pictures. So the videos you
saw were just animations of what happens
386
00:36:30,410 --> 00:36:36,440
during this training process. At first if
we input noise we get noise. But as the
387
00:36:36,440 --> 00:36:41,510
network is able to better and better
recreate those images from the dataset we
388
00:36:41,510 --> 00:36:47,390
used as input, in this case pictures of
handwritten digits, the output also became
389
00:36:47,390 --> 00:36:54,660
more lookalike to those numbers, these
handwritten digits. Hope that helped.
390
00:36:54,660 --> 00:37:06,590
Herald Angel: Now we go to the
Internet. – Can we get sound for the signal
391
00:37:06,590 --> 00:37:10,040
Angel, please? Teubi: Sounded so great,
"now we go to the Internet."
392
00:37:10,040 --> 00:37:11,040
Herald Angel: Yeah, that sounds like
"yeeaah".
393
00:37:11,040 --> 00:37:13,040
Signal Angel: And now we're finally ready
to go to the interwebs. "Schorsch" is
394
00:37:13,040 --> 00:37:18,040
asking: Do you have any recommendations
for a beginner regarding the framework or
395
00:37:18,040 --> 00:37:26,460
the software?
Teubi: I, of course, am very biased to
396
00:37:26,460 --> 00:37:34,150
recommend what I use everyday. But I also
think that it is a great start. Basically,
397
00:37:34,150 --> 00:37:40,210
use python and use pytorch. Many people
will disagree with me and tell you
398
00:37:40,210 --> 00:37:45,930
"tensorflow is better." It might be, in my
opinion not for getting started, and there
399
00:37:45,930 --> 00:37:51,560
are also some nice tutorials on the
pytorch website. What you can also do is
400
00:37:51,560 --> 00:37:57,200
look at websites like OpenAI, where they
have a gym to get you started with some
401
00:37:57,200 --> 00:38:02,371
training exercises, where you already have
datasets. Yeah, basically my
402
00:38:02,371 --> 00:38:08,600
recommendation is get used to Python and
start with a pytorch tutorial, see where
403
00:38:08,600 --> 00:38:13,590
to go from there. Often there also some
github repositories linked with many
404
00:38:13,590 --> 00:38:18,740
examples for already established network
architectures like the cycle GAN or the
405
00:38:18,740 --> 00:38:26,250
GAN itself or basically everything else.
There will be a repo you can use to get
406
00:38:26,250 --> 00:38:29,940
started.
Herald Angel: OK, we stay with the
407
00:38:29,940 --> 00:38:32,589
internet. There's some more questions, I
heard.
408
00:38:32,589 --> 00:38:37,920
Signal Angel: Yes. Rubin8 is asking: Have
you have you ever come across an example
409
00:38:37,920 --> 00:38:42,580
of a neural network that deals with audio
instead of images?
410
00:38:42,580 --> 00:38:49,410
Teubi: Me personally, no. At least not
directly. I've heard about examples, like
411
00:38:49,410 --> 00:38:54,859
where you can change the voice to sound
like another person, but there is not much
412
00:38:54,859 --> 00:38:59,980
I can reliably tell about that. My
expertise really is in image processing,
413
00:38:59,980 --> 00:39:05,550
I'm sorry.
Herald Angel: And I think we have time for
414
00:39:05,550 --> 00:39:12,340
one more question. We have one at number
8. Microphone number 8.
415
00:39:12,340 --> 00:39:20,730
Question: Is the current Face recognition
technologies in, for example iPhone X, is
416
00:39:20,730 --> 00:39:26,420
it also a deep learning algorithm or is
it something more simple? Do you have any
417
00:39:26,420 --> 00:39:31,880
idea about that?
Teubi: As far as I know, yes. That's all I
418
00:39:31,880 --> 00:39:38,630
can reliably tell you about that, but it
is not only based on images but also uses
419
00:39:38,630 --> 00:39:45,420
other information. I think distance
information encoded with some infrared
420
00:39:45,420 --> 00:39:50,599
signals. I don't really know exactly how
it works, but at least iPhones already
421
00:39:50,599 --> 00:39:56,000
have a neural network
processing engine built in, so a chip
422
00:39:56,000 --> 00:40:01,190
dedicated to just doing those
computations. You saw that many of those
423
00:40:01,190 --> 00:40:05,820
things can be parallelized, and this is
what those hardware architectures make use
424
00:40:05,820 --> 00:40:10,380
of. So I'm pretty confident in saying,
yes, they also do it there.
425
00:40:10,380 --> 00:40:12,786
How exactly, no clue.
426
00:40:13,760 --> 00:40:15,323
Herald Angel: OK. I myself have a last
427
00:40:15,390 --> 00:40:20,680
completely unrelated question: Did you
create the design of the slides yourself?
428
00:40:20,680 --> 00:40:29,060
Teubi: I had some help. We have a really
great Congress design and I use that as an
429
00:40:29,060 --> 00:40:32,790
inspiration to create those slides, yes.
430
00:40:32,790 --> 00:40:36,760
Herald Angel: OK, yeah, because those are really amazing. I love them.
431
00:40:36,760 --> 00:40:38,140
Teubi: Thank you!
432
00:40:38,470 --> 00:40:41,200
Herald Angel: OK, thank you very much
Teubi.
433
00:40:45,130 --> 00:40:48,900
35C5 outro music
434
00:40:48,900 --> 00:41:07,000
subtitles created by c3subtitles.de
in the year 2019. Join, and help us!