WEBVTT
00:00:00.000 --> 00:00:18.229
35C3 preroll music
00:00:18.229 --> 00:00:24.750
Herald Angel: Welcome to our introduction
to deep learning with Teubi. Deep
00:00:24.750 --> 00:00:30.247
learning, also often called machine
learning is a hype word which we hear in
00:00:30.247 --> 00:00:37.152
the media all the time. It's nearly as bad
as blockchain. It's a solution for
00:00:37.152 --> 00:00:43.249
everything. Today we'll get a sneak peek
into the internals of this mystical black
00:00:43.249 --> 00:00:48.820
box, they are talking about. And Teubi
will show us why people, who know what
00:00:48.820 --> 00:00:53.040
machine learning really is about, have to
facepalm so often, when they read the
00:00:53.040 --> 00:00:58.715
news. So please welcome Teubi
with a big round of applause!
00:00:58.715 --> 00:01:10.245
Applause
Teubi: Alright! Good morning and welcome
00:01:10.245 --> 00:01:14.470
to Introduction to Deep Learning. The
title will already tell you what this talk
00:01:14.470 --> 00:01:19.920
is about. I want to give you an
introduction onto how deep learning works,
00:01:19.920 --> 00:01:27.090
what happens inside this black box. But,
first of all, who am I? I'm Teubi. It's a
00:01:27.090 --> 00:01:32.280
German nickname, it has nothing to do with
toys or bees. You might have heard my
00:01:32.280 --> 00:01:36.480
voice before, because I host the
Nussschale podcast. There I explain
00:01:36.480 --> 00:01:41.560
scientific topics in under 10 minutes.
I'll have to use a little more time today,
00:01:41.560 --> 00:01:46.850
and you'll also have fancy animations
which hopefully will help. In my day job
00:01:46.850 --> 00:01:52.540
I'm a research scientist at an institute
for computer vision. I analyze microscopy
00:01:52.540 --> 00:01:58.240
images of bone marrow blood cells and try
to find ways to teach the computer to
00:01:58.240 --> 00:02:04.660
understand what it sees. Namely, to
differentiate between certain cells or,
00:02:04.660 --> 00:02:09.449
first of all, find cells in an image,
which is a task that is more complex than
00:02:09.449 --> 00:02:17.180
it might sound like. Let me start with the
introduction to deep learning. We all know
00:02:17.180 --> 00:02:22.769
how to code. We code in a very simple way.
We have some input for all computer
00:02:22.769 --> 00:02:27.618
algorithm. Then we have an algorithm which
says: Do this, do that. If this, then
00:02:28.510 --> 00:02:28.906
that. And in that way we generate some
output. This is not how machine learning
00:02:29.495 --> 00:02:30.754
works. Machine learning assumes you have
some input, and you also have some output.
00:02:40.810 --> 00:02:46.180
And what you also have is some statistical
model. This statistical model is flexible.
00:02:46.180 --> 00:02:51.549
It has certain parameters, which it can
learn from the distribution of inputs and
00:02:51.549 --> 00:02:57.430
outputs you give it for training. So you
basically learn the statistical model to
00:02:57.430 --> 00:03:03.659
generate the desired output from the given
input. Let me give you a really simple
00:03:03.659 --> 00:03:09.980
example of how this might work. Let's say
we have two animals. Well, we have two
00:03:09.980 --> 00:03:15.689
kinds of animals: unicorns and rabbits.
And now we want to find an algorithm that
00:03:15.689 --> 00:03:24.270
tells us whether this animal we have right
now as an input is a rabbit or a unicorn.
00:03:24.270 --> 00:03:28.230
We can write a simple algorithm to do
that, but we can also do it with machine
00:03:28.230 --> 00:03:34.590
learning. The first thing we need is some
input. I choose two features that are able
00:03:34.590 --> 00:03:42.269
to tell me whether this animal is a rabbit
or a unicorn. Namely, speed and size. We
00:03:42.269 --> 00:03:46.859
call these features, and they describe
something about what we want to classify.
00:03:46.859 --> 00:03:52.409
And the class is in this case our animal.
First thing I need is some training data,
00:03:52.409 --> 00:03:59.170
some input. The input here are just pairs
of speed and size. What I also need is
00:03:59.170 --> 00:04:04.129
information about the desired output. The
desired output, of course, being the
00:04:04.129 --> 00:04:12.999
class. So either unicorn or rabbit, here
denoted by yellow and red X's. So let's
00:04:12.999 --> 00:04:18.298
try to find a statistical model which we
can use to separate this feature space
00:04:18.298 --> 00:04:24.150
into two halves: One for the rabbits, one
for the unicorns. Looking at this, we can
00:04:24.150 --> 00:04:28.660
actually find a really simple statistical
model, and our statistical model in this
00:04:28.660 --> 00:04:34.390
case is just a straight line. And the
learning process is then to find where in
00:04:34.390 --> 00:04:41.080
this feature space the line should be.
Ideally, for example, here. Right in the
00:04:41.080 --> 00:04:45.220
middle between the two classes rabbit and
unicorn. Of course this is an overly
00:04:45.220 --> 00:04:50.370
simplified example. Real-world
applications have feature distributions
00:04:50.370 --> 00:04:56.080
which look much more like this. So, we
have a gradient, we don't have a perfect
00:04:56.080 --> 00:05:00.130
separation between those two classes, and
those two classes are definitely not
00:05:00.130 --> 00:05:05.560
separable by a line. If we look again at
some training samples — training samples
00:05:05.560 --> 00:05:11.730
are the data points we use for the machine
learning process, so, to try to find the
00:05:11.730 --> 00:05:17.540
parameters of our statistical model — if
we look at the line again, then this will
00:05:17.540 --> 00:05:23.000
not be able to separate this training set.
Well, we will have a line that has some
00:05:23.000 --> 00:05:27.320
errors, some unicorns which will be
classified as rabbits, some rabbits which
00:05:27.320 --> 00:05:33.070
will be classified as unicorns. This is
what we call underfitting. Our model is
00:05:33.070 --> 00:05:40.150
just not able to express what we want it
to learn. There is the opposite case. The
00:05:40.150 --> 00:05:45.510
opposite case being: we just learn all the
training samples by heart. This is if we
00:05:45.510 --> 00:05:50.020
have a very complex model and just a few
training samples to teach the model what
00:05:50.020 --> 00:05:55.120
it should learn. In this case we have a
perfect separation of unicorns and
00:05:55.120 --> 00:06:00.700
rabbits, at least for the few data points
we have. If we draw another example from
00:06:00.700 --> 00:06:07.300
the real world,some other data points,
they will most likely be wrong. And this
00:06:07.300 --> 00:06:11.380
is what we call overfitting. The perfect
scenario in this case would be something
00:06:11.380 --> 00:06:17.340
like this: a classifier which is really
close to the distribution we have in the
00:06:17.340 --> 00:06:23.350
real world and machine learning is tasked
with finding this perfect model and its
00:06:23.350 --> 00:06:28.960
parameters. Let me show you a different
kind of model, something you probably all
00:06:28.960 --> 00:06:35.670
have heard about: Neural networks. Neural
networks are inspired by the brain.
00:06:35.670 --> 00:06:41.210
Or more precisely, by the neurons in our
brain. Neurons are tiny objects, tiny
00:06:41.210 --> 00:06:47.250
cells in our brain that take some input
and generate some output. Sounds familiar,
00:06:47.250 --> 00:06:52.680
right? We have inputs usually in the form
of electrical signals. And if they are
00:06:52.680 --> 00:06:57.860
strong enough, this neuron will also send
out an electrical signal. And this is
00:06:57.860 --> 00:07:03.430
something we can model in a computer-
engineering way. So, what we do is: We
00:07:03.430 --> 00:07:09.240
take a neuron. The neuron is just a simple
mapping from input to output. Input here,
00:07:09.240 --> 00:07:17.200
just three input nodes. We denote them by
i1, i2 and i3 and output denoted by o. And
00:07:17.200 --> 00:07:20.840
now you will actually see some
mathematical equations. There are not many
00:07:20.840 --> 00:07:26.700
of these in this foundation talk, don't
worry, and it's really simple. There's one
00:07:26.700 --> 00:07:30.250
more thing we need first, though, if we
want to map input to output in the way a
00:07:30.250 --> 00:07:35.490
neuron does. Namely, the weights. The
weights are just some arbitrary numbers
00:07:35.490 --> 00:07:43.020
for now. Let's call them w1, w2 and w3.
So, we take those weights and we multiply
00:07:43.020 --> 00:07:51.360
them with the input. Input1 times weight1,
input2 times weight2, and so on. And this,
00:07:51.360 --> 00:07:57.550
this sum just will be our output. Well,
not quite. We make it a little bit more
00:07:57.550 --> 00:08:02.430
complicated. We also use something called
an activation function. The activation
00:08:02.430 --> 00:08:08.520
function is just a mapping from one scalar
value to another scalar value. In this
00:08:08.520 --> 00:08:14.280
case from what we got as an output, the
sum, to something that more closely fits
00:08:14.280 --> 00:08:19.360
what we need. This could for example be
something binary, where we have all the
00:08:19.360 --> 00:08:23.780
negative numbers being mapped to zero and
all the positive numbers being mapped to
00:08:23.780 --> 00:08:30.910
one. And then this zero and one can encode
something. For example: rabbit or unicorn.
00:08:30.910 --> 00:08:35.309
So, let me give you an example of how we
can make the previous example with the
00:08:35.309 --> 00:08:41.729
rabbits and unicorns work with such a
simple neuron. We just use speed, size,
00:08:41.729 --> 00:08:49.650
and the arbitrarily chosen number 10 as
our inputs and the weights 1, 1, and -1.
00:08:49.650 --> 00:08:54.400
If we look at the equations, then we get
for our negative numbers — so, speed plus
00:08:54.400 --> 00:09:01.440
size being less than 10 — a 0, and a 1 for
all positive numbers — being speed plus
00:09:01.440 --> 00:09:07.680
size larger than 10, greater than 10. This
way we again have a separating line
00:09:07.680 --> 00:09:14.600
between unicorns and rabbits. But again we
have this really simplistic model. We want
00:09:14.600 --> 00:09:21.529
to become more and more complicated in
order to express more complex tasks. So
00:09:21.529 --> 00:09:26.279
what do we do? We take more neurons. We
take our three input values and put them
00:09:26.279 --> 00:09:31.920
into one neuron, and into a second neuron,
and into a third neuron. And we take the
00:09:31.920 --> 00:09:38.330
output of those three neurons as input for
another neuron. We also call this a
00:09:38.330 --> 00:09:42.140
multilayer perceptron, perceptron just
being a different name for a neuron, what
00:09:42.140 --> 00:09:48.670
we have there. And the whole thing is also
called a neural network. So now the
00:09:48.670 --> 00:09:53.300
question: How do we train this? How do we
learn what this network should encode?
00:09:53.300 --> 00:09:57.620
Well, we want a mapping from input to
output, and what we can change are the
00:09:57.620 --> 00:10:02.880
weights. First, what we do is we take a
training sample, some input. Put it
00:10:02.880 --> 00:10:07.010
through the network, get an output. But
this might not be the desired output which
00:10:07.010 --> 00:10:13.570
we know. So, in the binary case there are
four possible cases: computed output,
00:10:13.570 --> 00:10:19.860
expected output, each two values, 0 and 1.
The best case would be: we want a 0, get a
00:10:19.860 --> 00:10:27.120
0, want a 1 and get a 1. But there is also
the opposite case. In these two cases we
00:10:27.120 --> 00:10:31.440
can learn something about our model.
Namely, in which direction to change the
00:10:31.440 --> 00:10:37.270
weights. It's a little bit simplified, but
in principle you just raise the weights if
00:10:37.270 --> 00:10:41.250
you need a higher number as output and you
lower the weights if you need a lower
00:10:41.250 --> 00:10:47.350
number as output. To tell you how much, we
have two terms. First term being the
00:10:47.350 --> 00:10:53.110
error, so in this case just the difference
between desired and expected output – also
00:10:53.110 --> 00:10:56.890
often called a loss function, especially
in deep learning and more complex
00:10:56.890 --> 00:11:04.120
applications. You also have a second term
we call the act the learning rate, and the
00:11:04.120 --> 00:11:09.170
learning rate is what tells us how quickly
we should change the weights, how quickly
00:11:09.170 --> 00:11:14.890
we should adapt the weights. Okay, this is
how we learn a model. This is almost
00:11:14.890 --> 00:11:18.550
everything you need to know. There are
mathematical equations that tell you how
00:11:18.550 --> 00:11:23.770
much to change based on the error and the
learning function. And this is the entire
00:11:23.770 --> 00:11:30.339
learning process. Let's get back to the
terminology. We have the input layer. We
00:11:30.339 --> 00:11:34.020
have the output layer, which somehow
encodes our output either in one value or
00:11:34.020 --> 00:11:39.650
in several values if we have a multiple,
if we have multiple classes. We also have
00:11:39.650 --> 00:11:45.930
the hidden layers, which are actually what
makes our model deep. What we can change,
00:11:45.930 --> 00:11:51.980
what we can learn, is the are the weights,
the parameters of this model. But what we
00:11:51.980 --> 00:11:55.490
also need to keep in mind, is the number
of layers, the number of neurons per
00:11:55.490 --> 00:11:59.590
layer, the learning rate, and the
activation function. These are called
00:11:59.590 --> 00:12:04.240
hyper parameters, and they determine how
complex our model is, how well it is
00:12:04.240 --> 00:12:09.970
suited to solve the task at hand. I quite
often spoke about solving tasks, so the
00:12:09.970 --> 00:12:14.630
question is: What can we actually do with
neural networks? Mostly classification
00:12:14.630 --> 00:12:19.560
tasks, for example: Tell me, is this
animal a rabbit or unicorn? Is this text
00:12:19.560 --> 00:12:24.690
message spam or legitimate? Is this
patient healthy or ill? Is this image a
00:12:24.690 --> 00:12:30.710
picture of a cat or a dog? We already saw
for the animal that we need something
00:12:30.710 --> 00:12:35.040
called features, which somehow encodes
information about what we want to
00:12:35.040 --> 00:12:39.530
classify, something we can use as input
for the neural network. Some kind of
00:12:39.530 --> 00:12:43.830
number that is meaningful. So, for the
animal it could be speed, size, or
00:12:43.830 --> 00:12:48.740
something like color. Color, of course,
being more complex again, because we have,
00:12:48.740 --> 00:12:55.940
for example, RGB, so three values. And,
text message being a more complex case
00:12:55.940 --> 00:13:00.060
again, because we somehow need to encode
the sender, and whether the sender is
00:13:00.060 --> 00:13:04.770
legitimate. Same for the recipient, or the
number of hyperlinks, or where the
00:13:04.770 --> 00:13:11.400
hyperlinks refer to, or the, whether there
are certain words present in the text. It
00:13:11.400 --> 00:13:16.720
gets more and more complicated. Even more
so for a patient. How do we encode medical
00:13:16.720 --> 00:13:22.420
history in a proper way for the network to
learn. I mean, temperature is simple. It's
00:13:22.420 --> 00:13:26.750
a scalar value, we just have a number. But
how do we encode whether certain symptoms
00:13:26.750 --> 00:13:32.720
are present. And the image, which is
actually what I work with everyday, is
00:13:32.720 --> 00:13:38.350
again quite complex. We have values, we
have numbers, but only pixel values, which
00:13:38.350 --> 00:13:43.450
make it difficult, which are difficult to
use as input for a neural network. Why?
00:13:43.450 --> 00:13:48.350
I'll show you. I'll actually show you with
this picture, it's a very famous picture,
00:13:48.350 --> 00:13:53.970
and everybody uses it in computer vision.
They will tell you, it's because there is
00:13:53.970 --> 00:14:01.010
a multitude of different characteristics
in this image: shapes, edges, whatever you
00:14:01.010 --> 00:14:07.080
desire. The truth is, it's a crop from the
centrefold of the Playboy, and in earlier
00:14:07.080 --> 00:14:12.070
years, the computer vision engineers was a
mostly male audience. Anyway, let's take
00:14:12.070 --> 00:14:16.850
five by five pixels. Let's assume, this is
a five by five pixels, a really small,
00:14:16.850 --> 00:14:22.230
image. If we take those 25 pixels and use
them as input for a neural network you
00:14:22.230 --> 00:14:26.730
already see that we have many connections
- many weights - which means a very
00:14:26.730 --> 00:14:32.540
complex model. Complex model, of course,
prone to overfitting. But there are more
00:14:32.540 --> 00:14:38.800
problems. First being, we have
disconnected the pixels from its neigh-, a
00:14:38.800 --> 00:14:43.670
pixel from its neighbors. We can't encode
information about the neighborhood
00:14:43.670 --> 00:14:47.850
anymore, and that really sucks. If we just
take the whole picture, and move it to the
00:14:47.850 --> 00:14:52.790
left or to the right by just one pixel,
the network will see something completely
00:14:52.790 --> 00:14:58.470
different, even though to us it is exactly
the same. But, we can solve that with some
00:14:58.470 --> 00:15:03.400
very clever engineering, something we call
a convolutional layer. It is again a
00:15:03.400 --> 00:15:08.860
hidden layer in a neural network, but it
does something special. It actually is a
00:15:08.860 --> 00:15:13.970
very simple neuron again, just four input
values - one output value. But the four
00:15:13.970 --> 00:15:19.780
input values look at two by two pixels,
and encode one output value. And then the
00:15:19.780 --> 00:15:23.790
same network is shifted to the right, and
encodes another pixel, and another pixel,
00:15:23.790 --> 00:15:30.150
and the next row of pixels. And in this
way creates another 2D image. We have
00:15:30.150 --> 00:15:34.900
preserved information about the
neighborhood, and we just have a very low
00:15:34.900 --> 00:15:41.910
number of weights, not the huge number of
parameters we saw earlier. We can use this
00:15:41.910 --> 00:15:49.640
once, or twice, or several hundred times.
And this is actually where we go deep.
00:15:49.640 --> 00:15:54.920
Deep means: We have several layers, and
having layers that don't need thousands or
00:15:54.920 --> 00:16:01.040
millions of connections, but only a few.
This is what allows us to go really deep.
00:16:01.040 --> 00:16:06.250
And in this fashion we can encode an
entire image in just a few meaningful
00:16:06.250 --> 00:16:11.480
values. How these values look like, and
what they encode, this is learned through
00:16:11.480 --> 00:16:18.240
the learning process. And we can then, for
example, use these few values as input for
00:16:18.240 --> 00:16:24.709
a classification network.
The fully connected network we saw earlier.
00:16:24.709 --> 00:16:29.560
Or we can do something more clever. We can
do the inverse operation and create an image
00:16:29.560 --> 00:16:35.170
again, for example, the same image, which
is then called an auto encoder. Auto
00:16:35.170 --> 00:16:40.200
encoders are tremendously useful, even
though they don't appear that way. For
00:16:40.200 --> 00:16:43.959
example, imagine you want to check whether
something has a defect, or not, a picture
00:16:43.959 --> 00:16:51.290
of a fabric, or of something. You just
train the network with normal pictures.
00:16:51.290 --> 00:16:56.770
And then, if you have a defect picture,
the network is not able to produce this
00:16:56.770 --> 00:17:02.149
defect. And so the difference of the
reproduced picture, and the real picture
00:17:02.149 --> 00:17:07.420
will show you where errors are. If it
works properly, I'll have to admit that.
00:17:07.420 --> 00:17:12.569
But we can go even further. Let's say, we
want to encode something entirely else.
00:17:12.569 --> 00:17:17.400
Well, let's encode the image, the
information in the image, but in another
00:17:17.400 --> 00:17:21.859
representation. For example, let's say we
have three classes again. The background
00:17:21.859 --> 00:17:30.049
class in grey, a class called hat or
headwear in blue, and person in green. We
00:17:30.049 --> 00:17:34.309
can also use this for other applications
than just for pictures of humans. For
00:17:34.309 --> 00:17:38.370
example, we have a picture of a street and
want to encode: Where is the car, where's
00:17:38.370 --> 00:17:44.860
the pedestrian? Tremendously useful. Or we
have an MRI scan of a brain: Where in the
00:17:44.860 --> 00:17:51.110
brain is the tumor? Can we somehow learn
this? Yes we can do this, with methods
00:17:51.110 --> 00:17:57.480
like these, if they are trained properly.
More about that later. Well we expect
00:17:57.480 --> 00:18:01.020
something like this to come out but the
truth looks rather like this – especially
00:18:01.020 --> 00:18:05.870
if it's not properly trained. We have not
the real shape we want to get but
00:18:05.870 --> 00:18:11.980
something distorted. So here is again
where we need to do learning. First we
00:18:11.980 --> 00:18:15.790
take a picture, put it through the
network, get our output representation.
00:18:15.790 --> 00:18:21.110
And we have the information about how we
want it to look. We again compute some
00:18:21.110 --> 00:18:27.040
kind of loss value. This time for example
being the overlap between the shape we get
00:18:27.040 --> 00:18:34.040
out of the model and the shape we want to
have. And we use this error, this lost
00:18:34.040 --> 00:18:38.660
function, to update the weights of our
network. Again – even though it's more
00:18:38.660 --> 00:18:43.570
complicated here, even though we have more
layers, and even though the layers look
00:18:43.570 --> 00:18:48.640
slightly different – it is the same
process all over again as with a binary
00:18:48.640 --> 00:18:56.540
case. And we need lots of training data.
This is something that you'll hear often
00:18:56.540 --> 00:19:02.960
in connection with deep learning: You need
lots of training data to make this work.
00:19:02.960 --> 00:19:10.100
Images are complex things and in order to
meaningful extract knowledge from them,
00:19:10.100 --> 00:19:17.090
the network needs to see a multitude of
different images. Well now I already
00:19:17.090 --> 00:19:22.230
showed you some things we use in network
architecture, some support networks: The
00:19:22.230 --> 00:19:26.679
fully convolutional encoder, which takes
an image and produces a few meaningful
00:19:26.679 --> 00:19:33.110
values out of this image; its counterpart
the fully convolutional decoder – fully
00:19:33.110 --> 00:19:36.960
convolutional meaning by the way that we
only have these convolutional layers with
00:19:36.960 --> 00:19:42.980
a few parameters that somehow encode
spatial information and keep it for the
00:19:42.980 --> 00:19:49.360
next layers. The decoder takes a few
meaningful numbers and reproduces an image
00:19:49.360 --> 00:19:55.420
– either the same image or another
representation of the information encoded
00:19:55.420 --> 00:20:01.400
in the image. We also already saw the
fully connected network. Fully connected
00:20:01.400 --> 00:20:06.640
meaning every neuron is connected to every
neuron in the next layer. This of course
00:20:06.640 --> 00:20:12.570
can be dangerous because this is where we
actually get most of our parameters. If we
00:20:12.570 --> 00:20:16.390
have a fully connected network, this is
where the most parameters will be present
00:20:16.390 --> 00:20:21.580
because connecting every node to every
node … this is just a high number of
00:20:21.580 --> 00:20:25.860
connections. We can also do other things.
For example something called a pooling
00:20:25.860 --> 00:20:32.280
layer. A pooling layer being basically the
same as one of those convolutional layers,
00:20:32.280 --> 00:20:36.370
just that we don't have parameters we need
to learn. This works without parameters
00:20:36.370 --> 00:20:43.740
because this neuron just chooses whichever
value is the highest and takes that value
00:20:43.740 --> 00:20:49.600
as output. This is really great for
reducing the size of your image and also
00:20:49.600 --> 00:20:55.150
getting rid of information that might not
be that important. We can also do some
00:20:55.150 --> 00:20:59.890
clever techniques like adding a dropout
layer. A dropout layer just being a normal
00:20:59.890 --> 00:21:05.799
layer in a neural network where we remove
some connections: In one training step
00:21:05.799 --> 00:21:10.720
these connections, in the next training
step some other connections. This way we
00:21:10.720 --> 00:21:18.049
teach the other connections to become more
resilient against errors. I would like to
00:21:18.049 --> 00:21:22.750
start with something I call the "Model
Show" now, and show you some models and
00:21:22.750 --> 00:21:28.870
how we train those models. And I will
start with a fully convolutional decoder
00:21:28.870 --> 00:21:34.740
we saw earlier: This thing that takes a
number and creates a picture. I would like
00:21:34.740 --> 00:21:41.420
to take this model, put in some number and
get out a picture – a picture of a horse
00:21:41.420 --> 00:21:46.000
for example. If I put in a different
number I also want to get a picture of a
00:21:46.000 --> 00:21:52.390
horse, but of a different horse. So what I
want to get is a mapping from some
00:21:52.390 --> 00:21:56.730
numbers, some features that encode
something about the horse picture, and get
00:21:56.730 --> 00:22:03.450
a horse picture out of it. You might see
already why this is problematic. It is
00:22:03.450 --> 00:22:08.230
problematic because we don't have a
mapping from feature to horse or from
00:22:08.230 --> 00:22:15.050
horse to features. So we don't have a
truth value we can use to learn how to
00:22:15.050 --> 00:22:21.790
generate this mapping. Well computer
vision engineers – or deep learning
00:22:21.790 --> 00:22:26.800
professionals – they're smart and have
clever ideas. Let's just assume we have
00:22:26.800 --> 00:22:32.870
such a network and let's call it a
generator. Let's take some numbers put,
00:22:32.870 --> 00:22:39.240
them into the generator and get some
horses. Well it doesn't work yet. We still
00:22:39.240 --> 00:22:42.490
have to train it. So they're probably not
only horses but also some very special
00:22:42.490 --> 00:22:47.970
unicorns among the horses; which might be
nice for other applications, but I wanted
00:22:47.970 --> 00:22:55.480
pictures of horses right now. So I can't
train with this data directly. But what I
00:22:55.480 --> 00:23:01.600
can do is I can create a second network.
This network is called a discriminator and
00:23:01.600 --> 00:23:08.820
I can give it the input generated from the
generator as well as the real data I have:
00:23:08.820 --> 00:23:13.920
the real horse pictures. And then I can
teach the discriminator to distinguish
00:23:13.920 --> 00:23:22.080
between those. Tell me it is a real horse
or it's not a real horse. And there I know
00:23:22.080 --> 00:23:27.000
what is the truth because I either take
real horse pictures or fake horse pictures
00:23:27.000 --> 00:23:34.170
from the generator. So I have a truth
value for this discriminator. But in doing
00:23:34.170 --> 00:23:39.070
this I also have a truth value for the
generator. Because I want the generator to
00:23:39.070 --> 00:23:43.799
work against the discriminator. So I can
also use the information how well the
00:23:43.799 --> 00:23:51.010
discriminator does to train the generator
to become better in fooling. This is
00:23:51.010 --> 00:23:57.470
called a generative adversarial network.
And it can be used to generate pictures of
00:23:57.470 --> 00:24:02.350
an arbitrary distribution. Let's do this
with numbers and I will actually show you
00:24:02.350 --> 00:24:07.590
the training process. Before I start the
video, I'll tell you what I did. I took
00:24:07.590 --> 00:24:11.550
some handwritten digits. There is a
database called "??? of handwritten
00:24:11.550 --> 00:24:18.570
digits" so the numbers of 0 to 9. And I
took those and used them as training data.
00:24:18.570 --> 00:24:24.299
I trained a generator in the way I showed
you on the previous slide, and then I just
00:24:24.299 --> 00:24:30.110
took some random numbers. I put those
random numbers into the network and just
00:24:30.110 --> 00:24:35.960
stored the image of what came out of the
network. And here in the video you'll see
00:24:35.960 --> 00:24:43.090
how the network improved with ongoing
training. You will see that we start
00:24:43.090 --> 00:24:50.179
basically with just noisy images … and
then after some – what we call apox(???)
00:24:50.179 --> 00:24:55.919
so training iterations – the network is
able to almost perfectly generate
00:24:55.919 --> 00:25:05.679
handwritten digits just from noise. Which
I find truly fascinating. Of course this
00:25:05.679 --> 00:25:11.270
is an example where it works. It highly
depends on your data set and how you train
00:25:11.270 --> 00:25:15.600
the model whether it is a success or not.
But if it works, you can use it to
00:25:15.600 --> 00:25:22.559
generate fonts. You can generate
characters, 3D objects, pictures of
00:25:22.559 --> 00:25:28.700
animals, whatever you want as long as you
have training data. Let's go more crazy.
00:25:28.700 --> 00:25:34.539
Let's take two of those and let's say we
have pictures of horses and pictures of
00:25:34.539 --> 00:25:41.150
zebras. I want to convert those pictures
of horses into pictures of zebras, and I
00:25:41.150 --> 00:25:44.590
want to convert pictures of zebras into
pictures of horses. So I want to have the
00:25:44.590 --> 00:25:49.690
same picture just with the other animal.
But I don't have training data of the same
00:25:49.690 --> 00:25:56.270
situation just once with a horse and once
with a zebra. Doesn't matter. We can train
00:25:56.270 --> 00:26:00.650
a network that does that for us. Again we
just have a network – we call it the
00:26:00.650 --> 00:26:05.730
generator – and we have two of those: One
that converts horses to zebras and one
00:26:05.730 --> 00:26:14.840
that converts zebras to horses. And then
we also have two discriminators that tell
00:26:14.840 --> 00:26:21.150
us: real horse – fake horse – real zebra –
fake zebra. And then we again need to
00:26:21.150 --> 00:26:27.210
perform some training. So we need to
somehow encode: Did it work what we wanted
00:26:27.210 --> 00:26:31.460
to do? And a very simple way to do this is
we take a picture of a horse put it
00:26:31.460 --> 00:26:35.470
through the generator that generates a
zebra. Take this fake picture of a zebra,
00:26:35.470 --> 00:26:39.340
put it through the generator that
generates a picture of a horse. And if
00:26:39.340 --> 00:26:43.700
this is the same picture as we put in,
then our model worked. And if it didn't,
00:26:43.700 --> 00:26:48.549
we can use that information to update the
weights. I just took a random picture,
00:26:48.549 --> 00:26:54.460
from a free library in the Internet, of a
horse and generated a zebra and it worked
00:26:54.460 --> 00:26:59.470
remarkably well. I actually didn't even do
training. It also doesn't need to be a
00:26:59.470 --> 00:27:03.120
picture. You can also convert text to
images: You describe something in words
00:27:03.120 --> 00:27:09.570
and generate images. You can age your face
or age a cell; or make a patient healthy
00:27:09.570 --> 00:27:15.510
or sick – or the image of a patient, not
the patient self, unfortunately. You can
00:27:15.510 --> 00:27:20.690
do style transfer like take a picture of
Van Gogh and apply it to your own picture.
00:27:20.690 --> 00:27:27.559
Stuff like that. Something else that we
can do with neural networks. Let's assume
00:27:27.559 --> 00:27:31.030
we have a classification network, we have
a picture of a toothbrush and the network
00:27:31.030 --> 00:27:36.770
tells us: Well, this is a toothbrush.
Great! But how resilient is this network?
00:27:36.770 --> 00:27:44.530
Does it really work in every scenario.
There's a second network we can apply: We
00:27:44.530 --> 00:27:48.701
call it an adversarial network. And that
network is trained to do one thing: Look
00:27:48.701 --> 00:27:52.289
at the network, look at the picture, and
then find the one weak spot in the
00:27:52.289 --> 00:27:55.880
picture: Just change one pixel slightly so
that the network will tell me this
00:27:55.880 --> 00:28:03.600
toothbrush is an octopus. Works remarkably
well. Also works with just changing the
00:28:03.600 --> 00:28:08.940
picture slightly, so changing all the
pixels, but just slight minute changes
00:28:08.940 --> 00:28:12.860
that we don't perceive, but the network –
the classification network – is completely
00:28:12.860 --> 00:28:19.640
thrown off. Well sounds bad. Is bad if you
don't consider it. But you can also for
00:28:19.640 --> 00:28:24.200
example use this for training your network
and make your network resilient. So
00:28:24.200 --> 00:28:28.460
there's always an upside and downside.
Something entirely else: Now I'd like to
00:28:28.460 --> 00:28:32.880
show you something about text. A word-
language model. I want to generate
00:28:32.880 --> 00:28:38.101
sentences for my podcast. I have a network
that gives me a word, and then if I want
00:28:38.101 --> 00:28:42.640
to somehow get the next word in the
sentence, I also need to consider this
00:28:42.640 --> 00:28:47.070
word. So another network architecture –
quite interestingly – just takes the
00:28:47.070 --> 00:28:52.179
hidden states of the network and uses them
as the input for the same network so that
00:28:52.179 --> 00:28:58.780
in the next iteration we still know what
we did in the previous step. I tried to
00:28:58.780 --> 00:29:04.730
train a network that generates podcast
episodes for my podcasts. Didn't work.
00:29:04.730 --> 00:29:08.450
What I learned is I don't have enough
training data. I really need to produce
00:29:08.450 --> 00:29:15.790
more podcast episodes in order to train a
model to do my job for me. And this is
00:29:15.790 --> 00:29:21.539
very important, a very crucial point:
Training data. We need shitloads of
00:29:21.539 --> 00:29:26.081
training data. And actually the more
complicated our model and our training
00:29:26.081 --> 00:29:30.990
process becomes, the more training data we
need. I started with a supervised case –
00:29:30.990 --> 00:29:35.990
the really simple case where we, really
simple, the really simpler case where we
00:29:35.990 --> 00:29:40.660
have a picture and a label that
corresponds to that picture; or a
00:29:40.660 --> 00:29:46.280
representation of that picture showing
entirely what I wanted to learn. But we
00:29:46.280 --> 00:29:51.909
also saw a more complex task, where I had
to pictures – horses and zebras – that are
00:29:51.909 --> 00:29:56.400
from two different domains – but domains
with no direct mapping. What can also
00:29:56.400 --> 00:30:01.020
happen – and actually happens quite a lot
– is weakly annotated data, so data that
00:30:01.020 --> 00:30:08.750
is not precisely annotated; where we can't
rely on the information we get. Or even
00:30:08.750 --> 00:30:13.050
more complicated: Something called
reinforcement learning where we perform a
00:30:13.050 --> 00:30:19.380
sequence of actions and then in the end
are told "yeah that was great". Which is
00:30:19.380 --> 00:30:24.080
often not enough information to really
perform proper training. But of course
00:30:24.080 --> 00:30:28.190
there are also methods for that. As well
as there are methods for the unsupervised
00:30:28.190 --> 00:30:33.590
case where we don't have annotations,
labeled data – no ground truth at all –
00:30:33.590 --> 00:30:41.241
just the picture itself. Well I talked
about pictures. I told you that we can
00:30:41.241 --> 00:30:45.320
learn features and create images from
them. And we can use them for
00:30:45.320 --> 00:30:51.640
classification. And for this there exist
many databases. There are public data sets
00:30:51.640 --> 00:30:56.659
we can use. Often they refer to for
example Flickr. They're just hyperlinks
00:30:56.659 --> 00:31:00.960
which is also why I didn't show you many
pictures right here, because I am honestly
00:31:00.960 --> 00:31:05.690
not sure about the copyright in those
cases. But there are also challenge
00:31:05.690 --> 00:31:11.190
datasets where you can just sign up, get
some for example medical data sets, and
00:31:11.190 --> 00:31:16.650
then compete against other researchers.
And of course there are those companies
00:31:16.650 --> 00:31:22.090
that just have lots of data. And those
companies also have the means, the
00:31:22.090 --> 00:31:28.110
capacity to perform intense computations.
And those are also often the companies you
00:31:28.110 --> 00:31:36.179
hear from in terms of innovation for deep
learning. Well this was mostly to tell you
00:31:36.179 --> 00:31:40.200
that you can process images quite well
with deep learning if you have enough
00:31:40.200 --> 00:31:46.029
training data, if you have a proper
training process and also a little if you
00:31:46.029 --> 00:31:52.090
know what you're doing. But you can also
process text, you can process audio and
00:31:52.090 --> 00:31:58.520
time series like prices or a stack
exchange – stuff like that. You can
00:31:58.520 --> 00:32:02.929
process almost everything if you make it
encodeable to your network. Sounds like a
00:32:02.929 --> 00:32:08.120
dream come true. But – as I already told
you – you need data, a lot of it. I told
00:32:08.120 --> 00:32:14.020
you about those companies that have lots
of data sets and the publicly available
00:32:14.020 --> 00:32:21.370
data sets which you can actually use to
get started with your own experiments. But
00:32:21.370 --> 00:32:24.309
that also makes it a little dangerous
because deep learning still is a black box
00:32:24.309 --> 00:32:30.820
to us. I told you what happens inside the
black box on a level that teaches you how
00:32:30.820 --> 00:32:36.529
we learn and how the network is
structured, but not really what the
00:32:36.529 --> 00:32:42.831
network learned. It is for us computer
vision engineers really nice that we can
00:32:42.831 --> 00:32:48.590
visualize the first layers of a neural
network and see what is actually encoded
00:32:48.590 --> 00:32:53.950
in those first layers; what information
the network looks at. But you can't really
00:32:53.950 --> 00:32:59.059
mathematically prove what happens in a
network. Which is one major downside. And
00:32:59.059 --> 00:33:02.150
so if you want to use it, the numbers may
be really great but be sure to properly
00:33:02.150 --> 00:33:08.059
evaluate them. In summary I call that
"easy to learn". Every one – every single
00:33:08.059 --> 00:33:12.679
one of you – can just start with deep
learning right away. You don't need to do
00:33:12.679 --> 00:33:19.440
much work. You don't need to do much
learning. The model learns for you. But
00:33:19.440 --> 00:33:23.770
they're hard to master in a way that makes
them useful for production use cases for
00:33:23.770 --> 00:33:29.900
example. So if you want to use deep
learning for something – if you really
00:33:29.900 --> 00:33:34.299
want to seriously use it –, make sure that
it really does what you wanted to and
00:33:34.299 --> 00:33:38.900
doesn't learn something else – which also
happens. Pretty sure you saw some talks
00:33:38.900 --> 00:33:43.670
about deep learning fails – which is not
what this talk is about. They're quite
00:33:43.670 --> 00:33:47.370
funny to look at. Just make sure that they
don't happen to you! If you do that
00:33:47.370 --> 00:33:53.300
though, you'll achieve great things with
deep learning, I'm sure. And that was
00:33:53.300 --> 00:34:00.740
introduction to deep learning. Thank you!
Applause
00:34:09.172 --> 00:34:13.449
Herald Angel: So now it's question and
answer time. So if you have a question,
00:34:13.449 --> 00:34:19.110
please line up at the mikes. We have in
total eight, so it shouldn't be far from
00:34:19.110 --> 00:34:26.139
you. They are here in the corridors and on
these sides. Please line up! For
00:34:26.139 --> 00:34:31.540
everybody: A question consists of one
sentence with the question mark in the end
00:34:31.540 --> 00:34:38.449
– not three minutes of rambling. And also
if you go to the microphone, speak into
00:34:38.449 --> 00:34:53.889
the microphone, so you really get close to
it. Okay. Where do we have … Number 7!
00:34:53.889 --> 00:35:02.200
We start with mic number 7:
Question: Hello. My question is: How did
00:35:02.200 --> 00:35:13.020
you compute the example for the fonts, the
numbers? I didn't really understand it,
00:35:13.020 --> 00:35:19.770
you just said it was made from white
noise.
00:35:19.770 --> 00:35:25.580
Teubi: I'll give you a really brief recap
of what I did. I showed you that we have a
00:35:25.580 --> 00:35:31.140
model that maps image to some meaningful
values, that an image can be encoded in
00:35:31.140 --> 00:35:36.860
just a few values. What happens here is
exactly the other way round. We have some
00:35:36.860 --> 00:35:43.270
values, just some arbitrary values we
actually know nothing about. We can
00:35:43.270 --> 00:35:47.480
generate pictures out of those. So I
trained this model to just take some
00:35:47.480 --> 00:35:54.560
random values and show the pictures
generated from the model. The training
00:35:54.560 --> 00:36:03.320
process was this "min max game", as its
called. We have two networks that try to
00:36:03.320 --> 00:36:08.260
compete against each other. One network
trying to distinguish, whether a picture
00:36:08.260 --> 00:36:12.790
it sees is real or one of those fake
pictures, and the network that actually
00:36:12.790 --> 00:36:18.510
generates those pictures and in training
the network that is able to distinguish
00:36:18.510 --> 00:36:24.599
between those, we can also get information
for the training of the network that
00:36:24.599 --> 00:36:30.410
generates the pictures. So the videos you
saw were just animations of what happens
00:36:30.410 --> 00:36:36.440
during this training process. At first if
we input noise we get noise. But as the
00:36:36.440 --> 00:36:41.510
network is able to better and better
recreate those images from the dataset we
00:36:41.510 --> 00:36:47.390
used as input, in this case pictures of
handwritten digits, the output also became
00:36:47.390 --> 00:36:54.660
more lookalike to those numbers, these
handwritten digits. Hope that helped.
00:36:54.660 --> 00:37:06.590
Herald Angel: Now we go to the
Internet. – Can we get sound for the signal
00:37:06.590 --> 00:37:10.040
Angel, please? Teubi: Sounded so great,
"now we go to the Internet."
00:37:10.040 --> 00:37:11.040
Herald Angel: Yeah, that sounds like
"yeeaah".
00:37:11.040 --> 00:37:13.040
Signal Angel: And now we're finally ready
to go to the interwebs. "Schorsch" is
00:37:13.040 --> 00:37:18.040
asking: Do you have any recommendations
for a beginner regarding the framework or
00:37:18.040 --> 00:37:26.460
the software?
Teubi: I, of course, am very biased to
00:37:26.460 --> 00:37:34.150
recommend what I use everyday. But I also
think that it is a great start. Basically,
00:37:34.150 --> 00:37:40.210
use python and use pytorch. Many people
will disagree with me and tell you
00:37:40.210 --> 00:37:45.930
"tensorflow is better." It might be, in my
opinion not for getting started, and there
00:37:45.930 --> 00:37:51.560
are also some nice tutorials on the
pytorch website. What you can also do is
00:37:51.560 --> 00:37:57.200
look at websites like OpenAI, where they
have a gym to get you started with some
00:37:57.200 --> 00:38:02.371
training exercises, where you already have
datasets. Yeah, basically my
00:38:02.371 --> 00:38:08.600
recommendation is get used to Python and
start with a pytorch tutorial, see where
00:38:08.600 --> 00:38:13.590
to go from there. Often there also some
github repositories linked with many
00:38:13.590 --> 00:38:18.740
examples for already established network
architectures like the cycle GAN or the
00:38:18.740 --> 00:38:26.250
GAN itself or basically everything else.
There will be a repo you can use to get
00:38:26.250 --> 00:38:29.940
started.
Herald Angel: OK, we stay with the
00:38:29.940 --> 00:38:32.589
internet. There's some more questions, I
heard.
00:38:32.589 --> 00:38:37.920
Signal Angel: Yes. Rubin8 is asking: Have
you have you ever come across an example
00:38:37.920 --> 00:38:42.580
of a neural network that deals with audio
instead of images?
00:38:42.580 --> 00:38:49.410
Teubi: Me personally, no. At least not
directly. I've heard about examples, like
00:38:49.410 --> 00:38:54.859
where you can change the voice to sound
like another person, but there is not much
00:38:54.859 --> 00:38:59.980
I can reliably tell about that. My
expertise really is in image processing,
00:38:59.980 --> 00:39:05.550
I'm sorry.
Herald Angel: And I think we have time for
00:39:05.550 --> 00:39:12.340
one more question. We have one at number
8. Microphone number 8.
00:39:12.340 --> 00:39:20.730
Question: Is the current Face recognition
technologies in, for example iPhone X, is
00:39:20.730 --> 00:39:26.420
it also a deep learning algorithm or is
it something more simple? Do you have any
00:39:26.420 --> 00:39:31.880
idea about that?
Teubi: As far as I know, yes. That's all I
00:39:31.880 --> 00:39:38.630
can reliably tell you about that, but it
is not only based on images but also uses
00:39:38.630 --> 00:39:45.420
other information. I think distance
information encoded with some infrared
00:39:45.420 --> 00:39:50.599
signals. I don't really know exactly how
it works, but at least iPhones already
00:39:50.599 --> 00:39:56.000
have a neural network
processing engine built in, so a chip
00:39:56.000 --> 00:40:01.190
dedicated to just doing those
computations. You saw that many of those
00:40:01.190 --> 00:40:05.820
things can be parallelized, and this is
what those hardware architectures make use
00:40:05.820 --> 00:40:10.380
of. So I'm pretty confident in saying,
yes, they also do it there.
00:40:10.380 --> 00:40:12.786
How exactly, no clue.
00:40:13.760 --> 00:40:15.323
Herald Angel: OK. I myself have a last
00:40:15.390 --> 00:40:20.680
completely unrelated question: Did you
create the design of the slides yourself?
00:40:20.680 --> 00:40:29.060
Teubi: I had some help. We have a really
great Congress design and I use that as an
00:40:29.060 --> 00:40:32.790
inspiration to create those slides, yes.
00:40:32.790 --> 00:40:36.760
Herald Angel: OK, yeah, because those are really amazing. I love them.
00:40:36.760 --> 00:40:38.140
Teubi: Thank you!
00:40:38.470 --> 00:40:41.200
Herald Angel: OK, thank you very much
Teubi.
00:40:45.130 --> 00:40:48.900
35C5 outro music
00:40:48.900 --> 00:41:07.000
subtitles created by c3subtitles.de
in the year 2019. Join, and help us!