WEBVTT

00:00:00.000 --> 00:00:18.229
<i>35C3 preroll music</i>


00:00:18.229 --> 00:00:24.750
Herald Angel: Welcome to our introduction
to deep learning with Teubi. Deep

00:00:24.750 --> 00:00:30.247
learning, also often called machine
learning is a hype word which we hear in

00:00:30.247 --> 00:00:37.152
the media all the time. It's nearly as bad
as blockchain. It's a solution for

00:00:37.152 --> 00:00:43.249
everything. Today we'll get a sneak peek
into the internals of this mystical black

00:00:43.249 --> 00:00:48.820
box, they are talking about. And Teubi
will show us why people, who know what

00:00:48.820 --> 00:00:53.040
machine learning really is about, have to
facepalm so often, when they read the

00:00:53.040 --> 00:00:58.715
news. So please welcome Teubi
with a big round of applause!

00:00:58.715 --> 00:01:10.245
<i>Applause</i>
Teubi: Alright! Good morning and welcome

00:01:10.245 --> 00:01:14.470
to Introduction to Deep Learning. The
title will already tell you what this talk

00:01:14.470 --> 00:01:19.920
is about. I want to give you an
introduction onto how deep learning works,

00:01:19.920 --> 00:01:27.090
what happens inside this black box. But,
first of all, who am I? I'm Teubi. It's a

00:01:27.090 --> 00:01:32.280
German nickname, it has nothing to do with
toys or bees. You might have heard my

00:01:32.280 --> 00:01:36.480
voice before, because I host the
Nussschale podcast. There I explain

00:01:36.480 --> 00:01:41.560
scientific topics in under 10 minutes.
I'll have to use a little more time today,

00:01:41.560 --> 00:01:46.850
and you'll also have fancy animations
which hopefully will help. In my day job

00:01:46.850 --> 00:01:52.540
I'm a research scientist at an institute
for computer vision. I analyze microscopy

00:01:52.540 --> 00:01:58.240
images of bone marrow blood cells and try
to find ways to teach the computer to

00:01:58.240 --> 00:02:04.660
understand what it sees. Namely, to
differentiate between certain cells or,

00:02:04.660 --> 00:02:09.449
first of all, find cells in an image,
which is a task that is more complex than

00:02:09.449 --> 00:02:17.180
it might sound like. Let me start with the
introduction to deep learning. We all know

00:02:17.180 --> 00:02:22.769
how to code. We code in a very simple way.
We have some input for all computer

00:02:22.769 --> 00:02:27.618
algorithm. Then we have an algorithm which
says: Do this, do that. If this, then

00:02:28.510 --> 00:02:28.906
that. And in that way we generate some
output. This is not how machine learning

00:02:29.495 --> 00:02:30.754
works. Machine learning assumes you have
some input, and you also have some output.

00:02:40.810 --> 00:02:46.180
And what you also have is some statistical
model. This statistical model is flexible.

00:02:46.180 --> 00:02:51.549
It has certain parameters, which it can
learn from the distribution of inputs and

00:02:51.549 --> 00:02:57.430
outputs you give it for training. So you
basically learn the statistical model to

00:02:57.430 --> 00:03:03.659
generate the desired output from the given
input. Let me give you a really simple

00:03:03.659 --> 00:03:09.980
example of how this might work. Let's say
we have two animals. Well, we have two

00:03:09.980 --> 00:03:15.689
kinds of animals: unicorns and rabbits.
And now we want to find an algorithm that

00:03:15.689 --> 00:03:24.270
tells us whether this animal we have right
now as an input is a rabbit or a unicorn.

00:03:24.270 --> 00:03:28.230
We can write a simple algorithm to do
that, but we can also do it with machine

00:03:28.230 --> 00:03:34.590
learning. The first thing we need is some
input. I choose two features that are able

00:03:34.590 --> 00:03:42.269
to tell me whether this animal is a rabbit
or a unicorn. Namely, speed and size. We

00:03:42.269 --> 00:03:46.859
call these features, and they describe
something about what we want to classify.

00:03:46.859 --> 00:03:52.409
And the class is in this case our animal.
First thing I need is some training data,

00:03:52.409 --> 00:03:59.170
some input. The input here are just pairs
of speed and size. What I also need is

00:03:59.170 --> 00:04:04.129
information about the desired output. The
desired output, of course, being the

00:04:04.129 --> 00:04:12.999
class. So either unicorn or rabbit, here
denoted by yellow and red X's. So let's

00:04:12.999 --> 00:04:18.298
try to find a statistical model which we
can use to separate this feature space

00:04:18.298 --> 00:04:24.150
into two halves: One for the rabbits, one
for the unicorns. Looking at this, we can

00:04:24.150 --> 00:04:28.660
actually find a really simple statistical
model, and our statistical model in this

00:04:28.660 --> 00:04:34.390
case is just a straight line. And the
learning process is then to find where in

00:04:34.390 --> 00:04:41.080
this feature space the line should be.
Ideally, for example, here. Right in the

00:04:41.080 --> 00:04:45.220
middle between the two classes rabbit and
unicorn. Of course this is an overly

00:04:45.220 --> 00:04:50.370
simplified example. Real-world
applications have feature distributions

00:04:50.370 --> 00:04:56.080
which look much more like this. So, we
have a gradient, we don't have a perfect

00:04:56.080 --> 00:05:00.130
separation between those two classes, and
those two classes are definitely not

00:05:00.130 --> 00:05:05.560
separable by a line. If we look again at
some training samples — training samples

00:05:05.560 --> 00:05:11.730
are the data points we use for the machine
learning process, so, to try to find the

00:05:11.730 --> 00:05:17.540
parameters of our statistical model — if
we look at the line again, then this will

00:05:17.540 --> 00:05:23.000
not be able to separate this training set.
Well, we will have a line that has some

00:05:23.000 --> 00:05:27.320
errors, some unicorns which will be
classified as rabbits, some rabbits which

00:05:27.320 --> 00:05:33.070
will be classified as unicorns. This is
what we call underfitting. Our model is

00:05:33.070 --> 00:05:40.150
just not able to express what we want it
to learn. There is the opposite case. The

00:05:40.150 --> 00:05:45.510
opposite case being: we just learn all the
training samples by heart. This is if we

00:05:45.510 --> 00:05:50.020
have a very complex model and just a few
training samples to teach the model what

00:05:50.020 --> 00:05:55.120
it should learn. In this case we have a
perfect separation of unicorns and

00:05:55.120 --> 00:06:00.700
rabbits, at least for the few data points
we have. If we draw another example from

00:06:00.700 --> 00:06:07.300
the real world,some other data points,
they will most likely be wrong. And this

00:06:07.300 --> 00:06:11.380
is what we call overfitting. The perfect
scenario in this case would be something

00:06:11.380 --> 00:06:17.340
like this: a classifier which is really
close to the distribution we have in the

00:06:17.340 --> 00:06:23.350
real world and machine learning is tasked
with finding this perfect model and its

00:06:23.350 --> 00:06:28.960
parameters. Let me show you a different
kind of model, something you probably all

00:06:28.960 --> 00:06:35.670
have heard about: Neural networks. Neural
networks are inspired by the brain.

00:06:35.670 --> 00:06:41.210
Or more precisely, by the neurons in our
brain. Neurons are tiny objects, tiny

00:06:41.210 --> 00:06:47.250
cells in our brain that take some input
and generate some output. Sounds familiar,

00:06:47.250 --> 00:06:52.680
right? We have inputs usually in the form
of electrical signals. And if they are

00:06:52.680 --> 00:06:57.860
strong enough, this neuron will also send
out an electrical signal. And this is

00:06:57.860 --> 00:07:03.430
something we can model in a computer-
engineering way. So, what we do is: We

00:07:03.430 --> 00:07:09.240
take a neuron. The neuron is just a simple
mapping from input to output. Input here,

00:07:09.240 --> 00:07:17.200
just three input nodes. We denote them by
i1, i2 and i3 and output denoted by o. And

00:07:17.200 --> 00:07:20.840
now you will actually see some
mathematical equations. There are not many

00:07:20.840 --> 00:07:26.700
of these in this foundation talk, don't
worry, and it's really simple. There's one

00:07:26.700 --> 00:07:30.250
more thing we need first, though, if we
want to map input to output in the way a

00:07:30.250 --> 00:07:35.490
neuron does. Namely, the weights. The
weights are just some arbitrary numbers

00:07:35.490 --> 00:07:43.020
for now. Let's call them w1, w2 and w3.
So, we take those weights and we multiply

00:07:43.020 --> 00:07:51.360
them with the input. Input1 times weight1,
input2 times weight2, and so on. And this,

00:07:51.360 --> 00:07:57.550
this sum just will be our output. Well,
not quite. We make it a little bit more

00:07:57.550 --> 00:08:02.430
complicated. We also use something called
an activation function. The activation

00:08:02.430 --> 00:08:08.520
function is just a mapping from one scalar
value to another scalar value. In this

00:08:08.520 --> 00:08:14.280
case from what we got as an output, the
sum, to something that more closely fits

00:08:14.280 --> 00:08:19.360
what we need. This could for example be
something binary, where we have all the

00:08:19.360 --> 00:08:23.780
negative numbers being mapped to zero and
all the positive numbers being mapped to

00:08:23.780 --> 00:08:30.910
one. And then this zero and one can encode
something. For example: rabbit or unicorn.

00:08:30.910 --> 00:08:35.309
So, let me give you an example of how we
can make the previous example with the

00:08:35.309 --> 00:08:41.729
rabbits and unicorns work with such a
simple neuron. We just use speed, size,

00:08:41.729 --> 00:08:49.650
and the arbitrarily chosen number 10 as
our inputs and the weights 1, 1, and -1.

00:08:49.650 --> 00:08:54.400
If we look at the equations, then we get
for our negative numbers — so, speed plus

00:08:54.400 --> 00:09:01.440
size being less than 10 — a 0, and a 1 for
all positive numbers — being speed plus

00:09:01.440 --> 00:09:07.680
size larger than 10, greater than 10. This
way we again have a separating line

00:09:07.680 --> 00:09:14.600
between unicorns and rabbits. But again we
have this really simplistic model. We want

00:09:14.600 --> 00:09:21.529
to become more and more complicated in
order to express more complex tasks. So

00:09:21.529 --> 00:09:26.279
what do we do? We take more neurons. We
take our three input values and put them

00:09:26.279 --> 00:09:31.920
into one neuron, and into a second neuron,
and into a third neuron. And we take the

00:09:31.920 --> 00:09:38.330
output of those three neurons as input for
another neuron. We also call this a

00:09:38.330 --> 00:09:42.140
multilayer perceptron, perceptron just
being a different name for a neuron, what

00:09:42.140 --> 00:09:48.670
we have there. And the whole thing is also
called a neural network. So now the

00:09:48.670 --> 00:09:53.300
question: How do we train this? How do we
learn what this network should encode?

00:09:53.300 --> 00:09:57.620
Well, we want a mapping from input to
output, and what we can change are the

00:09:57.620 --> 00:10:02.880
weights. First, what we do is we take a
training sample, some input. Put it

00:10:02.880 --> 00:10:07.010
through the network, get an output. But
this might not be the desired output which

00:10:07.010 --> 00:10:13.570
we know. So, in the binary case there are
four possible cases: computed output,

00:10:13.570 --> 00:10:19.860
expected output, each two values, 0 and 1.
The best case would be: we want a 0, get a

00:10:19.860 --> 00:10:27.120
0, want a 1 and get a 1. But there is also
the opposite case. In these two cases we

00:10:27.120 --> 00:10:31.440
can learn something about our model.
Namely, in which direction to change the

00:10:31.440 --> 00:10:37.270
weights. It's a little bit simplified, but
in principle you just raise the weights if

00:10:37.270 --> 00:10:41.250
you need a higher number as output and you
lower the weights if you need a lower

00:10:41.250 --> 00:10:47.350
number as output. To tell you how much, we
have two terms. First term being the

00:10:47.350 --> 00:10:53.110
error, so in this case just the difference
between desired and expected output – also

00:10:53.110 --> 00:10:56.890
often called a loss function, especially
in deep learning and more complex

00:10:56.890 --> 00:11:04.120
applications. You also have a second term
we call the act the learning rate, and the

00:11:04.120 --> 00:11:09.170
learning rate is what tells us how quickly
we should change the weights, how quickly

00:11:09.170 --> 00:11:14.890
we should adapt the weights. Okay, this is
how we learn a model. This is almost

00:11:14.890 --> 00:11:18.550
everything you need to know. There are
mathematical equations that tell you how

00:11:18.550 --> 00:11:23.770
much to change based on the error and the
learning function. And this is the entire

00:11:23.770 --> 00:11:30.339
learning process. Let's get back to the
terminology. We have the input layer. We

00:11:30.339 --> 00:11:34.020
have the output layer, which somehow
encodes our output either in one value or

00:11:34.020 --> 00:11:39.650
in several values if we have a multiple,
if we have multiple classes. We also have

00:11:39.650 --> 00:11:45.930
the hidden layers, which are actually what
makes our model deep. What we can change,

00:11:45.930 --> 00:11:51.980
what we can learn, is the are the weights,
the parameters of this model. But what we

00:11:51.980 --> 00:11:55.490
also need to keep in mind, is the number
of layers, the number of neurons per

00:11:55.490 --> 00:11:59.590
layer, the learning rate, and the
activation function. These are called

00:11:59.590 --> 00:12:04.240
hyper parameters, and they determine how
complex our model is, how well it is

00:12:04.240 --> 00:12:09.970
suited to solve the task at hand. I quite
often spoke about solving tasks, so the

00:12:09.970 --> 00:12:14.630
question is: What can we actually do with
neural networks? Mostly classification

00:12:14.630 --> 00:12:19.560
tasks, for example: Tell me, is this
animal a rabbit or unicorn? Is this text

00:12:19.560 --> 00:12:24.690
message spam or legitimate? Is this
patient healthy or ill? Is this image a

00:12:24.690 --> 00:12:30.710
picture of a cat or a dog? We already saw
for the animal that we need something

00:12:30.710 --> 00:12:35.040
called features, which somehow encodes
information about what we want to

00:12:35.040 --> 00:12:39.530
classify, something we can use as input
for the neural network. Some kind of

00:12:39.530 --> 00:12:43.830
number that is meaningful. So, for the
animal it could be speed, size, or

00:12:43.830 --> 00:12:48.740
something like color. Color, of course,
being more complex again, because we have,

00:12:48.740 --> 00:12:55.940
for example, RGB, so three values. And,
text message being a more complex case

00:12:55.940 --> 00:13:00.060
again, because we somehow need to encode
the sender, and whether the sender is

00:13:00.060 --> 00:13:04.770
legitimate. Same for the recipient, or the
number of hyperlinks, or where the

00:13:04.770 --> 00:13:11.400
hyperlinks refer to, or the, whether there
are certain words present in the text. It

00:13:11.400 --> 00:13:16.720
gets more and more complicated. Even more
so for a patient. How do we encode medical

00:13:16.720 --> 00:13:22.420
history in a proper way for the network to
learn. I mean, temperature is simple. It's

00:13:22.420 --> 00:13:26.750
a scalar value, we just have a number. But
how do we encode whether certain symptoms

00:13:26.750 --> 00:13:32.720
are present. And the image, which is
actually what I work with everyday, is

00:13:32.720 --> 00:13:38.350
again quite complex. We have values, we
have numbers, but only pixel values, which

00:13:38.350 --> 00:13:43.450
make it difficult, which are difficult to
use as input for a neural network. Why?

00:13:43.450 --> 00:13:48.350
I'll show you. I'll actually show you with
this picture, it's a very famous picture,

00:13:48.350 --> 00:13:53.970
and everybody uses it in computer vision.
They will tell you, it's because there is

00:13:53.970 --> 00:14:01.010
a multitude of different characteristics
in this image: shapes, edges, whatever you

00:14:01.010 --> 00:14:07.080
desire. The truth is, it's a crop from the
centrefold of the Playboy, and in earlier

00:14:07.080 --> 00:14:12.070
years, the computer vision engineers was a
mostly male audience. Anyway, let's take

00:14:12.070 --> 00:14:16.850
five by five pixels. Let's assume, this is
a five by five pixels, a really small,

00:14:16.850 --> 00:14:22.230
image. If we take those 25 pixels and use
them as input for a neural network you

00:14:22.230 --> 00:14:26.730
already see that we have many connections
- many weights - which means a very

00:14:26.730 --> 00:14:32.540
complex model. Complex model, of course,
prone to overfitting. But there are more

00:14:32.540 --> 00:14:38.800
problems. First being, we have
disconnected the pixels from its neigh-, a

00:14:38.800 --> 00:14:43.670
pixel from its neighbors. We can't encode
information about the neighborhood

00:14:43.670 --> 00:14:47.850
anymore, and that really sucks. If we just
take the whole picture, and move it to the

00:14:47.850 --> 00:14:52.790
left or to the right by just one pixel,
the network will see something completely

00:14:52.790 --> 00:14:58.470
different, even though to us it is exactly
the same. But, we can solve that with some

00:14:58.470 --> 00:15:03.400
very clever engineering, something we call
a convolutional layer. It is again a

00:15:03.400 --> 00:15:08.860
hidden layer in a neural network, but it
does something special. It actually is a

00:15:08.860 --> 00:15:13.970
very simple neuron again, just four input
values - one output value. But the four

00:15:13.970 --> 00:15:19.780
input values look at two by two pixels,
and encode one output value. And then the

00:15:19.780 --> 00:15:23.790
same network is shifted to the right, and
encodes another pixel, and another pixel,

00:15:23.790 --> 00:15:30.150
and the next row of pixels. And in this
way creates another 2D image. We have

00:15:30.150 --> 00:15:34.900
preserved information about the
neighborhood, and we just have a very low

00:15:34.900 --> 00:15:41.910
number of weights, not the huge number of
parameters we saw earlier. We can use this

00:15:41.910 --> 00:15:49.640
once, or twice, or several hundred times.
And this is actually where we go deep.

00:15:49.640 --> 00:15:54.920
Deep means: We have several layers, and
having layers that don't need thousands or

00:15:54.920 --> 00:16:01.040
millions of connections, but only a few.
This is what allows us to go really deep.

00:16:01.040 --> 00:16:06.250
And in this fashion we can encode an
entire image in just a few meaningful

00:16:06.250 --> 00:16:11.480
values. How these values look like, and
what they encode, this is learned through

00:16:11.480 --> 00:16:18.240
the learning process. And we can then, for
example, use these few values as input for

00:16:18.240 --> 00:16:24.709
a classification network. 
The fully connected network we saw earlier.

00:16:24.709 --> 00:16:29.560
Or we can do something more clever. We can 
do the inverse operation and create an image

00:16:29.560 --> 00:16:35.170
again, for example, the same image, which
is then called an auto encoder. Auto

00:16:35.170 --> 00:16:40.200
encoders are tremendously useful, even
though they don't appear that way. For

00:16:40.200 --> 00:16:43.959
example, imagine you want to check whether
something has a defect, or not, a picture

00:16:43.959 --> 00:16:51.290
of a fabric, or of something. You just
train the network with normal pictures.

00:16:51.290 --> 00:16:56.770
And then, if you have a defect picture,
the network is not able to produce this

00:16:56.770 --> 00:17:02.149
defect. And so the difference of the
reproduced picture, and the real picture

00:17:02.149 --> 00:17:07.420
will show you where errors are. If it
works properly, I'll have to admit that.

00:17:07.420 --> 00:17:12.569
But we can go even further. Let's say, we
want to encode something entirely else.

00:17:12.569 --> 00:17:17.400
Well, let's encode the image, the
information in the image, but in another

00:17:17.400 --> 00:17:21.859
representation. For example, let's say we
have three classes again. The background

00:17:21.859 --> 00:17:30.049
class in grey, a class called hat or
headwear in blue, and person in green. We

00:17:30.049 --> 00:17:34.309
can also use this for other applications
than just for pictures of humans. For

00:17:34.309 --> 00:17:38.370
example, we have a picture of a street and
want to encode: Where is the car, where's

00:17:38.370 --> 00:17:44.860
the pedestrian? Tremendously useful. Or we
have an MRI scan of a brain: Where in the

00:17:44.860 --> 00:17:51.110
brain is the tumor? Can we somehow learn
this? Yes we can do this, with methods

00:17:51.110 --> 00:17:57.480
like these, if they are trained properly.
More about that later. Well we expect

00:17:57.480 --> 00:18:01.020
something like this to come out but the
truth looks rather like this – especially

00:18:01.020 --> 00:18:05.870
if it's not properly trained. We have not
the real shape we want to get but

00:18:05.870 --> 00:18:11.980
something distorted. So here is again
where we need to do learning. First we

00:18:11.980 --> 00:18:15.790
take a picture, put it through the
network, get our output representation.

00:18:15.790 --> 00:18:21.110
And we have the information about how we
want it to look. We again compute some

00:18:21.110 --> 00:18:27.040
kind of loss value. This time for example
being the overlap between the shape we get

00:18:27.040 --> 00:18:34.040
out of the model and the shape we want to
have. And we use this error, this lost

00:18:34.040 --> 00:18:38.660
function, to update the weights of our
network. Again – even though it's more

00:18:38.660 --> 00:18:43.570
complicated here, even though we have more
layers, and even though the layers look

00:18:43.570 --> 00:18:48.640
slightly different – it is the same
process all over again as with a binary

00:18:48.640 --> 00:18:56.540
case. And we need lots of training data.
This is something that you'll hear often

00:18:56.540 --> 00:19:02.960
in connection with deep learning: You need
lots of training data to make this work.

00:19:02.960 --> 00:19:10.100
Images are complex things and in order to
meaningful extract knowledge from them,

00:19:10.100 --> 00:19:17.090
the network needs to see a multitude of
different images. Well now I already

00:19:17.090 --> 00:19:22.230
showed you some things we use in network
architecture, some support networks: The

00:19:22.230 --> 00:19:26.679
fully convolutional encoder, which takes
an image and produces a few meaningful

00:19:26.679 --> 00:19:33.110
values out of this image; its counterpart
the fully convolutional decoder – fully

00:19:33.110 --> 00:19:36.960
convolutional meaning by the way that we
only have these convolutional layers with

00:19:36.960 --> 00:19:42.980
a few parameters that somehow encode
spatial information and keep it for the

00:19:42.980 --> 00:19:49.360
next layers. The decoder takes a few
meaningful numbers and reproduces an image

00:19:49.360 --> 00:19:55.420
– either the same image or another
representation of the information encoded

00:19:55.420 --> 00:20:01.400
in the image. We also already saw the
fully connected network. Fully connected

00:20:01.400 --> 00:20:06.640
meaning every neuron is connected to every
neuron in the next layer. This of course

00:20:06.640 --> 00:20:12.570
can be dangerous because this is where we
actually get most of our parameters. If we

00:20:12.570 --> 00:20:16.390
have a fully connected network, this is
where the most parameters will be present

00:20:16.390 --> 00:20:21.580
because connecting every node to every
node … this is just a high number of

00:20:21.580 --> 00:20:25.860
connections. We can also do other things.
For example something called a pooling

00:20:25.860 --> 00:20:32.280
layer. A pooling layer being basically the
same as one of those convolutional layers,

00:20:32.280 --> 00:20:36.370
just that we don't have parameters we need
to learn. This works without parameters

00:20:36.370 --> 00:20:43.740
because this neuron just chooses whichever
value is the highest and takes that value

00:20:43.740 --> 00:20:49.600
as output. This is really great for
reducing the size of your image and also

00:20:49.600 --> 00:20:55.150
getting rid of information that might not
be that important. We can also do some

00:20:55.150 --> 00:20:59.890
clever techniques like adding a dropout
layer. A dropout layer just being a normal

00:20:59.890 --> 00:21:05.799
layer in a neural network where we remove
some connections: In one training step

00:21:05.799 --> 00:21:10.720
these connections, in the next training
step some other connections. This way we

00:21:10.720 --> 00:21:18.049
teach the other connections to become more
resilient against errors. I would like to

00:21:18.049 --> 00:21:22.750
start with something I call the "Model
Show" now, and show you some models and

00:21:22.750 --> 00:21:28.870
how we train those models. And I will
start with a fully convolutional decoder

00:21:28.870 --> 00:21:34.740
we saw earlier: This thing that takes a
number and creates a picture. I would like

00:21:34.740 --> 00:21:41.420
to take this model, put in some number and
get out a picture – a picture of a horse

00:21:41.420 --> 00:21:46.000
for example. If I put in a different
number I also want to get a picture of a

00:21:46.000 --> 00:21:52.390
horse, but of a different horse. So what I
want to get is a mapping from some

00:21:52.390 --> 00:21:56.730
numbers, some features that encode
something about the horse picture, and get

00:21:56.730 --> 00:22:03.450
a horse picture out of it. You might see
already why this is problematic. It is

00:22:03.450 --> 00:22:08.230
problematic because we don't have a
mapping from feature to horse or from

00:22:08.230 --> 00:22:15.050
horse to features. So we don't have a
truth value we can use to learn how to

00:22:15.050 --> 00:22:21.790
generate this mapping. Well computer
vision engineers – or deep learning

00:22:21.790 --> 00:22:26.800
professionals – they're smart and have
clever ideas. Let's just assume we have

00:22:26.800 --> 00:22:32.870
such a network and let's call it a
generator. Let's take some numbers put,

00:22:32.870 --> 00:22:39.240
them into the generator and get some
horses. Well it doesn't work yet. We still

00:22:39.240 --> 00:22:42.490
have to train it. So they're probably not
only horses but also some very special

00:22:42.490 --> 00:22:47.970
unicorns among the horses; which might be
nice for other applications, but I wanted

00:22:47.970 --> 00:22:55.480
pictures of horses right now. So I can't
train with this data directly. But what I

00:22:55.480 --> 00:23:01.600
can do is I can create a second network.
This network is called a discriminator and

00:23:01.600 --> 00:23:08.820
I can give it the input generated from the
generator as well as the real data I have:

00:23:08.820 --> 00:23:13.920
the real horse pictures. And then I can
teach the discriminator to distinguish

00:23:13.920 --> 00:23:22.080
between those. Tell me it is a real horse
or it's not a real horse. And there I know

00:23:22.080 --> 00:23:27.000
what is the truth because I either take
real horse pictures or fake horse pictures

00:23:27.000 --> 00:23:34.170
from the generator. So I have a truth
value for this discriminator. But in doing

00:23:34.170 --> 00:23:39.070
this I also have a truth value for the
generator. Because I want the generator to

00:23:39.070 --> 00:23:43.799
work against the discriminator. So I can
also use the information how well the

00:23:43.799 --> 00:23:51.010
discriminator does to train the generator
to become better in fooling. This is

00:23:51.010 --> 00:23:57.470
called a generative adversarial network.
And it can be used to generate pictures of

00:23:57.470 --> 00:24:02.350
an arbitrary distribution. Let's do this
with numbers and I will actually show you

00:24:02.350 --> 00:24:07.590
the training process. Before I start the
video, I'll tell you what I did. I took

00:24:07.590 --> 00:24:11.550
some handwritten digits. There is a
database called "??? of handwritten

00:24:11.550 --> 00:24:18.570
digits" so the numbers of 0 to 9. And I
took those and used them as training data.

00:24:18.570 --> 00:24:24.299
I trained a generator in the way I showed
you on the previous slide, and then I just

00:24:24.299 --> 00:24:30.110
took some random numbers. I put those
random numbers into the network and just

00:24:30.110 --> 00:24:35.960
stored the image of what came out of the
network. And here in the video you'll see

00:24:35.960 --> 00:24:43.090
how the network improved with ongoing
training. You will see that we start

00:24:43.090 --> 00:24:50.179
basically with just noisy images … and
then after some – what we call apox(???)

00:24:50.179 --> 00:24:55.919
so training iterations – the network is
able to almost perfectly generate

00:24:55.919 --> 00:25:05.679
handwritten digits just from noise. Which
I find truly fascinating. Of course this

00:25:05.679 --> 00:25:11.270
is an example where it works. It highly
depends on your data set and how you train

00:25:11.270 --> 00:25:15.600
the model whether it is a success or not.
But if it works, you can use it to

00:25:15.600 --> 00:25:22.559
generate fonts. You can generate
characters, 3D objects, pictures of

00:25:22.559 --> 00:25:28.700
animals, whatever you want as long as you
have training data. Let's go more crazy.

00:25:28.700 --> 00:25:34.539
Let's take two of those and let's say we
have pictures of horses and pictures of

00:25:34.539 --> 00:25:41.150
zebras. I want to convert those pictures
of horses into pictures of zebras, and I

00:25:41.150 --> 00:25:44.590
want to convert pictures of zebras into
pictures of horses. So I want to have the

00:25:44.590 --> 00:25:49.690
same picture just with the other animal.
But I don't have training data of the same

00:25:49.690 --> 00:25:56.270
situation just once with a horse and once
with a zebra. Doesn't matter. We can train

00:25:56.270 --> 00:26:00.650
a network that does that for us. Again we
just have a network – we call it the

00:26:00.650 --> 00:26:05.730
generator – and we have two of those: One
that converts horses to zebras and one

00:26:05.730 --> 00:26:14.840
that converts zebras to horses. And then
we also have two discriminators that tell

00:26:14.840 --> 00:26:21.150
us: real horse – fake horse – real zebra –
fake zebra. And then we again need to

00:26:21.150 --> 00:26:27.210
perform some training. So we need to
somehow encode: Did it work what we wanted

00:26:27.210 --> 00:26:31.460
to do? And a very simple way to do this is
we take a picture of a horse put it

00:26:31.460 --> 00:26:35.470
through the generator that generates a
zebra. Take this fake picture of a zebra,

00:26:35.470 --> 00:26:39.340
put it through the generator that
generates a picture of a horse. And if

00:26:39.340 --> 00:26:43.700
this is the same picture as we put in,
then our model worked. And if it didn't,

00:26:43.700 --> 00:26:48.549
we can use that information to update the
weights. I just took a random picture,

00:26:48.549 --> 00:26:54.460
from a free library in the Internet, of a
horse and generated a zebra and it worked

00:26:54.460 --> 00:26:59.470
remarkably well. I actually didn't even do
training. It also doesn't need to be a

00:26:59.470 --> 00:27:03.120
picture. You can also convert text to
images: You describe something in words

00:27:03.120 --> 00:27:09.570
and generate images. You can age your face
or age a cell; or make a patient healthy

00:27:09.570 --> 00:27:15.510
or sick – or the image of a patient, not
the patient self, unfortunately. You can

00:27:15.510 --> 00:27:20.690
do style transfer like take a picture of
Van Gogh and apply it to your own picture.

00:27:20.690 --> 00:27:27.559
Stuff like that. Something else that we
can do with neural networks. Let's assume

00:27:27.559 --> 00:27:31.030
we have a classification network, we have
a picture of a toothbrush and the network

00:27:31.030 --> 00:27:36.770
tells us: Well, this is a toothbrush.
Great! But how resilient is this network?

00:27:36.770 --> 00:27:44.530
Does it really work in every scenario.
There's a second network we can apply: We

00:27:44.530 --> 00:27:48.701
call it an adversarial network. And that
network is trained to do one thing: Look

00:27:48.701 --> 00:27:52.289
at the network, look at the picture, and
then find the one weak spot in the

00:27:52.289 --> 00:27:55.880
picture: Just change one pixel slightly so
that the network will tell me this

00:27:55.880 --> 00:28:03.600
toothbrush is an octopus. Works remarkably
well. Also works with just changing the

00:28:03.600 --> 00:28:08.940
picture slightly, so changing all the
pixels, but just slight minute changes

00:28:08.940 --> 00:28:12.860
that we don't perceive, but the network –
the classification network – is completely

00:28:12.860 --> 00:28:19.640
thrown off. Well sounds bad. Is bad if you
don't consider it. But you can also for

00:28:19.640 --> 00:28:24.200
example use this for training your network
and make your network resilient. So

00:28:24.200 --> 00:28:28.460
there's always an upside and downside.
Something entirely else: Now I'd like to

00:28:28.460 --> 00:28:32.880
show you something about text. A word-
language model. I want to generate

00:28:32.880 --> 00:28:38.101
sentences for my podcast. I have a network
that gives me a word, and then if I want

00:28:38.101 --> 00:28:42.640
to somehow get the next word in the
sentence, I also need to consider this

00:28:42.640 --> 00:28:47.070
word. So another network architecture –
quite interestingly – just takes the

00:28:47.070 --> 00:28:52.179
hidden states of the network and uses them
as the input for the same network so that

00:28:52.179 --> 00:28:58.780
in the next iteration we still know what
we did in the previous step. I tried to

00:28:58.780 --> 00:29:04.730
train a network that generates podcast
episodes for my podcasts. Didn't work.

00:29:04.730 --> 00:29:08.450
What I learned is I don't have enough
training data. I really need to produce

00:29:08.450 --> 00:29:15.790
more podcast episodes in order to train a
model to do my job for me. And this is

00:29:15.790 --> 00:29:21.539
very important, a very crucial point:
Training data. We need shitloads of

00:29:21.539 --> 00:29:26.081
training data. And actually the more
complicated our model and our training

00:29:26.081 --> 00:29:30.990
process becomes, the more training data we
need. I started with a supervised case –

00:29:30.990 --> 00:29:35.990
the really simple case where we, really
simple, the really simpler case where we

00:29:35.990 --> 00:29:40.660
have a picture and a label that
corresponds to that picture; or a

00:29:40.660 --> 00:29:46.280
representation of that picture showing
entirely what I wanted to learn. But we

00:29:46.280 --> 00:29:51.909
also saw a more complex task, where I had
to pictures – horses and zebras – that are

00:29:51.909 --> 00:29:56.400
from two different domains – but domains
with no direct mapping. What can also

00:29:56.400 --> 00:30:01.020
happen – and actually happens quite a lot
– is weakly annotated data, so data that

00:30:01.020 --> 00:30:08.750
is not precisely annotated; where we can't
rely on the information we get. Or even

00:30:08.750 --> 00:30:13.050
more complicated: Something called
reinforcement learning where we perform a

00:30:13.050 --> 00:30:19.380
sequence of actions and then in the end
are told "yeah that was great". Which is

00:30:19.380 --> 00:30:24.080
often not enough information to really
perform proper training. But of course

00:30:24.080 --> 00:30:28.190
there are also methods for that. As well
as there are methods for the unsupervised

00:30:28.190 --> 00:30:33.590
case where we don't have annotations,
labeled data – no ground truth at all –

00:30:33.590 --> 00:30:41.241
just the picture itself. Well I talked
about pictures. I told you that we can

00:30:41.241 --> 00:30:45.320
learn features and create images from
them. And we can use them for

00:30:45.320 --> 00:30:51.640
classification. And for this there exist
many databases. There are public data sets

00:30:51.640 --> 00:30:56.659
we can use. Often they refer to for
example Flickr. They're just hyperlinks

00:30:56.659 --> 00:31:00.960
which is also why I didn't show you many
pictures right here, because I am honestly

00:31:00.960 --> 00:31:05.690
not sure about the copyright in those
cases. But there are also challenge

00:31:05.690 --> 00:31:11.190
datasets where you can just sign up, get
some for example medical data sets, and

00:31:11.190 --> 00:31:16.650
then compete against other researchers.
And of course there are those companies

00:31:16.650 --> 00:31:22.090
that just have lots of data. And those
companies also have the means, the

00:31:22.090 --> 00:31:28.110
capacity to perform intense computations.
And those are also often the companies you

00:31:28.110 --> 00:31:36.179
hear from in terms of innovation for deep
learning. Well this was mostly to tell you

00:31:36.179 --> 00:31:40.200
that you can process images quite well
with deep learning if you have enough

00:31:40.200 --> 00:31:46.029
training data, if you have a proper
training process and also a little if you

00:31:46.029 --> 00:31:52.090
know what you're doing. But you can also
process text, you can process audio and

00:31:52.090 --> 00:31:58.520
time series like prices or a stack
exchange – stuff like that. You can

00:31:58.520 --> 00:32:02.929
process almost everything if you make it
encodeable to your network. Sounds like a

00:32:02.929 --> 00:32:08.120
dream come true. But – as I already told
you – you need data, a lot of it. I told

00:32:08.120 --> 00:32:14.020
you about those companies that have lots
of data sets and the publicly available

00:32:14.020 --> 00:32:21.370
data sets which you can actually use to
get started with your own experiments. But

00:32:21.370 --> 00:32:24.309
that also makes it a little dangerous
because deep learning still is a black box

00:32:24.309 --> 00:32:30.820
to us. I told you what happens inside the
black box on a level that teaches you how

00:32:30.820 --> 00:32:36.529
we learn and how the network is
structured, but not really what the

00:32:36.529 --> 00:32:42.831
network learned. It is for us computer
vision engineers really nice that we can

00:32:42.831 --> 00:32:48.590
visualize the first layers of a neural
network and see what is actually encoded

00:32:48.590 --> 00:32:53.950
in those first layers; what information
the network looks at. But you can't really

00:32:53.950 --> 00:32:59.059
mathematically prove what happens in a
network. Which is one major downside. And

00:32:59.059 --> 00:33:02.150
so if you want to use it, the numbers may
be really great but be sure to properly

00:33:02.150 --> 00:33:08.059
evaluate them. In summary I call that
"easy to learn". Every one – every single

00:33:08.059 --> 00:33:12.679
one of you – can just start with deep
learning right away. You don't need to do

00:33:12.679 --> 00:33:19.440
much work. You don't need to do much
learning. The model learns for you. But

00:33:19.440 --> 00:33:23.770
they're hard to master in a way that makes
them useful for production use cases for

00:33:23.770 --> 00:33:29.900
example. So if you want to use deep
learning for something – if you really

00:33:29.900 --> 00:33:34.299
want to seriously use it –, make sure that
it really does what you wanted to and

00:33:34.299 --> 00:33:38.900
doesn't learn something else – which also
happens. Pretty sure you saw some talks

00:33:38.900 --> 00:33:43.670
about deep learning fails – which is not
what this talk is about. They're quite

00:33:43.670 --> 00:33:47.370
funny to look at. Just make sure that they
don't happen to you! If you do that

00:33:47.370 --> 00:33:53.300
though, you'll achieve great things with
deep learning, I'm sure. And that was

00:33:53.300 --> 00:34:00.740
introduction to deep learning. Thank you!
<i>Applause</i>

00:34:09.172 --> 00:34:13.449
Herald Angel: So now it's question and
answer time. So if you have a question,

00:34:13.449 --> 00:34:19.110
please line up at the mikes. We have in
total eight, so it shouldn't be far from

00:34:19.110 --> 00:34:26.139
you. They are here in the corridors and on
these sides. Please line up! For

00:34:26.139 --> 00:34:31.540
everybody: A question consists of one
sentence with the question mark in the end

00:34:31.540 --> 00:34:38.449
– not three minutes of rambling. And also
if you go to the microphone, speak into

00:34:38.449 --> 00:34:53.889
the microphone, so you really get close to
it. Okay. Where do we have … Number 7!

00:34:53.889 --> 00:35:02.200
We start with mic number 7:
Question: Hello. My question is: How did

00:35:02.200 --> 00:35:13.020
you compute the example for the fonts, the
numbers? I didn't really understand it,

00:35:13.020 --> 00:35:19.770
you just said it was made from white
noise.

00:35:19.770 --> 00:35:25.580
Teubi: I'll give you a really brief recap
of what I did. I showed you that we have a

00:35:25.580 --> 00:35:31.140
model that maps image to some meaningful
values, that an image can be encoded in

00:35:31.140 --> 00:35:36.860
just a few values. What happens here is
exactly the other way round. We have some

00:35:36.860 --> 00:35:43.270
values, just some arbitrary values we
actually know nothing about. We can

00:35:43.270 --> 00:35:47.480
generate pictures out of those. So I
trained this model to just take some

00:35:47.480 --> 00:35:54.560
random values and show the pictures
generated from the model. The training

00:35:54.560 --> 00:36:03.320
process was this "min max game", as its
called. We have two networks that try to

00:36:03.320 --> 00:36:08.260
compete against each other. One network
trying to distinguish, whether a picture

00:36:08.260 --> 00:36:12.790
it sees is real or one of those fake
pictures, and the network that actually

00:36:12.790 --> 00:36:18.510
generates those pictures and in training
the network that is able to distinguish

00:36:18.510 --> 00:36:24.599
between those, we can also get information
for the training of the network that

00:36:24.599 --> 00:36:30.410
generates the pictures. So the videos you
saw were just animations of what happens

00:36:30.410 --> 00:36:36.440
during this training process. At first if
we input noise we get noise. But as the

00:36:36.440 --> 00:36:41.510
network is able to better and better
recreate those images from the dataset we

00:36:41.510 --> 00:36:47.390
used as input, in this case pictures of
handwritten digits, the output also became

00:36:47.390 --> 00:36:54.660
more lookalike to those numbers, these
handwritten digits. Hope that helped.

00:36:54.660 --> 00:37:06.590
Herald Angel: Now we go to the
Internet. – Can we get sound for the signal

00:37:06.590 --> 00:37:10.040
Angel, please? Teubi: Sounded so great,
"now we go to the Internet."

00:37:10.040 --> 00:37:11.040
Herald Angel: Yeah, that sounds like
"yeeaah".

00:37:11.040 --> 00:37:13.040
Signal Angel: And now we're finally ready
to go to the interwebs. "Schorsch" is

00:37:13.040 --> 00:37:18.040
asking: Do you have any recommendations
for a beginner regarding the framework or

00:37:18.040 --> 00:37:26.460
the software?
Teubi: I, of course, am very biased to

00:37:26.460 --> 00:37:34.150
recommend what I use everyday. But I also
think that it is a great start. Basically,

00:37:34.150 --> 00:37:40.210
use python and use pytorch. Many people
will disagree with me and tell you

00:37:40.210 --> 00:37:45.930
"tensorflow is better." It might be, in my
opinion not for getting started, and there

00:37:45.930 --> 00:37:51.560
are also some nice tutorials on the
pytorch website. What you can also do is

00:37:51.560 --> 00:37:57.200
look at websites like OpenAI, where they
have a gym to get you started with some

00:37:57.200 --> 00:38:02.371
training exercises, where you already have
datasets. Yeah, basically my

00:38:02.371 --> 00:38:08.600
recommendation is get used to Python and
start with a pytorch tutorial, see where

00:38:08.600 --> 00:38:13.590
to go from there. Often there also some
github repositories linked with many

00:38:13.590 --> 00:38:18.740
examples for already established network
architectures like the cycle GAN or the

00:38:18.740 --> 00:38:26.250
GAN itself or basically everything else.
There will be a repo you can use to get

00:38:26.250 --> 00:38:29.940
started.
Herald Angel: OK, we stay with the

00:38:29.940 --> 00:38:32.589
internet. There's some more questions, I
heard.

00:38:32.589 --> 00:38:37.920
Signal Angel: Yes. Rubin8 is asking: Have
you have you ever come across an example

00:38:37.920 --> 00:38:42.580
of a neural network that deals with audio
instead of images?

00:38:42.580 --> 00:38:49.410
Teubi: Me personally, no. At least not
directly. I've heard about examples, like

00:38:49.410 --> 00:38:54.859
where you can change the voice to sound
like another person, but there is not much

00:38:54.859 --> 00:38:59.980
I can reliably tell about that. My
expertise really is in image processing,

00:38:59.980 --> 00:39:05.550
I'm sorry.
Herald Angel: And I think we have time for

00:39:05.550 --> 00:39:12.340
one more question. We have one at number
8. Microphone number 8.

00:39:12.340 --> 00:39:20.730
Question: Is the current Face recognition
technologies in, for example iPhone X, is

00:39:20.730 --> 00:39:26.420
it also a deep learning algorithm or is
it something more simple? Do you have any

00:39:26.420 --> 00:39:31.880
idea about that?
Teubi: As far as I know, yes. That's all I

00:39:31.880 --> 00:39:38.630
can reliably tell you about that, but it
is not only based on images but also uses

00:39:38.630 --> 00:39:45.420
other information. I think distance
information encoded with some infrared

00:39:45.420 --> 00:39:50.599
signals. I don't really know exactly how
it works, but at least iPhones already

00:39:50.599 --> 00:39:56.000
have a neural network
processing engine built in, so a chip

00:39:56.000 --> 00:40:01.190
dedicated to just doing those
computations. You saw that many of those

00:40:01.190 --> 00:40:05.820
things can be parallelized, and this is
what those hardware architectures make use

00:40:05.820 --> 00:40:10.380
of. So I'm pretty confident in saying,
yes, they also do it there.

00:40:10.380 --> 00:40:12.786
How exactly, no clue.

00:40:13.760 --> 00:40:15.323

Herald Angel: OK. I myself have a last

00:40:15.390 --> 00:40:20.680
completely unrelated question: Did you
create the design of the slides yourself?

00:40:20.680 --> 00:40:29.060
Teubi: I had some help. We have a really
great Congress design and I use that as an

00:40:29.060 --> 00:40:32.790
inspiration to create those slides, yes.


00:40:32.790 --> 00:40:36.760
Herald Angel: OK, yeah, because those are really amazing. I love them.


00:40:36.760 --> 00:40:38.140
Teubi: Thank you!

00:40:38.470 --> 00:40:41.200
Herald Angel: OK, thank you very much
Teubi.

00:40:45.130 --> 00:40:48.900
<i>35C5 outro music</i>

00:40:48.900 --> 00:41:07.000
subtitles created by c3subtitles.de
in the year 2019. Join, and help us!