35C3 - Introduction to Deep Learning

0:00 - 0:18

35C3 preroll music
0:18 - 0:25

Herald Angel: Welcome to our introduction
to deep learning with Teubi. Deep
0:25 - 0:30

learning, also often called machine
learning is a hype word which we hear in
0:30 - 0:37

the media all the time. It's nearly as bad
as blockchain. It's a solution for
0:37 - 0:43

everything. Today we'll get a sneak peek
into the internals of this mystical black
0:43 - 0:49

box, they are talking about. And Teubi
will show us why people, who know what
0:49 - 0:53

machine learning really is about, have to
facepalm so often, when they read the
0:53 - 0:59

news. So please welcome Teubi
with a big round of applause!
0:59 - 1:10

Applause
Teubi: Alright! Good morning and welcome
1:10 - 1:14

to Introduction to Deep Learning. The
title will already tell you what this talk
1:14 - 1:20

is about. I want to give you an
introduction onto how deep learning works,
1:20 - 1:27

what happens inside this black box. But,
first of all, who am I? I'm Teubi. It's a
1:27 - 1:32

German nickname, it has nothing to do with
toys or bees. You might have heard my
1:32 - 1:36

voice before, because I host the
Nussschale podcast. There I explain
1:36 - 1:42

scientific topics in under 10 minutes.
I'll have to use a little more time today,
1:42 - 1:47

and you'll also have fancy animations
which hopefully will help. In my day job
1:47 - 1:53

I'm a research scientist at an institute
for computer vision. I analyze microscopy
1:53 - 1:58

images of bone marrow blood cells and try
to find ways to teach the computer to
1:58 - 2:05

understand what it sees. Namely, to
differentiate between certain cells or,
2:05 - 2:09

first of all, find cells in an image,
which is a task that is more complex than
2:09 - 2:17

it might sound like. Let me start with the
introduction to deep learning. We all know
2:17 - 2:23

how to code. We code in a very simple way.
We have some input for all computer
2:23 - 2:28

algorithm. Then we have an algorithm which
says: Do this, do that. If this, then
2:29 - 2:29

that. And in that way we generate some
output. This is not how machine learning
2:29 - 2:31

works. Machine learning assumes you have
some input, and you also have some output.
2:41 - 2:46

And what you also have is some statistical
model. This statistical model is flexible.
2:46 - 2:52

It has certain parameters, which it can
learn from the distribution of inputs and
2:52 - 2:57

outputs you give it for training. So you
basically learn the statistical model to
2:57 - 3:04

generate the desired output from the given
input. Let me give you a really simple
3:04 - 3:10

example of how this might work. Let's say
we have two animals. Well, we have two
3:10 - 3:16

kinds of animals: unicorns and rabbits.
And now we want to find an algorithm that
3:16 - 3:24

tells us whether this animal we have right
now as an input is a rabbit or a unicorn.
3:24 - 3:28

We can write a simple algorithm to do
that, but we can also do it with machine
3:28 - 3:35

learning. The first thing we need is some
input. I choose two features that are able
3:35 - 3:42

to tell me whether this animal is a rabbit
or a unicorn. Namely, speed and size. We
3:42 - 3:47

call these features, and they describe
something about what we want to classify.
3:47 - 3:52

And the class is in this case our animal.
First thing I need is some training data,
3:52 - 3:59

some input. The input here are just pairs
of speed and size. What I also need is
3:59 - 4:04

information about the desired output. The
desired output, of course, being the
4:04 - 4:13

class. So either unicorn or rabbit, here
denoted by yellow and red X's. So let's
4:13 - 4:18

try to find a statistical model which we
can use to separate this feature space
4:18 - 4:24

into two halves: One for the rabbits, one
for the unicorns. Looking at this, we can
4:24 - 4:29

actually find a really simple statistical
model, and our statistical model in this
4:29 - 4:34

case is just a straight line. And the
learning process is then to find where in
4:34 - 4:41

this feature space the line should be.
Ideally, for example, here. Right in the
4:41 - 4:45

middle between the two classes rabbit and
unicorn. Of course this is an overly
4:45 - 4:50

simplified example. Real-world
applications have feature distributions
4:50 - 4:56

which look much more like this. So, we
have a gradient, we don't have a perfect
4:56 - 5:00

separation between those two classes, and
those two classes are definitely not
5:00 - 5:06

separable by a line. If we look again at
some training samples — training samples
5:06 - 5:12

are the data points we use for the machine
learning process, so, to try to find the
5:12 - 5:18

parameters of our statistical model — if
we look at the line again, then this will
5:18 - 5:23

not be able to separate this training set.
Well, we will have a line that has some
5:23 - 5:27

errors, some unicorns which will be
classified as rabbits, some rabbits which
5:27 - 5:33

will be classified as unicorns. This is
what we call underfitting. Our model is
5:33 - 5:40

just not able to express what we want it
to learn. There is the opposite case. The
5:40 - 5:46

opposite case being: we just learn all the
training samples by heart. This is if we
5:46 - 5:50

have a very complex model and just a few
training samples to teach the model what
5:50 - 5:55

it should learn. In this case we have a
perfect separation of unicorns and
5:55 - 6:01

rabbits, at least for the few data points
we have. If we draw another example from
6:01 - 6:07

the real world,some other data points,
they will most likely be wrong. And this
6:07 - 6:11

is what we call overfitting. The perfect
scenario in this case would be something
6:11 - 6:17

like this: a classifier which is really
close to the distribution we have in the
6:17 - 6:23

real world and machine learning is tasked
with finding this perfect model and its
6:23 - 6:29

parameters. Let me show you a different
kind of model, something you probably all
6:29 - 6:36

have heard about: Neural networks. Neural
networks are inspired by the brain.
6:36 - 6:41

Or more precisely, by the neurons in our
brain. Neurons are tiny objects, tiny
6:41 - 6:47

cells in our brain that take some input
and generate some output. Sounds familiar,
6:47 - 6:53

right? We have inputs usually in the form
of electrical signals. And if they are
6:53 - 6:58

strong enough, this neuron will also send
out an electrical signal. And this is
6:58 - 7:03

something we can model in a computer-
engineering way. So, what we do is: We
7:03 - 7:09

take a neuron. The neuron is just a simple
mapping from input to output. Input here,
7:09 - 7:17

just three input nodes. We denote them by
i1, i2 and i3 and output denoted by o. And
7:17 - 7:21

now you will actually see some
mathematical equations. There are not many
7:21 - 7:27

of these in this foundation talk, don't
worry, and it's really simple. There's one
7:27 - 7:30

more thing we need first, though, if we
want to map input to output in the way a
7:30 - 7:35

neuron does. Namely, the weights. The
weights are just some arbitrary numbers
7:35 - 7:43

for now. Let's call them w1, w2 and w3.
So, we take those weights and we multiply
7:43 - 7:51

them with the input. Input1 times weight1,
input2 times weight2, and so on. And this,
7:51 - 7:58

this sum just will be our output. Well,
not quite. We make it a little bit more
7:58 - 8:02

complicated. We also use something called
an activation function. The activation
8:02 - 8:09

function is just a mapping from one scalar
value to another scalar value. In this
8:09 - 8:14

case from what we got as an output, the
sum, to something that more closely fits
8:14 - 8:19

what we need. This could for example be
something binary, where we have all the
8:19 - 8:24

negative numbers being mapped to zero and
all the positive numbers being mapped to
8:24 - 8:31

one. And then this zero and one can encode
something. For example: rabbit or unicorn.
8:31 - 8:35

So, let me give you an example of how we
can make the previous example with the
8:35 - 8:42

rabbits and unicorns work with such a
simple neuron. We just use speed, size,
8:42 - 8:50

and the arbitrarily chosen number 10 as
our inputs and the weights 1, 1, and -1.
8:50 - 8:54

If we look at the equations, then we get
for our negative numbers — so, speed plus
8:54 - 9:01

size being less than 10 — a 0, and a 1 for
all positive numbers — being speed plus
9:01 - 9:08

size larger than 10, greater than 10. This
way we again have a separating line
9:08 - 9:15

between unicorns and rabbits. But again we
have this really simplistic model. We want
9:15 - 9:22

to become more and more complicated in
order to express more complex tasks. So
9:22 - 9:26

what do we do? We take more neurons. We
take our three input values and put them
9:26 - 9:32

into one neuron, and into a second neuron,
and into a third neuron. And we take the
9:32 - 9:38

output of those three neurons as input for
another neuron. We also call this a
9:38 - 9:42

multilayer perceptron, perceptron just
being a different name for a neuron, what
9:42 - 9:49

we have there. And the whole thing is also
called a neural network. So now the
9:49 - 9:53

question: How do we train this? How do we
learn what this network should encode?
9:53 - 9:58

Well, we want a mapping from input to
output, and what we can change are the
9:58 - 10:03

weights. First, what we do is we take a
training sample, some input. Put it
10:03 - 10:07

through the network, get an output. But
this might not be the desired output which
10:07 - 10:14

we know. So, in the binary case there are
four possible cases: computed output,
10:14 - 10:20

expected output, each two values, 0 and 1.
The best case would be: we want a 0, get a
10:20 - 10:27

0, want a 1 and get a 1. But there is also
the opposite case. In these two cases we
10:27 - 10:31

can learn something about our model.
Namely, in which direction to change the
10:31 - 10:37

weights. It's a little bit simplified, but
in principle you just raise the weights if
10:37 - 10:41

you need a higher number as output and you
lower the weights if you need a lower
10:41 - 10:47

number as output. To tell you how much, we
have two terms. First term being the
10:47 - 10:53

error, so in this case just the difference
between desired and expected output – also
10:53 - 10:57

often called a loss function, especially
in deep learning and more complex
10:57 - 11:04

applications. You also have a second term
we call the act the learning rate, and the
11:04 - 11:09

learning rate is what tells us how quickly
we should change the weights, how quickly
11:09 - 11:15

we should adapt the weights. Okay, this is
how we learn a model. This is almost
11:15 - 11:19

everything you need to know. There are
mathematical equations that tell you how
11:19 - 11:24

much to change based on the error and the
learning function. And this is the entire
11:24 - 11:30

learning process. Let's get back to the
terminology. We have the input layer. We
11:30 - 11:34

have the output layer, which somehow
encodes our output either in one value or
11:34 - 11:40

in several values if we have a multiple,
if we have multiple classes. We also have
11:40 - 11:46

the hidden layers, which are actually what
makes our model deep. What we can change,
11:46 - 11:52

what we can learn, is the are the weights,
the parameters of this model. But what we
11:52 - 11:55

also need to keep in mind, is the number
of layers, the number of neurons per
11:55 - 12:00

layer, the learning rate, and the
activation function. These are called
12:00 - 12:04

hyper parameters, and they determine how
complex our model is, how well it is
12:04 - 12:10

suited to solve the task at hand. I quite
often spoke about solving tasks, so the
12:10 - 12:15

question is: What can we actually do with
neural networks? Mostly classification
12:15 - 12:20

tasks, for example: Tell me, is this
animal a rabbit or unicorn? Is this text
12:20 - 12:25

message spam or legitimate? Is this
patient healthy or ill? Is this image a
12:25 - 12:31

picture of a cat or a dog? We already saw
for the animal that we need something
12:31 - 12:35

called features, which somehow encodes
information about what we want to
12:35 - 12:40

classify, something we can use as input
for the neural network. Some kind of
12:40 - 12:44

number that is meaningful. So, for the
animal it could be speed, size, or
12:44 - 12:49

something like color. Color, of course,
being more complex again, because we have,
12:49 - 12:56

for example, RGB, so three values. And,
text message being a more complex case
12:56 - 13:00

again, because we somehow need to encode
the sender, and whether the sender is
13:00 - 13:05

legitimate. Same for the recipient, or the
number of hyperlinks, or where the
13:05 - 13:11

hyperlinks refer to, or the, whether there
are certain words present in the text. It
13:11 - 13:17

gets more and more complicated. Even more
so for a patient. How do we encode medical
13:17 - 13:22

history in a proper way for the network to
learn. I mean, temperature is simple. It's
13:22 - 13:27

a scalar value, we just have a number. But
how do we encode whether certain symptoms
13:27 - 13:33

are present. And the image, which is
actually what I work with everyday, is
13:33 - 13:38

again quite complex. We have values, we
have numbers, but only pixel values, which
13:38 - 13:43

make it difficult, which are difficult to
use as input for a neural network. Why?
13:43 - 13:48

I'll show you. I'll actually show you with
this picture, it's a very famous picture,
13:48 - 13:54

and everybody uses it in computer vision.
They will tell you, it's because there is
13:54 - 14:01

a multitude of different characteristics
in this image: shapes, edges, whatever you
14:01 - 14:07

desire. The truth is, it's a crop from the
centrefold of the Playboy, and in earlier
14:07 - 14:12

years, the computer vision engineers was a
mostly male audience. Anyway, let's take
14:12 - 14:17

five by five pixels. Let's assume, this is
a five by five pixels, a really small,
14:17 - 14:22

image. If we take those 25 pixels and use
them as input for a neural network you
14:22 - 14:27

already see that we have many connections
- many weights - which means a very
14:27 - 14:33

complex model. Complex model, of course,
prone to overfitting. But there are more
14:33 - 14:39

problems. First being, we have
disconnected the pixels from its neigh-, a
14:39 - 14:44

pixel from its neighbors. We can't encode
information about the neighborhood
14:44 - 14:48

anymore, and that really sucks. If we just
take the whole picture, and move it to the
14:48 - 14:53

left or to the right by just one pixel,
the network will see something completely
14:53 - 14:58

different, even though to us it is exactly
the same. But, we can solve that with some
14:58 - 15:03

very clever engineering, something we call
a convolutional layer. It is again a
15:03 - 15:09

hidden layer in a neural network, but it
does something special. It actually is a
15:09 - 15:14

very simple neuron again, just four input
values - one output value. But the four
15:14 - 15:20

input values look at two by two pixels,
and encode one output value. And then the
15:20 - 15:24

same network is shifted to the right, and
encodes another pixel, and another pixel,
15:24 - 15:30

and the next row of pixels. And in this
way creates another 2D image. We have
15:30 - 15:35

preserved information about the
neighborhood, and we just have a very low
15:35 - 15:42

number of weights, not the huge number of
parameters we saw earlier. We can use this
15:42 - 15:50

once, or twice, or several hundred times.
And this is actually where we go deep.
15:50 - 15:55

Deep means: We have several layers, and
having layers that don't need thousands or
15:55 - 16:01

millions of connections, but only a few.
This is what allows us to go really deep.
16:01 - 16:06

And in this fashion we can encode an
entire image in just a few meaningful
16:06 - 16:11

values. How these values look like, and
what they encode, this is learned through
16:11 - 16:18

the learning process. And we can then, for
example, use these few values as input for
16:18 - 16:25

a classification network.
The fully connected network we saw earlier.
16:25 - 16:30

Or we can do something more clever. We can
do the inverse operation and create an image
16:30 - 16:35

again, for example, the same image, which
is then called an auto encoder. Auto
16:35 - 16:40

encoders are tremendously useful, even
though they don't appear that way. For
16:40 - 16:44

example, imagine you want to check whether
something has a defect, or not, a picture
16:44 - 16:51

of a fabric, or of something. You just
train the network with normal pictures.
16:51 - 16:57

And then, if you have a defect picture,
the network is not able to produce this
16:57 - 17:02

defect. And so the difference of the
reproduced picture, and the real picture
17:02 - 17:07

will show you where errors are. If it
works properly, I'll have to admit that.
17:07 - 17:13

But we can go even further. Let's say, we
want to encode something entirely else.
17:13 - 17:17

Well, let's encode the image, the
information in the image, but in another
17:17 - 17:22

representation. For example, let's say we
have three classes again. The background
17:22 - 17:30

class in grey, a class called hat or
headwear in blue, and person in green. We
17:30 - 17:34

can also use this for other applications
than just for pictures of humans. For
17:34 - 17:38

example, we have a picture of a street and
want to encode: Where is the car, where's
17:38 - 17:45

the pedestrian? Tremendously useful. Or we
have an MRI scan of a brain: Where in the
17:45 - 17:51

brain is the tumor? Can we somehow learn
this? Yes we can do this, with methods
17:51 - 17:57

like these, if they are trained properly.
More about that later. Well we expect
17:57 - 18:01

something like this to come out but the
truth looks rather like this – especially
18:01 - 18:06

if it's not properly trained. We have not
the real shape we want to get but
18:06 - 18:12

something distorted. So here is again
where we need to do learning. First we
18:12 - 18:16

take a picture, put it through the
network, get our output representation.
18:16 - 18:21

And we have the information about how we
want it to look. We again compute some
18:21 - 18:27

kind of loss value. This time for example
being the overlap between the shape we get
18:27 - 18:34

out of the model and the shape we want to
have. And we use this error, this lost
18:34 - 18:39

function, to update the weights of our
network. Again – even though it's more
18:39 - 18:44

complicated here, even though we have more
layers, and even though the layers look
18:44 - 18:49

slightly different – it is the same
process all over again as with a binary
18:49 - 18:57

case. And we need lots of training data.
This is something that you'll hear often
18:57 - 19:03

in connection with deep learning: You need
lots of training data to make this work.
19:03 - 19:10

Images are complex things and in order to
meaningful extract knowledge from them,
19:10 - 19:17

the network needs to see a multitude of
different images. Well now I already
19:17 - 19:22

showed you some things we use in network
architecture, some support networks: The
19:22 - 19:27

fully convolutional encoder, which takes
an image and produces a few meaningful
19:27 - 19:33

values out of this image; its counterpart
the fully convolutional decoder – fully
19:33 - 19:37

convolutional meaning by the way that we
only have these convolutional layers with
19:37 - 19:43

a few parameters that somehow encode
spatial information and keep it for the
19:43 - 19:49

next layers. The decoder takes a few
meaningful numbers and reproduces an image
19:49 - 19:55

– either the same image or another
representation of the information encoded
19:55 - 20:01

in the image. We also already saw the
fully connected network. Fully connected
20:01 - 20:07

meaning every neuron is connected to every
neuron in the next layer. This of course
20:07 - 20:13

can be dangerous because this is where we
actually get most of our parameters. If we
20:13 - 20:16

have a fully connected network, this is
where the most parameters will be present
20:16 - 20:22

because connecting every node to every
node … this is just a high number of
20:22 - 20:26

connections. We can also do other things.
For example something called a pooling
20:26 - 20:32

layer. A pooling layer being basically the
same as one of those convolutional layers,
20:32 - 20:36

just that we don't have parameters we need
to learn. This works without parameters
20:36 - 20:44

because this neuron just chooses whichever
value is the highest and takes that value
20:44 - 20:50

as output. This is really great for
reducing the size of your image and also
20:50 - 20:55

getting rid of information that might not
be that important. We can also do some
20:55 - 21:00

clever techniques like adding a dropout
layer. A dropout layer just being a normal
21:00 - 21:06

layer in a neural network where we remove
some connections: In one training step
21:06 - 21:11

these connections, in the next training
step some other connections. This way we
21:11 - 21:18

teach the other connections to become more
resilient against errors. I would like to
21:18 - 21:23

start with something I call the "Model
Show" now, and show you some models and
21:23 - 21:29

how we train those models. And I will
start with a fully convolutional decoder
21:29 - 21:35

we saw earlier: This thing that takes a
number and creates a picture. I would like
21:35 - 21:41

to take this model, put in some number and
get out a picture – a picture of a horse
21:41 - 21:46

for example. If I put in a different
number I also want to get a picture of a
21:46 - 21:52

horse, but of a different horse. So what I
want to get is a mapping from some
21:52 - 21:57

numbers, some features that encode
something about the horse picture, and get
21:57 - 22:03

a horse picture out of it. You might see
already why this is problematic. It is
22:03 - 22:08

problematic because we don't have a
mapping from feature to horse or from
22:08 - 22:15

horse to features. So we don't have a
truth value we can use to learn how to
22:15 - 22:22

generate this mapping. Well computer
vision engineers – or deep learning
22:22 - 22:27

professionals – they're smart and have
clever ideas. Let's just assume we have
22:27 - 22:33

such a network and let's call it a
generator. Let's take some numbers put,
22:33 - 22:39

them into the generator and get some
horses. Well it doesn't work yet. We still
22:39 - 22:42

have to train it. So they're probably not
only horses but also some very special
22:42 - 22:48

unicorns among the horses; which might be
nice for other applications, but I wanted
22:48 - 22:55

pictures of horses right now. So I can't
train with this data directly. But what I
22:55 - 23:02

can do is I can create a second network.
This network is called a discriminator and
23:02 - 23:09

I can give it the input generated from the
generator as well as the real data I have:
23:09 - 23:14

the real horse pictures. And then I can
teach the discriminator to distinguish
23:14 - 23:22

between those. Tell me it is a real horse
or it's not a real horse. And there I know
23:22 - 23:27

what is the truth because I either take
real horse pictures or fake horse pictures
23:27 - 23:34

from the generator. So I have a truth
value for this discriminator. But in doing
23:34 - 23:39

this I also have a truth value for the
generator. Because I want the generator to
23:39 - 23:44

work against the discriminator. So I can
also use the information how well the
23:44 - 23:51

discriminator does to train the generator
to become better in fooling. This is
23:51 - 23:57

called a generative adversarial network.
And it can be used to generate pictures of
23:57 - 24:02

an arbitrary distribution. Let's do this
with numbers and I will actually show you
24:02 - 24:08

the training process. Before I start the
video, I'll tell you what I did. I took
24:08 - 24:12

some handwritten digits. There is a
database called "??? of handwritten
24:12 - 24:19

digits" so the numbers of 0 to 9. And I
took those and used them as training data.
24:19 - 24:24

I trained a generator in the way I showed
you on the previous slide, and then I just
24:24 - 24:30

took some random numbers. I put those
random numbers into the network and just
24:30 - 24:36

stored the image of what came out of the
network. And here in the video you'll see
24:36 - 24:43

how the network improved with ongoing
training. You will see that we start
24:43 - 24:50

basically with just noisy images … and
then after some – what we call apox(???)
24:50 - 24:56

so training iterations – the network is
able to almost perfectly generate
24:56 - 25:06

handwritten digits just from noise. Which
I find truly fascinating. Of course this
25:06 - 25:11

is an example where it works. It highly
depends on your data set and how you train
25:11 - 25:16

the model whether it is a success or not.
But if it works, you can use it to
25:16 - 25:23

generate fonts. You can generate
characters, 3D objects, pictures of
25:23 - 25:29

animals, whatever you want as long as you
have training data. Let's go more crazy.
25:29 - 25:35

Let's take two of those and let's say we
have pictures of horses and pictures of
25:35 - 25:41

zebras. I want to convert those pictures
of horses into pictures of zebras, and I
25:41 - 25:45

want to convert pictures of zebras into
pictures of horses. So I want to have the
25:45 - 25:50

same picture just with the other animal.
But I don't have training data of the same
25:50 - 25:56

situation just once with a horse and once
with a zebra. Doesn't matter. We can train
25:56 - 26:01

a network that does that for us. Again we
just have a network – we call it the
26:01 - 26:06

generator – and we have two of those: One
that converts horses to zebras and one
26:06 - 26:15

that converts zebras to horses. And then
we also have two discriminators that tell
26:15 - 26:21

us: real horse – fake horse – real zebra –
fake zebra. And then we again need to
26:21 - 26:27

perform some training. So we need to
somehow encode: Did it work what we wanted
26:27 - 26:31

to do? And a very simple way to do this is
we take a picture of a horse put it
26:31 - 26:35

through the generator that generates a
zebra. Take this fake picture of a zebra,
26:35 - 26:39

put it through the generator that
generates a picture of a horse. And if
26:39 - 26:44

this is the same picture as we put in,
then our model worked. And if it didn't,
26:44 - 26:49

we can use that information to update the
weights. I just took a random picture,
26:49 - 26:54

from a free library in the Internet, of a
horse and generated a zebra and it worked
26:54 - 26:59

remarkably well. I actually didn't even do
training. It also doesn't need to be a
26:59 - 27:03

picture. You can also convert text to
images: You describe something in words
27:03 - 27:10

and generate images. You can age your face
or age a cell; or make a patient healthy
27:10 - 27:16

or sick – or the image of a patient, not
the patient self, unfortunately. You can
27:16 - 27:21

do style transfer like take a picture of
Van Gogh and apply it to your own picture.
27:21 - 27:28

Stuff like that. Something else that we
can do with neural networks. Let's assume
27:28 - 27:31

we have a classification network, we have
a picture of a toothbrush and the network
27:31 - 27:37

tells us: Well, this is a toothbrush.
Great! But how resilient is this network?
27:37 - 27:45

Does it really work in every scenario.
There's a second network we can apply: We
27:45 - 27:49

call it an adversarial network. And that
network is trained to do one thing: Look
27:49 - 27:52

at the network, look at the picture, and
then find the one weak spot in the
27:52 - 27:56

picture: Just change one pixel slightly so
that the network will tell me this
27:56 - 28:04

toothbrush is an octopus. Works remarkably
well. Also works with just changing the
28:04 - 28:09

picture slightly, so changing all the
pixels, but just slight minute changes
28:09 - 28:13

that we don't perceive, but the network –
the classification network – is completely
28:13 - 28:20

thrown off. Well sounds bad. Is bad if you
don't consider it. But you can also for
28:20 - 28:24

example use this for training your network
and make your network resilient. So
28:24 - 28:28

there's always an upside and downside.
Something entirely else: Now I'd like to
28:28 - 28:33

show you something about text. A word-
language model. I want to generate
28:33 - 28:38

sentences for my podcast. I have a network
that gives me a word, and then if I want
28:38 - 28:43

to somehow get the next word in the
sentence, I also need to consider this
28:43 - 28:47

word. So another network architecture –
quite interestingly – just takes the
28:47 - 28:52

hidden states of the network and uses them
as the input for the same network so that
28:52 - 28:59

in the next iteration we still know what
we did in the previous step. I tried to
28:59 - 29:05

train a network that generates podcast
episodes for my podcasts. Didn't work.
29:05 - 29:08

What I learned is I don't have enough
training data. I really need to produce
29:08 - 29:16

more podcast episodes in order to train a
model to do my job for me. And this is
29:16 - 29:22

very important, a very crucial point:
Training data. We need shitloads of
29:22 - 29:26

training data. And actually the more
complicated our model and our training
29:26 - 29:31

process becomes, the more training data we
need. I started with a supervised case –
29:31 - 29:36

the really simple case where we, really
simple, the really simpler case where we
29:36 - 29:41

have a picture and a label that
corresponds to that picture; or a
29:41 - 29:46

representation of that picture showing
entirely what I wanted to learn. But we
29:46 - 29:52

also saw a more complex task, where I had
to pictures – horses and zebras – that are
29:52 - 29:56

from two different domains – but domains
with no direct mapping. What can also
29:56 - 30:01

happen – and actually happens quite a lot
– is weakly annotated data, so data that
30:01 - 30:09

is not precisely annotated; where we can't
rely on the information we get. Or even
30:09 - 30:13

more complicated: Something called
reinforcement learning where we perform a
30:13 - 30:19

sequence of actions and then in the end
are told "yeah that was great". Which is
30:19 - 30:24

often not enough information to really
perform proper training. But of course
30:24 - 30:28

there are also methods for that. As well
as there are methods for the unsupervised
30:28 - 30:34

case where we don't have annotations,
labeled data – no ground truth at all –
30:34 - 30:41

just the picture itself. Well I talked
about pictures. I told you that we can
30:41 - 30:45

learn features and create images from
them. And we can use them for
30:45 - 30:52

classification. And for this there exist
many databases. There are public data sets
30:52 - 30:57

we can use. Often they refer to for
example Flickr. They're just hyperlinks
30:57 - 31:01

which is also why I didn't show you many
pictures right here, because I am honestly
31:01 - 31:06

not sure about the copyright in those
cases. But there are also challenge
31:06 - 31:11

datasets where you can just sign up, get
some for example medical data sets, and
31:11 - 31:17

then compete against other researchers.
And of course there are those companies
31:17 - 31:22

that just have lots of data. And those
companies also have the means, the
31:22 - 31:28

capacity to perform intense computations.
And those are also often the companies you
31:28 - 31:36

hear from in terms of innovation for deep
learning. Well this was mostly to tell you
31:36 - 31:40

that you can process images quite well
with deep learning if you have enough
31:40 - 31:46

training data, if you have a proper
training process and also a little if you
31:46 - 31:52

know what you're doing. But you can also
process text, you can process audio and
31:52 - 31:59

time series like prices or a stack
exchange – stuff like that. You can
31:59 - 32:03

process almost everything if you make it
encodeable to your network. Sounds like a
32:03 - 32:08

dream come true. But – as I already told
you – you need data, a lot of it. I told
32:08 - 32:14

you about those companies that have lots
of data sets and the publicly available
32:14 - 32:21

data sets which you can actually use to
get started with your own experiments. But
32:21 - 32:24

that also makes it a little dangerous
because deep learning still is a black box
32:24 - 32:31

to us. I told you what happens inside the
black box on a level that teaches you how
32:31 - 32:37

we learn and how the network is
structured, but not really what the
32:37 - 32:43

network learned. It is for us computer
vision engineers really nice that we can
32:43 - 32:49

visualize the first layers of a neural
network and see what is actually encoded
32:49 - 32:54

in those first layers; what information
the network looks at. But you can't really
32:54 - 32:59

mathematically prove what happens in a
network. Which is one major downside. And
32:59 - 33:02

so if you want to use it, the numbers may
be really great but be sure to properly
33:02 - 33:08

evaluate them. In summary I call that
"easy to learn". Every one – every single
33:08 - 33:13

one of you – can just start with deep
learning right away. You don't need to do
33:13 - 33:19

much work. You don't need to do much
learning. The model learns for you. But
33:19 - 33:24

they're hard to master in a way that makes
them useful for production use cases for
33:24 - 33:30

example. So if you want to use deep
learning for something – if you really
33:30 - 33:34

want to seriously use it –, make sure that
it really does what you wanted to and
33:34 - 33:39

doesn't learn something else – which also
happens. Pretty sure you saw some talks
33:39 - 33:44

about deep learning fails – which is not
what this talk is about. They're quite
33:44 - 33:47

funny to look at. Just make sure that they
don't happen to you! If you do that
33:47 - 33:53

though, you'll achieve great things with
deep learning, I'm sure. And that was
33:53 - 34:01

introduction to deep learning. Thank you!
Applause
34:09 - 34:13

Herald Angel: So now it's question and
answer time. So if you have a question,
34:13 - 34:19

please line up at the mikes. We have in
total eight, so it shouldn't be far from
34:19 - 34:26

you. They are here in the corridors and on
these sides. Please line up! For
34:26 - 34:32

everybody: A question consists of one
sentence with the question mark in the end
34:32 - 34:38

– not three minutes of rambling. And also
if you go to the microphone, speak into
34:38 - 34:54

the microphone, so you really get close to
it. Okay. Where do we have … Number 7!
34:54 - 35:02

We start with mic number 7:
Question: Hello. My question is: How did
35:02 - 35:13

you compute the example for the fonts, the
numbers? I didn't really understand it,
35:13 - 35:20

you just said it was made from white
noise.
35:20 - 35:26

Teubi: I'll give you a really brief recap
of what I did. I showed you that we have a
35:26 - 35:31

model that maps image to some meaningful
values, that an image can be encoded in
35:31 - 35:37

just a few values. What happens here is
exactly the other way round. We have some
35:37 - 35:43

values, just some arbitrary values we
actually know nothing about. We can
35:43 - 35:47

generate pictures out of those. So I
trained this model to just take some
35:47 - 35:55

random values and show the pictures
generated from the model. The training
35:55 - 36:03

process was this "min max game", as its
called. We have two networks that try to
36:03 - 36:08

compete against each other. One network
trying to distinguish, whether a picture
36:08 - 36:13

it sees is real or one of those fake
pictures, and the network that actually
36:13 - 36:19

generates those pictures and in training
the network that is able to distinguish
36:19 - 36:25

between those, we can also get information
for the training of the network that
36:25 - 36:30

generates the pictures. So the videos you
saw were just animations of what happens
36:30 - 36:36

during this training process. At first if
we input noise we get noise. But as the
36:36 - 36:42

network is able to better and better
recreate those images from the dataset we
36:42 - 36:47

used as input, in this case pictures of
handwritten digits, the output also became
36:47 - 36:55

more lookalike to those numbers, these
handwritten digits. Hope that helped.
36:55 - 37:07

Herald Angel: Now we go to the
Internet. – Can we get sound for the signal
37:07 - 37:10

Angel, please? Teubi: Sounded so great,
"now we go to the Internet."
37:10 - 37:11

Herald Angel: Yeah, that sounds like
"yeeaah".
37:11 - 37:13

Signal Angel: And now we're finally ready
to go to the interwebs. "Schorsch" is
37:13 - 37:18

asking: Do you have any recommendations
for a beginner regarding the framework or
37:18 - 37:26

the software?
Teubi: I, of course, am very biased to
37:26 - 37:34

recommend what I use everyday. But I also
think that it is a great start. Basically,
37:34 - 37:40

use python and use pytorch. Many people
will disagree with me and tell you
37:40 - 37:46

"tensorflow is better." It might be, in my
opinion not for getting started, and there
37:46 - 37:52

are also some nice tutorials on the
pytorch website. What you can also do is
37:52 - 37:57

look at websites like OpenAI, where they
have a gym to get you started with some
37:57 - 38:02

training exercises, where you already have
datasets. Yeah, basically my
38:02 - 38:09

recommendation is get used to Python and
start with a pytorch tutorial, see where
38:09 - 38:14

to go from there. Often there also some
github repositories linked with many
38:14 - 38:19

examples for already established network
architectures like the cycle GAN or the
38:19 - 38:26

GAN itself or basically everything else.
There will be a repo you can use to get
38:26 - 38:30

started.
Herald Angel: OK, we stay with the
38:30 - 38:33

internet. There's some more questions, I
heard.
38:33 - 38:38

Signal Angel: Yes. Rubin8 is asking: Have
you have you ever come across an example
38:38 - 38:43

of a neural network that deals with audio
instead of images?
38:43 - 38:49

Teubi: Me personally, no. At least not
directly. I've heard about examples, like
38:49 - 38:55

where you can change the voice to sound
like another person, but there is not much
38:55 - 39:00

I can reliably tell about that. My
expertise really is in image processing,
39:00 - 39:06

I'm sorry.
Herald Angel: And I think we have time for
39:06 - 39:12

one more question. We have one at number
8. Microphone number 8.
39:12 - 39:21

Question: Is the current Face recognition
technologies in, for example iPhone X, is
39:21 - 39:26

it also a deep learning algorithm or is
it something more simple? Do you have any
39:26 - 39:32

idea about that?
Teubi: As far as I know, yes. That's all I
39:32 - 39:39

can reliably tell you about that, but it
is not only based on images but also uses
39:39 - 39:45

other information. I think distance
information encoded with some infrared
39:45 - 39:51

signals. I don't really know exactly how
it works, but at least iPhones already
39:51 - 39:56

have a neural network
processing engine built in, so a chip
39:56 - 40:01

dedicated to just doing those
computations. You saw that many of those
40:01 - 40:06

things can be parallelized, and this is
what those hardware architectures make use
40:06 - 40:10

of. So I'm pretty confident in saying,
yes, they also do it there.
40:10 - 40:13

How exactly, no clue.
40:14 - 40:15

Herald Angel: OK. I myself have a last
40:15 - 40:21

completely unrelated question: Did you
create the design of the slides yourself?
40:21 - 40:29

Teubi: I had some help. We have a really
great Congress design and I use that as an
40:29 - 40:33

inspiration to create those slides, yes.
40:33 - 40:37

Herald Angel: OK, yeah, because those are really amazing. I love them.
40:37 - 40:38

Teubi: Thank you!
40:38 - 40:41

Herald Angel: OK, thank you very much
Teubi.
40:45 - 40:49

35C5 outro music
40:49 - 41:07

subtitles created by c3subtitles.de
in the year 2019. Join, and help us!

Title:: 35C3 - Introduction to Deep Learning
Description:: more » « less
Video Language:: English
Duration:: 41:07

	C3Subtitles edited English subtitles for 35C3 - Introduction to Deep Learning
	Bar Sch edited English subtitles for 35C3 - Introduction to Deep Learning
	Andi edited English subtitles for 35C3 - Introduction to Deep Learning
	C3Subtitles edited English subtitles for 35C3 - Introduction to Deep Learning
	C3Subtitles edited English subtitles for 35C3 - Introduction to Deep Learning

English subtitles

Revisions

Revision 5 Edited

C3Subtitles

35C3 - Introduction to Deep Learning

Revisions

Our website uses cookies

Operating cookies (Required)