Return to Video

35C3 - Introduction to Deep Learning

  • 0:00 - 0:18
    35C3 preroll music
  • 0:18 - 0:25
    Herald Angel: Welcome to our introduction
    to deep learning with Teubi. Deep
  • 0:25 - 0:30
    learning, also often called machine
    learning is a hype word which we hear in
  • 0:30 - 0:37
    the media all the time. It's nearly as bad
    as blockchain. It's a solution for
  • 0:37 - 0:43
    everything. Today we'll get a sneak peek
    into the internals of this mystical black
  • 0:43 - 0:49
    box, they are talking about. And Teubi
    will show us why people, who know what
  • 0:49 - 0:53
    machine learning really is about, have to
    facepalm so often, when they read the
  • 0:53 - 0:59
    news. So please welcome Teubi
    with a big round of applause!
  • 0:59 - 1:10
    Applause
    Teubi: Alright! Good morning and welcome
  • 1:10 - 1:14
    to Introduction to Deep Learning. The
    title will already tell you what this talk
  • 1:14 - 1:20
    is about. I want to give you an
    introduction onto how deep learning works,
  • 1:20 - 1:27
    what happens inside this black box. But,
    first of all, who am I? I'm Teubi. It's a
  • 1:27 - 1:32
    German nickname, it has nothing to do with
    toys or bees. You might have heard my
  • 1:32 - 1:36
    voice before, because I host the
    Nussschale podcast. There I explain
  • 1:36 - 1:42
    scientific topics in under 10 minutes.
    I'll have to use a little more time today,
  • 1:42 - 1:47
    and you'll also have fancy animations
    which hopefully will help. In my day job
  • 1:47 - 1:53
    I'm a research scientist at an institute
    for computer vision. I analyze microscopy
  • 1:53 - 1:58
    images of bone marrow blood cells and try
    to find ways to teach the computer to
  • 1:58 - 2:05
    understand what it sees. Namely, to
    differentiate between certain cells or,
  • 2:05 - 2:09
    first of all, find cells in an image,
    which is a task that is more complex than
  • 2:09 - 2:17
    it might sound like. Let me start with the
    introduction to deep learning. We all know
  • 2:17 - 2:23
    how to code. We code in a very simple way.
    We have some input for all computer
  • 2:23 - 2:28
    algorithm. Then we have an algorithm which
    says: Do this, do that. If this, then
  • 2:29 - 2:29
    that. And in that way we generate some
    output. This is not how machine learning
  • 2:29 - 2:31
    works. Machine learning assumes you have
    some input, and you also have some output.
  • 2:41 - 2:46
    And what you also have is some statistical
    model. This statistical model is flexible.
  • 2:46 - 2:52
    It has certain parameters, which it can
    learn from the distribution of inputs and
  • 2:52 - 2:57
    outputs you give it for training. So you
    basically learn the statistical model to
  • 2:57 - 3:04
    generate the desired output from the given
    input. Let me give you a really simple
  • 3:04 - 3:10
    example of how this might work. Let's say
    we have two animals. Well, we have two
  • 3:10 - 3:16
    kinds of animals: unicorns and rabbits.
    And now we want to find an algorithm that
  • 3:16 - 3:24
    tells us whether this animal we have right
    now as an input is a rabbit or a unicorn.
  • 3:24 - 3:28
    We can write a simple algorithm to do
    that, but we can also do it with machine
  • 3:28 - 3:35
    learning. The first thing we need is some
    input. I choose two features that are able
  • 3:35 - 3:42
    to tell me whether this animal is a rabbit
    or a unicorn. Namely, speed and size. We
  • 3:42 - 3:47
    call these features, and they describe
    something about what we want to classify.
  • 3:47 - 3:52
    And the class is in this case our animal.
    First thing I need is some training data,
  • 3:52 - 3:59
    some input. The input here are just pairs
    of speed and size. What I also need is
  • 3:59 - 4:04
    information about the desired output. The
    desired output, of course, being the
  • 4:04 - 4:13
    class. So either unicorn or rabbit, here
    denoted by yellow and red X's. So let's
  • 4:13 - 4:18
    try to find a statistical model which we
    can use to separate this feature space
  • 4:18 - 4:24
    into two halves: One for the rabbits, one
    for the unicorns. Looking at this, we can
  • 4:24 - 4:29
    actually find a really simple statistical
    model, and our statistical model in this
  • 4:29 - 4:34
    case is just a straight line. And the
    learning process is then to find where in
  • 4:34 - 4:41
    this feature space the line should be.
    Ideally, for example, here. Right in the
  • 4:41 - 4:45
    middle between the two classes rabbit and
    unicorn. Of course this is an overly
  • 4:45 - 4:50
    simplified example. Real-world
    applications have feature distributions
  • 4:50 - 4:56
    which look much more like this. So, we
    have a gradient, we don't have a perfect
  • 4:56 - 5:00
    separation between those two classes, and
    those two classes are definitely not
  • 5:00 - 5:06
    separable by a line. If we look again at
    some training samples — training samples
  • 5:06 - 5:12
    are the data points we use for the machine
    learning process, so, to try to find the
  • 5:12 - 5:18
    parameters of our statistical model — if
    we look at the line again, then this will
  • 5:18 - 5:23
    not be able to separate this training set.
    Well, we will have a line that has some
  • 5:23 - 5:27
    errors, some unicorns which will be
    classified as rabbits, some rabbits which
  • 5:27 - 5:33
    will be classified as unicorns. This is
    what we call underfitting. Our model is
  • 5:33 - 5:40
    just not able to express what we want it
    to learn. There is the opposite case. The
  • 5:40 - 5:46
    opposite case being: we just learn all the
    training samples by heart. This is if we
  • 5:46 - 5:50
    have a very complex model and just a few
    training samples to teach the model what
  • 5:50 - 5:55
    it should learn. In this case we have a
    perfect separation of unicorns and
  • 5:55 - 6:01
    rabbits, at least for the few data points
    we have. If we draw another example from
  • 6:01 - 6:07
    the real world,some other data points,
    they will most likely be wrong. And this
  • 6:07 - 6:11
    is what we call overfitting. The perfect
    scenario in this case would be something
  • 6:11 - 6:17
    like this: a classifier which is really
    close to the distribution we have in the
  • 6:17 - 6:23
    real world and machine learning is tasked
    with finding this perfect model and its
  • 6:23 - 6:29
    parameters. Let me show you a different
    kind of model, something you probably all
  • 6:29 - 6:36
    have heard about: Neural networks. Neural
    networks are inspired by the brain.
  • 6:36 - 6:41
    Or more precisely, by the neurons in our
    brain. Neurons are tiny objects, tiny
  • 6:41 - 6:47
    cells in our brain that take some input
    and generate some output. Sounds familiar,
  • 6:47 - 6:53
    right? We have inputs usually in the form
    of electrical signals. And if they are
  • 6:53 - 6:58
    strong enough, this neuron will also send
    out an electrical signal. And this is
  • 6:58 - 7:03
    something we can model in a computer-
    engineering way. So, what we do is: We
  • 7:03 - 7:09
    take a neuron. The neuron is just a simple
    mapping from input to output. Input here,
  • 7:09 - 7:17
    just three input nodes. We denote them by
    i1, i2 and i3 and output denoted by o. And
  • 7:17 - 7:21
    now you will actually see some
    mathematical equations. There are not many
  • 7:21 - 7:27
    of these in this foundation talk, don't
    worry, and it's really simple. There's one
  • 7:27 - 7:30
    more thing we need first, though, if we
    want to map input to output in the way a
  • 7:30 - 7:35
    neuron does. Namely, the weights. The
    weights are just some arbitrary numbers
  • 7:35 - 7:43
    for now. Let's call them w1, w2 and w3.
    So, we take those weights and we multiply
  • 7:43 - 7:51
    them with the input. Input1 times weight1,
    input2 times weight2, and so on. And this,
  • 7:51 - 7:58
    this sum just will be our output. Well,
    not quite. We make it a little bit more
  • 7:58 - 8:02
    complicated. We also use something called
    an activation function. The activation
  • 8:02 - 8:09
    function is just a mapping from one scalar
    value to another scalar value. In this
  • 8:09 - 8:14
    case from what we got as an output, the
    sum, to something that more closely fits
  • 8:14 - 8:19
    what we need. This could for example be
    something binary, where we have all the
  • 8:19 - 8:24
    negative numbers being mapped to zero and
    all the positive numbers being mapped to
  • 8:24 - 8:31
    one. And then this zero and one can encode
    something. For example: rabbit or unicorn.
  • 8:31 - 8:35
    So, let me give you an example of how we
    can make the previous example with the
  • 8:35 - 8:42
    rabbits and unicorns work with such a
    simple neuron. We just use speed, size,
  • 8:42 - 8:50
    and the arbitrarily chosen number 10 as
    our inputs and the weights 1, 1, and -1.
  • 8:50 - 8:54
    If we look at the equations, then we get
    for our negative numbers — so, speed plus
  • 8:54 - 9:01
    size being less than 10 — a 0, and a 1 for
    all positive numbers — being speed plus
  • 9:01 - 9:08
    size larger than 10, greater than 10. This
    way we again have a separating line
  • 9:08 - 9:15
    between unicorns and rabbits. But again we
    have this really simplistic model. We want
  • 9:15 - 9:22
    to become more and more complicated in
    order to express more complex tasks. So
  • 9:22 - 9:26
    what do we do? We take more neurons. We
    take our three input values and put them
  • 9:26 - 9:32
    into one neuron, and into a second neuron,
    and into a third neuron. And we take the
  • 9:32 - 9:38
    output of those three neurons as input for
    another neuron. We also call this a
  • 9:38 - 9:42
    multilayer perceptron, perceptron just
    being a different name for a neuron, what
  • 9:42 - 9:49
    we have there. And the whole thing is also
    called a neural network. So now the
  • 9:49 - 9:53
    question: How do we train this? How do we
    learn what this network should encode?
  • 9:53 - 9:58
    Well, we want a mapping from input to
    output, and what we can change are the
  • 9:58 - 10:03
    weights. First, what we do is we take a
    training sample, some input. Put it
  • 10:03 - 10:07
    through the network, get an output. But
    this might not be the desired output which
  • 10:07 - 10:14
    we know. So, in the binary case there are
    four possible cases: computed output,
  • 10:14 - 10:20
    expected output, each two values, 0 and 1.
    The best case would be: we want a 0, get a
  • 10:20 - 10:27
    0, want a 1 and get a 1. But there is also
    the opposite case. In these two cases we
  • 10:27 - 10:31
    can learn something about our model.
    Namely, in which direction to change the
  • 10:31 - 10:37
    weights. It's a little bit simplified, but
    in principle you just raise the weights if
  • 10:37 - 10:41
    you need a higher number as output and you
    lower the weights if you need a lower
  • 10:41 - 10:47
    number as output. To tell you how much, we
    have two terms. First term being the
  • 10:47 - 10:53
    error, so in this case just the difference
    between desired and expected output – also
  • 10:53 - 10:57
    often called a loss function, especially
    in deep learning and more complex
  • 10:57 - 11:04
    applications. You also have a second term
    we call the act the learning rate, and the
  • 11:04 - 11:09
    learning rate is what tells us how quickly
    we should change the weights, how quickly
  • 11:09 - 11:15
    we should adapt the weights. Okay, this is
    how we learn a model. This is almost
  • 11:15 - 11:19
    everything you need to know. There are
    mathematical equations that tell you how
  • 11:19 - 11:24
    much to change based on the error and the
    learning function. And this is the entire
  • 11:24 - 11:30
    learning process. Let's get back to the
    terminology. We have the input layer. We
  • 11:30 - 11:34
    have the output layer, which somehow
    encodes our output either in one value or
  • 11:34 - 11:40
    in several values if we have a multiple,
    if we have multiple classes. We also have
  • 11:40 - 11:46
    the hidden layers, which are actually what
    makes our model deep. What we can change,
  • 11:46 - 11:52
    what we can learn, is the are the weights,
    the parameters of this model. But what we
  • 11:52 - 11:55
    also need to keep in mind, is the number
    of layers, the number of neurons per
  • 11:55 - 12:00
    layer, the learning rate, and the
    activation function. These are called
  • 12:00 - 12:04
    hyper parameters, and they determine how
    complex our model is, how well it is
  • 12:04 - 12:10
    suited to solve the task at hand. I quite
    often spoke about solving tasks, so the
  • 12:10 - 12:15
    question is: What can we actually do with
    neural networks? Mostly classification
  • 12:15 - 12:20
    tasks, for example: Tell me, is this
    animal a rabbit or unicorn? Is this text
  • 12:20 - 12:25
    message spam or legitimate? Is this
    patient healthy or ill? Is this image a
  • 12:25 - 12:31
    picture of a cat or a dog? We already saw
    for the animal that we need something
  • 12:31 - 12:35
    called features, which somehow encodes
    information about what we want to
  • 12:35 - 12:40
    classify, something we can use as input
    for the neural network. Some kind of
  • 12:40 - 12:44
    number that is meaningful. So, for the
    animal it could be speed, size, or
  • 12:44 - 12:49
    something like color. Color, of course,
    being more complex again, because we have,
  • 12:49 - 12:56
    for example, RGB, so three values. And,
    text message being a more complex case
  • 12:56 - 13:00
    again, because we somehow need to encode
    the sender, and whether the sender is
  • 13:00 - 13:05
    legitimate. Same for the recipient, or the
    number of hyperlinks, or where the
  • 13:05 - 13:11
    hyperlinks refer to, or the, whether there
    are certain words present in the text. It
  • 13:11 - 13:17
    gets more and more complicated. Even more
    so for a patient. How do we encode medical
  • 13:17 - 13:22
    history in a proper way for the network to
    learn. I mean, temperature is simple. It's
  • 13:22 - 13:27
    a scalar value, we just have a number. But
    how do we encode whether certain symptoms
  • 13:27 - 13:33
    are present. And the image, which is
    actually what I work with everyday, is
  • 13:33 - 13:38
    again quite complex. We have values, we
    have numbers, but only pixel values, which
  • 13:38 - 13:43
    make it difficult, which are difficult to
    use as input for a neural network. Why?
  • 13:43 - 13:48
    I'll show you. I'll actually show you with
    this picture, it's a very famous picture,
  • 13:48 - 13:54
    and everybody uses it in computer vision.
    They will tell you, it's because there is
  • 13:54 - 14:01
    a multitude of different characteristics
    in this image: shapes, edges, whatever you
  • 14:01 - 14:07
    desire. The truth is, it's a crop from the
    centrefold of the Playboy, and in earlier
  • 14:07 - 14:12
    years, the computer vision engineers was a
    mostly male audience. Anyway, let's take
  • 14:12 - 14:17
    five by five pixels. Let's assume, this is
    a five by five pixels, a really small,
  • 14:17 - 14:22
    image. If we take those 25 pixels and use
    them as input for a neural network you
  • 14:22 - 14:27
    already see that we have many connections
    - many weights - which means a very
  • 14:27 - 14:33
    complex model. Complex model, of course,
    prone to overfitting. But there are more
  • 14:33 - 14:39
    problems. First being, we have
    disconnected the pixels from its neigh-, a
  • 14:39 - 14:44
    pixel from its neighbors. We can't encode
    information about the neighborhood
  • 14:44 - 14:48
    anymore, and that really sucks. If we just
    take the whole picture, and move it to the
  • 14:48 - 14:53
    left or to the right by just one pixel,
    the network will see something completely
  • 14:53 - 14:58
    different, even though to us it is exactly
    the same. But, we can solve that with some
  • 14:58 - 15:03
    very clever engineering, something we call
    a convolutional layer. It is again a
  • 15:03 - 15:09
    hidden layer in a neural network, but it
    does something special. It actually is a
  • 15:09 - 15:14
    very simple neuron again, just four input
    values - one output value. But the four
  • 15:14 - 15:20
    input values look at two by two pixels,
    and encode one output value. And then the
  • 15:20 - 15:24
    same network is shifted to the right, and
    encodes another pixel, and another pixel,
  • 15:24 - 15:30
    and the next row of pixels. And in this
    way creates another 2D image. We have
  • 15:30 - 15:35
    preserved information about the
    neighborhood, and we just have a very low
  • 15:35 - 15:42
    number of weights, not the huge number of
    parameters we saw earlier. We can use this
  • 15:42 - 15:50
    once, or twice, or several hundred times.
    And this is actually where we go deep.
  • 15:50 - 15:55
    Deep means: We have several layers, and
    having layers that don't need thousands or
  • 15:55 - 16:01
    millions of connections, but only a few.
    This is what allows us to go really deep.
  • 16:01 - 16:06
    And in this fashion we can encode an
    entire image in just a few meaningful
  • 16:06 - 16:11
    values. How these values look like, and
    what they encode, this is learned through
  • 16:11 - 16:18
    the learning process. And we can then, for
    example, use these few values as input for
  • 16:18 - 16:25
    a classification network.
    The fully connected network we saw earlier.
  • 16:25 - 16:30
    Or we can do something more clever. We can
    do the inverse operation and create an image
  • 16:30 - 16:35
    again, for example, the same image, which
    is then called an auto encoder. Auto
  • 16:35 - 16:40
    encoders are tremendously useful, even
    though they don't appear that way. For
  • 16:40 - 16:44
    example, imagine you want to check whether
    something has a defect, or not, a picture
  • 16:44 - 16:51
    of a fabric, or of something. You just
    train the network with normal pictures.
  • 16:51 - 16:57
    And then, if you have a defect picture,
    the network is not able to produce this
  • 16:57 - 17:02
    defect. And so the difference of the
    reproduced picture, and the real picture
  • 17:02 - 17:07
    will show you where errors are. If it
    works properly, I'll have to admit that.
  • 17:07 - 17:13
    But we can go even further. Let's say, we
    want to encode something entirely else.
  • 17:13 - 17:17
    Well, let's encode the image, the
    information in the image, but in another
  • 17:17 - 17:22
    representation. For example, let's say we
    have three classes again. The background
  • 17:22 - 17:30
    class in grey, a class called hat or
    headwear in blue, and person in green. We
  • 17:30 - 17:34
    can also use this for other applications
    than just for pictures of humans. For
  • 17:34 - 17:38
    example, we have a picture of a street and
    want to encode: Where is the car, where's
  • 17:38 - 17:45
    the pedestrian? Tremendously useful. Or we
    have an MRI scan of a brain: Where in the
  • 17:45 - 17:51
    brain is the tumor? Can we somehow learn
    this? Yes we can do this, with methods
  • 17:51 - 17:57
    like these, if they are trained properly.
    More about that later. Well we expect
  • 17:57 - 18:01
    something like this to come out but the
    truth looks rather like this – especially
  • 18:01 - 18:06
    if it's not properly trained. We have not
    the real shape we want to get but
  • 18:06 - 18:12
    something distorted. So here is again
    where we need to do learning. First we
  • 18:12 - 18:16
    take a picture, put it through the
    network, get our output representation.
  • 18:16 - 18:21
    And we have the information about how we
    want it to look. We again compute some
  • 18:21 - 18:27
    kind of loss value. This time for example
    being the overlap between the shape we get
  • 18:27 - 18:34
    out of the model and the shape we want to
    have. And we use this error, this lost
  • 18:34 - 18:39
    function, to update the weights of our
    network. Again – even though it's more
  • 18:39 - 18:44
    complicated here, even though we have more
    layers, and even though the layers look
  • 18:44 - 18:49
    slightly different – it is the same
    process all over again as with a binary
  • 18:49 - 18:57
    case. And we need lots of training data.
    This is something that you'll hear often
  • 18:57 - 19:03
    in connection with deep learning: You need
    lots of training data to make this work.
  • 19:03 - 19:10
    Images are complex things and in order to
    meaningful extract knowledge from them,
  • 19:10 - 19:17
    the network needs to see a multitude of
    different images. Well now I already
  • 19:17 - 19:22
    showed you some things we use in network
    architecture, some support networks: The
  • 19:22 - 19:27
    fully convolutional encoder, which takes
    an image and produces a few meaningful
  • 19:27 - 19:33
    values out of this image; its counterpart
    the fully convolutional decoder – fully
  • 19:33 - 19:37
    convolutional meaning by the way that we
    only have these convolutional layers with
  • 19:37 - 19:43
    a few parameters that somehow encode
    spatial information and keep it for the
  • 19:43 - 19:49
    next layers. The decoder takes a few
    meaningful numbers and reproduces an image
  • 19:49 - 19:55
    – either the same image or another
    representation of the information encoded
  • 19:55 - 20:01
    in the image. We also already saw the
    fully connected network. Fully connected
  • 20:01 - 20:07
    meaning every neuron is connected to every
    neuron in the next layer. This of course
  • 20:07 - 20:13
    can be dangerous because this is where we
    actually get most of our parameters. If we
  • 20:13 - 20:16
    have a fully connected network, this is
    where the most parameters will be present
  • 20:16 - 20:22
    because connecting every node to every
    node … this is just a high number of
  • 20:22 - 20:26
    connections. We can also do other things.
    For example something called a pooling
  • 20:26 - 20:32
    layer. A pooling layer being basically the
    same as one of those convolutional layers,
  • 20:32 - 20:36
    just that we don't have parameters we need
    to learn. This works without parameters
  • 20:36 - 20:44
    because this neuron just chooses whichever
    value is the highest and takes that value
  • 20:44 - 20:50
    as output. This is really great for
    reducing the size of your image and also
  • 20:50 - 20:55
    getting rid of information that might not
    be that important. We can also do some
  • 20:55 - 21:00
    clever techniques like adding a dropout
    layer. A dropout layer just being a normal
  • 21:00 - 21:06
    layer in a neural network where we remove
    some connections: In one training step
  • 21:06 - 21:11
    these connections, in the next training
    step some other connections. This way we
  • 21:11 - 21:18
    teach the other connections to become more
    resilient against errors. I would like to
  • 21:18 - 21:23
    start with something I call the "Model
    Show" now, and show you some models and
  • 21:23 - 21:29
    how we train those models. And I will
    start with a fully convolutional decoder
  • 21:29 - 21:35
    we saw earlier: This thing that takes a
    number and creates a picture. I would like
  • 21:35 - 21:41
    to take this model, put in some number and
    get out a picture – a picture of a horse
  • 21:41 - 21:46
    for example. If I put in a different
    number I also want to get a picture of a
  • 21:46 - 21:52
    horse, but of a different horse. So what I
    want to get is a mapping from some
  • 21:52 - 21:57
    numbers, some features that encode
    something about the horse picture, and get
  • 21:57 - 22:03
    a horse picture out of it. You might see
    already why this is problematic. It is
  • 22:03 - 22:08
    problematic because we don't have a
    mapping from feature to horse or from
  • 22:08 - 22:15
    horse to features. So we don't have a
    truth value we can use to learn how to
  • 22:15 - 22:22
    generate this mapping. Well computer
    vision engineers – or deep learning
  • 22:22 - 22:27
    professionals – they're smart and have
    clever ideas. Let's just assume we have
  • 22:27 - 22:33
    such a network and let's call it a
    generator. Let's take some numbers put,
  • 22:33 - 22:39
    them into the generator and get some
    horses. Well it doesn't work yet. We still
  • 22:39 - 22:42
    have to train it. So they're probably not
    only horses but also some very special
  • 22:42 - 22:48
    unicorns among the horses; which might be
    nice for other applications, but I wanted
  • 22:48 - 22:55
    pictures of horses right now. So I can't
    train with this data directly. But what I
  • 22:55 - 23:02
    can do is I can create a second network.
    This network is called a discriminator and
  • 23:02 - 23:09
    I can give it the input generated from the
    generator as well as the real data I have:
  • 23:09 - 23:14
    the real horse pictures. And then I can
    teach the discriminator to distinguish
  • 23:14 - 23:22
    between those. Tell me it is a real horse
    or it's not a real horse. And there I know
  • 23:22 - 23:27
    what is the truth because I either take
    real horse pictures or fake horse pictures
  • 23:27 - 23:34
    from the generator. So I have a truth
    value for this discriminator. But in doing
  • 23:34 - 23:39
    this I also have a truth value for the
    generator. Because I want the generator to
  • 23:39 - 23:44
    work against the discriminator. So I can
    also use the information how well the
  • 23:44 - 23:51
    discriminator does to train the generator
    to become better in fooling. This is
  • 23:51 - 23:57
    called a generative adversarial network.
    And it can be used to generate pictures of
  • 23:57 - 24:02
    an arbitrary distribution. Let's do this
    with numbers and I will actually show you
  • 24:02 - 24:08
    the training process. Before I start the
    video, I'll tell you what I did. I took
  • 24:08 - 24:12
    some handwritten digits. There is a
    database called "??? of handwritten
  • 24:12 - 24:19
    digits" so the numbers of 0 to 9. And I
    took those and used them as training data.
  • 24:19 - 24:24
    I trained a generator in the way I showed
    you on the previous slide, and then I just
  • 24:24 - 24:30
    took some random numbers. I put those
    random numbers into the network and just
  • 24:30 - 24:36
    stored the image of what came out of the
    network. And here in the video you'll see
  • 24:36 - 24:43
    how the network improved with ongoing
    training. You will see that we start
  • 24:43 - 24:50
    basically with just noisy images … and
    then after some – what we call apox(???)
  • 24:50 - 24:56
    so training iterations – the network is
    able to almost perfectly generate
  • 24:56 - 25:06
    handwritten digits just from noise. Which
    I find truly fascinating. Of course this
  • 25:06 - 25:11
    is an example where it works. It highly
    depends on your data set and how you train
  • 25:11 - 25:16
    the model whether it is a success or not.
    But if it works, you can use it to
  • 25:16 - 25:23
    generate fonts. You can generate
    characters, 3D objects, pictures of
  • 25:23 - 25:29
    animals, whatever you want as long as you
    have training data. Let's go more crazy.
  • 25:29 - 25:35
    Let's take two of those and let's say we
    have pictures of horses and pictures of
  • 25:35 - 25:41
    zebras. I want to convert those pictures
    of horses into pictures of zebras, and I
  • 25:41 - 25:45
    want to convert pictures of zebras into
    pictures of horses. So I want to have the
  • 25:45 - 25:50
    same picture just with the other animal.
    But I don't have training data of the same
  • 25:50 - 25:56
    situation just once with a horse and once
    with a zebra. Doesn't matter. We can train
  • 25:56 - 26:01
    a network that does that for us. Again we
    just have a network – we call it the
  • 26:01 - 26:06
    generator – and we have two of those: One
    that converts horses to zebras and one
  • 26:06 - 26:15
    that converts zebras to horses. And then
    we also have two discriminators that tell
  • 26:15 - 26:21
    us: real horse – fake horse – real zebra –
    fake zebra. And then we again need to
  • 26:21 - 26:27
    perform some training. So we need to
    somehow encode: Did it work what we wanted
  • 26:27 - 26:31
    to do? And a very simple way to do this is
    we take a picture of a horse put it
  • 26:31 - 26:35
    through the generator that generates a
    zebra. Take this fake picture of a zebra,
  • 26:35 - 26:39
    put it through the generator that
    generates a picture of a horse. And if
  • 26:39 - 26:44
    this is the same picture as we put in,
    then our model worked. And if it didn't,
  • 26:44 - 26:49
    we can use that information to update the
    weights. I just took a random picture,
  • 26:49 - 26:54
    from a free library in the Internet, of a
    horse and generated a zebra and it worked
  • 26:54 - 26:59
    remarkably well. I actually didn't even do
    training. It also doesn't need to be a
  • 26:59 - 27:03
    picture. You can also convert text to
    images: You describe something in words
  • 27:03 - 27:10
    and generate images. You can age your face
    or age a cell; or make a patient healthy
  • 27:10 - 27:16
    or sick – or the image of a patient, not
    the patient self, unfortunately. You can
  • 27:16 - 27:21
    do style transfer like take a picture of
    Van Gogh and apply it to your own picture.
  • 27:21 - 27:28
    Stuff like that. Something else that we
    can do with neural networks. Let's assume
  • 27:28 - 27:31
    we have a classification network, we have
    a picture of a toothbrush and the network
  • 27:31 - 27:37
    tells us: Well, this is a toothbrush.
    Great! But how resilient is this network?
  • 27:37 - 27:45
    Does it really work in every scenario.
    There's a second network we can apply: We
  • 27:45 - 27:49
    call it an adversarial network. And that
    network is trained to do one thing: Look
  • 27:49 - 27:52
    at the network, look at the picture, and
    then find the one weak spot in the
  • 27:52 - 27:56
    picture: Just change one pixel slightly so
    that the network will tell me this
  • 27:56 - 28:04
    toothbrush is an octopus. Works remarkably
    well. Also works with just changing the
  • 28:04 - 28:09
    picture slightly, so changing all the
    pixels, but just slight minute changes
  • 28:09 - 28:13
    that we don't perceive, but the network –
    the classification network – is completely
  • 28:13 - 28:20
    thrown off. Well sounds bad. Is bad if you
    don't consider it. But you can also for
  • 28:20 - 28:24
    example use this for training your network
    and make your network resilient. So
  • 28:24 - 28:28
    there's always an upside and downside.
    Something entirely else: Now I'd like to
  • 28:28 - 28:33
    show you something about text. A word-
    language model. I want to generate
  • 28:33 - 28:38
    sentences for my podcast. I have a network
    that gives me a word, and then if I want
  • 28:38 - 28:43
    to somehow get the next word in the
    sentence, I also need to consider this
  • 28:43 - 28:47
    word. So another network architecture –
    quite interestingly – just takes the
  • 28:47 - 28:52
    hidden states of the network and uses them
    as the input for the same network so that
  • 28:52 - 28:59
    in the next iteration we still know what
    we did in the previous step. I tried to
  • 28:59 - 29:05
    train a network that generates podcast
    episodes for my podcasts. Didn't work.
  • 29:05 - 29:08
    What I learned is I don't have enough
    training data. I really need to produce
  • 29:08 - 29:16
    more podcast episodes in order to train a
    model to do my job for me. And this is
  • 29:16 - 29:22
    very important, a very crucial point:
    Training data. We need shitloads of
  • 29:22 - 29:26
    training data. And actually the more
    complicated our model and our training
  • 29:26 - 29:31
    process becomes, the more training data we
    need. I started with a supervised case –
  • 29:31 - 29:36
    the really simple case where we, really
    simple, the really simpler case where we
  • 29:36 - 29:41
    have a picture and a label that
    corresponds to that picture; or a
  • 29:41 - 29:46
    representation of that picture showing
    entirely what I wanted to learn. But we
  • 29:46 - 29:52
    also saw a more complex task, where I had
    to pictures – horses and zebras – that are
  • 29:52 - 29:56
    from two different domains – but domains
    with no direct mapping. What can also
  • 29:56 - 30:01
    happen – and actually happens quite a lot
    – is weakly annotated data, so data that
  • 30:01 - 30:09
    is not precisely annotated; where we can't
    rely on the information we get. Or even
  • 30:09 - 30:13
    more complicated: Something called
    reinforcement learning where we perform a
  • 30:13 - 30:19
    sequence of actions and then in the end
    are told "yeah that was great". Which is
  • 30:19 - 30:24
    often not enough information to really
    perform proper training. But of course
  • 30:24 - 30:28
    there are also methods for that. As well
    as there are methods for the unsupervised
  • 30:28 - 30:34
    case where we don't have annotations,
    labeled data – no ground truth at all –
  • 30:34 - 30:41
    just the picture itself. Well I talked
    about pictures. I told you that we can
  • 30:41 - 30:45
    learn features and create images from
    them. And we can use them for
  • 30:45 - 30:52
    classification. And for this there exist
    many databases. There are public data sets
  • 30:52 - 30:57
    we can use. Often they refer to for
    example Flickr. They're just hyperlinks
  • 30:57 - 31:01
    which is also why I didn't show you many
    pictures right here, because I am honestly
  • 31:01 - 31:06
    not sure about the copyright in those
    cases. But there are also challenge
  • 31:06 - 31:11
    datasets where you can just sign up, get
    some for example medical data sets, and
  • 31:11 - 31:17
    then compete against other researchers.
    And of course there are those companies
  • 31:17 - 31:22
    that just have lots of data. And those
    companies also have the means, the
  • 31:22 - 31:28
    capacity to perform intense computations.
    And those are also often the companies you
  • 31:28 - 31:36
    hear from in terms of innovation for deep
    learning. Well this was mostly to tell you
  • 31:36 - 31:40
    that you can process images quite well
    with deep learning if you have enough
  • 31:40 - 31:46
    training data, if you have a proper
    training process and also a little if you
  • 31:46 - 31:52
    know what you're doing. But you can also
    process text, you can process audio and
  • 31:52 - 31:59
    time series like prices or a stack
    exchange – stuff like that. You can
  • 31:59 - 32:03
    process almost everything if you make it
    encodeable to your network. Sounds like a
  • 32:03 - 32:08
    dream come true. But – as I already told
    you – you need data, a lot of it. I told
  • 32:08 - 32:14
    you about those companies that have lots
    of data sets and the publicly available
  • 32:14 - 32:21
    data sets which you can actually use to
    get started with your own experiments. But
  • 32:21 - 32:24
    that also makes it a little dangerous
    because deep learning still is a black box
  • 32:24 - 32:31
    to us. I told you what happens inside the
    black box on a level that teaches you how
  • 32:31 - 32:37
    we learn and how the network is
    structured, but not really what the
  • 32:37 - 32:43
    network learned. It is for us computer
    vision engineers really nice that we can
  • 32:43 - 32:49
    visualize the first layers of a neural
    network and see what is actually encoded
  • 32:49 - 32:54
    in those first layers; what information
    the network looks at. But you can't really
  • 32:54 - 32:59
    mathematically prove what happens in a
    network. Which is one major downside. And
  • 32:59 - 33:02
    so if you want to use it, the numbers may
    be really great but be sure to properly
  • 33:02 - 33:08
    evaluate them. In summary I call that
    "easy to learn". Every one – every single
  • 33:08 - 33:13
    one of you – can just start with deep
    learning right away. You don't need to do
  • 33:13 - 33:19
    much work. You don't need to do much
    learning. The model learns for you. But
  • 33:19 - 33:24
    they're hard to master in a way that makes
    them useful for production use cases for
  • 33:24 - 33:30
    example. So if you want to use deep
    learning for something – if you really
  • 33:30 - 33:34
    want to seriously use it –, make sure that
    it really does what you wanted to and
  • 33:34 - 33:39
    doesn't learn something else – which also
    happens. Pretty sure you saw some talks
  • 33:39 - 33:44
    about deep learning fails – which is not
    what this talk is about. They're quite
  • 33:44 - 33:47
    funny to look at. Just make sure that they
    don't happen to you! If you do that
  • 33:47 - 33:53
    though, you'll achieve great things with
    deep learning, I'm sure. And that was
  • 33:53 - 34:01
    introduction to deep learning. Thank you!
    Applause
  • 34:09 - 34:13
    Herald Angel: So now it's question and
    answer time. So if you have a question,
  • 34:13 - 34:19
    please line up at the mikes. We have in
    total eight, so it shouldn't be far from
  • 34:19 - 34:26
    you. They are here in the corridors and on
    these sides. Please line up! For
  • 34:26 - 34:32
    everybody: A question consists of one
    sentence with the question mark in the end
  • 34:32 - 34:38
    – not three minutes of rambling. And also
    if you go to the microphone, speak into
  • 34:38 - 34:54
    the microphone, so you really get close to
    it. Okay. Where do we have … Number 7!
  • 34:54 - 35:02
    We start with mic number 7:
    Question: Hello. My question is: How did
  • 35:02 - 35:13
    you compute the example for the fonts, the
    numbers? I didn't really understand it,
  • 35:13 - 35:20
    you just said it was made from white
    noise.
  • 35:20 - 35:26
    Teubi: I'll give you a really brief recap
    of what I did. I showed you that we have a
  • 35:26 - 35:31
    model that maps image to some meaningful
    values, that an image can be encoded in
  • 35:31 - 35:37
    just a few values. What happens here is
    exactly the other way round. We have some
  • 35:37 - 35:43
    values, just some arbitrary values we
    actually know nothing about. We can
  • 35:43 - 35:47
    generate pictures out of those. So I
    trained this model to just take some
  • 35:47 - 35:55
    random values and show the pictures
    generated from the model. The training
  • 35:55 - 36:03
    process was this "min max game", as its
    called. We have two networks that try to
  • 36:03 - 36:08
    compete against each other. One network
    trying to distinguish, whether a picture
  • 36:08 - 36:13
    it sees is real or one of those fake
    pictures, and the network that actually
  • 36:13 - 36:19
    generates those pictures and in training
    the network that is able to distinguish
  • 36:19 - 36:25
    between those, we can also get information
    for the training of the network that
  • 36:25 - 36:30
    generates the pictures. So the videos you
    saw were just animations of what happens
  • 36:30 - 36:36
    during this training process. At first if
    we input noise we get noise. But as the
  • 36:36 - 36:42
    network is able to better and better
    recreate those images from the dataset we
  • 36:42 - 36:47
    used as input, in this case pictures of
    handwritten digits, the output also became
  • 36:47 - 36:55
    more lookalike to those numbers, these
    handwritten digits. Hope that helped.
  • 36:55 - 37:07
    Herald Angel: Now we go to the
    Internet. – Can we get sound for the signal
  • 37:07 - 37:10
    Angel, please? Teubi: Sounded so great,
    "now we go to the Internet."
  • 37:10 - 37:11
    Herald Angel: Yeah, that sounds like
    "yeeaah".
  • 37:11 - 37:13
    Signal Angel: And now we're finally ready
    to go to the interwebs. "Schorsch" is
  • 37:13 - 37:18
    asking: Do you have any recommendations
    for a beginner regarding the framework or
  • 37:18 - 37:26
    the software?
    Teubi: I, of course, am very biased to
  • 37:26 - 37:34
    recommend what I use everyday. But I also
    think that it is a great start. Basically,
  • 37:34 - 37:40
    use python and use pytorch. Many people
    will disagree with me and tell you
  • 37:40 - 37:46
    "tensorflow is better." It might be, in my
    opinion not for getting started, and there
  • 37:46 - 37:52
    are also some nice tutorials on the
    pytorch website. What you can also do is
  • 37:52 - 37:57
    look at websites like OpenAI, where they
    have a gym to get you started with some
  • 37:57 - 38:02
    training exercises, where you already have
    datasets. Yeah, basically my
  • 38:02 - 38:09
    recommendation is get used to Python and
    start with a pytorch tutorial, see where
  • 38:09 - 38:14
    to go from there. Often there also some
    github repositories linked with many
  • 38:14 - 38:19
    examples for already established network
    architectures like the cycle GAN or the
  • 38:19 - 38:26
    GAN itself or basically everything else.
    There will be a repo you can use to get
  • 38:26 - 38:30
    started.
    Herald Angel: OK, we stay with the
  • 38:30 - 38:33
    internet. There's some more questions, I
    heard.
  • 38:33 - 38:38
    Signal Angel: Yes. Rubin8 is asking: Have
    you have you ever come across an example
  • 38:38 - 38:43
    of a neural network that deals with audio
    instead of images?
  • 38:43 - 38:49
    Teubi: Me personally, no. At least not
    directly. I've heard about examples, like
  • 38:49 - 38:55
    where you can change the voice to sound
    like another person, but there is not much
  • 38:55 - 39:00
    I can reliably tell about that. My
    expertise really is in image processing,
  • 39:00 - 39:06
    I'm sorry.
    Herald Angel: And I think we have time for
  • 39:06 - 39:12
    one more question. We have one at number
    8. Microphone number 8.
  • 39:12 - 39:21
    Question: Is the current Face recognition
    technologies in, for example iPhone X, is
  • 39:21 - 39:26
    it also a deep learning algorithm or is
    it something more simple? Do you have any
  • 39:26 - 39:32
    idea about that?
    Teubi: As far as I know, yes. That's all I
  • 39:32 - 39:39
    can reliably tell you about that, but it
    is not only based on images but also uses
  • 39:39 - 39:45
    other information. I think distance
    information encoded with some infrared
  • 39:45 - 39:51
    signals. I don't really know exactly how
    it works, but at least iPhones already
  • 39:51 - 39:56
    have a neural network
    processing engine built in, so a chip
  • 39:56 - 40:01
    dedicated to just doing those
    computations. You saw that many of those
  • 40:01 - 40:06
    things can be parallelized, and this is
    what those hardware architectures make use
  • 40:06 - 40:10
    of. So I'm pretty confident in saying,
    yes, they also do it there.
  • 40:10 - 40:13
    How exactly, no clue.
  • 40:14 - 40:15

    Herald Angel: OK. I myself have a last
  • 40:15 - 40:21
    completely unrelated question: Did you
    create the design of the slides yourself?
  • 40:21 - 40:29
    Teubi: I had some help. We have a really
    great Congress design and I use that as an
  • 40:29 - 40:33
    inspiration to create those slides, yes.
  • 40:33 - 40:37
    Herald Angel: OK, yeah, because those are really amazing. I love them.
  • 40:37 - 40:38
    Teubi: Thank you!
  • 40:38 - 40:41
    Herald Angel: OK, thank you very much
    Teubi.
  • 40:45 - 40:49
    35C5 outro music
  • 40:49 - 41:07
    subtitles created by c3subtitles.de
    in the year 2019. Join, and help us!
Title:
35C3 - Introduction to Deep Learning
Description:

more » « less
Video Language:
English
Duration:
41:07

English subtitles

Revisions