WEBVTT 00:00:00.000 --> 00:00:18.229 35C3 preroll music 00:00:18.229 --> 00:00:24.750 Herald Angel: Welcome to our introduction to deep learning with Teubi. Deep 00:00:24.750 --> 00:00:30.247 learning, also often called machine learning is a hype word which we hear in 00:00:30.247 --> 00:00:37.152 the media all the time. It's nearly as bad as blockchain. It's a solution for 00:00:37.152 --> 00:00:43.249 everything. Today we'll get a sneak peek into the internals of this mystical black 00:00:43.249 --> 00:00:48.820 box, they are talking about. And Teubi will show us why people, who know what 00:00:48.820 --> 00:00:53.040 machine learning really is about, have to facepalm so often, when they read the 00:00:53.040 --> 00:00:58.715 news. So please welcome Teubi with a big round of applause! 00:00:58.715 --> 00:01:10.245 Applause Teubi: Alright! Good morning and welcome 00:01:10.245 --> 00:01:14.470 to Introduction to Deep Learning. The title will already tell you what this talk 00:01:14.470 --> 00:01:19.920 is about. I want to give you an introduction onto how deep learning works, 00:01:19.920 --> 00:01:27.090 what happens inside this black box. But, first of all, who am I? I'm Teubi. It's a 00:01:27.090 --> 00:01:32.280 German nickname, it has nothing to do with toys or bees. You might have heard my 00:01:32.280 --> 00:01:36.480 voice before, because I host the Nussschale podcast. There I explain 00:01:36.480 --> 00:01:41.560 scientific topics in under 10 minutes. I'll have to use a little more time today, 00:01:41.560 --> 00:01:46.850 and you'll also have fancy animations which hopefully will help. In my day job 00:01:46.850 --> 00:01:52.540 I'm a research scientist at an institute for computer vision. I analyze microscopy 00:01:52.540 --> 00:01:58.240 images of bone marrow blood cells and try to find ways to teach the computer to 00:01:58.240 --> 00:02:04.660 understand what it sees. Namely, to differentiate between certain cells or, 00:02:04.660 --> 00:02:09.449 first of all, find cells in an image, which is a task that is more complex than 00:02:09.449 --> 00:02:17.180 it might sound like. Let me start with the introduction to deep learning. We all know 00:02:17.180 --> 00:02:22.769 how to code. We code in a very simple way. We have some input for all computer 00:02:22.769 --> 00:02:27.618 algorithm. Then we have an algorithm which says: Do this, do that. If this, then 00:02:28.510 --> 00:02:28.906 that. And in that way we generate some output. This is not how machine learning 00:02:29.495 --> 00:02:30.754 works. Machine learning assumes you have some input, and you also have some output. 00:02:40.810 --> 00:02:46.180 And what you also have is some statistical model. This statistical model is flexible. 00:02:46.180 --> 00:02:51.549 It has certain parameters, which it can learn from the distribution of inputs and 00:02:51.549 --> 00:02:57.430 outputs you give it for training. So you basically learn the statistical model to 00:02:57.430 --> 00:03:03.659 generate the desired output from the given input. Let me give you a really simple 00:03:03.659 --> 00:03:09.980 example of how this might work. Let's say we have two animals. Well, we have two 00:03:09.980 --> 00:03:15.689 kinds of animals: unicorns and rabbits. And now we want to find an algorithm that 00:03:15.689 --> 00:03:24.270 tells us whether this animal we have right now as an input is a rabbit or a unicorn. 00:03:24.270 --> 00:03:28.230 We can write a simple algorithm to do that, but we can also do it with machine 00:03:28.230 --> 00:03:34.590 learning. The first thing we need is some input. I choose two features that are able 00:03:34.590 --> 00:03:42.269 to tell me whether this animal is a rabbit or a unicorn. Namely, speed and size. We 00:03:42.269 --> 00:03:46.859 call these features, and they describe something about what we want to classify. 00:03:46.859 --> 00:03:52.409 And the class is in this case our animal. First thing I need is some training data, 00:03:52.409 --> 00:03:59.170 some input. The input here are just pairs of speed and size. What I also need is 00:03:59.170 --> 00:04:04.129 information about the desired output. The desired output, of course, being the 00:04:04.129 --> 00:04:12.999 class. So either unicorn or rabbit, here denoted by yellow and red X's. So let's 00:04:12.999 --> 00:04:18.298 try to find a statistical model which we can use to separate this feature space 00:04:18.298 --> 00:04:24.150 into two halves: One for the rabbits, one for the unicorns. Looking at this, we can 00:04:24.150 --> 00:04:28.660 actually find a really simple statistical model, and our statistical model in this 00:04:28.660 --> 00:04:34.390 case is just a straight line. And the learning process is then to find where in 00:04:34.390 --> 00:04:41.080 this feature space the line should be. Ideally, for example, here. Right in the 00:04:41.080 --> 00:04:45.220 middle between the two classes rabbit and unicorn. Of course this is an overly 00:04:45.220 --> 00:04:50.370 simplified example. Real-world applications have feature distributions 00:04:50.370 --> 00:04:56.080 which look much more like this. So, we have a gradient, we don't have a perfect 00:04:56.080 --> 00:05:00.130 separation between those two classes, and those two classes are definitely not 00:05:00.130 --> 00:05:05.560 separable by a line. If we look again at some training samples — training samples 00:05:05.560 --> 00:05:11.730 are the data points we use for the machine learning process, so, to try to find the 00:05:11.730 --> 00:05:17.540 parameters of our statistical model — if we look at the line again, then this will 00:05:17.540 --> 00:05:23.000 not be able to separate this training set. Well, we will have a line that has some 00:05:23.000 --> 00:05:27.320 errors, some unicorns which will be classified as rabbits, some rabbits which 00:05:27.320 --> 00:05:33.070 will be classified as unicorns. This is what we call underfitting. Our model is 00:05:33.070 --> 00:05:40.150 just not able to express what we want it to learn. There is the opposite case. The 00:05:40.150 --> 00:05:45.510 opposite case being: we just learn all the training samples by heart. This is if we 00:05:45.510 --> 00:05:50.020 have a very complex model and just a few training samples to teach the model what 00:05:50.020 --> 00:05:55.120 it should learn. In this case we have a perfect separation of unicorns and 00:05:55.120 --> 00:06:00.700 rabbits, at least for the few data points we have. If we draw another example from 00:06:00.700 --> 00:06:07.300 the real world,some other data points, they will most likely be wrong. And this 00:06:07.300 --> 00:06:11.380 is what we call overfitting. The perfect scenario in this case would be something 00:06:11.380 --> 00:06:17.340 like this: a classifier which is really close to the distribution we have in the 00:06:17.340 --> 00:06:23.350 real world and machine learning is tasked with finding this perfect model and its 00:06:23.350 --> 00:06:28.960 parameters. Let me show you a different kind of model, something you probably all 00:06:28.960 --> 00:06:35.670 have heard about: Neural networks. Neural networks are inspired by the brain. 00:06:35.670 --> 00:06:41.210 Or more precisely, by the neurons in our brain. Neurons are tiny objects, tiny 00:06:41.210 --> 00:06:47.250 cells in our brain that take some input and generate some output. Sounds familiar, 00:06:47.250 --> 00:06:52.680 right? We have inputs usually in the form of electrical signals. And if they are 00:06:52.680 --> 00:06:57.860 strong enough, this neuron will also send out an electrical signal. And this is 00:06:57.860 --> 00:07:03.430 something we can model in a computer- engineering way. So, what we do is: We 00:07:03.430 --> 00:07:09.240 take a neuron. The neuron is just a simple mapping from input to output. Input here, 00:07:09.240 --> 00:07:17.200 just three input nodes. We denote them by i1, i2 and i3 and output denoted by o. And 00:07:17.200 --> 00:07:20.840 now you will actually see some mathematical equations. There are not many 00:07:20.840 --> 00:07:26.700 of these in this foundation talk, don't worry, and it's really simple. There's one 00:07:26.700 --> 00:07:30.250 more thing we need first, though, if we want to map input to output in the way a 00:07:30.250 --> 00:07:35.490 neuron does. Namely, the weights. The weights are just some arbitrary numbers 00:07:35.490 --> 00:07:43.020 for now. Let's call them w1, w2 and w3. So, we take those weights and we multiply 00:07:43.020 --> 00:07:51.360 them with the input. Input1 times weight1, input2 times weight2, and so on. And this, 00:07:51.360 --> 00:07:57.550 this sum just will be our output. Well, not quite. We make it a little bit more 00:07:57.550 --> 00:08:02.430 complicated. We also use something called an activation function. The activation 00:08:02.430 --> 00:08:08.520 function is just a mapping from one scalar value to another scalar value. In this 00:08:08.520 --> 00:08:14.280 case from what we got as an output, the sum, to something that more closely fits 00:08:14.280 --> 00:08:19.360 what we need. This could for example be something binary, where we have all the 00:08:19.360 --> 00:08:23.780 negative numbers being mapped to zero and all the positive numbers being mapped to 00:08:23.780 --> 00:08:30.910 one. And then this zero and one can encode something. For example: rabbit or unicorn. 00:08:30.910 --> 00:08:35.309 So, let me give you an example of how we can make the previous example with the 00:08:35.309 --> 00:08:41.729 rabbits and unicorns work with such a simple neuron. We just use speed, size, 00:08:41.729 --> 00:08:49.650 and the arbitrarily chosen number 10 as our inputs and the weights 1, 1, and -1. 00:08:49.650 --> 00:08:54.400 If we look at the equations, then we get for our negative numbers — so, speed plus 00:08:54.400 --> 00:09:01.440 size being less than 10 — a 0, and a 1 for all positive numbers — being speed plus 00:09:01.440 --> 00:09:07.680 size larger than 10, greater than 10. This way we again have a separating line 00:09:07.680 --> 00:09:14.600 between unicorns and rabbits. But again we have this really simplistic model. We want 00:09:14.600 --> 00:09:21.529 to become more and more complicated in order to express more complex tasks. So 00:09:21.529 --> 00:09:26.279 what do we do? We take more neurons. We take our three input values and put them 00:09:26.279 --> 00:09:31.920 into one neuron, and into a second neuron, and into a third neuron. And we take the 00:09:31.920 --> 00:09:38.330 output of those three neurons as input for another neuron. We also call this a 00:09:38.330 --> 00:09:42.140 multilayer perceptron, perceptron just being a different name for a neuron, what 00:09:42.140 --> 00:09:48.670 we have there. And the whole thing is also called a neural network. So now the 00:09:48.670 --> 00:09:53.300 question: How do we train this? How do we learn what this network should encode? 00:09:53.300 --> 00:09:57.620 Well, we want a mapping from input to output, and what we can change are the 00:09:57.620 --> 00:10:02.880 weights. First, what we do is we take a training sample, some input. Put it 00:10:02.880 --> 00:10:07.010 through the network, get an output. But this might not be the desired output which 00:10:07.010 --> 00:10:13.570 we know. So, in the binary case there are four possible cases: computed output, 00:10:13.570 --> 00:10:19.860 expected output, each two values, 0 and 1. The best case would be: we want a 0, get a 00:10:19.860 --> 00:10:27.120 0, want a 1 and get a 1. But there is also the opposite case. In these two cases we 00:10:27.120 --> 00:10:31.440 can learn something about our model. Namely, in which direction to change the 00:10:31.440 --> 00:10:37.270 weights. It's a little bit simplified, but in principle you just raise the weights if 00:10:37.270 --> 00:10:41.250 you need a higher number as output and you lower the weights if you need a lower 00:10:41.250 --> 00:10:47.350 number as output. To tell you how much, we have two terms. First term being the 00:10:47.350 --> 00:10:53.110 error, so in this case just the difference between desired and expected output – also 00:10:53.110 --> 00:10:56.890 often called a loss function, especially in deep learning and more complex 00:10:56.890 --> 00:11:04.120 applications. You also have a second term we call the act the learning rate, and the 00:11:04.120 --> 00:11:09.170 learning rate is what tells us how quickly we should change the weights, how quickly 00:11:09.170 --> 00:11:14.890 we should adapt the weights. Okay, this is how we learn a model. This is almost 00:11:14.890 --> 00:11:18.550 everything you need to know. There are mathematical equations that tell you how 00:11:18.550 --> 00:11:23.770 much to change based on the error and the learning function. And this is the entire 00:11:23.770 --> 00:11:30.339 learning process. Let's get back to the terminology. We have the input layer. We 00:11:30.339 --> 00:11:34.020 have the output layer, which somehow encodes our output either in one value or 00:11:34.020 --> 00:11:39.650 in several values if we have a multiple, if we have multiple classes. We also have 00:11:39.650 --> 00:11:45.930 the hidden layers, which are actually what makes our model deep. What we can change, 00:11:45.930 --> 00:11:51.980 what we can learn, is the are the weights, the parameters of this model. But what we 00:11:51.980 --> 00:11:55.490 also need to keep in mind, is the number of layers, the number of neurons per 00:11:55.490 --> 00:11:59.590 layer, the learning rate, and the activation function. These are called 00:11:59.590 --> 00:12:04.240 hyper parameters, and they determine how complex our model is, how well it is 00:12:04.240 --> 00:12:09.970 suited to solve the task at hand. I quite often spoke about solving tasks, so the 00:12:09.970 --> 00:12:14.630 question is: What can we actually do with neural networks? Mostly classification 00:12:14.630 --> 00:12:19.560 tasks, for example: Tell me, is this animal a rabbit or unicorn? Is this text 00:12:19.560 --> 00:12:24.690 message spam or legitimate? Is this patient healthy or ill? Is this image a 00:12:24.690 --> 00:12:30.710 picture of a cat or a dog? We already saw for the animal that we need something 00:12:30.710 --> 00:12:35.040 called features, which somehow encodes information about what we want to 00:12:35.040 --> 00:12:39.530 classify, something we can use as input for the neural network. Some kind of 00:12:39.530 --> 00:12:43.830 number that is meaningful. So, for the animal it could be speed, size, or 00:12:43.830 --> 00:12:48.740 something like color. Color, of course, being more complex again, because we have, 00:12:48.740 --> 00:12:55.940 for example, RGB, so three values. And, text message being a more complex case 00:12:55.940 --> 00:13:00.060 again, because we somehow need to encode the sender, and whether the sender is 00:13:00.060 --> 00:13:04.770 legitimate. Same for the recipient, or the number of hyperlinks, or where the 00:13:04.770 --> 00:13:11.400 hyperlinks refer to, or the, whether there are certain words present in the text. It 00:13:11.400 --> 00:13:16.720 gets more and more complicated. Even more so for a patient. How do we encode medical 00:13:16.720 --> 00:13:22.420 history in a proper way for the network to learn. I mean, temperature is simple. It's 00:13:22.420 --> 00:13:26.750 a scalar value, we just have a number. But how do we encode whether certain symptoms 00:13:26.750 --> 00:13:32.720 are present. And the image, which is actually what I work with everyday, is 00:13:32.720 --> 00:13:38.350 again quite complex. We have values, we have numbers, but only pixel values, which 00:13:38.350 --> 00:13:43.450 make it difficult, which are difficult to use as input for a neural network. Why? 00:13:43.450 --> 00:13:48.350 I'll show you. I'll actually show you with this picture, it's a very famous picture, 00:13:48.350 --> 00:13:53.970 and everybody uses it in computer vision. They will tell you, it's because there is 00:13:53.970 --> 00:14:01.010 a multitude of different characteristics in this image: shapes, edges, whatever you 00:14:01.010 --> 00:14:07.080 desire. The truth is, it's a crop from the centrefold of the Playboy, and in earlier 00:14:07.080 --> 00:14:12.070 years, the computer vision engineers was a mostly male audience. Anyway, let's take 00:14:12.070 --> 00:14:16.850 five by five pixels. Let's assume, this is a five by five pixels, a really small, 00:14:16.850 --> 00:14:22.230 image. If we take those 25 pixels and use them as input for a neural network you 00:14:22.230 --> 00:14:26.730 already see that we have many connections - many weights - which means a very 00:14:26.730 --> 00:14:32.540 complex model. Complex model, of course, prone to overfitting. But there are more 00:14:32.540 --> 00:14:38.800 problems. First being, we have disconnected the pixels from its neigh-, a 00:14:38.800 --> 00:14:43.670 pixel from its neighbors. We can't encode information about the neighborhood 00:14:43.670 --> 00:14:47.850 anymore, and that really sucks. If we just take the whole picture, and move it to the 00:14:47.850 --> 00:14:52.790 left or to the right by just one pixel, the network will see something completely 00:14:52.790 --> 00:14:58.470 different, even though to us it is exactly the same. But, we can solve that with some 00:14:58.470 --> 00:15:03.400 very clever engineering, something we call a convolutional layer. It is again a 00:15:03.400 --> 00:15:08.860 hidden layer in a neural network, but it does something special. It actually is a 00:15:08.860 --> 00:15:13.970 very simple neuron again, just four input values - one output value. But the four 00:15:13.970 --> 00:15:19.780 input values look at two by two pixels, and encode one output value. And then the 00:15:19.780 --> 00:15:23.790 same network is shifted to the right, and encodes another pixel, and another pixel, 00:15:23.790 --> 00:15:30.150 and the next row of pixels. And in this way creates another 2D image. We have 00:15:30.150 --> 00:15:34.900 preserved information about the neighborhood, and we just have a very low 00:15:34.900 --> 00:15:41.910 number of weights, not the huge number of parameters we saw earlier. We can use this 00:15:41.910 --> 00:15:49.640 once, or twice, or several hundred times. And this is actually where we go deep. 00:15:49.640 --> 00:15:54.920 Deep means: We have several layers, and having layers that don't need thousands or 00:15:54.920 --> 00:16:01.040 millions of connections, but only a few. This is what allows us to go really deep. 00:16:01.040 --> 00:16:06.250 And in this fashion we can encode an entire image in just a few meaningful 00:16:06.250 --> 00:16:11.480 values. How these values look like, and what they encode, this is learned through 00:16:11.480 --> 00:16:18.240 the learning process. And we can then, for example, use these few values as input for 00:16:18.240 --> 00:16:24.709 a classification network. The fully connected network we saw earlier. 00:16:24.709 --> 00:16:29.560 Or we can do something more clever. We can do the inverse operation and create an image 00:16:29.560 --> 00:16:35.170 again, for example, the same image, which is then called an auto encoder. Auto 00:16:35.170 --> 00:16:40.200 encoders are tremendously useful, even though they don't appear that way. For 00:16:40.200 --> 00:16:43.959 example, imagine you want to check whether something has a defect, or not, a picture 00:16:43.959 --> 00:16:51.290 of a fabric, or of something. You just train the network with normal pictures. 00:16:51.290 --> 00:16:56.770 And then, if you have a defect picture, the network is not able to produce this 00:16:56.770 --> 00:17:02.149 defect. And so the difference of the reproduced picture, and the real picture 00:17:02.149 --> 00:17:07.420 will show you where errors are. If it works properly, I'll have to admit that. 00:17:07.420 --> 00:17:12.569 But we can go even further. Let's say, we want to encode something entirely else. 00:17:12.569 --> 00:17:17.400 Well, let's encode the image, the information in the image, but in another 00:17:17.400 --> 00:17:21.859 representation. For example, let's say we have three classes again. The background 00:17:21.859 --> 00:17:30.049 class in grey, a class called hat or headwear in blue, and person in green. We 00:17:30.049 --> 00:17:34.309 can also use this for other applications than just for pictures of humans. For 00:17:34.309 --> 00:17:38.370 example, we have a picture of a street and want to encode: Where is the car, where's 00:17:38.370 --> 00:17:44.860 the pedestrian? Tremendously useful. Or we have an MRI scan of a brain: Where in the 00:17:44.860 --> 00:17:51.110 brain is the tumor? Can we somehow learn this? Yes we can do this, with methods 00:17:51.110 --> 00:17:57.480 like these, if they are trained properly. More about that later. Well we expect 00:17:57.480 --> 00:18:01.020 something like this to come out but the truth looks rather like this – especially 00:18:01.020 --> 00:18:05.870 if it's not properly trained. We have not the real shape we want to get but 00:18:05.870 --> 00:18:11.980 something distorted. So here is again where we need to do learning. First we 00:18:11.980 --> 00:18:15.790 take a picture, put it through the network, get our output representation. 00:18:15.790 --> 00:18:21.110 And we have the information about how we want it to look. We again compute some 00:18:21.110 --> 00:18:27.040 kind of loss value. This time for example being the overlap between the shape we get 00:18:27.040 --> 00:18:34.040 out of the model and the shape we want to have. And we use this error, this lost 00:18:34.040 --> 00:18:38.660 function, to update the weights of our network. Again – even though it's more 00:18:38.660 --> 00:18:43.570 complicated here, even though we have more layers, and even though the layers look 00:18:43.570 --> 00:18:48.640 slightly different – it is the same process all over again as with a binary 00:18:48.640 --> 00:18:56.540 case. And we need lots of training data. This is something that you'll hear often 00:18:56.540 --> 00:19:02.960 in connection with deep learning: You need lots of training data to make this work. 00:19:02.960 --> 00:19:10.100 Images are complex things and in order to meaningful extract knowledge from them, 00:19:10.100 --> 00:19:17.090 the network needs to see a multitude of different images. Well now I already 00:19:17.090 --> 00:19:22.230 showed you some things we use in network architecture, some support networks: The 00:19:22.230 --> 00:19:26.679 fully convolutional encoder, which takes an image and produces a few meaningful 00:19:26.679 --> 00:19:33.110 values out of this image; its counterpart the fully convolutional decoder – fully 00:19:33.110 --> 00:19:36.960 convolutional meaning by the way that we only have these convolutional layers with 00:19:36.960 --> 00:19:42.980 a few parameters that somehow encode spatial information and keep it for the 00:19:42.980 --> 00:19:49.360 next layers. The decoder takes a few meaningful numbers and reproduces an image 00:19:49.360 --> 00:19:55.420 – either the same image or another representation of the information encoded 00:19:55.420 --> 00:20:01.400 in the image. We also already saw the fully connected network. Fully connected 00:20:01.400 --> 00:20:06.640 meaning every neuron is connected to every neuron in the next layer. This of course 00:20:06.640 --> 00:20:12.570 can be dangerous because this is where we actually get most of our parameters. If we 00:20:12.570 --> 00:20:16.390 have a fully connected network, this is where the most parameters will be present 00:20:16.390 --> 00:20:21.580 because connecting every node to every node … this is just a high number of 00:20:21.580 --> 00:20:25.860 connections. We can also do other things. For example something called a pooling 00:20:25.860 --> 00:20:32.280 layer. A pooling layer being basically the same as one of those convolutional layers, 00:20:32.280 --> 00:20:36.370 just that we don't have parameters we need to learn. This works without parameters 00:20:36.370 --> 00:20:43.740 because this neuron just chooses whichever value is the highest and takes that value 00:20:43.740 --> 00:20:49.600 as output. This is really great for reducing the size of your image and also 00:20:49.600 --> 00:20:55.150 getting rid of information that might not be that important. We can also do some 00:20:55.150 --> 00:20:59.890 clever techniques like adding a dropout layer. A dropout layer just being a normal 00:20:59.890 --> 00:21:05.799 layer in a neural network where we remove some connections: In one training step 00:21:05.799 --> 00:21:10.720 these connections, in the next training step some other connections. This way we 00:21:10.720 --> 00:21:18.049 teach the other connections to become more resilient against errors. I would like to 00:21:18.049 --> 00:21:22.750 start with something I call the "Model Show" now, and show you some models and 00:21:22.750 --> 00:21:28.870 how we train those models. And I will start with a fully convolutional decoder 00:21:28.870 --> 00:21:34.740 we saw earlier: This thing that takes a number and creates a picture. I would like 00:21:34.740 --> 00:21:41.420 to take this model, put in some number and get out a picture – a picture of a horse 00:21:41.420 --> 00:21:46.000 for example. If I put in a different number I also want to get a picture of a 00:21:46.000 --> 00:21:52.390 horse, but of a different horse. So what I want to get is a mapping from some 00:21:52.390 --> 00:21:56.730 numbers, some features that encode something about the horse picture, and get 00:21:56.730 --> 00:22:03.450 a horse picture out of it. You might see already why this is problematic. It is 00:22:03.450 --> 00:22:08.230 problematic because we don't have a mapping from feature to horse or from 00:22:08.230 --> 00:22:15.050 horse to features. So we don't have a truth value we can use to learn how to 00:22:15.050 --> 00:22:21.790 generate this mapping. Well computer vision engineers – or deep learning 00:22:21.790 --> 00:22:26.800 professionals – they're smart and have clever ideas. Let's just assume we have 00:22:26.800 --> 00:22:32.870 such a network and let's call it a generator. Let's take some numbers put, 00:22:32.870 --> 00:22:39.240 them into the generator and get some horses. Well it doesn't work yet. We still 00:22:39.240 --> 00:22:42.490 have to train it. So they're probably not only horses but also some very special 00:22:42.490 --> 00:22:47.970 unicorns among the horses; which might be nice for other applications, but I wanted 00:22:47.970 --> 00:22:55.480 pictures of horses right now. So I can't train with this data directly. But what I 00:22:55.480 --> 00:23:01.600 can do is I can create a second network. This network is called a discriminator and 00:23:01.600 --> 00:23:08.820 I can give it the input generated from the generator as well as the real data I have: 00:23:08.820 --> 00:23:13.920 the real horse pictures. And then I can teach the discriminator to distinguish 00:23:13.920 --> 00:23:22.080 between those. Tell me it is a real horse or it's not a real horse. And there I know 00:23:22.080 --> 00:23:27.000 what is the truth because I either take real horse pictures or fake horse pictures 00:23:27.000 --> 00:23:34.170 from the generator. So I have a truth value for this discriminator. But in doing 00:23:34.170 --> 00:23:39.070 this I also have a truth value for the generator. Because I want the generator to 00:23:39.070 --> 00:23:43.799 work against the discriminator. So I can also use the information how well the 00:23:43.799 --> 00:23:51.010 discriminator does to train the generator to become better in fooling. This is 00:23:51.010 --> 00:23:57.470 called a generative adversarial network. And it can be used to generate pictures of 00:23:57.470 --> 00:24:02.350 an arbitrary distribution. Let's do this with numbers and I will actually show you 00:24:02.350 --> 00:24:07.590 the training process. Before I start the video, I'll tell you what I did. I took 00:24:07.590 --> 00:24:11.550 some handwritten digits. There is a database called "??? of handwritten 00:24:11.550 --> 00:24:18.570 digits" so the numbers of 0 to 9. And I took those and used them as training data. 00:24:18.570 --> 00:24:24.299 I trained a generator in the way I showed you on the previous slide, and then I just 00:24:24.299 --> 00:24:30.110 took some random numbers. I put those random numbers into the network and just 00:24:30.110 --> 00:24:35.960 stored the image of what came out of the network. And here in the video you'll see 00:24:35.960 --> 00:24:43.090 how the network improved with ongoing training. You will see that we start 00:24:43.090 --> 00:24:50.179 basically with just noisy images … and then after some – what we call apox(???) 00:24:50.179 --> 00:24:55.919 so training iterations – the network is able to almost perfectly generate 00:24:55.919 --> 00:25:05.679 handwritten digits just from noise. Which I find truly fascinating. Of course this 00:25:05.679 --> 00:25:11.270 is an example where it works. It highly depends on your data set and how you train 00:25:11.270 --> 00:25:15.600 the model whether it is a success or not. But if it works, you can use it to 00:25:15.600 --> 00:25:22.559 generate fonts. You can generate characters, 3D objects, pictures of 00:25:22.559 --> 00:25:28.700 animals, whatever you want as long as you have training data. Let's go more crazy. 00:25:28.700 --> 00:25:34.539 Let's take two of those and let's say we have pictures of horses and pictures of 00:25:34.539 --> 00:25:41.150 zebras. I want to convert those pictures of horses into pictures of zebras, and I 00:25:41.150 --> 00:25:44.590 want to convert pictures of zebras into pictures of horses. So I want to have the 00:25:44.590 --> 00:25:49.690 same picture just with the other animal. But I don't have training data of the same 00:25:49.690 --> 00:25:56.270 situation just once with a horse and once with a zebra. Doesn't matter. We can train 00:25:56.270 --> 00:26:00.650 a network that does that for us. Again we just have a network – we call it the 00:26:00.650 --> 00:26:05.730 generator – and we have two of those: One that converts horses to zebras and one 00:26:05.730 --> 00:26:14.840 that converts zebras to horses. And then we also have two discriminators that tell 00:26:14.840 --> 00:26:21.150 us: real horse – fake horse – real zebra – fake zebra. And then we again need to 00:26:21.150 --> 00:26:27.210 perform some training. So we need to somehow encode: Did it work what we wanted 00:26:27.210 --> 00:26:31.460 to do? And a very simple way to do this is we take a picture of a horse put it 00:26:31.460 --> 00:26:35.470 through the generator that generates a zebra. Take this fake picture of a zebra, 00:26:35.470 --> 00:26:39.340 put it through the generator that generates a picture of a horse. And if 00:26:39.340 --> 00:26:43.700 this is the same picture as we put in, then our model worked. And if it didn't, 00:26:43.700 --> 00:26:48.549 we can use that information to update the weights. I just took a random picture, 00:26:48.549 --> 00:26:54.460 from a free library in the Internet, of a horse and generated a zebra and it worked 00:26:54.460 --> 00:26:59.470 remarkably well. I actually didn't even do training. It also doesn't need to be a 00:26:59.470 --> 00:27:03.120 picture. You can also convert text to images: You describe something in words 00:27:03.120 --> 00:27:09.570 and generate images. You can age your face or age a cell; or make a patient healthy 00:27:09.570 --> 00:27:15.510 or sick – or the image of a patient, not the patient self, unfortunately. You can 00:27:15.510 --> 00:27:20.690 do style transfer like take a picture of Van Gogh and apply it to your own picture. 00:27:20.690 --> 00:27:27.559 Stuff like that. Something else that we can do with neural networks. Let's assume 00:27:27.559 --> 00:27:31.030 we have a classification network, we have a picture of a toothbrush and the network 00:27:31.030 --> 00:27:36.770 tells us: Well, this is a toothbrush. Great! But how resilient is this network? 00:27:36.770 --> 00:27:44.530 Does it really work in every scenario. There's a second network we can apply: We 00:27:44.530 --> 00:27:48.701 call it an adversarial network. And that network is trained to do one thing: Look 00:27:48.701 --> 00:27:52.289 at the network, look at the picture, and then find the one weak spot in the 00:27:52.289 --> 00:27:55.880 picture: Just change one pixel slightly so that the network will tell me this 00:27:55.880 --> 00:28:03.600 toothbrush is an octopus. Works remarkably well. Also works with just changing the 00:28:03.600 --> 00:28:08.940 picture slightly, so changing all the pixels, but just slight minute changes 00:28:08.940 --> 00:28:12.860 that we don't perceive, but the network – the classification network – is completely 00:28:12.860 --> 00:28:19.640 thrown off. Well sounds bad. Is bad if you don't consider it. But you can also for 00:28:19.640 --> 00:28:24.200 example use this for training your network and make your network resilient. So 00:28:24.200 --> 00:28:28.460 there's always an upside and downside. Something entirely else: Now I'd like to 00:28:28.460 --> 00:28:32.880 show you something about text. A word- language model. I want to generate 00:28:32.880 --> 00:28:38.101 sentences for my podcast. I have a network that gives me a word, and then if I want 00:28:38.101 --> 00:28:42.640 to somehow get the next word in the sentence, I also need to consider this 00:28:42.640 --> 00:28:47.070 word. So another network architecture – quite interestingly – just takes the 00:28:47.070 --> 00:28:52.179 hidden states of the network and uses them as the input for the same network so that 00:28:52.179 --> 00:28:58.780 in the next iteration we still know what we did in the previous step. I tried to 00:28:58.780 --> 00:29:04.730 train a network that generates podcast episodes for my podcasts. Didn't work. 00:29:04.730 --> 00:29:08.450 What I learned is I don't have enough training data. I really need to produce 00:29:08.450 --> 00:29:15.790 more podcast episodes in order to train a model to do my job for me. And this is 00:29:15.790 --> 00:29:21.539 very important, a very crucial point: Training data. We need shitloads of 00:29:21.539 --> 00:29:26.081 training data. And actually the more complicated our model and our training 00:29:26.081 --> 00:29:30.990 process becomes, the more training data we need. I started with a supervised case – 00:29:30.990 --> 00:29:35.990 the really simple case where we, really simple, the really simpler case where we 00:29:35.990 --> 00:29:40.660 have a picture and a label that corresponds to that picture; or a 00:29:40.660 --> 00:29:46.280 representation of that picture showing entirely what I wanted to learn. But we 00:29:46.280 --> 00:29:51.909 also saw a more complex task, where I had to pictures – horses and zebras – that are 00:29:51.909 --> 00:29:56.400 from two different domains – but domains with no direct mapping. What can also 00:29:56.400 --> 00:30:01.020 happen – and actually happens quite a lot – is weakly annotated data, so data that 00:30:01.020 --> 00:30:08.750 is not precisely annotated; where we can't rely on the information we get. Or even 00:30:08.750 --> 00:30:13.050 more complicated: Something called reinforcement learning where we perform a 00:30:13.050 --> 00:30:19.380 sequence of actions and then in the end are told "yeah that was great". Which is 00:30:19.380 --> 00:30:24.080 often not enough information to really perform proper training. But of course 00:30:24.080 --> 00:30:28.190 there are also methods for that. As well as there are methods for the unsupervised 00:30:28.190 --> 00:30:33.590 case where we don't have annotations, labeled data – no ground truth at all – 00:30:33.590 --> 00:30:41.241 just the picture itself. Well I talked about pictures. I told you that we can 00:30:41.241 --> 00:30:45.320 learn features and create images from them. And we can use them for 00:30:45.320 --> 00:30:51.640 classification. And for this there exist many databases. There are public data sets 00:30:51.640 --> 00:30:56.659 we can use. Often they refer to for example Flickr. They're just hyperlinks 00:30:56.659 --> 00:31:00.960 which is also why I didn't show you many pictures right here, because I am honestly 00:31:00.960 --> 00:31:05.690 not sure about the copyright in those cases. But there are also challenge 00:31:05.690 --> 00:31:11.190 datasets where you can just sign up, get some for example medical data sets, and 00:31:11.190 --> 00:31:16.650 then compete against other researchers. And of course there are those companies 00:31:16.650 --> 00:31:22.090 that just have lots of data. And those companies also have the means, the 00:31:22.090 --> 00:31:28.110 capacity to perform intense computations. And those are also often the companies you 00:31:28.110 --> 00:31:36.179 hear from in terms of innovation for deep learning. Well this was mostly to tell you 00:31:36.179 --> 00:31:40.200 that you can process images quite well with deep learning if you have enough 00:31:40.200 --> 00:31:46.029 training data, if you have a proper training process and also a little if you 00:31:46.029 --> 00:31:52.090 know what you're doing. But you can also process text, you can process audio and 00:31:52.090 --> 00:31:58.520 time series like prices or a stack exchange – stuff like that. You can 00:31:58.520 --> 00:32:02.929 process almost everything if you make it encodeable to your network. Sounds like a 00:32:02.929 --> 00:32:08.120 dream come true. But – as I already told you – you need data, a lot of it. I told 00:32:08.120 --> 00:32:14.020 you about those companies that have lots of data sets and the publicly available 00:32:14.020 --> 00:32:21.370 data sets which you can actually use to get started with your own experiments. But 00:32:21.370 --> 00:32:24.309 that also makes it a little dangerous because deep learning still is a black box 00:32:24.309 --> 00:32:30.820 to us. I told you what happens inside the black box on a level that teaches you how 00:32:30.820 --> 00:32:36.529 we learn and how the network is structured, but not really what the 00:32:36.529 --> 00:32:42.831 network learned. It is for us computer vision engineers really nice that we can 00:32:42.831 --> 00:32:48.590 visualize the first layers of a neural network and see what is actually encoded 00:32:48.590 --> 00:32:53.950 in those first layers; what information the network looks at. But you can't really 00:32:53.950 --> 00:32:59.059 mathematically prove what happens in a network. Which is one major downside. And 00:32:59.059 --> 00:33:02.150 so if you want to use it, the numbers may be really great but be sure to properly 00:33:02.150 --> 00:33:08.059 evaluate them. In summary I call that "easy to learn". Every one – every single 00:33:08.059 --> 00:33:12.679 one of you – can just start with deep learning right away. You don't need to do 00:33:12.679 --> 00:33:19.440 much work. You don't need to do much learning. The model learns for you. But 00:33:19.440 --> 00:33:23.770 they're hard to master in a way that makes them useful for production use cases for 00:33:23.770 --> 00:33:29.900 example. So if you want to use deep learning for something – if you really 00:33:29.900 --> 00:33:34.299 want to seriously use it –, make sure that it really does what you wanted to and 00:33:34.299 --> 00:33:38.900 doesn't learn something else – which also happens. Pretty sure you saw some talks 00:33:38.900 --> 00:33:43.670 about deep learning fails – which is not what this talk is about. They're quite 00:33:43.670 --> 00:33:47.370 funny to look at. Just make sure that they don't happen to you! If you do that 00:33:47.370 --> 00:33:53.300 though, you'll achieve great things with deep learning, I'm sure. And that was 00:33:53.300 --> 00:34:00.740 introduction to deep learning. Thank you! Applause 00:34:09.172 --> 00:34:13.449 Herald Angel: So now it's question and answer time. So if you have a question, 00:34:13.449 --> 00:34:19.110 please line up at the mikes. We have in total eight, so it shouldn't be far from 00:34:19.110 --> 00:34:26.139 you. They are here in the corridors and on these sides. Please line up! For 00:34:26.139 --> 00:34:31.540 everybody: A question consists of one sentence with the question mark in the end 00:34:31.540 --> 00:34:38.449 – not three minutes of rambling. And also if you go to the microphone, speak into 00:34:38.449 --> 00:34:53.889 the microphone, so you really get close to it. Okay. Where do we have … Number 7! 00:34:53.889 --> 00:35:02.200 We start with mic number 7: Question: Hello. My question is: How did 00:35:02.200 --> 00:35:13.020 you compute the example for the fonts, the numbers? I didn't really understand it, 00:35:13.020 --> 00:35:19.770 you just said it was made from white noise. 00:35:19.770 --> 00:35:25.580 Teubi: I'll give you a really brief recap of what I did. I showed you that we have a 00:35:25.580 --> 00:35:31.140 model that maps image to some meaningful values, that an image can be encoded in 00:35:31.140 --> 00:35:36.860 just a few values. What happens here is exactly the other way round. We have some 00:35:36.860 --> 00:35:43.270 values, just some arbitrary values we actually know nothing about. We can 00:35:43.270 --> 00:35:47.480 generate pictures out of those. So I trained this model to just take some 00:35:47.480 --> 00:35:54.560 random values and show the pictures generated from the model. The training 00:35:54.560 --> 00:36:03.320 process was this "min max game", as its called. We have two networks that try to 00:36:03.320 --> 00:36:08.260 compete against each other. One network trying to distinguish, whether a picture 00:36:08.260 --> 00:36:12.790 it sees is real or one of those fake pictures, and the network that actually 00:36:12.790 --> 00:36:18.510 generates those pictures and in training the network that is able to distinguish 00:36:18.510 --> 00:36:24.599 between those, we can also get information for the training of the network that 00:36:24.599 --> 00:36:30.410 generates the pictures. So the videos you saw were just animations of what happens 00:36:30.410 --> 00:36:36.440 during this training process. At first if we input noise we get noise. But as the 00:36:36.440 --> 00:36:41.510 network is able to better and better recreate those images from the dataset we 00:36:41.510 --> 00:36:47.390 used as input, in this case pictures of handwritten digits, the output also became 00:36:47.390 --> 00:36:54.660 more lookalike to those numbers, these handwritten digits. Hope that helped. 00:36:54.660 --> 00:37:06.590 Herald Angel: Now we go to the Internet. – Can we get sound for the signal 00:37:06.590 --> 00:37:10.040 Angel, please? Teubi: Sounded so great, "now we go to the Internet." 00:37:10.040 --> 00:37:11.040 Herald Angel: Yeah, that sounds like "yeeaah". 00:37:11.040 --> 00:37:13.040 Signal Angel: And now we're finally ready to go to the interwebs. "Schorsch" is 00:37:13.040 --> 00:37:18.040 asking: Do you have any recommendations for a beginner regarding the framework or 00:37:18.040 --> 00:37:26.460 the software? Teubi: I, of course, am very biased to 00:37:26.460 --> 00:37:34.150 recommend what I use everyday. But I also think that it is a great start. Basically, 00:37:34.150 --> 00:37:40.210 use python and use pytorch. Many people will disagree with me and tell you 00:37:40.210 --> 00:37:45.930 "tensorflow is better." It might be, in my opinion not for getting started, and there 00:37:45.930 --> 00:37:51.560 are also some nice tutorials on the pytorch website. What you can also do is 00:37:51.560 --> 00:37:57.200 look at websites like OpenAI, where they have a gym to get you started with some 00:37:57.200 --> 00:38:02.371 training exercises, where you already have datasets. Yeah, basically my 00:38:02.371 --> 00:38:08.600 recommendation is get used to Python and start with a pytorch tutorial, see where 00:38:08.600 --> 00:38:13.590 to go from there. Often there also some github repositories linked with many 00:38:13.590 --> 00:38:18.740 examples for already established network architectures like the cycle GAN or the 00:38:18.740 --> 00:38:26.250 GAN itself or basically everything else. There will be a repo you can use to get 00:38:26.250 --> 00:38:29.940 started. Herald Angel: OK, we stay with the 00:38:29.940 --> 00:38:32.589 internet. There's some more questions, I heard. 00:38:32.589 --> 00:38:37.920 Signal Angel: Yes. Rubin8 is asking: Have you have you ever come across an example 00:38:37.920 --> 00:38:42.580 of a neural network that deals with audio instead of images? 00:38:42.580 --> 00:38:49.410 Teubi: Me personally, no. At least not directly. I've heard about examples, like 00:38:49.410 --> 00:38:54.859 where you can change the voice to sound like another person, but there is not much 00:38:54.859 --> 00:38:59.980 I can reliably tell about that. My expertise really is in image processing, 00:38:59.980 --> 00:39:05.550 I'm sorry. Herald Angel: And I think we have time for 00:39:05.550 --> 00:39:12.340 one more question. We have one at number 8. Microphone number 8. 00:39:12.340 --> 00:39:20.730 Question: Is the current Face recognition technologies in, for example iPhone X, is 00:39:20.730 --> 00:39:26.420 it also a deep learning algorithm or is it something more simple? Do you have any 00:39:26.420 --> 00:39:31.880 idea about that? Teubi: As far as I know, yes. That's all I 00:39:31.880 --> 00:39:38.630 can reliably tell you about that, but it is not only based on images but also uses 00:39:38.630 --> 00:39:45.420 other information. I think distance information encoded with some infrared 00:39:45.420 --> 00:39:50.599 signals. I don't really know exactly how it works, but at least iPhones already 00:39:50.599 --> 00:39:56.000 have a neural network processing engine built in, so a chip 00:39:56.000 --> 00:40:01.190 dedicated to just doing those computations. You saw that many of those 00:40:01.190 --> 00:40:05.820 things can be parallelized, and this is what those hardware architectures make use 00:40:05.820 --> 00:40:10.380 of. So I'm pretty confident in saying, yes, they also do it there. 00:40:10.380 --> 00:40:12.786 How exactly, no clue. 00:40:13.760 --> 00:40:15.323 Herald Angel: OK. I myself have a last 00:40:15.390 --> 00:40:20.680 completely unrelated question: Did you create the design of the slides yourself? 00:40:20.680 --> 00:40:29.060 Teubi: I had some help. We have a really great Congress design and I use that as an 00:40:29.060 --> 00:40:32.790 inspiration to create those slides, yes. 00:40:32.790 --> 00:40:36.760 Herald Angel: OK, yeah, because those are really amazing. I love them. 00:40:36.760 --> 00:40:38.140 Teubi: Thank you! 00:40:38.470 --> 00:40:41.200 Herald Angel: OK, thank you very much Teubi. 00:40:45.130 --> 00:40:48.900 35C5 outro music 00:40:48.900 --> 00:41:07.000 subtitles created by c3subtitles.de in the year 2019. Join, and help us!