0:00:00.099,0:00:15.690 34c3 intro 0:00:15.690,0:00:20.270 Herald: All right, now it's my great[br]pleasure to introduce Paul Emmerich who is 0:00:20.270,0:00:26.520 going to talk about "Demystifying Network[br]Cards". Paul is a PhD student at the 0:00:26.520,0:00:33.660 Technical University in Munich. He's doing[br]all kinds of network related stuff and[br] 0:00:33.660,0:00:37.950 hopefully today he's gonna help us make[br]network cards a bit less of a black box. 0:00:37.950,0:00:48.530 So, please give a warm welcome to Paul[br]applause 0:00:48.530,0:00:50.559 Paul: Thank you and as the introduction 0:00:50.559,0:00:54.649 already said I'm a PhD student and I'm[br]researching performance of software packet 0:00:54.649,0:00:58.319 processing and forwarding systems.[br]That means I spend a lot of time doing 0:00:58.319,0:01:02.559 low-level optimizations and looking into[br]what makes a system fast, what makes it 0:01:02.559,0:01:05.980 slow, what can be done to improve it[br]and I'm mostly working on my packet 0:01:05.980,0:01:09.770 generator MoonGen[br]I have some cross promotion of a lightning 0:01:09.770,0:01:13.490 talk about this on Saturday but here I[br]have this long slot 0:01:13.490,0:01:17.550 and I brought a lot of content here so I[br]have to talk really fast so sorry for the 0:01:17.550,0:01:20.560 translators and I hope you can mainly[br]follow along 0:01:20.560,0:01:24.920 So: this is about Network cards meaning[br]network cards you all have seen. This is a 0:01:24.920,0:01:30.369 usual 10G network card with the SFP+ port[br]and this is a faster network card with a 0:01:30.369,0:01:35.359 QSFP+ port. This is 20, 40, or 100G[br]and now you bought this fancy network 0:01:35.359,0:01:38.229 card, you plug it into your server or your[br]macbook or whatever, 0:01:38.229,0:01:41.520 and you start your web server that serves[br]cat pictures and cat videos. 0:01:41.520,0:01:45.739 You all know that there's a whole stack of[br]protocols that your cat picture has to go 0:01:45.739,0:01:48.089 through until it arrives at a network card[br]at the bottom 0:01:48.089,0:01:52.120 and the only thing that I care about are[br]the lower layers. I don't care about TCP, 0:01:52.120,0:01:55.520 I have no idea how TCP works.[br]Well I have some idea how it works, but 0:01:55.520,0:01:57.701 this is not my research, I don't care[br]about it. 0:01:57.701,0:02:01.280 I just want to look at individual packets[br]and the highest thing I look at it's maybe 0:02:01.280,0:02:07.729 an IP address or maybe a part of the[br]protocol to identify flows or anything. 0:02:07.729,0:02:11.050 Now you might wonder: Is there anything[br]even interesting in these lower layers? 0:02:11.050,0:02:15.080 Because people nowadays think that[br]everything runs on top of HTTP, 0:02:15.080,0:02:19.160 but you might be surprised that not all[br]applications run on top of HTTP. 0:02:19.160,0:02:23.380 There is a lot of software that needs to[br]run at these lower levels and in the 0:02:23.380,0:02:26.150 recent years[br]there is a trend of moving network 0:02:26.150,0:02:30.810 infrastructure stuff from specialized[br]hardware black boxes to open software 0:02:30.810,0:02:33.220 boxes[br]and examples for such software that was 0:02:33.220,0:02:37.780 hardware in the past are: routers, switches,[br]firewalls, middle boxes and so on. 0:02:37.780,0:02:40.420 If you want to look up the relevant[br]buzzwords: It's Network Function 0:02:40.420,0:02:45.850 Virtualization what it's called and this[br]is a recent trend of the recent years. 0:02:45.850,0:02:50.610 Now let's say we want to build our own[br]fancy application on that low-level thing. 0:02:50.610,0:02:55.120 We want to build our firewall router[br]packet forward modifier thing that does 0:02:55.120,0:02:59.410 whatever useful on that lower layer for[br]network infrastructure 0:02:59.410,0:03:03.760 and I will use this application as a demo[br]application for this talk as everything 0:03:03.760,0:03:08.310 will be about this hypothetical router[br]fireball packet forward modifier thing. 0:03:08.310,0:03:11.800 What it does: It receives packets on one[br]or multiple network interfaces, it does 0:03:11.800,0:03:16.270 stuff with the packets - filter them,[br]modify them, route them 0:03:16.270,0:03:19.980 and sent them out to some other port or[br]maybe the same port or maybe multiple 0:03:19.980,0:03:23.140 ports - whatever these low-level[br]applications do. 0:03:23.140,0:03:27.540 And this means the application operates on[br]individual packets, not a stream of TCP 0:03:27.540,0:03:31.300 packets, not a stream of UDP packets, they[br]have to cope with small packets. 0:03:31.300,0:03:34.200 Because that's just the worst case: You[br]get a lot of small packets. 0:03:34.200,0:03:37.760 Now you want to build the application. You[br]go to the Internet and you look up: How to 0:03:37.760,0:03:41.290 build a packet forwarding application?[br]The internet tells you: There is the 0:03:41.290,0:03:46.040 socket API, the socket API is great and it[br]allows you to get packets to your program. 0:03:46.040,0:03:50.080 So you build your application on top of[br]the socket API. Once in userspace, you use 0:03:50.080,0:03:52.930 your socket, the socket talks to the[br]operating system, 0:03:52.930,0:03:56.030 the operating system talks to the driver[br]and the driver talks to the network cards, 0:03:56.030,0:03:59.340 and everything is fine except for that it[br]isn't 0:03:59.340,0:04:02.080 because what it really looks like if you[br]build this application: 0:04:02.080,0:04:07.460 There is this huge scary big gap between[br]user space and kernel space and you 0:04:07.460,0:04:13.170 somehow need your packets to go across[br]that without being eaten. 0:04:13.170,0:04:16.359 You might wonder why I said this is a big[br]deal and a huge deal that you have this 0:04:16.359,0:04:19.399 gap in there[br]and because I think: "Well, my web server 0:04:19.399,0:04:23.120 serving cat pictures is doing just fine on[br]a fast connection." 0:04:23.120,0:04:28.890 Well, it is because it is serving large[br]packets or even large chunks of files that 0:04:28.890,0:04:33.930 it sends at one to the server[br]like you can send a... can take your whole 0:04:33.930,0:04:36.510 cat video, give it to the kernel and the[br]kernel will handle everything 0:04:36.510,0:04:42.800 from doing... from packetizing it to TCP.[br]But what we want to build is a application 0:04:42.800,0:04:47.640 that needs to cope with the worst case of[br]lots of small packets coming in, 0:04:47.640,0:04:53.600 and then the overhead that you get here[br]from this gap is mostly on a packet basis 0:04:53.600,0:04:57.421 not on a pair-byte basis.[br]So, lots of small packets are a problem 0:04:57.421,0:05:00.690 for this interface.[br]When I say "problem" I'm always talking 0:05:00.690,0:05:03.240 about performance because I'm mostly about[br]performance. 0:05:03.240,0:05:09.390 So if you look at performance... a few[br]figures to get started is... 0:05:09.390,0:05:13.250 well how many packets can you fit over[br]your usual 10G link? That's around fifteen 0:05:13.250,0:05:17.810 million.[br]But 10G that's last year's news, this year 0:05:17.810,0:05:21.370 you have multiple hundred G connections[br]even to this location here. 0:05:21.370,0:05:28.280 So 100G link can handle up to 150 million[br]packets per second, and, well, how long 0:05:28.280,0:05:32.819 does that give us if we have a CPU?[br]And say we have a three gigahertz CPU in 0:05:32.819,0:05:37.260 our Macbook running the router and that[br]means we have around 200 cycles per packet 0:05:37.260,0:05:40.400 if we want to handle one 10G link with one[br]CPU core. 0:05:40.400,0:05:46.000 Okay we don't want to handle... we have of[br]course multiple cores. But you also have 0:05:46.000,0:05:50.430 multiple links, and faster links than 10G.[br]So the typical performance target that you 0:05:50.430,0:05:54.510 would aim for when building such an[br]application is five to ten million packets 0:05:54.510,0:05:56.880 per second per CPU core per thread that[br]you start. 0:05:56.880,0:06:00.550 Thats like a usual target. And that is[br]just for forwarding, just to receive the 0:06:00.550,0:06:05.630 packet and to send it back out. All the[br]stuff, that is: all the remaining cycles 0:06:05.630,0:06:09.110 can be used for your application.[br]So we don't want any big overhead just for 0:06:09.110,0:06:11.700 receiving and sending them without doing[br]any useful work. 0:06:11.700,0:06:20.370 So these these figures translate to[br]around 300 to 600 cycles per packet, on a 0:06:20.370,0:06:24.380 three gigahertz CPU core. Now, how long[br]does it take to cross that userspace 0:06:24.380,0:06:30.860 boundary? Well, very very very long for an[br]individual packet. So in some performance 0:06:30.860,0:06:34.620 measurements, if you do single core packet[br]forwarding, with a raw socket socket you 0:06:34.620,0:06:38.920 can maybe achieve 300,000 packets per[br]second, if you use libpcap, you can 0:06:38.920,0:06:42.740 achieve a million packets per second.[br]These figures can be tuned. You can maybe 0:06:42.740,0:06:46.080 get factor two out of that by some tuning,[br]but there are more problems, like 0:06:46.080,0:06:50.340 multicore scaling is unnecessarily hard[br]and so on, so this doesn't really seem to 0:06:50.340,0:06:54.800 work. So the boundary is the problem, so[br]let's get rid of the boundary by just 0:06:54.800,0:06:59.310 moving the application into the kernel. We[br]rewrite our application as a kernel module 0:06:59.310,0:07:04.330 and use it directly. You might think "what[br]an incredibly stupid idea, to write kernel 0:07:04.330,0:07:08.580 code for something that clearly should be[br]user space". Well, it's not that 0:07:08.580,0:07:11.949 unreasonable, there are lots of examples[br]of applications doing this, like a certain 0:07:11.949,0:07:16.850 web server by Microsoft runs as a kernel[br]module, the latest Linux kernel has TLS 0:07:16.850,0:07:20.850 offloading, to speed that up. Another[br]interesting use case is Open vSwitch, that 0:07:20.850,0:07:24.170 has a fast internal chache, that just[br]caches stuff and does complex processing 0:07:24.170,0:07:27.419 in a userspace thing, so it's not[br]completely unreasonable. 0:07:27.419,0:07:30.890 But it comes with a lot of drawbacks, like[br]it's very cumbersome to develop, most your 0:07:30.890,0:07:34.930 usual tools don't work or don't work as[br]expected, you have to follow the usual 0:07:34.930,0:07:38.000 kernel restrictions, like you have to use[br]C as a programming language, what you 0:07:38.000,0:07:42.260 maybe don't want to, and your application[br]can and will crash the kernel, which can 0:07:42.260,0:07:46.750 be quite bad. But lets not care about the[br]restrictions, we wanted to fix 0:07:46.750,0:07:50.530 performance, so same figures again: We[br]have 300 to 600 cycles to receive and sent 0:07:50.530,0:07:54.660 a packet. What I did: I tested this, I[br]profiled the Linux kernel to see how long 0:07:54.660,0:07:58.840 does it take to receive a packet until I[br]can do some useful work on it. This is an 0:07:58.840,0:08:03.550 average cost of a longer profiling run. So[br]on average it takes 500 cycles just to 0:08:03.550,0:08:08.010 receive the packet. Well, that's bad but[br]sending it out is slightly faster and 0:08:08.010,0:08:11.490 again, we are now over our budget. Now you[br]might think "what else do I need to do 0:08:11.490,0:08:15.639 besides receiving and sending the packet?"[br]There is some more overhead, there's you 0:08:15.639,0:08:20.710 need some time to the sk_buff, the data[br]structure used in the kernel for all 0:08:20.710,0:08:24.910 packet buffers, and this is quite bloated,[br]old, big data structure that is growing 0:08:24.910,0:08:29.760 bigger and bigger with each release and[br]this takes another 400 cycles. So if you 0:08:29.760,0:08:32.999 measure a real world application, single[br]core packet forwarding with Open vSwitch 0:08:32.999,0:08:36.429 with the minimum processing possible: One[br]open flow rule that matches on physical 0:08:36.429,0:08:40.529 ports and the processing, I profiled this[br]at around 200 cycles per packet. 0:08:40.529,0:08:44.790 And while the overhead of the kernel is[br]another thousand something cycles, so in 0:08:44.790,0:08:49.360 the end you achieve two million packets[br]per second - and this is faster than our 0:08:49.360,0:08:55.320 user space stuff but still kind of slow,[br]well, we want to be faster, because yeah. 0:08:55.320,0:08:59.220 And the currently hottest topic, which I'm[br]not talking about in the Linux kernel is 0:08:59.220,0:09:03.040 XDP. This fixes some of these problems but[br]comes with new restrictions. I cut that 0:09:03.040,0:09:10.079 for my talk for time reasons and so let's[br]just talk about not XDP. So the problem 0:09:10.079,0:09:14.439 was that our application - and we wanted[br]to move the application to the kernel 0:09:14.439,0:09:17.680 space - and it didn't work, so can we[br]instead move stuff from the kernel to the 0:09:17.680,0:09:22.160 user space? Well, yes we can. There a[br]libraries called "user space packet 0:09:22.160,0:09:25.660 processing frameworks". They come in two[br]parts: One is a library, you link your 0:09:25.660,0:09:29.209 program against, in the user space and one[br]is a kernel module. These two parts 0:09:29.209,0:09:34.199 communicate and they setup shared, mapped[br]memory and this shared mapped memory is 0:09:34.199,0:09:37.770 used to directly communicate from your[br]application to the driver. You directly 0:09:37.770,0:09:41.209 fill the packet buffers that the driver[br]then sends out and this is way faster. 0:09:41.209,0:09:44.379 And you might have noticed that the[br]operating system box here is not connected 0:09:44.379,0:09:47.349 to anything. That means your operating[br]system doesn't even know that the network 0:09:47.349,0:09:51.589 card is there in most cases, this can be[br]quite annoying. But there are quite a few 0:09:51.589,0:09:58.000 such frameworks, the biggest examples are[br]netmap PF_RING and pfq and they come with 0:09:58.000,0:10:02.170 restrictions, like there is a non-standard[br]API, you can't port between one framework 0:10:02.170,0:10:06.180 and the other or one framework in the[br]kernel or sockets, there's a custom kernel 0:10:06.180,0:10:10.650 module required, most of these frameworks[br]require some small patches to the drivers, 0:10:10.650,0:10:15.699 it's just a mess to maintain and of course[br]they need exclusive access to the network 0:10:15.699,0:10:18.970 card, because this one network card is[br]direc- this one application is talking 0:10:18.970,0:10:23.540 directly to the network card.[br]Ok, and the next thing is you lose the 0:10:23.540,0:10:27.759 access to the usual kernel features, which[br]can be quite annoying and then there's 0:10:27.759,0:10:30.970 often poor support for hardware offloading[br]features of the network cards, because 0:10:30.970,0:10:33.970 they often found on different parts of the[br]kernel that we no longer have reasonable 0:10:33.970,0:10:37.679 access to. And of course these frameworks,[br]we talk directly to a network card, 0:10:37.679,0:10:41.529 meaning we need support for each network[br]card individually. Usually they just 0:10:41.529,0:10:46.000 support one to two or maybe three NIC[br]families, which can be quite restricting, 0:10:46.000,0:10:50.579 if you don't have that specific NIC that[br]is restricted. But can we do an even more 0:10:50.579,0:10:54.790 radical approach, because we have all[br]these problems with kernel dependencies 0:10:54.790,0:10:59.189 and so on? Well, turns out we can get rid[br]of the kernel entirely and move everything 0:10:59.189,0:11:03.650 into one application. This means we take[br]our driver put it in the application, the 0:11:03.650,0:11:08.050 driver directly accesses the network card[br]and the sets up DMA memory in the user 0:11:08.050,0:11:11.579 space, because the network card doesn't[br]care, where it copies the packets from. We 0:11:11.579,0:11:14.739 just have to set up the pointers in the[br]right way and we can build this framework 0:11:14.739,0:11:17.410 like this, that everything runs in the[br]application. 0:11:17.410,0:11:23.459 We remove the driver from the kernel, no[br]kernel driver running and this is super 0:11:23.459,0:11:27.649 fast and we can also use this to implement[br]crazy and obscure hardware features and 0:11:27.649,0:11:31.420 network cards that are not supported by[br]the standard driver. Now I'm not the first 0:11:31.420,0:11:36.200 one to do this, there are two big[br]frameworks that that do that: One is DPDK, 0:11:36.200,0:11:41.060 which is quite quite big. This is a Linux[br]Foundation project and it has basically 0:11:41.060,0:11:44.709 support by all NIC vendors, meaning[br]everyone who builds a high-speed NIC 0:11:44.709,0:11:49.209 writes a driver that works for DPDK and[br]the second such framework is Snabb, which 0:11:49.209,0:11:54.139 I think is quite interesting, because it[br]doesn't write the drivers in C but is 0:11:54.139,0:11:58.290 entirely written in Lua, in the scripting[br]language, so this is kind of nice to see a 0:11:58.290,0:12:02.999 driver that's written in a scripting[br]language. Okay, what problems did we solve 0:12:02.999,0:12:06.679 and what problems did we now gain? One [br]problem is we still have the non-standard 0:12:06.679,0:12:11.329 API, we still need exclusive access to the[br]network card from one application, because 0:12:11.329,0:12:15.189 the driver runs in that thing, so there's[br]some hardware tricks to solve that, but 0:12:15.189,0:12:18.329 mainly it's one application that is[br]running. 0:12:18.329,0:12:22.459 Then the framework needs explicit support[br]for all the unique models out there. It's 0:12:22.459,0:12:26.369 not that big a problem with DPDK, because[br]it's such a big project that virtually 0:12:26.369,0:12:31.319 everyone has a driver for DPDK NIC. And[br]yes, limited support for interrupts but 0:12:31.319,0:12:34.170 it turns out interrupts are not something[br]that is useful, when you are building 0:12:34.170,0:12:37.999 something that processes more than a few[br]hundred thousand packets per second, 0:12:37.999,0:12:41.379 because the overhead of the interrupt is[br]just too large, it's just mainly a power 0:12:41.379,0:12:44.839 saving thing, if you ever run into low[br]load. But I don't care about the low load 0:12:44.839,0:12:50.410 scenario and power saving, so for me it's[br]polling all the way and all the CPU. And 0:12:50.410,0:12:55.260 you of course lose all the access to the[br]usual kernel features. And, well, time to 0:12:55.260,0:12:59.880 ask "what has the kernel ever done for[br]us?" Well, the kernel has lots of mature 0:12:59.880,0:13:03.139 drivers. Okay, what has the kernel ever[br]done for us, except for all these nice 0:13:03.139,0:13:07.639 mature drivers? There are very nice[br]protocol implementations that actually 0:13:07.639,0:13:10.220 work, like the kernel TCP stack is a work[br]of art. 0:13:10.220,0:13:14.319 It actually works in real world scenarios,[br]unlike all these other TCP stacks that 0:13:14.319,0:13:18.410 fail under some things or don't support[br]the features we want, so there is quite 0:13:18.410,0:13:22.509 some nice stuff. But what has the kernel[br]ever done for us, except for these mature 0:13:22.509,0:13:26.799 drivers and these nice protocol stack[br]implementations? Okay, quite a few things 0:13:26.799,0:13:32.870 and we are all throwing them out. And one[br]thing to notice: We mostly don't care 0:13:32.870,0:13:37.610 about these features, when building our[br]packet forward modify router firewall 0:13:37.610,0:13:44.349 thing, because these are mostly high-level[br]features mostly I think. But it's still a 0:13:44.349,0:13:49.199 lot of features that we are losing, like[br]building a TCP stack on top of these 0:13:49.199,0:13:52.999 frameworks is kind of an unsolved problem.[br]There are TCP stacks but they all suck in 0:13:52.999,0:13:58.409 different ways. Ok, we lost features but[br]we didn't care about the features in the 0:13:58.409,0:14:02.640 first place, we wanted performance.[br]Back to our performance figure we want 300 0:14:02.640,0:14:06.490 to 600 cycles per packet that we have[br]available, how long does it take in, for 0:14:06.490,0:14:10.899 example, DPDK to receive and send a[br]packet? That is around a hundred cycles to 0:14:10.899,0:14:15.239 get a packet through the whole stack, from[br]like like receiving a packet, processing 0:14:15.239,0:14:19.660 it, well, not processing it but getting it[br]to the application and back to the driver 0:14:19.660,0:14:23.080 to send it out. A hundred cycles and the[br]other frameworks typically play in the 0:14:23.080,0:14:27.709 same league. DPDK is slightly faster than[br]the other ones, because it's full of magic 0:14:27.709,0:14:33.000 SSE and AVX intrinsics and the driver is[br]kind of black magic but it's super fast. 0:14:33.000,0:14:37.480 Now in kind of real world scenario, Open[br]vSwitch, as I've mentioned as an example 0:14:37.480,0:14:41.689 earlier, that was 2 million packets was[br]the kernel version and Open vSwitch can be 0:14:41.689,0:14:45.220 compiled with an optional DPDK backend, so[br]you set some magic flags when compiling, 0:14:45.220,0:14:49.729 then it links against DPDK and uses the[br]network card directly, runs completely in 0:14:49.729,0:14:54.709 userspace and now it's a factor of around[br]6 or 7 faster and we can achieve 13 0:14:54.709,0:14:58.429 million packets per second with the same,[br]around the same processing step on a 0:14:58.429,0:15:03.119 single CPU core. So, great, where does do[br]the performance gains come from? Well, 0:15:03.119,0:15:08.129 there are two things: Mainly it's compared[br]to the kernel, not compared to sockets. 0:15:08.129,0:15:13.290 What people often say is that this is,[br]zero copy which is a stupid term because 0:15:13.290,0:15:18.279 the kernel doesn't copy packets either, so[br]it's not copying packets that was slow, it 0:15:18.279,0:15:22.299 was other things. Mainly it's batching,[br]meaning it's very efficient to process a 0:15:22.299,0:15:28.619 relatively large number of packets at once[br]and that really helps and the thing has 0:15:28.619,0:15:32.509 reduced memory overhead, the SK_Buff data[br]structure is really big and if you cut 0:15:32.509,0:15:37.319 that down you save a lot of cycles. These[br]DPDK figures, because DPDK has, unlike 0:15:37.319,0:15:42.679 some other frameworks, has memory[br]management, and this is already included 0:15:42.679,0:15:46.549 in these 50 cycles.[br]Okay, now we know that these frameworks 0:15:46.549,0:15:52.009 exist and everything, and the next obvious[br]question is: "Can we build our own 0:15:52.009,0:15:57.689 driver?" Well, but why? First for fun,[br]obviously, and then to understand how that 0:15:57.689,0:16:01.159 stuff works; how these drivers work,[br]how these packet processing frameworks 0:16:01.159,0:16:04.679 work.[br]I've seen in my work in academia; I've 0:16:04.679,0:16:07.840 seen a lot of people using these[br]frameworks. It's nice, because they are 0:16:07.840,0:16:12.260 fast and they enable a few things, that[br]just weren't possible before. But people 0:16:12.260,0:16:16.170 often treat these as magic black boxes you[br]put in your packet and then it magically 0:16:16.170,0:16:20.429 is faster and sometimes I don't blame[br]them. If you look at DPDK source code, 0:16:20.429,0:16:24.269 there are more than 20,000 lines of code[br]for each driver. And just for example, 0:16:24.269,0:16:28.809 looking at the receive and transmit[br]functions of the IXGBE driver and DPDK, 0:16:28.809,0:16:33.769 this is one file with around 3,000 lines[br]of code and they do a lot of magic, just 0:16:33.769,0:16:37.950 to receive and send packets. No one wants[br]to read through that, so the question is: 0:16:37.950,0:16:40.960 "How hard can it be to write your own[br]driver?" 0:16:40.960,0:16:44.850 Turns out: It's quite easy! This was like[br]a weekend project. I have written the 0:16:44.850,0:16:48.369 driver called XC. It's less than a[br]thousand lines of C code. That is the full 0:16:48.369,0:16:53.559 driver for 10 G network cards and the full[br]framework to get some applications and 2 0:16:53.559,0:16:58.099 simple example applications. Took me like[br]less than two days to write it completely, 0:16:58.099,0:17:00.897 then two more days to debug it and fix[br]performance. 0:17:02.385,0:17:08.209 So I've been building this driver on the[br]Intel IXGBE family. This is a family of 0:17:08.209,0:17:13.041 network cards that you know of, if you[br]ever had a server to test this. Because 0:17:13.041,0:17:17.639 almost all servers, that have 10 G[br]connections, have these Intel cards. And 0:17:17.639,0:17:22.829 they are also embedded in some Xeon CPUs.[br]They are also onboard chips on many 0:17:22.829,0:17:29.480 mainboards and the nice thing about them[br]is, they have a publicly available data 0:17:29.480,0:17:33.620 sheet. Meaning Intel publishes this 1,000[br]pages of PDF, that describes everything, 0:17:33.620,0:17:37.140 you ever wanted to know, when writing a[br]driver for these. And the next nice thing 0:17:37.140,0:17:41.324 is, that there is almost no logic hidden[br]behind the black box magic firmware. Many 0:17:41.324,0:17:46.210 newer network cards -especially Mellanox,[br]the newer ones- hide a lot of 0:17:46.210,0:17:50.120 functionality behind a firmware and the[br]driver. Mostly just exchanges messages 0:17:50.120,0:17:54.169 with the firmware, which is kind of[br]boring, and with this family, it is not 0:17:54.169,0:17:58.340 the case, which i think is very nice. So[br]how can we build a driver for this in four 0:17:58.340,0:18:02.884 very simple steps? One: We remove the[br]driver that is currently loaded, because 0:18:02.884,0:18:07.600 we don't want it to interfere with our[br]stuff. Okay, easy so far. Second, we 0:18:07.600,0:18:12.590 memory-map the PCIO memory-mapped I/O[br]address space. This allows us to access 0:18:12.590,0:18:16.430 the PCI Express device. Number three: We[br]figure out the physical addresses of our 0:18:16.430,0:18:22.750 DMA; of our process per address region and[br]then we use them for DMA. And step four is 0:18:22.750,0:18:26.779 slightly more complicated, than the first[br]three steps, as we write the driver. Now, 0:18:26.779,0:18:31.849 first thing to do, we figure out, where[br]our network card -let's say we have a 0:18:31.849,0:18:35.444 server and be plugged in our network card-[br]then it gets assigned an address and the 0:18:35.444,0:18:39.611 PCI bus. We can figure that out with[br]lspci, this is the address. We need it in 0:18:39.611,0:18:43.429 a slightly different version with the[br]fully qualified ID, and then we can remove 0:18:43.429,0:18:47.775 the kernel driver by telling the currently[br]bound driver to remove that specific ID. 0:18:47.775,0:18:52.100 Now the operating system doesn't know,[br]that this is a network card; doesn't know 0:18:52.100,0:18:55.870 anything, just notes that some PCI device[br]has no driver. Then we write our 0:18:55.870,0:18:59.209 application.[br]This is written in C and we just opened 0:18:59.209,0:19:04.207 this magic file in sysfs and this magic[br]file; we just mmap it. Ain't no magic, 0:19:04.207,0:19:08.183 just a normal mmap there. But what we get[br]back is a kind of special memory region. 0:19:08.183,0:19:12.160 This is the memory mapped I/O memory[br]region of the PCI address configuration 0:19:12.160,0:19:17.620 space and this is where all the registers[br]are available. Meaning, I will show you 0:19:17.620,0:19:20.960 what that means in just a second. If we if[br]go through the datasheet, there are 0:19:20.960,0:19:25.532 hundreds of pages of tables like this and[br]these tables tell us the registers, that 0:19:25.532,0:19:29.974 exist on that network card, the offset[br]they have and a link to more detailed 0:19:29.974,0:19:34.589 descriptions. And in code that looks like[br]this: For example the LED control register 0:19:34.589,0:19:38.090 is at this offset and then the LED control[br]register. 0:19:38.090,0:19:42.522 On this register, there are 32 bits, there[br]are some bits offset. Bit 7 is called 0:19:42.522,0:19:48.590 LED0_BLINK and if we set that bit in that[br]register, then one of the LEDs will start 0:19:48.590,0:19:53.669 to blink. And we can just do that via our[br]magic memory region, because all the reads 0:19:53.669,0:19:57.682 and writes, that we do to that memory[br]region, go directly over the PCI Express 0:19:57.682,0:20:01.568 bus to the network card and the network[br]card does whatever it wants to do with 0:20:01.568,0:20:03.128 them.[br]It doesn't have to be a register, 0:20:03.128,0:20:08.690 basically it's just a command, to send to[br]a network card and it's just a nice and 0:20:08.690,0:20:11.669 convenient interface to map that into[br]memory. This is a very common technique, 0:20:11.669,0:20:15.098 that you will also find when you do some[br]microprocessor programming or something. 0:20:16.260,0:20:20.110 So, and one thing to note is, since this[br]is not memory: That also means, it can't 0:20:20.110,0:20:24.111 be cached. There's no cache in between.[br]Each of these accesses will trigger a PCI 0:20:24.111,0:20:29.210 Express transaction and it will take quite[br]some time. Speaking of lots of lots of 0:20:29.210,0:20:32.919 cycles, where lots means like hundreds of[br]cycles or hundred cycles which is a lot 0:20:32.919,0:20:37.206 for me.[br]So how do we now handle packets? We now 0:20:37.206,0:20:42.400 can, we have access to this registers we[br]can read the datasheet and we can write 0:20:42.400,0:20:47.250 the driver but we some need some way to[br]get packets through that. Of course it 0:20:47.250,0:20:51.470 would be possible to write a network card[br]that does that via this memory-mapped I/O 0:20:51.470,0:20:56.800 region but it's kind of annoying. The[br]second way a PCI Express device 0:20:56.800,0:21:01.429 communicates with your server or macbook[br]is via DMA ,direct memory access, and a 0:21:01.429,0:21:07.536 DMA transfer, unlike the memory-mapped I/O[br]stuff is initiated by the network card and 0:21:07.536,0:21:14.046 this means the network card can just write[br]to arbitrary addresses in in main memory. 0:21:14.050,0:21:20.200 And this the network card offers so called[br]rings which are queue interfaces and like 0:21:20.200,0:21:22.946 for receiving packets and for sending[br]packets, and they are multiple of these 0:21:22.946,0:21:26.584 interfaces, because this is how you do[br]multi-core scaling. If you want to 0:21:26.584,0:21:30.649 transmit from multiple cores, you allocate[br]multiple queues. Each core sends to one 0:21:30.649,0:21:34.269 queue and the network card just merges[br]these queues in hardware onto the link, 0:21:34.269,0:21:38.789 and on receiving the network card can[br]either hash on the incoming incoming 0:21:38.789,0:21:42.821 packet like hash over protocol headers or[br]you can set explicit filters. 0:21:42.821,0:21:46.630 This is not specific to a network card[br]most PCI Express devices work like this 0:21:46.630,0:21:52.000 like GPUs have queues, a command queues[br]and so on, a NVME PCI Express disks have 0:21:52.000,0:21:56.660 queues and...[br]So let's look at queues on example of the 0:21:56.660,0:22:01.480 ixgbe family but you will find that most[br]NICs work in a very similar way. There are 0:22:01.480,0:22:04.110 sometimes small differences but mainly[br]they work like this. 0:22:04.344,0:22:08.902 And these rings are just circular buffers[br]filled with so-called DMA descriptors. A 0:22:08.902,0:22:14.180 DMA descriptor is a 16-byte struct and[br]that is eight bytes of a physical pointer 0:22:14.180,0:22:18.960 pointing to some location where more stuff[br]is and eight byte of metadata like "I 0:22:18.960,0:22:24.389 fetch the stuff" or "this packet needs[br]VLAN tag offloading" or "this packet had a 0:22:24.389,0:22:27.124 VLAN tag that I removed", information like[br]that is stored in there. 0:22:27.124,0:22:31.200 And what we then need to do is we[br]translate virtual addresses from our 0:22:31.200,0:22:34.509 address space to physical addresses[br]because the PCI Express device of course 0:22:34.509,0:22:39.198 needs physical addresses.[br]And we can use this, do that using procfs: 0:22:39.198,0:22:45.590 In the /proc/self/pagemap we can do that.[br]And the next thing is we now have this 0:22:45.590,0:22:51.610 this queue of DMA descriptors in memory[br]and this queue itself is also accessed via 0:22:51.610,0:22:57.101 DMA and it's controlled like it works like[br]you expect a circular ring to work. It has 0:22:57.101,0:23:00.970 a head and a tail, and the head and tail[br]pointer are available via registers in 0:23:00.970,0:23:05.680 memory-mapped I/O address space, meaning[br]in a image it looks kind of like this: We 0:23:05.680,0:23:09.650 have this descriptor ring in our physical[br]memory to the left full of pointers and 0:23:09.650,0:23:16.000 then we have somewhere else these packets[br]in some memory pool. And one thing to note 0:23:16.000,0:23:20.269 when allocating this kind of memory: There[br]is a small trick you have to do because 0:23:20.269,0:23:25.059 the descriptor ring needs to be in[br]contiguous memory in your physical memory 0:23:25.059,0:23:29.139 and if you use if, you just assume[br]everything that's contiguous in your 0:23:29.139,0:23:34.399 process is also in hardware physically: No[br]it isn't, and if you have a bug in there 0:23:34.399,0:23:37.919 and then it writes to somewhere else then[br]your filesystem dies as I figured out, 0:23:37.919,0:23:43.179 which was not a good thing.[br]So ... we, what I'm doing is I'm using 0:23:43.179,0:23:46.789 huge pages, two megabyte pages, that's[br]enough of contiguous memory and that's 0:23:46.789,0:23:53.990 guaranteed to not have weird gaps.[br]So, um ... now we see packets we need to 0:23:53.990,0:23:58.600 set up the ring so we tell the network[br]car via memory mapped I/O the location and 0:23:58.600,0:24:03.070 the size of the ring, then we fill up the[br]ring with pointers to freshly allocated 0:24:03.070,0:24:09.820 memory that are just empty and now we set[br]the head and tail pointer to tell the head 0:24:09.820,0:24:13.100 and tail pointer that the queue is full,[br]because the queue is at the moment full, 0:24:13.100,0:24:16.956 it's full of packets. These packets are[br]just not yet filled with anything. And now 0:24:16.956,0:24:20.629 what the NIC does, it fetches one of the[br]DNA descriptors and as soon as it receives 0:24:20.629,0:24:25.539 a packet it writes the packet via DMA to[br]the location specified in the register and 0:24:25.539,0:24:30.299 increments the head pointer of the queue[br]and it also sets a status flag in the DMA 0:24:30.299,0:24:33.590 descriptor once it's done like in the[br]packet to memory and this step is 0:24:33.590,0:24:39.610 important because reading back the head[br]pointer via MM I/O would be way too slow. 0:24:39.610,0:24:43.330 So instead we check the status flag[br]because the status flag gets optimized by 0:24:43.330,0:24:47.302 the ... by the cache and is already in[br]cache so we can check that really fast. 0:24:48.794,0:24:52.121 Next step is we periodically poll the[br]status flag. This is the point where 0:24:52.121,0:24:56.009 interrupts might come in useful.[br]There's some misconception: people 0:24:56.009,0:24:59.419 sometimes believe that if you receive a[br]packet then you get an interrupt and the 0:24:59.419,0:25:02.420 interrupt somehow magically contains the[br]packet. No it doesn't. The interrupt just 0:25:02.420,0:25:05.600 contains the information that there is a[br]new packet. After the interrupt you would 0:25:05.600,0:25:12.450 have to poll the status flag anyways. So[br]we now have the packet, we process the 0:25:12.450,0:25:16.170 packet or do whatever, then we reset the[br]DMA descriptor, we can either recycle the 0:25:16.170,0:25:21.653 old packet or allocate a new one and we[br]set the ready flag on the status register 0:25:21.653,0:25:25.529 and we adjust the tail pointer register to[br]tell the network card that we are done 0:25:25.529,0:25:28.389 with this and we don't have to do that for[br]any time because we don't have to keep the 0:25:28.389,0:25:33.220 queue 100% utilized. We can only update[br]the tail pointer like every hundred 0:25:33.220,0:25:37.559 packets or so and then that's not a[br]performance problem. What now, we have a 0:25:37.559,0:25:42.020 driver that can receive packets. Next[br]steps, well transmit packets, it basically 0:25:42.020,0:25:46.373 works the same. I won't bore you with the[br]details. Then there's of course a lot of 0:25:46.373,0:25:50.600 boring boring initialization code and it's[br]just following the datasheet, they are 0:25:50.600,0:25:54.070 like: set this register, set that[br]register, do that and I just coded it down 0:25:54.070,0:25:58.870 from the datasheet and it works, so big[br]surprise. Then now you know how to write a 0:25:58.870,0:26:03.799 driver like this and a few ideas of what[br]... what I want to do, what maybe you want 0:26:03.799,0:26:06.820 to do with a driver like this. One of[br]course want to look at performance to look 0:26:06.820,0:26:09.929 at what makes this faster than the kernel,[br]then I want some obscure 0:26:09.929,0:26:12.529 hardware/offloading features.[br]In the past I've looked at IPSec 0:26:12.529,0:26:15.840 offloading, just quite interesting,[br]because the Intel network cards have 0:26:15.840,0:26:19.870 hardware support for IPSec offloading, but[br]none of the Intel drivers had it and it 0:26:19.870,0:26:24.200 seems to work just fine. So not sure[br]what's going on there. Then security is 0:26:24.200,0:26:29.440 interesting. There is the ... there's[br]obvious some security implications of 0:26:29.440,0:26:33.399 having the whole driver in a user space[br]process and ... and I'm wondering about 0:26:33.399,0:26:37.120 how we can use the IOMMU, because it turns[br]out, once we have set up the memory 0:26:37.120,0:26:40.130 mapping we can drop all the privileges, we[br]don't need them. 0:26:40.130,0:26:43.659 And if we set up the IOMMU before to[br]restrict the network card to certain 0:26:43.659,0:26:48.750 things then we could have a safe driver in[br]userspace that can't do anything wrong, 0:26:48.750,0:26:52.264 because has no privileges and the network[br]card has no access because goes through 0:26:52.264,0:26:56.046 the IOMMU and there are performance[br]implications of the IOMMU and so on. Of 0:26:56.046,0:26:59.889 course, support for other NICs. I want to[br]support virtIO, virtual NICs and other 0:26:59.889,0:27:03.564 programming languages for the driver would[br]also be interesting. It's just written in 0:27:03.564,0:27:06.686 C because C is the lowest common[br]denominator of programming languages. 0:27:06.991,0:27:12.700 To conclude, check out ixy. It's BSD[br]license on github and the main thing to 0:27:12.700,0:27:16.094 take with you is that drivers are really[br]simple. Don't be afraid of drivers. Don't 0:27:16.094,0:27:20.059 be afraid of writing your drivers. You can[br]do it in any language and you don't even 0:27:20.059,0:27:23.139 need to add kernel code. Just map the[br]stuff to your process, write the driver 0:27:23.139,0:27:27.019 and do whatever you want. Okay, thanks for[br]your attention. 0:27:27.019,0:27:33.340 Applause 0:27:33.340,0:27:36.079 Herald: You have very few minutes left for 0:27:36.079,0:27:40.529 questions. So if you have a question in[br]the room please go quickly to one of the 8 0:27:40.529,0:27:46.899 microphones in the room. Does the signal[br]angel already have a question ready? I 0:27:46.899,0:27:52.998 don't see anything. Anybody lining up at[br]any microphones? 0:28:07.182,0:28:08.950 Alright, number 6 please. 0:28:09.926,0:28:15.140 Mic 6: As you're not actually using any of[br]the Linux drivers, is there an advantage 0:28:15.140,0:28:19.470 to using Linux here or could you use any[br]open source operating system? 0:28:19.470,0:28:24.200 Paul: I don't know about other operating[br]systems but the only thing I'm using of 0:28:24.200,0:28:28.649 Linux here is the ability to easily map[br]that. For some other operating systems we 0:28:28.649,0:28:32.779 might need a small stub driver that maps[br]the stuff in there. You can check out the 0:28:32.779,0:28:36.820 DPDK FreeBSD port which has a small stub[br]driver that just handles the memory 0:28:36.820,0:28:41.379 mapping.[br]Herald: Here, at number 2. 0:28:41.379,0:28:45.340 Mic 2: Hi, erm, slightly disconnected to[br]the talk, but I just like to hear your 0:28:45.340,0:28:50.880 opinion on smart NICs where they're[br]considering putting CPUs on the NIC 0:28:50.880,0:28:55.279 itself. So you could imagine running Open[br]vSwitch on the CPU on the NIC. 0:28:55.279,0:28:59.530 Paul: Yeah, I have some smart NIC[br]somewhere on some lap and have also done 0:28:59.530,0:29:05.639 work with the net FPGA. I think that it's[br]very interesting, but it ... it's a 0:29:05.639,0:29:09.820 complicated trade-off, because these smart[br]NICs come with new restrictions and they 0:29:09.820,0:29:13.820 are not dramatically super fast. So it's[br]... it's interesting from a performance 0:29:13.820,0:29:17.610 perspective to see when it's worth it,[br]when it's not worth it and what I 0:29:17.610,0:29:22.100 personally think it's probably better to[br]do everything with raw CPU power. 0:29:22.100,0:29:25.200 Mic 2: Thanks.[br]Herald: Alright, before we take the next 0:29:25.200,0:29:29.730 question, just for the people who don't[br]want to stick around for the Q&A. If you 0:29:29.730,0:29:33.720 really do have to leave the room early,[br]please do so quietly, so we can continue 0:29:33.720,0:29:39.440 the Q&A. Number 6, please.[br]Mic 6: So how does the performance of the 0:29:39.440,0:29:42.809 userspace driver is compared to the XDP[br]solution? 0:29:42.809,0:29:51.190 Paul: Um, it's slightly faster. But one[br]important thing about XDP is, if you look 0:29:51.190,0:29:54.910 at this, this is still new work and there[br]is ... there are few important 0:29:54.910,0:29:58.340 restrictions like you can write your[br]userspace thing in whatever programming 0:29:58.340,0:30:01.522 language you want. Like I mentioned, snap[br]has a driver entirely written in Lua. With 0:30:01.522,0:30:06.985 XDP you are restricted to eBPF, meaning[br]usually a restricted subset of C and then 0:30:06.985,0:30:09.670 there's bytecode verifier but you can[br]disable the bytecode verifier if you want 0:30:09.670,0:30:13.990 to disable it, and meaning, you again have[br]weird restrictions that you maybe don't 0:30:13.990,0:30:18.960 want and also XDP requires patched driv[br]... not patched drivers but requires a new 0:30:18.960,0:30:23.550 memory model for the drivers. So at moment[br]DPDK supports more drivers than XDP in the 0:30:23.550,0:30:26.740 kernel, which is kind of weird, and[br]they're still lacking many features like 0:30:26.740,0:30:31.187 sending back to a different NIC.[br]One very very good use case for XDP is 0:30:31.187,0:30:35.340 firewalling for applications on the same[br]host because you can pass on a packet to 0:30:35.340,0:30:40.309 the TCP stack and this is a very good use[br]case for XDP. But overall, I think that 0:30:40.309,0:30:46.761 ... that both things are very very[br]different and XDP is slightly slower but 0:30:46.761,0:30:51.077 it's not slower in such a way that it[br]would be relevant. So it's fast, to 0:30:51.077,0:30:54.960 answer the question.[br]Herald: All right, unfortunately we are 0:30:54.960,0:30:59.172 out of time. So that was the last[br]question. Thanks again, Paul. 0:30:59.172,0:31:07.957 Applause 0:31:07.957,0:31:12.769 34c3 outro 0:31:12.769,0:31:30.000 subtitles created by c3subtitles.de[br]in the year 2018. Join, and help us!