[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.10,0:00:15.69,Default,,0000,0000,0000,,{\i1}34c3 intro{\i0} Dialogue: 0,0:00:15.69,0:00:20.27,Default,,0000,0000,0000,,Herald: All right, now it's my great\Npleasure to introduce Paul Emmerich who is Dialogue: 0,0:00:20.27,0:00:26.52,Default,,0000,0000,0000,,going to talk about "Demystifying Network\NCards". Paul is a PhD student at the Dialogue: 0,0:00:26.52,0:00:33.66,Default,,0000,0000,0000,,Technical University in Munich. He's doing\Nall kinds of network related stuff and\N Dialogue: 0,0:00:33.66,0:00:37.95,Default,,0000,0000,0000,,hopefully today he's gonna help us make\Nnetwork cards a bit less of a black box. Dialogue: 0,0:00:37.95,0:00:48.53,Default,,0000,0000,0000,,So, please give a warm welcome to Paul\N{\i1}applause{\i0} Dialogue: 0,0:00:48.53,0:00:50.56,Default,,0000,0000,0000,,Paul: Thank you and as the introduction Dialogue: 0,0:00:50.56,0:00:54.65,Default,,0000,0000,0000,,already said I'm a PhD student and I'm\Nresearching performance of software packet Dialogue: 0,0:00:54.65,0:00:58.32,Default,,0000,0000,0000,,processing and forwarding systems.\NThat means I spend a lot of time doing Dialogue: 0,0:00:58.32,0:01:02.56,Default,,0000,0000,0000,,low-level optimizations and looking into\Nwhat makes a system fast, what makes it Dialogue: 0,0:01:02.56,0:01:05.98,Default,,0000,0000,0000,,slow, what can be done to improve it\Nand I'm mostly working on my packet Dialogue: 0,0:01:05.98,0:01:09.77,Default,,0000,0000,0000,,generator MoonGen\NI have some cross promotion of a lightning Dialogue: 0,0:01:09.77,0:01:13.49,Default,,0000,0000,0000,,talk about this on Saturday but here I\Nhave this long slot Dialogue: 0,0:01:13.49,0:01:17.55,Default,,0000,0000,0000,,and I brought a lot of content here so I\Nhave to talk really fast so sorry for the Dialogue: 0,0:01:17.55,0:01:20.56,Default,,0000,0000,0000,,translators and I hope you can mainly\Nfollow along Dialogue: 0,0:01:20.56,0:01:24.92,Default,,0000,0000,0000,,So: this is about Network cards meaning\Nnetwork cards you all have seen. This is a Dialogue: 0,0:01:24.92,0:01:30.37,Default,,0000,0000,0000,,usual 10G network card with the SFP+ port\Nand this is a faster network card with a Dialogue: 0,0:01:30.37,0:01:35.36,Default,,0000,0000,0000,,QSFP+ port. This is 20, 40, or 100G\Nand now you bought this fancy network Dialogue: 0,0:01:35.36,0:01:38.23,Default,,0000,0000,0000,,card, you plug it into your server or your\Nmacbook or whatever, Dialogue: 0,0:01:38.23,0:01:41.52,Default,,0000,0000,0000,,and you start your web server that serves\Ncat pictures and cat videos. Dialogue: 0,0:01:41.52,0:01:45.74,Default,,0000,0000,0000,,You all know that there's a whole stack of\Nprotocols that your cat picture has to go Dialogue: 0,0:01:45.74,0:01:48.09,Default,,0000,0000,0000,,through until it arrives at a network card\Nat the bottom Dialogue: 0,0:01:48.09,0:01:52.12,Default,,0000,0000,0000,,and the only thing that I care about are\Nthe lower layers. I don't care about TCP, Dialogue: 0,0:01:52.12,0:01:55.52,Default,,0000,0000,0000,,I have no idea how TCP works.\NWell I have some idea how it works, but Dialogue: 0,0:01:55.52,0:01:57.70,Default,,0000,0000,0000,,this is not my research, I don't care\Nabout it. Dialogue: 0,0:01:57.70,0:02:01.28,Default,,0000,0000,0000,,I just want to look at individual packets\Nand the highest thing I look at it's maybe Dialogue: 0,0:02:01.28,0:02:07.73,Default,,0000,0000,0000,,an IP address or maybe a part of the\Nprotocol to identify flows or anything. Dialogue: 0,0:02:07.73,0:02:11.05,Default,,0000,0000,0000,,Now you might wonder: Is there anything\Neven interesting in these lower layers? Dialogue: 0,0:02:11.05,0:02:15.08,Default,,0000,0000,0000,,Because people nowadays think that\Neverything runs on top of HTTP, Dialogue: 0,0:02:15.08,0:02:19.16,Default,,0000,0000,0000,,but you might be surprised that not all\Napplications run on top of HTTP. Dialogue: 0,0:02:19.16,0:02:23.38,Default,,0000,0000,0000,,There is a lot of software that needs to\Nrun at these lower levels and in the Dialogue: 0,0:02:23.38,0:02:26.15,Default,,0000,0000,0000,,recent years\Nthere is a trend of moving network Dialogue: 0,0:02:26.15,0:02:30.81,Default,,0000,0000,0000,,infrastructure stuff from specialized\Nhardware black boxes to open software Dialogue: 0,0:02:30.81,0:02:33.22,Default,,0000,0000,0000,,boxes\Nand examples for such software that was Dialogue: 0,0:02:33.22,0:02:37.78,Default,,0000,0000,0000,,hardware in the past are: routers, switches,\Nfirewalls, middle boxes and so on. Dialogue: 0,0:02:37.78,0:02:40.42,Default,,0000,0000,0000,,If you want to look up the relevant\Nbuzzwords: It's Network Function Dialogue: 0,0:02:40.42,0:02:45.85,Default,,0000,0000,0000,,Virtualization what it's called and this\Nis a recent trend of the recent years. Dialogue: 0,0:02:45.85,0:02:50.61,Default,,0000,0000,0000,,Now let's say we want to build our own\Nfancy application on that low-level thing. Dialogue: 0,0:02:50.61,0:02:55.12,Default,,0000,0000,0000,,We want to build our firewall router\Npacket forward modifier thing that does Dialogue: 0,0:02:55.12,0:02:59.41,Default,,0000,0000,0000,,whatever useful on that lower layer for\Nnetwork infrastructure Dialogue: 0,0:02:59.41,0:03:03.76,Default,,0000,0000,0000,,and I will use this application as a demo\Napplication for this talk as everything Dialogue: 0,0:03:03.76,0:03:08.31,Default,,0000,0000,0000,,will be about this hypothetical router\Nfireball packet forward modifier thing. Dialogue: 0,0:03:08.31,0:03:11.80,Default,,0000,0000,0000,,What it does: It receives packets on one\Nor multiple network interfaces, it does Dialogue: 0,0:03:11.80,0:03:16.27,Default,,0000,0000,0000,,stuff with the packets - filter them,\Nmodify them, route them Dialogue: 0,0:03:16.27,0:03:19.98,Default,,0000,0000,0000,,and sent them out to some other port or\Nmaybe the same port or maybe multiple Dialogue: 0,0:03:19.98,0:03:23.14,Default,,0000,0000,0000,,ports - whatever these low-level\Napplications do. Dialogue: 0,0:03:23.14,0:03:27.54,Default,,0000,0000,0000,,And this means the application operates on\Nindividual packets, not a stream of TCP Dialogue: 0,0:03:27.54,0:03:31.30,Default,,0000,0000,0000,,packets, not a stream of UDP packets, they\Nhave to cope with small packets. Dialogue: 0,0:03:31.30,0:03:34.20,Default,,0000,0000,0000,,Because that's just the worst case: You\Nget a lot of small packets. Dialogue: 0,0:03:34.20,0:03:37.76,Default,,0000,0000,0000,,Now you want to build the application. You\Ngo to the Internet and you look up: How to Dialogue: 0,0:03:37.76,0:03:41.29,Default,,0000,0000,0000,,build a packet forwarding application?\NThe internet tells you: There is the Dialogue: 0,0:03:41.29,0:03:46.04,Default,,0000,0000,0000,,socket API, the socket API is great and it\Nallows you to get packets to your program. Dialogue: 0,0:03:46.04,0:03:50.08,Default,,0000,0000,0000,,So you build your application on top of\Nthe socket API. Once in userspace, you use Dialogue: 0,0:03:50.08,0:03:52.93,Default,,0000,0000,0000,,your socket, the socket talks to the\Noperating system, Dialogue: 0,0:03:52.93,0:03:56.03,Default,,0000,0000,0000,,the operating system talks to the driver\Nand the driver talks to the network cards, Dialogue: 0,0:03:56.03,0:03:59.34,Default,,0000,0000,0000,,and everything is fine except for that it\Nisn't Dialogue: 0,0:03:59.34,0:04:02.08,Default,,0000,0000,0000,,because what it really looks like if you\Nbuild this application: Dialogue: 0,0:04:02.08,0:04:07.46,Default,,0000,0000,0000,,There is this huge scary big gap between\Nuser space and kernel space and you Dialogue: 0,0:04:07.46,0:04:13.17,Default,,0000,0000,0000,,somehow need your packets to go across\Nthat without being eaten. Dialogue: 0,0:04:13.17,0:04:16.36,Default,,0000,0000,0000,,You might wonder why I said this is a big\Ndeal and a huge deal that you have this Dialogue: 0,0:04:16.36,0:04:19.40,Default,,0000,0000,0000,,gap in there\Nand because I think: "Well, my web server Dialogue: 0,0:04:19.40,0:04:23.12,Default,,0000,0000,0000,,serving cat pictures is doing just fine on\Na fast connection." Dialogue: 0,0:04:23.12,0:04:28.89,Default,,0000,0000,0000,,Well, it is because it is serving large\Npackets or even large chunks of files that Dialogue: 0,0:04:28.89,0:04:33.93,Default,,0000,0000,0000,,it sends at one to the server\Nlike you can send a... can take your whole Dialogue: 0,0:04:33.93,0:04:36.51,Default,,0000,0000,0000,,cat video, give it to the kernel and the\Nkernel will handle everything Dialogue: 0,0:04:36.51,0:04:42.80,Default,,0000,0000,0000,,from doing... from packetizing it to TCP.\NBut what we want to build is a application Dialogue: 0,0:04:42.80,0:04:47.64,Default,,0000,0000,0000,,that needs to cope with the worst case of\Nlots of small packets coming in, Dialogue: 0,0:04:47.64,0:04:53.60,Default,,0000,0000,0000,,and then the overhead that you get here\Nfrom this gap is mostly on a packet basis Dialogue: 0,0:04:53.60,0:04:57.42,Default,,0000,0000,0000,,not on a pair-byte basis.\NSo, lots of small packets are a problem Dialogue: 0,0:04:57.42,0:05:00.69,Default,,0000,0000,0000,,for this interface.\NWhen I say "problem" I'm always talking Dialogue: 0,0:05:00.69,0:05:03.24,Default,,0000,0000,0000,,about performance because I'm mostly about\Nperformance. Dialogue: 0,0:05:03.24,0:05:09.39,Default,,0000,0000,0000,,So if you look at performance... a few\Nfigures to get started is... Dialogue: 0,0:05:09.39,0:05:13.25,Default,,0000,0000,0000,,well how many packets can you fit over\Nyour usual 10G link? That's around fifteen Dialogue: 0,0:05:13.25,0:05:17.81,Default,,0000,0000,0000,,million.\NBut 10G that's last year's news, this year Dialogue: 0,0:05:17.81,0:05:21.37,Default,,0000,0000,0000,,you have multiple hundred G connections\Neven to this location here. Dialogue: 0,0:05:21.37,0:05:28.28,Default,,0000,0000,0000,,So 100G link can handle up to 150 million\Npackets per second, and, well, how long Dialogue: 0,0:05:28.28,0:05:32.82,Default,,0000,0000,0000,,does that give us if we have a CPU?\NAnd say we have a three gigahertz CPU in Dialogue: 0,0:05:32.82,0:05:37.26,Default,,0000,0000,0000,,our Macbook running the router and that\Nmeans we have around 200 cycles per packet Dialogue: 0,0:05:37.26,0:05:40.40,Default,,0000,0000,0000,,if we want to handle one 10G link with one\NCPU core. Dialogue: 0,0:05:40.40,0:05:46.00,Default,,0000,0000,0000,,Okay we don't want to handle... we have of\Ncourse multiple cores. But you also have Dialogue: 0,0:05:46.00,0:05:50.43,Default,,0000,0000,0000,,multiple links, and faster links than 10G.\NSo the typical performance target that you Dialogue: 0,0:05:50.43,0:05:54.51,Default,,0000,0000,0000,,would aim for when building such an\Napplication is five to ten million packets Dialogue: 0,0:05:54.51,0:05:56.88,Default,,0000,0000,0000,,per second per CPU core per thread that\Nyou start. Dialogue: 0,0:05:56.88,0:06:00.55,Default,,0000,0000,0000,,Thats like a usual target. And that is\Njust for forwarding, just to receive the Dialogue: 0,0:06:00.55,0:06:05.63,Default,,0000,0000,0000,,packet and to send it back out. All the\Nstuff, that is: all the remaining cycles Dialogue: 0,0:06:05.63,0:06:09.11,Default,,0000,0000,0000,,can be used for your application.\NSo we don't want any big overhead just for Dialogue: 0,0:06:09.11,0:06:11.70,Default,,0000,0000,0000,,receiving and sending them without doing\Nany useful work. Dialogue: 0,0:06:11.70,0:06:20.37,Default,,0000,0000,0000,,So these these figures translate to\Naround 300 to 600 cycles per packet, on a Dialogue: 0,0:06:20.37,0:06:24.38,Default,,0000,0000,0000,,three gigahertz CPU core. Now, how long\Ndoes it take to cross that userspace Dialogue: 0,0:06:24.38,0:06:30.86,Default,,0000,0000,0000,,boundary? Well, very very very long for an\Nindividual packet. So in some performance Dialogue: 0,0:06:30.86,0:06:34.62,Default,,0000,0000,0000,,measurements, if you do single core packet\Nforwarding, with a raw socket socket you Dialogue: 0,0:06:34.62,0:06:38.92,Default,,0000,0000,0000,,can maybe achieve 300,000 packets per\Nsecond, if you use libpcap, you can Dialogue: 0,0:06:38.92,0:06:42.74,Default,,0000,0000,0000,,achieve a million packets per second.\NThese figures can be tuned. You can maybe Dialogue: 0,0:06:42.74,0:06:46.08,Default,,0000,0000,0000,,get factor two out of that by some tuning,\Nbut there are more problems, like Dialogue: 0,0:06:46.08,0:06:50.34,Default,,0000,0000,0000,,multicore scaling is unnecessarily hard\Nand so on, so this doesn't really seem to Dialogue: 0,0:06:50.34,0:06:54.80,Default,,0000,0000,0000,,work. So the boundary is the problem, so\Nlet's get rid of the boundary by just Dialogue: 0,0:06:54.80,0:06:59.31,Default,,0000,0000,0000,,moving the application into the kernel. We\Nrewrite our application as a kernel module Dialogue: 0,0:06:59.31,0:07:04.33,Default,,0000,0000,0000,,and use it directly. You might think "what\Nan incredibly stupid idea, to write kernel Dialogue: 0,0:07:04.33,0:07:08.58,Default,,0000,0000,0000,,code for something that clearly should be\Nuser space". Well, it's not that Dialogue: 0,0:07:08.58,0:07:11.95,Default,,0000,0000,0000,,unreasonable, there are lots of examples\Nof applications doing this, like a certain Dialogue: 0,0:07:11.95,0:07:16.85,Default,,0000,0000,0000,,web server by Microsoft runs as a kernel\Nmodule, the latest Linux kernel has TLS Dialogue: 0,0:07:16.85,0:07:20.85,Default,,0000,0000,0000,,offloading, to speed that up. Another\Ninteresting use case is Open vSwitch, that Dialogue: 0,0:07:20.85,0:07:24.17,Default,,0000,0000,0000,,has a fast internal chache, that just\Ncaches stuff and does complex processing Dialogue: 0,0:07:24.17,0:07:27.42,Default,,0000,0000,0000,,in a userspace thing, so it's not\Ncompletely unreasonable. Dialogue: 0,0:07:27.42,0:07:30.89,Default,,0000,0000,0000,,But it comes with a lot of drawbacks, like\Nit's very cumbersome to develop, most your Dialogue: 0,0:07:30.89,0:07:34.93,Default,,0000,0000,0000,,usual tools don't work or don't work as\Nexpected, you have to follow the usual Dialogue: 0,0:07:34.93,0:07:38.00,Default,,0000,0000,0000,,kernel restrictions, like you have to use\NC as a programming language, what you Dialogue: 0,0:07:38.00,0:07:42.26,Default,,0000,0000,0000,,maybe don't want to, and your application\Ncan and will crash the kernel, which can Dialogue: 0,0:07:42.26,0:07:46.75,Default,,0000,0000,0000,,be quite bad. But lets not care about the\Nrestrictions, we wanted to fix Dialogue: 0,0:07:46.75,0:07:50.53,Default,,0000,0000,0000,,performance, so same figures again: We\Nhave 300 to 600 cycles to receive and sent Dialogue: 0,0:07:50.53,0:07:54.66,Default,,0000,0000,0000,,a packet. What I did: I tested this, I\Nprofiled the Linux kernel to see how long Dialogue: 0,0:07:54.66,0:07:58.84,Default,,0000,0000,0000,,does it take to receive a packet until I\Ncan do some useful work on it. This is an Dialogue: 0,0:07:58.84,0:08:03.55,Default,,0000,0000,0000,,average cost of a longer profiling run. So\Non average it takes 500 cycles just to Dialogue: 0,0:08:03.55,0:08:08.01,Default,,0000,0000,0000,,receive the packet. Well, that's bad but\Nsending it out is slightly faster and Dialogue: 0,0:08:08.01,0:08:11.49,Default,,0000,0000,0000,,again, we are now over our budget. Now you\Nmight think "what else do I need to do Dialogue: 0,0:08:11.49,0:08:15.64,Default,,0000,0000,0000,,besides receiving and sending the packet?"\NThere is some more overhead, there's you Dialogue: 0,0:08:15.64,0:08:20.71,Default,,0000,0000,0000,,need some time to the sk_buff, the data\Nstructure used in the kernel for all Dialogue: 0,0:08:20.71,0:08:24.91,Default,,0000,0000,0000,,packet buffers, and this is quite bloated,\Nold, big data structure that is growing Dialogue: 0,0:08:24.91,0:08:29.76,Default,,0000,0000,0000,,bigger and bigger with each release and\Nthis takes another 400 cycles. So if you Dialogue: 0,0:08:29.76,0:08:32.100,Default,,0000,0000,0000,,measure a real world application, single\Ncore packet forwarding with Open vSwitch Dialogue: 0,0:08:32.100,0:08:36.43,Default,,0000,0000,0000,,with the minimum processing possible: One\Nopen flow rule that matches on physical Dialogue: 0,0:08:36.43,0:08:40.53,Default,,0000,0000,0000,,ports and the processing, I profiled this\Nat around 200 cycles per packet. Dialogue: 0,0:08:40.53,0:08:44.79,Default,,0000,0000,0000,,And while the overhead of the kernel is\Nanother thousand something cycles, so in Dialogue: 0,0:08:44.79,0:08:49.36,Default,,0000,0000,0000,,the end you achieve two million packets\Nper second - and this is faster than our Dialogue: 0,0:08:49.36,0:08:55.32,Default,,0000,0000,0000,,user space stuff but still kind of slow,\Nwell, we want to be faster, because yeah. Dialogue: 0,0:08:55.32,0:08:59.22,Default,,0000,0000,0000,,And the currently hottest topic, which I'm\Nnot talking about in the Linux kernel is Dialogue: 0,0:08:59.22,0:09:03.04,Default,,0000,0000,0000,,XDP. This fixes some of these problems but\Ncomes with new restrictions. I cut that Dialogue: 0,0:09:03.04,0:09:10.08,Default,,0000,0000,0000,,for my talk for time reasons and so let's\Njust talk about not XDP. So the problem Dialogue: 0,0:09:10.08,0:09:14.44,Default,,0000,0000,0000,,was that our application - and we wanted\Nto move the application to the kernel Dialogue: 0,0:09:14.44,0:09:17.68,Default,,0000,0000,0000,,space - and it didn't work, so can we\Ninstead move stuff from the kernel to the Dialogue: 0,0:09:17.68,0:09:22.16,Default,,0000,0000,0000,,user space? Well, yes we can. There a\Nlibraries called "user space packet Dialogue: 0,0:09:22.16,0:09:25.66,Default,,0000,0000,0000,,processing frameworks". They come in two\Nparts: One is a library, you link your Dialogue: 0,0:09:25.66,0:09:29.21,Default,,0000,0000,0000,,program against, in the user space and one\Nis a kernel module. These two parts Dialogue: 0,0:09:29.21,0:09:34.20,Default,,0000,0000,0000,,communicate and they setup shared, mapped\Nmemory and this shared mapped memory is Dialogue: 0,0:09:34.20,0:09:37.77,Default,,0000,0000,0000,,used to directly communicate from your\Napplication to the driver. You directly Dialogue: 0,0:09:37.77,0:09:41.21,Default,,0000,0000,0000,,fill the packet buffers that the driver\Nthen sends out and this is way faster. Dialogue: 0,0:09:41.21,0:09:44.38,Default,,0000,0000,0000,,And you might have noticed that the\Noperating system box here is not connected Dialogue: 0,0:09:44.38,0:09:47.35,Default,,0000,0000,0000,,to anything. That means your operating\Nsystem doesn't even know that the network Dialogue: 0,0:09:47.35,0:09:51.59,Default,,0000,0000,0000,,card is there in most cases, this can be\Nquite annoying. But there are quite a few Dialogue: 0,0:09:51.59,0:09:58.00,Default,,0000,0000,0000,,such frameworks, the biggest examples are\Nnetmap PF_RING and pfq and they come with Dialogue: 0,0:09:58.00,0:10:02.17,Default,,0000,0000,0000,,restrictions, like there is a non-standard\NAPI, you can't port between one framework Dialogue: 0,0:10:02.17,0:10:06.18,Default,,0000,0000,0000,,and the other or one framework in the\Nkernel or sockets, there's a custom kernel Dialogue: 0,0:10:06.18,0:10:10.65,Default,,0000,0000,0000,,module required, most of these frameworks\Nrequire some small patches to the drivers, Dialogue: 0,0:10:10.65,0:10:15.70,Default,,0000,0000,0000,,it's just a mess to maintain and of course\Nthey need exclusive access to the network Dialogue: 0,0:10:15.70,0:10:18.97,Default,,0000,0000,0000,,card, because this one network card is\Ndirec- this one application is talking Dialogue: 0,0:10:18.97,0:10:23.54,Default,,0000,0000,0000,,directly to the network card.\NOk, and the next thing is you lose the Dialogue: 0,0:10:23.54,0:10:27.76,Default,,0000,0000,0000,,access to the usual kernel features, which\Ncan be quite annoying and then there's Dialogue: 0,0:10:27.76,0:10:30.97,Default,,0000,0000,0000,,often poor support for hardware offloading\Nfeatures of the network cards, because Dialogue: 0,0:10:30.97,0:10:33.97,Default,,0000,0000,0000,,they often found on different parts of the\Nkernel that we no longer have reasonable Dialogue: 0,0:10:33.97,0:10:37.68,Default,,0000,0000,0000,,access to. And of course these frameworks,\Nwe talk directly to a network card, Dialogue: 0,0:10:37.68,0:10:41.53,Default,,0000,0000,0000,,meaning we need support for each network\Ncard individually. Usually they just Dialogue: 0,0:10:41.53,0:10:46.00,Default,,0000,0000,0000,,support one to two or maybe three NIC\Nfamilies, which can be quite restricting, Dialogue: 0,0:10:46.00,0:10:50.58,Default,,0000,0000,0000,,if you don't have that specific NIC that\Nis restricted. But can we do an even more Dialogue: 0,0:10:50.58,0:10:54.79,Default,,0000,0000,0000,,radical approach, because we have all\Nthese problems with kernel dependencies Dialogue: 0,0:10:54.79,0:10:59.19,Default,,0000,0000,0000,,and so on? Well, turns out we can get rid\Nof the kernel entirely and move everything Dialogue: 0,0:10:59.19,0:11:03.65,Default,,0000,0000,0000,,into one application. This means we take\Nour driver put it in the application, the Dialogue: 0,0:11:03.65,0:11:08.05,Default,,0000,0000,0000,,driver directly accesses the network card\Nand the sets up DMA memory in the user Dialogue: 0,0:11:08.05,0:11:11.58,Default,,0000,0000,0000,,space, because the network card doesn't\Ncare, where it copies the packets from. We Dialogue: 0,0:11:11.58,0:11:14.74,Default,,0000,0000,0000,,just have to set up the pointers in the\Nright way and we can build this framework Dialogue: 0,0:11:14.74,0:11:17.41,Default,,0000,0000,0000,,like this, that everything runs in the\Napplication. Dialogue: 0,0:11:17.41,0:11:23.46,Default,,0000,0000,0000,,We remove the driver from the kernel, no\Nkernel driver running and this is super Dialogue: 0,0:11:23.46,0:11:27.65,Default,,0000,0000,0000,,fast and we can also use this to implement\Ncrazy and obscure hardware features and Dialogue: 0,0:11:27.65,0:11:31.42,Default,,0000,0000,0000,,network cards that are not supported by\Nthe standard driver. Now I'm not the first Dialogue: 0,0:11:31.42,0:11:36.20,Default,,0000,0000,0000,,one to do this, there are two big\Nframeworks that that do that: One is DPDK, Dialogue: 0,0:11:36.20,0:11:41.06,Default,,0000,0000,0000,,which is quite quite big. This is a Linux\NFoundation project and it has basically Dialogue: 0,0:11:41.06,0:11:44.71,Default,,0000,0000,0000,,support by all NIC vendors, meaning\Neveryone who builds a high-speed NIC Dialogue: 0,0:11:44.71,0:11:49.21,Default,,0000,0000,0000,,writes a driver that works for DPDK and\Nthe second such framework is Snabb, which Dialogue: 0,0:11:49.21,0:11:54.14,Default,,0000,0000,0000,,I think is quite interesting, because it\Ndoesn't write the drivers in C but is Dialogue: 0,0:11:54.14,0:11:58.29,Default,,0000,0000,0000,,entirely written in Lua, in the scripting\Nlanguage, so this is kind of nice to see a Dialogue: 0,0:11:58.29,0:12:02.100,Default,,0000,0000,0000,,driver that's written in a scripting\Nlanguage. Okay, what problems did we solve Dialogue: 0,0:12:02.100,0:12:06.68,Default,,0000,0000,0000,,and what problems did we now gain? One \Nproblem is we still have the non-standard Dialogue: 0,0:12:06.68,0:12:11.33,Default,,0000,0000,0000,,API, we still need exclusive access to the\Nnetwork card from one application, because Dialogue: 0,0:12:11.33,0:12:15.19,Default,,0000,0000,0000,,the driver runs in that thing, so there's\Nsome hardware tricks to solve that, but Dialogue: 0,0:12:15.19,0:12:18.33,Default,,0000,0000,0000,,mainly it's one application that is\Nrunning. Dialogue: 0,0:12:18.33,0:12:22.46,Default,,0000,0000,0000,,Then the framework needs explicit support\Nfor all the unique models out there. It's Dialogue: 0,0:12:22.46,0:12:26.37,Default,,0000,0000,0000,,not that big a problem with DPDK, because\Nit's such a big project that virtually Dialogue: 0,0:12:26.37,0:12:31.32,Default,,0000,0000,0000,,everyone has a driver for DPDK NIC. And\Nyes, limited support for interrupts but Dialogue: 0,0:12:31.32,0:12:34.17,Default,,0000,0000,0000,,it turns out interrupts are not something\Nthat is useful, when you are building Dialogue: 0,0:12:34.17,0:12:37.100,Default,,0000,0000,0000,,something that processes more than a few\Nhundred thousand packets per second, Dialogue: 0,0:12:37.100,0:12:41.38,Default,,0000,0000,0000,,because the overhead of the interrupt is\Njust too large, it's just mainly a power Dialogue: 0,0:12:41.38,0:12:44.84,Default,,0000,0000,0000,,saving thing, if you ever run into low\Nload. But I don't care about the low load Dialogue: 0,0:12:44.84,0:12:50.41,Default,,0000,0000,0000,,scenario and power saving, so for me it's\Npolling all the way and all the CPU. And Dialogue: 0,0:12:50.41,0:12:55.26,Default,,0000,0000,0000,,you of course lose all the access to the\Nusual kernel features. And, well, time to Dialogue: 0,0:12:55.26,0:12:59.88,Default,,0000,0000,0000,,ask "what has the kernel ever done for\Nus?" Well, the kernel has lots of mature Dialogue: 0,0:12:59.88,0:13:03.14,Default,,0000,0000,0000,,drivers. Okay, what has the kernel ever\Ndone for us, except for all these nice Dialogue: 0,0:13:03.14,0:13:07.64,Default,,0000,0000,0000,,mature drivers? There are very nice\Nprotocol implementations that actually Dialogue: 0,0:13:07.64,0:13:10.22,Default,,0000,0000,0000,,work, like the kernel TCP stack is a work\Nof art. Dialogue: 0,0:13:10.22,0:13:14.32,Default,,0000,0000,0000,,It actually works in real world scenarios,\Nunlike all these other TCP stacks that Dialogue: 0,0:13:14.32,0:13:18.41,Default,,0000,0000,0000,,fail under some things or don't support\Nthe features we want, so there is quite Dialogue: 0,0:13:18.41,0:13:22.51,Default,,0000,0000,0000,,some nice stuff. But what has the kernel\Never done for us, except for these mature Dialogue: 0,0:13:22.51,0:13:26.80,Default,,0000,0000,0000,,drivers and these nice protocol stack\Nimplementations? Okay, quite a few things Dialogue: 0,0:13:26.80,0:13:32.87,Default,,0000,0000,0000,,and we are all throwing them out. And one\Nthing to notice: We mostly don't care Dialogue: 0,0:13:32.87,0:13:37.61,Default,,0000,0000,0000,,about these features, when building our\Npacket forward modify router firewall Dialogue: 0,0:13:37.61,0:13:44.35,Default,,0000,0000,0000,,thing, because these are mostly high-level\Nfeatures mostly I think. But it's still a Dialogue: 0,0:13:44.35,0:13:49.20,Default,,0000,0000,0000,,lot of features that we are losing, like\Nbuilding a TCP stack on top of these Dialogue: 0,0:13:49.20,0:13:52.100,Default,,0000,0000,0000,,frameworks is kind of an unsolved problem.\NThere are TCP stacks but they all suck in Dialogue: 0,0:13:52.100,0:13:58.41,Default,,0000,0000,0000,,different ways. Ok, we lost features but\Nwe didn't care about the features in the Dialogue: 0,0:13:58.41,0:14:02.64,Default,,0000,0000,0000,,first place, we wanted performance.\NBack to our performance figure we want 300 Dialogue: 0,0:14:02.64,0:14:06.49,Default,,0000,0000,0000,,to 600 cycles per packet that we have\Navailable, how long does it take in, for Dialogue: 0,0:14:06.49,0:14:10.90,Default,,0000,0000,0000,,example, DPDK to receive and send a\Npacket? That is around a hundred cycles to Dialogue: 0,0:14:10.90,0:14:15.24,Default,,0000,0000,0000,,get a packet through the whole stack, from\Nlike like receiving a packet, processing Dialogue: 0,0:14:15.24,0:14:19.66,Default,,0000,0000,0000,,it, well, not processing it but getting it\Nto the application and back to the driver Dialogue: 0,0:14:19.66,0:14:23.08,Default,,0000,0000,0000,,to send it out. A hundred cycles and the\Nother frameworks typically play in the Dialogue: 0,0:14:23.08,0:14:27.71,Default,,0000,0000,0000,,same league. DPDK is slightly faster than\Nthe other ones, because it's full of magic Dialogue: 0,0:14:27.71,0:14:33.00,Default,,0000,0000,0000,,SSE and AVX intrinsics and the driver is\Nkind of black magic but it's super fast. Dialogue: 0,0:14:33.00,0:14:37.48,Default,,0000,0000,0000,,Now in kind of real world scenario, Open\NvSwitch, as I've mentioned as an example Dialogue: 0,0:14:37.48,0:14:41.69,Default,,0000,0000,0000,,earlier, that was 2 million packets was\Nthe kernel version and Open vSwitch can be Dialogue: 0,0:14:41.69,0:14:45.22,Default,,0000,0000,0000,,compiled with an optional DPDK backend, so\Nyou set some magic flags when compiling, Dialogue: 0,0:14:45.22,0:14:49.73,Default,,0000,0000,0000,,then it links against DPDK and uses the\Nnetwork card directly, runs completely in Dialogue: 0,0:14:49.73,0:14:54.71,Default,,0000,0000,0000,,userspace and now it's a factor of around\N6 or 7 faster and we can achieve 13 Dialogue: 0,0:14:54.71,0:14:58.43,Default,,0000,0000,0000,,million packets per second with the same,\Naround the same processing step on a Dialogue: 0,0:14:58.43,0:15:03.12,Default,,0000,0000,0000,,single CPU core. So, great, where does do\Nthe performance gains come from? Well, Dialogue: 0,0:15:03.12,0:15:08.13,Default,,0000,0000,0000,,there are two things: Mainly it's compared\Nto the kernel, not compared to sockets. Dialogue: 0,0:15:08.13,0:15:13.29,Default,,0000,0000,0000,,What people often say is that this is,\Nzero copy which is a stupid term because Dialogue: 0,0:15:13.29,0:15:18.28,Default,,0000,0000,0000,,the kernel doesn't copy packets either, so\Nit's not copying packets that was slow, it Dialogue: 0,0:15:18.28,0:15:22.30,Default,,0000,0000,0000,,was other things. Mainly it's batching,\Nmeaning it's very efficient to process a Dialogue: 0,0:15:22.30,0:15:28.62,Default,,0000,0000,0000,,relatively large number of packets at once\Nand that really helps and the thing has Dialogue: 0,0:15:28.62,0:15:32.51,Default,,0000,0000,0000,,reduced memory overhead, the SK_Buff data\Nstructure is really big and if you cut Dialogue: 0,0:15:32.51,0:15:37.32,Default,,0000,0000,0000,,that down you save a lot of cycles. These\NDPDK figures, because DPDK has, unlike Dialogue: 0,0:15:37.32,0:15:42.68,Default,,0000,0000,0000,,some other frameworks, has memory\Nmanagement, and this is already included Dialogue: 0,0:15:42.68,0:15:46.55,Default,,0000,0000,0000,,in these 50 cycles.\NOkay, now we know that these frameworks Dialogue: 0,0:15:46.55,0:15:52.01,Default,,0000,0000,0000,,exist and everything, and the next obvious\Nquestion is: "Can we build our own Dialogue: 0,0:15:52.01,0:15:57.69,Default,,0000,0000,0000,,driver?" Well, but why? First for fun,\Nobviously, and then to understand how that Dialogue: 0,0:15:57.69,0:16:01.16,Default,,0000,0000,0000,,stuff works; how these drivers work,\Nhow these packet processing frameworks Dialogue: 0,0:16:01.16,0:16:04.68,Default,,0000,0000,0000,,work.\NI've seen in my work in academia; I've Dialogue: 0,0:16:04.68,0:16:07.84,Default,,0000,0000,0000,,seen a lot of people using these\Nframeworks. It's nice, because they are Dialogue: 0,0:16:07.84,0:16:12.26,Default,,0000,0000,0000,,fast and they enable a few things, that\Njust weren't possible before. But people Dialogue: 0,0:16:12.26,0:16:16.17,Default,,0000,0000,0000,,often treat these as magic black boxes you\Nput in your packet and then it magically Dialogue: 0,0:16:16.17,0:16:20.43,Default,,0000,0000,0000,,is faster and sometimes I don't blame\Nthem. If you look at DPDK source code, Dialogue: 0,0:16:20.43,0:16:24.27,Default,,0000,0000,0000,,there are more than 20,000 lines of code\Nfor each driver. And just for example, Dialogue: 0,0:16:24.27,0:16:28.81,Default,,0000,0000,0000,,looking at the receive and transmit\Nfunctions of the IXGBE driver and DPDK, Dialogue: 0,0:16:28.81,0:16:33.77,Default,,0000,0000,0000,,this is one file with around 3,000 lines\Nof code and they do a lot of magic, just Dialogue: 0,0:16:33.77,0:16:37.95,Default,,0000,0000,0000,,to receive and send packets. No one wants\Nto read through that, so the question is: Dialogue: 0,0:16:37.95,0:16:40.96,Default,,0000,0000,0000,,"How hard can it be to write your own\Ndriver?" Dialogue: 0,0:16:40.96,0:16:44.85,Default,,0000,0000,0000,,Turns out: It's quite easy! This was like\Na weekend project. I have written the Dialogue: 0,0:16:44.85,0:16:48.37,Default,,0000,0000,0000,,driver called XC. It's less than a\Nthousand lines of C code. That is the full Dialogue: 0,0:16:48.37,0:16:53.56,Default,,0000,0000,0000,,driver for 10 G network cards and the full\Nframework to get some applications and 2 Dialogue: 0,0:16:53.56,0:16:58.10,Default,,0000,0000,0000,,simple example applications. Took me like\Nless than two days to write it completely, Dialogue: 0,0:16:58.10,0:17:00.90,Default,,0000,0000,0000,,then two more days to debug it and fix\Nperformance. Dialogue: 0,0:17:02.38,0:17:08.21,Default,,0000,0000,0000,,So I've been building this driver on the\NIntel IXGBE family. This is a family of Dialogue: 0,0:17:08.21,0:17:13.04,Default,,0000,0000,0000,,network cards that you know of, if you\Never had a server to test this. Because Dialogue: 0,0:17:13.04,0:17:17.64,Default,,0000,0000,0000,,almost all servers, that have 10 G\Nconnections, have these Intel cards. And Dialogue: 0,0:17:17.64,0:17:22.83,Default,,0000,0000,0000,,they are also embedded in some Xeon CPUs.\NThey are also onboard chips on many Dialogue: 0,0:17:22.83,0:17:29.48,Default,,0000,0000,0000,,mainboards and the nice thing about them\Nis, they have a publicly available data Dialogue: 0,0:17:29.48,0:17:33.62,Default,,0000,0000,0000,,sheet. Meaning Intel publishes this 1,000\Npages of PDF, that describes everything, Dialogue: 0,0:17:33.62,0:17:37.14,Default,,0000,0000,0000,,you ever wanted to know, when writing a\Ndriver for these. And the next nice thing Dialogue: 0,0:17:37.14,0:17:41.32,Default,,0000,0000,0000,,is, that there is almost no logic hidden\Nbehind the black box magic firmware. Many Dialogue: 0,0:17:41.32,0:17:46.21,Default,,0000,0000,0000,,newer network cards -especially Mellanox,\Nthe newer ones- hide a lot of Dialogue: 0,0:17:46.21,0:17:50.12,Default,,0000,0000,0000,,functionality behind a firmware and the\Ndriver. Mostly just exchanges messages Dialogue: 0,0:17:50.12,0:17:54.17,Default,,0000,0000,0000,,with the firmware, which is kind of\Nboring, and with this family, it is not Dialogue: 0,0:17:54.17,0:17:58.34,Default,,0000,0000,0000,,the case, which i think is very nice. So\Nhow can we build a driver for this in four Dialogue: 0,0:17:58.34,0:18:02.88,Default,,0000,0000,0000,,very simple steps? One: We remove the\Ndriver that is currently loaded, because Dialogue: 0,0:18:02.88,0:18:07.60,Default,,0000,0000,0000,,we don't want it to interfere with our\Nstuff. Okay, easy so far. Second, we Dialogue: 0,0:18:07.60,0:18:12.59,Default,,0000,0000,0000,,memory-map the PCIO memory-mapped I/O\Naddress space. This allows us to access Dialogue: 0,0:18:12.59,0:18:16.43,Default,,0000,0000,0000,,the PCI Express device. Number three: We\Nfigure out the physical addresses of our Dialogue: 0,0:18:16.43,0:18:22.75,Default,,0000,0000,0000,,DMA; of our process per address region and\Nthen we use them for DMA. And step four is Dialogue: 0,0:18:22.75,0:18:26.78,Default,,0000,0000,0000,,slightly more complicated, than the first\Nthree steps, as we write the driver. Now, Dialogue: 0,0:18:26.78,0:18:31.85,Default,,0000,0000,0000,,first thing to do, we figure out, where\Nour network card -let's say we have a Dialogue: 0,0:18:31.85,0:18:35.44,Default,,0000,0000,0000,,server and be plugged in our network card-\Nthen it gets assigned an address and the Dialogue: 0,0:18:35.44,0:18:39.61,Default,,0000,0000,0000,,PCI bus. We can figure that out with\Nlspci, this is the address. We need it in Dialogue: 0,0:18:39.61,0:18:43.43,Default,,0000,0000,0000,,a slightly different version with the\Nfully qualified ID, and then we can remove Dialogue: 0,0:18:43.43,0:18:47.78,Default,,0000,0000,0000,,the kernel driver by telling the currently\Nbound driver to remove that specific ID. Dialogue: 0,0:18:47.78,0:18:52.10,Default,,0000,0000,0000,,Now the operating system doesn't know,\Nthat this is a network card; doesn't know Dialogue: 0,0:18:52.10,0:18:55.87,Default,,0000,0000,0000,,anything, just notes that some PCI device\Nhas no driver. Then we write our Dialogue: 0,0:18:55.87,0:18:59.21,Default,,0000,0000,0000,,application.\NThis is written in C and we just opened Dialogue: 0,0:18:59.21,0:19:04.21,Default,,0000,0000,0000,,this magic file in sysfs and this magic\Nfile; we just mmap it. Ain't no magic, Dialogue: 0,0:19:04.21,0:19:08.18,Default,,0000,0000,0000,,just a normal mmap there. But what we get\Nback is a kind of special memory region. Dialogue: 0,0:19:08.18,0:19:12.16,Default,,0000,0000,0000,,This is the memory mapped I/O memory\Nregion of the PCI address configuration Dialogue: 0,0:19:12.16,0:19:17.62,Default,,0000,0000,0000,,space and this is where all the registers\Nare available. Meaning, I will show you Dialogue: 0,0:19:17.62,0:19:20.96,Default,,0000,0000,0000,,what that means in just a second. If we if\Ngo through the datasheet, there are Dialogue: 0,0:19:20.96,0:19:25.53,Default,,0000,0000,0000,,hundreds of pages of tables like this and\Nthese tables tell us the registers, that Dialogue: 0,0:19:25.53,0:19:29.97,Default,,0000,0000,0000,,exist on that network card, the offset\Nthey have and a link to more detailed Dialogue: 0,0:19:29.97,0:19:34.59,Default,,0000,0000,0000,,descriptions. And in code that looks like\Nthis: For example the LED control register Dialogue: 0,0:19:34.59,0:19:38.09,Default,,0000,0000,0000,,is at this offset and then the LED control\Nregister. Dialogue: 0,0:19:38.09,0:19:42.52,Default,,0000,0000,0000,,On this register, there are 32 bits, there\Nare some bits offset. Bit 7 is called Dialogue: 0,0:19:42.52,0:19:48.59,Default,,0000,0000,0000,,LED0_BLINK and if we set that bit in that\Nregister, then one of the LEDs will start Dialogue: 0,0:19:48.59,0:19:53.67,Default,,0000,0000,0000,,to blink. And we can just do that via our\Nmagic memory region, because all the reads Dialogue: 0,0:19:53.67,0:19:57.68,Default,,0000,0000,0000,,and writes, that we do to that memory\Nregion, go directly over the PCI Express Dialogue: 0,0:19:57.68,0:20:01.57,Default,,0000,0000,0000,,bus to the network card and the network\Ncard does whatever it wants to do with Dialogue: 0,0:20:01.57,0:20:03.13,Default,,0000,0000,0000,,them.\NIt doesn't have to be a register, Dialogue: 0,0:20:03.13,0:20:08.69,Default,,0000,0000,0000,,basically it's just a command, to send to\Na network card and it's just a nice and Dialogue: 0,0:20:08.69,0:20:11.67,Default,,0000,0000,0000,,convenient interface to map that into\Nmemory. This is a very common technique, Dialogue: 0,0:20:11.67,0:20:15.10,Default,,0000,0000,0000,,that you will also find when you do some\Nmicroprocessor programming or something. Dialogue: 0,0:20:16.26,0:20:20.11,Default,,0000,0000,0000,,So, and one thing to note is, since this\Nis not memory: That also means, it can't Dialogue: 0,0:20:20.11,0:20:24.11,Default,,0000,0000,0000,,be cached. There's no cache in between.\NEach of these accesses will trigger a PCI Dialogue: 0,0:20:24.11,0:20:29.21,Default,,0000,0000,0000,,Express transaction and it will take quite\Nsome time. Speaking of lots of lots of Dialogue: 0,0:20:29.21,0:20:32.92,Default,,0000,0000,0000,,cycles, where lots means like hundreds of\Ncycles or hundred cycles which is a lot Dialogue: 0,0:20:32.92,0:20:37.21,Default,,0000,0000,0000,,for me.\NSo how do we now handle packets? We now Dialogue: 0,0:20:37.21,0:20:42.40,Default,,0000,0000,0000,,can, we have access to this registers we\Ncan read the datasheet and we can write Dialogue: 0,0:20:42.40,0:20:47.25,Default,,0000,0000,0000,,the driver but we some need some way to\Nget packets through that. Of course it Dialogue: 0,0:20:47.25,0:20:51.47,Default,,0000,0000,0000,,would be possible to write a network card\Nthat does that via this memory-mapped I/O Dialogue: 0,0:20:51.47,0:20:56.80,Default,,0000,0000,0000,,region but it's kind of annoying. The\Nsecond way a PCI Express device Dialogue: 0,0:20:56.80,0:21:01.43,Default,,0000,0000,0000,,communicates with your server or macbook\Nis via DMA ,direct memory access, and a Dialogue: 0,0:21:01.43,0:21:07.54,Default,,0000,0000,0000,,DMA transfer, unlike the memory-mapped I/O\Nstuff is initiated by the network card and Dialogue: 0,0:21:07.54,0:21:14.05,Default,,0000,0000,0000,,this means the network card can just write\Nto arbitrary addresses in in main memory. Dialogue: 0,0:21:14.05,0:21:20.20,Default,,0000,0000,0000,,And this the network card offers so called\Nrings which are queue interfaces and like Dialogue: 0,0:21:20.20,0:21:22.95,Default,,0000,0000,0000,,for receiving packets and for sending\Npackets, and they are multiple of these Dialogue: 0,0:21:22.95,0:21:26.58,Default,,0000,0000,0000,,interfaces, because this is how you do\Nmulti-core scaling. If you want to Dialogue: 0,0:21:26.58,0:21:30.65,Default,,0000,0000,0000,,transmit from multiple cores, you allocate\Nmultiple queues. Each core sends to one Dialogue: 0,0:21:30.65,0:21:34.27,Default,,0000,0000,0000,,queue and the network card just merges\Nthese queues in hardware onto the link, Dialogue: 0,0:21:34.27,0:21:38.79,Default,,0000,0000,0000,,and on receiving the network card can\Neither hash on the incoming incoming Dialogue: 0,0:21:38.79,0:21:42.82,Default,,0000,0000,0000,,packet like hash over protocol headers or\Nyou can set explicit filters. Dialogue: 0,0:21:42.82,0:21:46.63,Default,,0000,0000,0000,,This is not specific to a network card\Nmost PCI Express devices work like this Dialogue: 0,0:21:46.63,0:21:52.00,Default,,0000,0000,0000,,like GPUs have queues, a command queues\Nand so on, a NVME PCI Express disks have Dialogue: 0,0:21:52.00,0:21:56.66,Default,,0000,0000,0000,,queues and...\NSo let's look at queues on example of the Dialogue: 0,0:21:56.66,0:22:01.48,Default,,0000,0000,0000,,ixgbe family but you will find that most\NNICs work in a very similar way. There are Dialogue: 0,0:22:01.48,0:22:04.11,Default,,0000,0000,0000,,sometimes small differences but mainly\Nthey work like this. Dialogue: 0,0:22:04.34,0:22:08.90,Default,,0000,0000,0000,,And these rings are just circular buffers\Nfilled with so-called DMA descriptors. A Dialogue: 0,0:22:08.90,0:22:14.18,Default,,0000,0000,0000,,DMA descriptor is a 16-byte struct and\Nthat is eight bytes of a physical pointer Dialogue: 0,0:22:14.18,0:22:18.96,Default,,0000,0000,0000,,pointing to some location where more stuff\Nis and eight byte of metadata like "I Dialogue: 0,0:22:18.96,0:22:24.39,Default,,0000,0000,0000,,fetch the stuff" or "this packet needs\NVLAN tag offloading" or "this packet had a Dialogue: 0,0:22:24.39,0:22:27.12,Default,,0000,0000,0000,,VLAN tag that I removed", information like\Nthat is stored in there. Dialogue: 0,0:22:27.12,0:22:31.20,Default,,0000,0000,0000,,And what we then need to do is we\Ntranslate virtual addresses from our Dialogue: 0,0:22:31.20,0:22:34.51,Default,,0000,0000,0000,,address space to physical addresses\Nbecause the PCI Express device of course Dialogue: 0,0:22:34.51,0:22:39.20,Default,,0000,0000,0000,,needs physical addresses.\NAnd we can use this, do that using procfs: Dialogue: 0,0:22:39.20,0:22:45.59,Default,,0000,0000,0000,,In the /proc/self/pagemap we can do that.\NAnd the next thing is we now have this Dialogue: 0,0:22:45.59,0:22:51.61,Default,,0000,0000,0000,,this queue of DMA descriptors in memory\Nand this queue itself is also accessed via Dialogue: 0,0:22:51.61,0:22:57.10,Default,,0000,0000,0000,,DMA and it's controlled like it works like\Nyou expect a circular ring to work. It has Dialogue: 0,0:22:57.10,0:23:00.97,Default,,0000,0000,0000,,a head and a tail, and the head and tail\Npointer are available via registers in Dialogue: 0,0:23:00.97,0:23:05.68,Default,,0000,0000,0000,,memory-mapped I/O address space, meaning\Nin a image it looks kind of like this: We Dialogue: 0,0:23:05.68,0:23:09.65,Default,,0000,0000,0000,,have this descriptor ring in our physical\Nmemory to the left full of pointers and Dialogue: 0,0:23:09.65,0:23:16.00,Default,,0000,0000,0000,,then we have somewhere else these packets\Nin some memory pool. And one thing to note Dialogue: 0,0:23:16.00,0:23:20.27,Default,,0000,0000,0000,,when allocating this kind of memory: There\Nis a small trick you have to do because Dialogue: 0,0:23:20.27,0:23:25.06,Default,,0000,0000,0000,,the descriptor ring needs to be in\Ncontiguous memory in your physical memory Dialogue: 0,0:23:25.06,0:23:29.14,Default,,0000,0000,0000,,and if you use if, you just assume\Neverything that's contiguous in your Dialogue: 0,0:23:29.14,0:23:34.40,Default,,0000,0000,0000,,process is also in hardware physically: No\Nit isn't, and if you have a bug in there Dialogue: 0,0:23:34.40,0:23:37.92,Default,,0000,0000,0000,,and then it writes to somewhere else then\Nyour filesystem dies as I figured out, Dialogue: 0,0:23:37.92,0:23:43.18,Default,,0000,0000,0000,,which was not a good thing.\NSo ... we, what I'm doing is I'm using Dialogue: 0,0:23:43.18,0:23:46.79,Default,,0000,0000,0000,,huge pages, two megabyte pages, that's\Nenough of contiguous memory and that's Dialogue: 0,0:23:46.79,0:23:53.99,Default,,0000,0000,0000,,guaranteed to not have weird gaps.\NSo, um ... now we see packets we need to Dialogue: 0,0:23:53.99,0:23:58.60,Default,,0000,0000,0000,,set up the ring so we tell the network\Ncar via memory mapped I/O the location and Dialogue: 0,0:23:58.60,0:24:03.07,Default,,0000,0000,0000,,the size of the ring, then we fill up the\Nring with pointers to freshly allocated Dialogue: 0,0:24:03.07,0:24:09.82,Default,,0000,0000,0000,,memory that are just empty and now we set\Nthe head and tail pointer to tell the head Dialogue: 0,0:24:09.82,0:24:13.10,Default,,0000,0000,0000,,and tail pointer that the queue is full,\Nbecause the queue is at the moment full, Dialogue: 0,0:24:13.10,0:24:16.96,Default,,0000,0000,0000,,it's full of packets. These packets are\Njust not yet filled with anything. And now Dialogue: 0,0:24:16.96,0:24:20.63,Default,,0000,0000,0000,,what the NIC does, it fetches one of the\NDNA descriptors and as soon as it receives Dialogue: 0,0:24:20.63,0:24:25.54,Default,,0000,0000,0000,,a packet it writes the packet via DMA to\Nthe location specified in the register and Dialogue: 0,0:24:25.54,0:24:30.30,Default,,0000,0000,0000,,increments the head pointer of the queue\Nand it also sets a status flag in the DMA Dialogue: 0,0:24:30.30,0:24:33.59,Default,,0000,0000,0000,,descriptor once it's done like in the\Npacket to memory and this step is Dialogue: 0,0:24:33.59,0:24:39.61,Default,,0000,0000,0000,,important because reading back the head\Npointer via MM I/O would be way too slow. Dialogue: 0,0:24:39.61,0:24:43.33,Default,,0000,0000,0000,,So instead we check the status flag\Nbecause the status flag gets optimized by Dialogue: 0,0:24:43.33,0:24:47.30,Default,,0000,0000,0000,,the ... by the cache and is already in\Ncache so we can check that really fast. Dialogue: 0,0:24:48.79,0:24:52.12,Default,,0000,0000,0000,,Next step is we periodically poll the\Nstatus flag. This is the point where Dialogue: 0,0:24:52.12,0:24:56.01,Default,,0000,0000,0000,,interrupts might come in useful.\NThere's some misconception: people Dialogue: 0,0:24:56.01,0:24:59.42,Default,,0000,0000,0000,,sometimes believe that if you receive a\Npacket then you get an interrupt and the Dialogue: 0,0:24:59.42,0:25:02.42,Default,,0000,0000,0000,,interrupt somehow magically contains the\Npacket. No it doesn't. The interrupt just Dialogue: 0,0:25:02.42,0:25:05.60,Default,,0000,0000,0000,,contains the information that there is a\Nnew packet. After the interrupt you would Dialogue: 0,0:25:05.60,0:25:12.45,Default,,0000,0000,0000,,have to poll the status flag anyways. So\Nwe now have the packet, we process the Dialogue: 0,0:25:12.45,0:25:16.17,Default,,0000,0000,0000,,packet or do whatever, then we reset the\NDMA descriptor, we can either recycle the Dialogue: 0,0:25:16.17,0:25:21.65,Default,,0000,0000,0000,,old packet or allocate a new one and we\Nset the ready flag on the status register Dialogue: 0,0:25:21.65,0:25:25.53,Default,,0000,0000,0000,,and we adjust the tail pointer register to\Ntell the network card that we are done Dialogue: 0,0:25:25.53,0:25:28.39,Default,,0000,0000,0000,,with this and we don't have to do that for\Nany time because we don't have to keep the Dialogue: 0,0:25:28.39,0:25:33.22,Default,,0000,0000,0000,,queue 100% utilized. We can only update\Nthe tail pointer like every hundred Dialogue: 0,0:25:33.22,0:25:37.56,Default,,0000,0000,0000,,packets or so and then that's not a\Nperformance problem. What now, we have a Dialogue: 0,0:25:37.56,0:25:42.02,Default,,0000,0000,0000,,driver that can receive packets. Next\Nsteps, well transmit packets, it basically Dialogue: 0,0:25:42.02,0:25:46.37,Default,,0000,0000,0000,,works the same. I won't bore you with the\Ndetails. Then there's of course a lot of Dialogue: 0,0:25:46.37,0:25:50.60,Default,,0000,0000,0000,,boring boring initialization code and it's\Njust following the datasheet, they are Dialogue: 0,0:25:50.60,0:25:54.07,Default,,0000,0000,0000,,like: set this register, set that\Nregister, do that and I just coded it down Dialogue: 0,0:25:54.07,0:25:58.87,Default,,0000,0000,0000,,from the datasheet and it works, so big\Nsurprise. Then now you know how to write a Dialogue: 0,0:25:58.87,0:26:03.80,Default,,0000,0000,0000,,driver like this and a few ideas of what\N... what I want to do, what maybe you want Dialogue: 0,0:26:03.80,0:26:06.82,Default,,0000,0000,0000,,to do with a driver like this. One of\Ncourse want to look at performance to look Dialogue: 0,0:26:06.82,0:26:09.93,Default,,0000,0000,0000,,at what makes this faster than the kernel,\Nthen I want some obscure Dialogue: 0,0:26:09.93,0:26:12.53,Default,,0000,0000,0000,,hardware/offloading features.\NIn the past I've looked at IPSec Dialogue: 0,0:26:12.53,0:26:15.84,Default,,0000,0000,0000,,offloading, just quite interesting,\Nbecause the Intel network cards have Dialogue: 0,0:26:15.84,0:26:19.87,Default,,0000,0000,0000,,hardware support for IPSec offloading, but\Nnone of the Intel drivers had it and it Dialogue: 0,0:26:19.87,0:26:24.20,Default,,0000,0000,0000,,seems to work just fine. So not sure\Nwhat's going on there. Then security is Dialogue: 0,0:26:24.20,0:26:29.44,Default,,0000,0000,0000,,interesting. There is the ... there's\Nobvious some security implications of Dialogue: 0,0:26:29.44,0:26:33.40,Default,,0000,0000,0000,,having the whole driver in a user space\Nprocess and ... and I'm wondering about Dialogue: 0,0:26:33.40,0:26:37.12,Default,,0000,0000,0000,,how we can use the IOMMU, because it turns\Nout, once we have set up the memory Dialogue: 0,0:26:37.12,0:26:40.13,Default,,0000,0000,0000,,mapping we can drop all the privileges, we\Ndon't need them. Dialogue: 0,0:26:40.13,0:26:43.66,Default,,0000,0000,0000,,And if we set up the IOMMU before to\Nrestrict the network card to certain Dialogue: 0,0:26:43.66,0:26:48.75,Default,,0000,0000,0000,,things then we could have a safe driver in\Nuserspace that can't do anything wrong, Dialogue: 0,0:26:48.75,0:26:52.26,Default,,0000,0000,0000,,because has no privileges and the network\Ncard has no access because goes through Dialogue: 0,0:26:52.26,0:26:56.05,Default,,0000,0000,0000,,the IOMMU and there are performance\Nimplications of the IOMMU and so on. Of Dialogue: 0,0:26:56.05,0:26:59.89,Default,,0000,0000,0000,,course, support for other NICs. I want to\Nsupport virtIO, virtual NICs and other Dialogue: 0,0:26:59.89,0:27:03.56,Default,,0000,0000,0000,,programming languages for the driver would\Nalso be interesting. It's just written in Dialogue: 0,0:27:03.56,0:27:06.69,Default,,0000,0000,0000,,C because C is the lowest common\Ndenominator of programming languages. Dialogue: 0,0:27:06.99,0:27:12.70,Default,,0000,0000,0000,,To conclude, check out ixy. It's BSD\Nlicense on github and the main thing to Dialogue: 0,0:27:12.70,0:27:16.09,Default,,0000,0000,0000,,take with you is that drivers are really\Nsimple. Don't be afraid of drivers. Don't Dialogue: 0,0:27:16.09,0:27:20.06,Default,,0000,0000,0000,,be afraid of writing your drivers. You can\Ndo it in any language and you don't even Dialogue: 0,0:27:20.06,0:27:23.14,Default,,0000,0000,0000,,need to add kernel code. Just map the\Nstuff to your process, write the driver Dialogue: 0,0:27:23.14,0:27:27.02,Default,,0000,0000,0000,,and do whatever you want. Okay, thanks for\Nyour attention. Dialogue: 0,0:27:27.02,0:27:33.34,Default,,0000,0000,0000,,{\i1}Applause{\i0} Dialogue: 0,0:27:33.34,0:27:36.08,Default,,0000,0000,0000,,Herald: You have very few minutes left for Dialogue: 0,0:27:36.08,0:27:40.53,Default,,0000,0000,0000,,questions. So if you have a question in\Nthe room please go quickly to one of the 8 Dialogue: 0,0:27:40.53,0:27:46.90,Default,,0000,0000,0000,,microphones in the room. Does the signal\Nangel already have a question ready? I Dialogue: 0,0:27:46.90,0:27:52.100,Default,,0000,0000,0000,,don't see anything. Anybody lining up at\Nany microphones? Dialogue: 0,0:28:07.18,0:28:08.95,Default,,0000,0000,0000,,Alright, number 6 please. Dialogue: 0,0:28:09.93,0:28:15.14,Default,,0000,0000,0000,,Mic 6: As you're not actually using any of\Nthe Linux drivers, is there an advantage Dialogue: 0,0:28:15.14,0:28:19.47,Default,,0000,0000,0000,,to using Linux here or could you use any\Nopen source operating system? Dialogue: 0,0:28:19.47,0:28:24.20,Default,,0000,0000,0000,,Paul: I don't know about other operating\Nsystems but the only thing I'm using of Dialogue: 0,0:28:24.20,0:28:28.65,Default,,0000,0000,0000,,Linux here is the ability to easily map\Nthat. For some other operating systems we Dialogue: 0,0:28:28.65,0:28:32.78,Default,,0000,0000,0000,,might need a small stub driver that maps\Nthe stuff in there. You can check out the Dialogue: 0,0:28:32.78,0:28:36.82,Default,,0000,0000,0000,,DPDK FreeBSD port which has a small stub\Ndriver that just handles the memory Dialogue: 0,0:28:36.82,0:28:41.38,Default,,0000,0000,0000,,mapping.\NHerald: Here, at number 2. Dialogue: 0,0:28:41.38,0:28:45.34,Default,,0000,0000,0000,,Mic 2: Hi, erm, slightly disconnected to\Nthe talk, but I just like to hear your Dialogue: 0,0:28:45.34,0:28:50.88,Default,,0000,0000,0000,,opinion on smart NICs where they're\Nconsidering putting CPUs on the NIC Dialogue: 0,0:28:50.88,0:28:55.28,Default,,0000,0000,0000,,itself. So you could imagine running Open\NvSwitch on the CPU on the NIC. Dialogue: 0,0:28:55.28,0:28:59.53,Default,,0000,0000,0000,,Paul: Yeah, I have some smart NIC\Nsomewhere on some lap and have also done Dialogue: 0,0:28:59.53,0:29:05.64,Default,,0000,0000,0000,,work with the net FPGA. I think that it's\Nvery interesting, but it ... it's a Dialogue: 0,0:29:05.64,0:29:09.82,Default,,0000,0000,0000,,complicated trade-off, because these smart\NNICs come with new restrictions and they Dialogue: 0,0:29:09.82,0:29:13.82,Default,,0000,0000,0000,,are not dramatically super fast. So it's\N... it's interesting from a performance Dialogue: 0,0:29:13.82,0:29:17.61,Default,,0000,0000,0000,,perspective to see when it's worth it,\Nwhen it's not worth it and what I Dialogue: 0,0:29:17.61,0:29:22.10,Default,,0000,0000,0000,,personally think it's probably better to\Ndo everything with raw CPU power. Dialogue: 0,0:29:22.10,0:29:25.20,Default,,0000,0000,0000,,Mic 2: Thanks.\NHerald: Alright, before we take the next Dialogue: 0,0:29:25.20,0:29:29.73,Default,,0000,0000,0000,,question, just for the people who don't\Nwant to stick around for the Q&A. If you Dialogue: 0,0:29:29.73,0:29:33.72,Default,,0000,0000,0000,,really do have to leave the room early,\Nplease do so quietly, so we can continue Dialogue: 0,0:29:33.72,0:29:39.44,Default,,0000,0000,0000,,the Q&A. Number 6, please.\NMic 6: So how does the performance of the Dialogue: 0,0:29:39.44,0:29:42.81,Default,,0000,0000,0000,,userspace driver is compared to the XDP\Nsolution? Dialogue: 0,0:29:42.81,0:29:51.19,Default,,0000,0000,0000,,Paul: Um, it's slightly faster. But one\Nimportant thing about XDP is, if you look Dialogue: 0,0:29:51.19,0:29:54.91,Default,,0000,0000,0000,,at this, this is still new work and there\Nis ... there are few important Dialogue: 0,0:29:54.91,0:29:58.34,Default,,0000,0000,0000,,restrictions like you can write your\Nuserspace thing in whatever programming Dialogue: 0,0:29:58.34,0:30:01.52,Default,,0000,0000,0000,,language you want. Like I mentioned, snap\Nhas a driver entirely written in Lua. With Dialogue: 0,0:30:01.52,0:30:06.98,Default,,0000,0000,0000,,XDP you are restricted to eBPF, meaning\Nusually a restricted subset of C and then Dialogue: 0,0:30:06.98,0:30:09.67,Default,,0000,0000,0000,,there's bytecode verifier but you can\Ndisable the bytecode verifier if you want Dialogue: 0,0:30:09.67,0:30:13.99,Default,,0000,0000,0000,,to disable it, and meaning, you again have\Nweird restrictions that you maybe don't Dialogue: 0,0:30:13.99,0:30:18.96,Default,,0000,0000,0000,,want and also XDP requires patched driv\N... not patched drivers but requires a new Dialogue: 0,0:30:18.96,0:30:23.55,Default,,0000,0000,0000,,memory model for the drivers. So at moment\NDPDK supports more drivers than XDP in the Dialogue: 0,0:30:23.55,0:30:26.74,Default,,0000,0000,0000,,kernel, which is kind of weird, and\Nthey're still lacking many features like Dialogue: 0,0:30:26.74,0:30:31.19,Default,,0000,0000,0000,,sending back to a different NIC.\NOne very very good use case for XDP is Dialogue: 0,0:30:31.19,0:30:35.34,Default,,0000,0000,0000,,firewalling for applications on the same\Nhost because you can pass on a packet to Dialogue: 0,0:30:35.34,0:30:40.31,Default,,0000,0000,0000,,the TCP stack and this is a very good use\Ncase for XDP. But overall, I think that Dialogue: 0,0:30:40.31,0:30:46.76,Default,,0000,0000,0000,,... that both things are very very\Ndifferent and XDP is slightly slower but Dialogue: 0,0:30:46.76,0:30:51.08,Default,,0000,0000,0000,,it's not slower in such a way that it\Nwould be relevant. So it's fast, to Dialogue: 0,0:30:51.08,0:30:54.96,Default,,0000,0000,0000,,answer the question.\NHerald: All right, unfortunately we are Dialogue: 0,0:30:54.96,0:30:59.17,Default,,0000,0000,0000,,out of time. So that was the last\Nquestion. Thanks again, Paul. Dialogue: 0,0:30:59.17,0:31:07.96,Default,,0000,0000,0000,,{\i1}Applause{\i0} Dialogue: 0,0:31:07.96,0:31:12.77,Default,,0000,0000,0000,,{\i1}34c3 outro{\i0} Dialogue: 0,0:31:12.77,0:31:30.00,Default,,0000,0000,0000,,subtitles created by c3subtitles.de\Nin the year 2018. Join, and help us!