0:00:00.099,0:00:15.690
34c3 intro
0:00:15.690,0:00:20.270
Herald: All right, now it's my great[br]pleasure to introduce Paul Emmerich who is
0:00:20.270,0:00:26.520
going to talk about "Demystifying Network[br]Cards". Paul is a PhD student at the
0:00:26.520,0:00:33.660
Technical University in Munich. He's doing[br]all kinds of network related stuff and[br]
0:00:33.660,0:00:37.950
hopefully today he's gonna help us make[br]network cards a bit less of a black box.
0:00:37.950,0:00:48.530
So, please give a warm welcome to Paul[br]applause
0:00:48.530,0:00:50.559
Paul: Thank you and as the introduction
0:00:50.559,0:00:54.649
already said I'm a PhD student and I'm[br]researching performance of software packet
0:00:54.649,0:00:58.319
processing and forwarding systems.[br]That means I spend a lot of time doing
0:00:58.319,0:01:02.559
low-level optimizations and looking into[br]what makes a system fast, what makes it
0:01:02.559,0:01:05.980
slow, what can be done to improve it[br]and I'm mostly working on my packet
0:01:05.980,0:01:09.770
generator MoonGen[br]I have some cross promotion of a lightning
0:01:09.770,0:01:13.490
talk about this on Saturday but here I[br]have this long slot
0:01:13.490,0:01:17.550
and I brought a lot of content here so I[br]have to talk really fast so sorry for the
0:01:17.550,0:01:20.560
translators and I hope you can mainly[br]follow along
0:01:20.560,0:01:24.920
So: this is about Network cards meaning[br]network cards you all have seen. This is a
0:01:24.920,0:01:30.369
usual 10G network card with the SFP+ port[br]and this is a faster network card with a
0:01:30.369,0:01:35.359
QSFP+ port. This is 20, 40, or 100G[br]and now you bought this fancy network
0:01:35.359,0:01:38.229
card, you plug it into your server or your[br]macbook or whatever,
0:01:38.229,0:01:41.520
and you start your web server that serves[br]cat pictures and cat videos.
0:01:41.520,0:01:45.739
You all know that there's a whole stack of[br]protocols that your cat picture has to go
0:01:45.739,0:01:48.089
through until it arrives at a network card[br]at the bottom
0:01:48.089,0:01:52.120
and the only thing that I care about are[br]the lower layers. I don't care about TCP,
0:01:52.120,0:01:55.520
I have no idea how TCP works.[br]Well I have some idea how it works, but
0:01:55.520,0:01:57.701
this is not my research, I don't care[br]about it.
0:01:57.701,0:02:01.280
I just want to look at individual packets[br]and the highest thing I look at it's maybe
0:02:01.280,0:02:07.729
an IP address or maybe a part of the[br]protocol to identify flows or anything.
0:02:07.729,0:02:11.050
Now you might wonder: Is there anything[br]even interesting in these lower layers?
0:02:11.050,0:02:15.080
Because people nowadays think that[br]everything runs on top of HTTP,
0:02:15.080,0:02:19.160
but you might be surprised that not all[br]applications run on top of HTTP.
0:02:19.160,0:02:23.380
There is a lot of software that needs to[br]run at these lower levels and in the
0:02:23.380,0:02:26.150
recent years[br]there is a trend of moving network
0:02:26.150,0:02:30.810
infrastructure stuff from specialized[br]hardware black boxes to open software
0:02:30.810,0:02:33.220
boxes[br]and examples for such software that was
0:02:33.220,0:02:37.780
hardware in the past are: routers, switches,[br]firewalls, middle boxes and so on.
0:02:37.780,0:02:40.420
If you want to look up the relevant[br]buzzwords: It's Network Function
0:02:40.420,0:02:45.850
Virtualization what it's called and this[br]is a recent trend of the recent years.
0:02:45.850,0:02:50.610
Now let's say we want to build our own[br]fancy application on that low-level thing.
0:02:50.610,0:02:55.120
We want to build our firewall router[br]packet forward modifier thing that does
0:02:55.120,0:02:59.410
whatever useful on that lower layer for[br]network infrastructure
0:02:59.410,0:03:03.760
and I will use this application as a demo[br]application for this talk as everything
0:03:03.760,0:03:08.310
will be about this hypothetical router[br]fireball packet forward modifier thing.
0:03:08.310,0:03:11.800
What it does: It receives packets on one[br]or multiple network interfaces, it does
0:03:11.800,0:03:16.270
stuff with the packets - filter them,[br]modify them, route them
0:03:16.270,0:03:19.980
and sent them out to some other port or[br]maybe the same port or maybe multiple
0:03:19.980,0:03:23.140
ports - whatever these low-level[br]applications do.
0:03:23.140,0:03:27.540
And this means the application operates on[br]individual packets, not a stream of TCP
0:03:27.540,0:03:31.300
packets, not a stream of UDP packets, they[br]have to cope with small packets.
0:03:31.300,0:03:34.200
Because that's just the worst case: You[br]get a lot of small packets.
0:03:34.200,0:03:37.760
Now you want to build the application. You[br]go to the Internet and you look up: How to
0:03:37.760,0:03:41.290
build a packet forwarding application?[br]The internet tells you: There is the
0:03:41.290,0:03:46.040
socket API, the socket API is great and it[br]allows you to get packets to your program.
0:03:46.040,0:03:50.080
So you build your application on top of[br]the socket API. Once in userspace, you use
0:03:50.080,0:03:52.930
your socket, the socket talks to the[br]operating system,
0:03:52.930,0:03:56.030
the operating system talks to the driver[br]and the driver talks to the network cards,
0:03:56.030,0:03:59.340
and everything is fine except for that it[br]isn't
0:03:59.340,0:04:02.080
because what it really looks like if you[br]build this application:
0:04:02.080,0:04:07.460
There is this huge scary big gap between[br]user space and kernel space and you
0:04:07.460,0:04:13.170
somehow need your packets to go across[br]that without being eaten.
0:04:13.170,0:04:16.359
You might wonder why I said this is a big[br]deal and a huge deal that you have this
0:04:16.359,0:04:19.399
gap in there[br]and because I think: "Well, my web server
0:04:19.399,0:04:23.120
serving cat pictures is doing just fine on[br]a fast connection."
0:04:23.120,0:04:28.890
Well, it is because it is serving large[br]packets or even large chunks of files that
0:04:28.890,0:04:33.930
it sends at one to the server[br]like you can send a... can take your whole
0:04:33.930,0:04:36.510
cat video, give it to the kernel and the[br]kernel will handle everything
0:04:36.510,0:04:42.800
from doing... from packetizing it to TCP.[br]But what we want to build is a application
0:04:42.800,0:04:47.640
that needs to cope with the worst case of[br]lots of small packets coming in,
0:04:47.640,0:04:53.600
and then the overhead that you get here[br]from this gap is mostly on a packet basis
0:04:53.600,0:04:57.421
not on a pair-byte basis.[br]So, lots of small packets are a problem
0:04:57.421,0:05:00.690
for this interface.[br]When I say "problem" I'm always talking
0:05:00.690,0:05:03.240
about performance because I'm mostly about[br]performance.
0:05:03.240,0:05:09.390
So if you look at performance... a few[br]figures to get started is...
0:05:09.390,0:05:13.250
well how many packets can you fit over[br]your usual 10G link? That's around fifteen
0:05:13.250,0:05:17.810
million.[br]But 10G that's last year's news, this year
0:05:17.810,0:05:21.370
you have multiple hundred G connections[br]even to this location here.
0:05:21.370,0:05:28.280
So 100G link can handle up to 150 million[br]packets per second, and, well, how long
0:05:28.280,0:05:32.819
does that give us if we have a CPU?[br]And say we have a three gigahertz CPU in
0:05:32.819,0:05:37.260
our Macbook running the router and that[br]means we have around 200 cycles per packet
0:05:37.260,0:05:40.400
if we want to handle one 10G link with one[br]CPU core.
0:05:40.400,0:05:46.000
Okay we don't want to handle... we have of[br]course multiple cores. But you also have
0:05:46.000,0:05:50.430
multiple links, and faster links than 10G.[br]So the typical performance target that you
0:05:50.430,0:05:54.510
would aim for when building such an[br]application is five to ten million packets
0:05:54.510,0:05:56.880
per second per CPU core per thread that[br]you start.
0:05:56.880,0:06:00.550
Thats like a usual target. And that is[br]just for forwarding, just to receive the
0:06:00.550,0:06:05.630
packet and to send it back out. All the[br]stuff, that is: all the remaining cycles
0:06:05.630,0:06:09.110
can be used for your application.[br]So we don't want any big overhead just for
0:06:09.110,0:06:11.700
receiving and sending them without doing[br]any useful work.
0:06:11.700,0:06:20.370
So these these figures translate to[br]around 300 to 600 cycles per packet, on a
0:06:20.370,0:06:24.380
three gigahertz CPU core. Now, how long[br]does it take to cross that userspace
0:06:24.380,0:06:30.860
boundary? Well, very very very long for an[br]individual packet. So in some performance
0:06:30.860,0:06:34.620
measurements, if you do single core packet[br]forwarding, with a raw socket socket you
0:06:34.620,0:06:38.920
can maybe achieve 300,000 packets per[br]second, if you use libpcap, you can
0:06:38.920,0:06:42.740
achieve a million packets per second.[br]These figures can be tuned. You can maybe
0:06:42.740,0:06:46.080
get factor two out of that by some tuning,[br]but there are more problems, like
0:06:46.080,0:06:50.340
multicore scaling is unnecessarily hard[br]and so on, so this doesn't really seem to
0:06:50.340,0:06:54.800
work. So the boundary is the problem, so[br]let's get rid of the boundary by just
0:06:54.800,0:06:59.310
moving the application into the kernel. We[br]rewrite our application as a kernel module
0:06:59.310,0:07:04.330
and use it directly. You might think "what[br]an incredibly stupid idea, to write kernel
0:07:04.330,0:07:08.580
code for something that clearly should be[br]user space". Well, it's not that
0:07:08.580,0:07:11.949
unreasonable, there are lots of examples[br]of applications doing this, like a certain
0:07:11.949,0:07:16.850
web server by Microsoft runs as a kernel[br]module, the latest Linux kernel has TLS
0:07:16.850,0:07:20.850
offloading, to speed that up. Another[br]interesting use case is Open vSwitch, that
0:07:20.850,0:07:24.170
has a fast internal chache, that just[br]caches stuff and does complex processing
0:07:24.170,0:07:27.419
in a userspace thing, so it's not[br]completely unreasonable.
0:07:27.419,0:07:30.890
But it comes with a lot of drawbacks, like[br]it's very cumbersome to develop, most your
0:07:30.890,0:07:34.930
usual tools don't work or don't work as[br]expected, you have to follow the usual
0:07:34.930,0:07:38.000
kernel restrictions, like you have to use[br]C as a programming language, what you
0:07:38.000,0:07:42.260
maybe don't want to, and your application[br]can and will crash the kernel, which can
0:07:42.260,0:07:46.750
be quite bad. But lets not care about the[br]restrictions, we wanted to fix
0:07:46.750,0:07:50.530
performance, so same figures again: We[br]have 300 to 600 cycles to receive and sent
0:07:50.530,0:07:54.660
a packet. What I did: I tested this, I[br]profiled the Linux kernel to see how long
0:07:54.660,0:07:58.840
does it take to receive a packet until I[br]can do some useful work on it. This is an
0:07:58.840,0:08:03.550
average cost of a longer profiling run. So[br]on average it takes 500 cycles just to
0:08:03.550,0:08:08.010
receive the packet. Well, that's bad but[br]sending it out is slightly faster and
0:08:08.010,0:08:11.490
again, we are now over our budget. Now you[br]might think "what else do I need to do
0:08:11.490,0:08:15.639
besides receiving and sending the packet?"[br]There is some more overhead, there's you
0:08:15.639,0:08:20.710
need some time to the sk_buff, the data[br]structure used in the kernel for all
0:08:20.710,0:08:24.910
packet buffers, and this is quite bloated,[br]old, big data structure that is growing
0:08:24.910,0:08:29.760
bigger and bigger with each release and[br]this takes another 400 cycles. So if you
0:08:29.760,0:08:32.999
measure a real world application, single[br]core packet forwarding with Open vSwitch
0:08:32.999,0:08:36.429
with the minimum processing possible: One[br]open flow rule that matches on physical
0:08:36.429,0:08:40.529
ports and the processing, I profiled this[br]at around 200 cycles per packet.
0:08:40.529,0:08:44.790
And while the overhead of the kernel is[br]another thousand something cycles, so in
0:08:44.790,0:08:49.360
the end you achieve two million packets[br]per second - and this is faster than our
0:08:49.360,0:08:55.320
user space stuff but still kind of slow,[br]well, we want to be faster, because yeah.
0:08:55.320,0:08:59.220
And the currently hottest topic, which I'm[br]not talking about in the Linux kernel is
0:08:59.220,0:09:03.040
XDP. This fixes some of these problems but[br]comes with new restrictions. I cut that
0:09:03.040,0:09:10.079
for my talk for time reasons and so let's[br]just talk about not XDP. So the problem
0:09:10.079,0:09:14.439
was that our application - and we wanted[br]to move the application to the kernel
0:09:14.439,0:09:17.680
space - and it didn't work, so can we[br]instead move stuff from the kernel to the
0:09:17.680,0:09:22.160
user space? Well, yes we can. There a[br]libraries called "user space packet
0:09:22.160,0:09:25.660
processing frameworks". They come in two[br]parts: One is a library, you link your
0:09:25.660,0:09:29.209
program against, in the user space and one[br]is a kernel module. These two parts
0:09:29.209,0:09:34.199
communicate and they setup shared, mapped[br]memory and this shared mapped memory is
0:09:34.199,0:09:37.770
used to directly communicate from your[br]application to the driver. You directly
0:09:37.770,0:09:41.209
fill the packet buffers that the driver[br]then sends out and this is way faster.
0:09:41.209,0:09:44.379
And you might have noticed that the[br]operating system box here is not connected
0:09:44.379,0:09:47.349
to anything. That means your operating[br]system doesn't even know that the network
0:09:47.349,0:09:51.589
card is there in most cases, this can be[br]quite annoying. But there are quite a few
0:09:51.589,0:09:58.000
such frameworks, the biggest examples are[br]netmap PF_RING and pfq and they come with
0:09:58.000,0:10:02.170
restrictions, like there is a non-standard[br]API, you can't port between one framework
0:10:02.170,0:10:06.180
and the other or one framework in the[br]kernel or sockets, there's a custom kernel
0:10:06.180,0:10:10.650
module required, most of these frameworks[br]require some small patches to the drivers,
0:10:10.650,0:10:15.699
it's just a mess to maintain and of course[br]they need exclusive access to the network
0:10:15.699,0:10:18.970
card, because this one network card is[br]direc- this one application is talking
0:10:18.970,0:10:23.540
directly to the network card.[br]Ok, and the next thing is you lose the
0:10:23.540,0:10:27.759
access to the usual kernel features, which[br]can be quite annoying and then there's
0:10:27.759,0:10:30.970
often poor support for hardware offloading[br]features of the network cards, because
0:10:30.970,0:10:33.970
they often found on different parts of the[br]kernel that we no longer have reasonable
0:10:33.970,0:10:37.679
access to. And of course these frameworks,[br]we talk directly to a network card,
0:10:37.679,0:10:41.529
meaning we need support for each network[br]card individually. Usually they just
0:10:41.529,0:10:46.000
support one to two or maybe three NIC[br]families, which can be quite restricting,
0:10:46.000,0:10:50.579
if you don't have that specific NIC that[br]is restricted. But can we do an even more
0:10:50.579,0:10:54.790
radical approach, because we have all[br]these problems with kernel dependencies
0:10:54.790,0:10:59.189
and so on? Well, turns out we can get rid[br]of the kernel entirely and move everything
0:10:59.189,0:11:03.650
into one application. This means we take[br]our driver put it in the application, the
0:11:03.650,0:11:08.050
driver directly accesses the network card[br]and the sets up DMA memory in the user
0:11:08.050,0:11:11.579
space, because the network card doesn't[br]care, where it copies the packets from. We
0:11:11.579,0:11:14.739
just have to set up the pointers in the[br]right way and we can build this framework
0:11:14.739,0:11:17.410
like this, that everything runs in the[br]application.
0:11:17.410,0:11:23.459
We remove the driver from the kernel, no[br]kernel driver running and this is super
0:11:23.459,0:11:27.649
fast and we can also use this to implement[br]crazy and obscure hardware features and
0:11:27.649,0:11:31.420
network cards that are not supported by[br]the standard driver. Now I'm not the first
0:11:31.420,0:11:36.200
one to do this, there are two big[br]frameworks that that do that: One is DPDK,
0:11:36.200,0:11:41.060
which is quite quite big. This is a Linux[br]Foundation project and it has basically
0:11:41.060,0:11:44.709
support by all NIC vendors, meaning[br]everyone who builds a high-speed NIC
0:11:44.709,0:11:49.209
writes a driver that works for DPDK and[br]the second such framework is Snabb, which
0:11:49.209,0:11:54.139
I think is quite interesting, because it[br]doesn't write the drivers in C but is
0:11:54.139,0:11:58.290
entirely written in Lua, in the scripting[br]language, so this is kind of nice to see a
0:11:58.290,0:12:02.999
driver that's written in a scripting[br]language. Okay, what problems did we solve
0:12:02.999,0:12:06.679
and what problems did we now gain? One [br]problem is we still have the non-standard
0:12:06.679,0:12:11.329
API, we still need exclusive access to the[br]network card from one application, because
0:12:11.329,0:12:15.189
the driver runs in that thing, so there's[br]some hardware tricks to solve that, but
0:12:15.189,0:12:18.329
mainly it's one application that is[br]running.
0:12:18.329,0:12:22.459
Then the framework needs explicit support[br]for all the unique models out there. It's
0:12:22.459,0:12:26.369
not that big a problem with DPDK, because[br]it's such a big project that virtually
0:12:26.369,0:12:31.319
everyone has a driver for DPDK NIC. And[br]yes, limited support for interrupts but
0:12:31.319,0:12:34.170
it turns out interrupts are not something[br]that is useful, when you are building
0:12:34.170,0:12:37.999
something that processes more than a few[br]hundred thousand packets per second,
0:12:37.999,0:12:41.379
because the overhead of the interrupt is[br]just too large, it's just mainly a power
0:12:41.379,0:12:44.839
saving thing, if you ever run into low[br]load. But I don't care about the low load
0:12:44.839,0:12:50.410
scenario and power saving, so for me it's[br]polling all the way and all the CPU. And
0:12:50.410,0:12:55.260
you of course lose all the access to the[br]usual kernel features. And, well, time to
0:12:55.260,0:12:59.880
ask "what has the kernel ever done for[br]us?" Well, the kernel has lots of mature
0:12:59.880,0:13:03.139
drivers. Okay, what has the kernel ever[br]done for us, except for all these nice
0:13:03.139,0:13:07.639
mature drivers? There are very nice[br]protocol implementations that actually
0:13:07.639,0:13:10.220
work, like the kernel TCP stack is a work[br]of art.
0:13:10.220,0:13:14.319
It actually works in real world scenarios,[br]unlike all these other TCP stacks that
0:13:14.319,0:13:18.410
fail under some things or don't support[br]the features we want, so there is quite
0:13:18.410,0:13:22.509
some nice stuff. But what has the kernel[br]ever done for us, except for these mature
0:13:22.509,0:13:26.799
drivers and these nice protocol stack[br]implementations? Okay, quite a few things
0:13:26.799,0:13:32.870
and we are all throwing them out. And one[br]thing to notice: We mostly don't care
0:13:32.870,0:13:37.610
about these features, when building our[br]packet forward modify router firewall
0:13:37.610,0:13:44.349
thing, because these are mostly high-level[br]features mostly I think. But it's still a
0:13:44.349,0:13:49.199
lot of features that we are losing, like[br]building a TCP stack on top of these
0:13:49.199,0:13:52.999
frameworks is kind of an unsolved problem.[br]There are TCP stacks but they all suck in
0:13:52.999,0:13:58.409
different ways. Ok, we lost features but[br]we didn't care about the features in the
0:13:58.409,0:14:02.640
first place, we wanted performance.[br]Back to our performance figure we want 300
0:14:02.640,0:14:06.490
to 600 cycles per packet that we have[br]available, how long does it take in, for
0:14:06.490,0:14:10.899
example, DPDK to receive and send a[br]packet? That is around a hundred cycles to
0:14:10.899,0:14:15.239
get a packet through the whole stack, from[br]like like receiving a packet, processing
0:14:15.239,0:14:19.660
it, well, not processing it but getting it[br]to the application and back to the driver
0:14:19.660,0:14:23.080
to send it out. A hundred cycles and the[br]other frameworks typically play in the
0:14:23.080,0:14:27.709
same league. DPDK is slightly faster than[br]the other ones, because it's full of magic
0:14:27.709,0:14:33.000
SSE and AVX intrinsics and the driver is[br]kind of black magic but it's super fast.
0:14:33.000,0:14:37.480
Now in kind of real world scenario, Open[br]vSwitch, as I've mentioned as an example
0:14:37.480,0:14:41.689
earlier, that was 2 million packets was[br]the kernel version and Open vSwitch can be
0:14:41.689,0:14:45.220
compiled with an optional DPDK backend, so[br]you set some magic flags when compiling,
0:14:45.220,0:14:49.729
then it links against DPDK and uses the[br]network card directly, runs completely in
0:14:49.729,0:14:54.709
userspace and now it's a factor of around[br]6 or 7 faster and we can achieve 13
0:14:54.709,0:14:58.429
million packets per second with the same,[br]around the same processing step on a
0:14:58.429,0:15:03.119
single CPU core. So, great, where does do[br]the performance gains come from? Well,
0:15:03.119,0:15:08.129
there are two things: Mainly it's compared[br]to the kernel, not compared to sockets.
0:15:08.129,0:15:13.290
What people often say is that this is,[br]zero copy which is a stupid term because
0:15:13.290,0:15:18.279
the kernel doesn't copy packets either, so[br]it's not copying packets that was slow, it
0:15:18.279,0:15:22.299
was other things. Mainly it's batching,[br]meaning it's very efficient to process a
0:15:22.299,0:15:28.619
relatively large number of packets at once[br]and that really helps and the thing has
0:15:28.619,0:15:32.509
reduced memory overhead, the SK_Buff data[br]structure is really big and if you cut
0:15:32.509,0:15:37.319
that down you save a lot of cycles. These[br]DPDK figures, because DPDK has, unlike
0:15:37.319,0:15:42.679
some other frameworks, has memory[br]management, and this is already included
0:15:42.679,0:15:46.549
in these 50 cycles.[br]Okay, now we know that these frameworks
0:15:46.549,0:15:52.009
exist and everything, and the next obvious[br]question is: "Can we build our own
0:15:52.009,0:15:57.689
driver?" Well, but why? First for fun,[br]obviously, and then to understand how that
0:15:57.689,0:16:01.159
stuff works; how these drivers work,[br]how these packet processing frameworks
0:16:01.159,0:16:04.679
work.[br]I've seen in my work in academia; I've
0:16:04.679,0:16:07.840
seen a lot of people using these[br]frameworks. It's nice, because they are
0:16:07.840,0:16:12.260
fast and they enable a few things, that[br]just weren't possible before. But people
0:16:12.260,0:16:16.170
often treat these as magic black boxes you[br]put in your packet and then it magically
0:16:16.170,0:16:20.429
is faster and sometimes I don't blame[br]them. If you look at DPDK source code,
0:16:20.429,0:16:24.269
there are more than 20,000 lines of code[br]for each driver. And just for example,
0:16:24.269,0:16:28.809
looking at the receive and transmit[br]functions of the IXGBE driver and DPDK,
0:16:28.809,0:16:33.769
this is one file with around 3,000 lines[br]of code and they do a lot of magic, just
0:16:33.769,0:16:37.950
to receive and send packets. No one wants[br]to read through that, so the question is:
0:16:37.950,0:16:40.960
"How hard can it be to write your own[br]driver?"
0:16:40.960,0:16:44.850
Turns out: It's quite easy! This was like[br]a weekend project. I have written the
0:16:44.850,0:16:48.369
driver called XC. It's less than a[br]thousand lines of C code. That is the full
0:16:48.369,0:16:53.559
driver for 10 G network cards and the full[br]framework to get some applications and 2
0:16:53.559,0:16:58.099
simple example applications. Took me like[br]less than two days to write it completely,
0:16:58.099,0:17:00.897
then two more days to debug it and fix[br]performance.
0:17:02.385,0:17:08.209
So I've been building this driver on the[br]Intel IXGBE family. This is a family of
0:17:08.209,0:17:13.041
network cards that you know of, if you[br]ever had a server to test this. Because
0:17:13.041,0:17:17.639
almost all servers, that have 10 G[br]connections, have these Intel cards. And
0:17:17.639,0:17:22.829
they are also embedded in some Xeon CPUs.[br]They are also onboard chips on many
0:17:22.829,0:17:29.480
mainboards and the nice thing about them[br]is, they have a publicly available data
0:17:29.480,0:17:33.620
sheet. Meaning Intel publishes this 1,000[br]pages of PDF, that describes everything,
0:17:33.620,0:17:37.140
you ever wanted to know, when writing a[br]driver for these. And the next nice thing
0:17:37.140,0:17:41.324
is, that there is almost no logic hidden[br]behind the black box magic firmware. Many
0:17:41.324,0:17:46.210
newer network cards -especially Mellanox,[br]the newer ones- hide a lot of
0:17:46.210,0:17:50.120
functionality behind a firmware and the[br]driver. Mostly just exchanges messages
0:17:50.120,0:17:54.169
with the firmware, which is kind of[br]boring, and with this family, it is not
0:17:54.169,0:17:58.340
the case, which i think is very nice. So[br]how can we build a driver for this in four
0:17:58.340,0:18:02.884
very simple steps? One: We remove the[br]driver that is currently loaded, because
0:18:02.884,0:18:07.600
we don't want it to interfere with our[br]stuff. Okay, easy so far. Second, we
0:18:07.600,0:18:12.590
memory-map the PCIO memory-mapped I/O[br]address space. This allows us to access
0:18:12.590,0:18:16.430
the PCI Express device. Number three: We[br]figure out the physical addresses of our
0:18:16.430,0:18:22.750
DMA; of our process per address region and[br]then we use them for DMA. And step four is
0:18:22.750,0:18:26.779
slightly more complicated, than the first[br]three steps, as we write the driver. Now,
0:18:26.779,0:18:31.849
first thing to do, we figure out, where[br]our network card -let's say we have a
0:18:31.849,0:18:35.444
server and be plugged in our network card-[br]then it gets assigned an address and the
0:18:35.444,0:18:39.611
PCI bus. We can figure that out with[br]lspci, this is the address. We need it in
0:18:39.611,0:18:43.429
a slightly different version with the[br]fully qualified ID, and then we can remove
0:18:43.429,0:18:47.775
the kernel driver by telling the currently[br]bound driver to remove that specific ID.
0:18:47.775,0:18:52.100
Now the operating system doesn't know,[br]that this is a network card; doesn't know
0:18:52.100,0:18:55.870
anything, just notes that some PCI device[br]has no driver. Then we write our
0:18:55.870,0:18:59.209
application.[br]This is written in C and we just opened
0:18:59.209,0:19:04.207
this magic file in sysfs and this magic[br]file; we just mmap it. Ain't no magic,
0:19:04.207,0:19:08.183
just a normal mmap there. But what we get[br]back is a kind of special memory region.
0:19:08.183,0:19:12.160
This is the memory mapped I/O memory[br]region of the PCI address configuration
0:19:12.160,0:19:17.620
space and this is where all the registers[br]are available. Meaning, I will show you
0:19:17.620,0:19:20.960
what that means in just a second. If we if[br]go through the datasheet, there are
0:19:20.960,0:19:25.532
hundreds of pages of tables like this and[br]these tables tell us the registers, that
0:19:25.532,0:19:29.974
exist on that network card, the offset[br]they have and a link to more detailed
0:19:29.974,0:19:34.589
descriptions. And in code that looks like[br]this: For example the LED control register
0:19:34.589,0:19:38.090
is at this offset and then the LED control[br]register.
0:19:38.090,0:19:42.522
On this register, there are 32 bits, there[br]are some bits offset. Bit 7 is called
0:19:42.522,0:19:48.590
LED0_BLINK and if we set that bit in that[br]register, then one of the LEDs will start
0:19:48.590,0:19:53.669
to blink. And we can just do that via our[br]magic memory region, because all the reads
0:19:53.669,0:19:57.682
and writes, that we do to that memory[br]region, go directly over the PCI Express
0:19:57.682,0:20:01.568
bus to the network card and the network[br]card does whatever it wants to do with
0:20:01.568,0:20:03.128
them.[br]It doesn't have to be a register,
0:20:03.128,0:20:08.690
basically it's just a command, to send to[br]a network card and it's just a nice and
0:20:08.690,0:20:11.669
convenient interface to map that into[br]memory. This is a very common technique,
0:20:11.669,0:20:15.098
that you will also find when you do some[br]microprocessor programming or something.
0:20:16.260,0:20:20.110
So, and one thing to note is, since this[br]is not memory: That also means, it can't
0:20:20.110,0:20:24.111
be cached. There's no cache in between.[br]Each of these accesses will trigger a PCI
0:20:24.111,0:20:29.210
Express transaction and it will take quite[br]some time. Speaking of lots of lots of
0:20:29.210,0:20:32.919
cycles, where lots means like hundreds of[br]cycles or hundred cycles which is a lot
0:20:32.919,0:20:37.206
for me.[br]So how do we now handle packets? We now
0:20:37.206,0:20:42.400
can, we have access to this registers we[br]can read the datasheet and we can write
0:20:42.400,0:20:47.250
the driver but we some need some way to[br]get packets through that. Of course it
0:20:47.250,0:20:51.470
would be possible to write a network card[br]that does that via this memory-mapped I/O
0:20:51.470,0:20:56.800
region but it's kind of annoying. The[br]second way a PCI Express device
0:20:56.800,0:21:01.429
communicates with your server or macbook[br]is via DMA ,direct memory access, and a
0:21:01.429,0:21:07.536
DMA transfer, unlike the memory-mapped I/O[br]stuff is initiated by the network card and
0:21:07.536,0:21:14.046
this means the network card can just write[br]to arbitrary addresses in in main memory.
0:21:14.050,0:21:20.200
And this the network card offers so called[br]rings which are queue interfaces and like
0:21:20.200,0:21:22.946
for receiving packets and for sending[br]packets, and they are multiple of these
0:21:22.946,0:21:26.584
interfaces, because this is how you do[br]multi-core scaling. If you want to
0:21:26.584,0:21:30.649
transmit from multiple cores, you allocate[br]multiple queues. Each core sends to one
0:21:30.649,0:21:34.269
queue and the network card just merges[br]these queues in hardware onto the link,
0:21:34.269,0:21:38.789
and on receiving the network card can[br]either hash on the incoming incoming
0:21:38.789,0:21:42.821
packet like hash over protocol headers or[br]you can set explicit filters.
0:21:42.821,0:21:46.630
This is not specific to a network card[br]most PCI Express devices work like this
0:21:46.630,0:21:52.000
like GPUs have queues, a command queues[br]and so on, a NVME PCI Express disks have
0:21:52.000,0:21:56.660
queues and...[br]So let's look at queues on example of the
0:21:56.660,0:22:01.480
ixgbe family but you will find that most[br]NICs work in a very similar way. There are
0:22:01.480,0:22:04.110
sometimes small differences but mainly[br]they work like this.
0:22:04.344,0:22:08.902
And these rings are just circular buffers[br]filled with so-called DMA descriptors. A
0:22:08.902,0:22:14.180
DMA descriptor is a 16-byte struct and[br]that is eight bytes of a physical pointer
0:22:14.180,0:22:18.960
pointing to some location where more stuff[br]is and eight byte of metadata like "I
0:22:18.960,0:22:24.389
fetch the stuff" or "this packet needs[br]VLAN tag offloading" or "this packet had a
0:22:24.389,0:22:27.124
VLAN tag that I removed", information like[br]that is stored in there.
0:22:27.124,0:22:31.200
And what we then need to do is we[br]translate virtual addresses from our
0:22:31.200,0:22:34.509
address space to physical addresses[br]because the PCI Express device of course
0:22:34.509,0:22:39.198
needs physical addresses.[br]And we can use this, do that using procfs:
0:22:39.198,0:22:45.590
In the /proc/self/pagemap we can do that.[br]And the next thing is we now have this
0:22:45.590,0:22:51.610
this queue of DMA descriptors in memory[br]and this queue itself is also accessed via
0:22:51.610,0:22:57.101
DMA and it's controlled like it works like[br]you expect a circular ring to work. It has
0:22:57.101,0:23:00.970
a head and a tail, and the head and tail[br]pointer are available via registers in
0:23:00.970,0:23:05.680
memory-mapped I/O address space, meaning[br]in a image it looks kind of like this: We
0:23:05.680,0:23:09.650
have this descriptor ring in our physical[br]memory to the left full of pointers and
0:23:09.650,0:23:16.000
then we have somewhere else these packets[br]in some memory pool. And one thing to note
0:23:16.000,0:23:20.269
when allocating this kind of memory: There[br]is a small trick you have to do because
0:23:20.269,0:23:25.059
the descriptor ring needs to be in[br]contiguous memory in your physical memory
0:23:25.059,0:23:29.139
and if you use if, you just assume[br]everything that's contiguous in your
0:23:29.139,0:23:34.399
process is also in hardware physically: No[br]it isn't, and if you have a bug in there
0:23:34.399,0:23:37.919
and then it writes to somewhere else then[br]your filesystem dies as I figured out,
0:23:37.919,0:23:43.179
which was not a good thing.[br]So ... we, what I'm doing is I'm using
0:23:43.179,0:23:46.789
huge pages, two megabyte pages, that's[br]enough of contiguous memory and that's
0:23:46.789,0:23:53.990
guaranteed to not have weird gaps.[br]So, um ... now we see packets we need to
0:23:53.990,0:23:58.600
set up the ring so we tell the network[br]car via memory mapped I/O the location and
0:23:58.600,0:24:03.070
the size of the ring, then we fill up the[br]ring with pointers to freshly allocated
0:24:03.070,0:24:09.820
memory that are just empty and now we set[br]the head and tail pointer to tell the head
0:24:09.820,0:24:13.100
and tail pointer that the queue is full,[br]because the queue is at the moment full,
0:24:13.100,0:24:16.956
it's full of packets. These packets are[br]just not yet filled with anything. And now
0:24:16.956,0:24:20.629
what the NIC does, it fetches one of the[br]DNA descriptors and as soon as it receives
0:24:20.629,0:24:25.539
a packet it writes the packet via DMA to[br]the location specified in the register and
0:24:25.539,0:24:30.299
increments the head pointer of the queue[br]and it also sets a status flag in the DMA
0:24:30.299,0:24:33.590
descriptor once it's done like in the[br]packet to memory and this step is
0:24:33.590,0:24:39.610
important because reading back the head[br]pointer via MM I/O would be way too slow.
0:24:39.610,0:24:43.330
So instead we check the status flag[br]because the status flag gets optimized by
0:24:43.330,0:24:47.302
the ... by the cache and is already in[br]cache so we can check that really fast.
0:24:48.794,0:24:52.121
Next step is we periodically poll the[br]status flag. This is the point where
0:24:52.121,0:24:56.009
interrupts might come in useful.[br]There's some misconception: people
0:24:56.009,0:24:59.419
sometimes believe that if you receive a[br]packet then you get an interrupt and the
0:24:59.419,0:25:02.420
interrupt somehow magically contains the[br]packet. No it doesn't. The interrupt just
0:25:02.420,0:25:05.600
contains the information that there is a[br]new packet. After the interrupt you would
0:25:05.600,0:25:12.450
have to poll the status flag anyways. So[br]we now have the packet, we process the
0:25:12.450,0:25:16.170
packet or do whatever, then we reset the[br]DMA descriptor, we can either recycle the
0:25:16.170,0:25:21.653
old packet or allocate a new one and we[br]set the ready flag on the status register
0:25:21.653,0:25:25.529
and we adjust the tail pointer register to[br]tell the network card that we are done
0:25:25.529,0:25:28.389
with this and we don't have to do that for[br]any time because we don't have to keep the
0:25:28.389,0:25:33.220
queue 100% utilized. We can only update[br]the tail pointer like every hundred
0:25:33.220,0:25:37.559
packets or so and then that's not a[br]performance problem. What now, we have a
0:25:37.559,0:25:42.020
driver that can receive packets. Next[br]steps, well transmit packets, it basically
0:25:42.020,0:25:46.373
works the same. I won't bore you with the[br]details. Then there's of course a lot of
0:25:46.373,0:25:50.600
boring boring initialization code and it's[br]just following the datasheet, they are
0:25:50.600,0:25:54.070
like: set this register, set that[br]register, do that and I just coded it down
0:25:54.070,0:25:58.870
from the datasheet and it works, so big[br]surprise. Then now you know how to write a
0:25:58.870,0:26:03.799
driver like this and a few ideas of what[br]... what I want to do, what maybe you want
0:26:03.799,0:26:06.820
to do with a driver like this. One of[br]course want to look at performance to look
0:26:06.820,0:26:09.929
at what makes this faster than the kernel,[br]then I want some obscure
0:26:09.929,0:26:12.529
hardware/offloading features.[br]In the past I've looked at IPSec
0:26:12.529,0:26:15.840
offloading, just quite interesting,[br]because the Intel network cards have
0:26:15.840,0:26:19.870
hardware support for IPSec offloading, but[br]none of the Intel drivers had it and it
0:26:19.870,0:26:24.200
seems to work just fine. So not sure[br]what's going on there. Then security is
0:26:24.200,0:26:29.440
interesting. There is the ... there's[br]obvious some security implications of
0:26:29.440,0:26:33.399
having the whole driver in a user space[br]process and ... and I'm wondering about
0:26:33.399,0:26:37.120
how we can use the IOMMU, because it turns[br]out, once we have set up the memory
0:26:37.120,0:26:40.130
mapping we can drop all the privileges, we[br]don't need them.
0:26:40.130,0:26:43.659
And if we set up the IOMMU before to[br]restrict the network card to certain
0:26:43.659,0:26:48.750
things then we could have a safe driver in[br]userspace that can't do anything wrong,
0:26:48.750,0:26:52.264
because has no privileges and the network[br]card has no access because goes through
0:26:52.264,0:26:56.046
the IOMMU and there are performance[br]implications of the IOMMU and so on. Of
0:26:56.046,0:26:59.889
course, support for other NICs. I want to[br]support virtIO, virtual NICs and other
0:26:59.889,0:27:03.564
programming languages for the driver would[br]also be interesting. It's just written in
0:27:03.564,0:27:06.686
C because C is the lowest common[br]denominator of programming languages.
0:27:06.991,0:27:12.700
To conclude, check out ixy. It's BSD[br]license on github and the main thing to
0:27:12.700,0:27:16.094
take with you is that drivers are really[br]simple. Don't be afraid of drivers. Don't
0:27:16.094,0:27:20.059
be afraid of writing your drivers. You can[br]do it in any language and you don't even
0:27:20.059,0:27:23.139
need to add kernel code. Just map the[br]stuff to your process, write the driver
0:27:23.139,0:27:27.019
and do whatever you want. Okay, thanks for[br]your attention.
0:27:27.019,0:27:33.340
Applause
0:27:33.340,0:27:36.079
Herald: You have very few minutes left for
0:27:36.079,0:27:40.529
questions. So if you have a question in[br]the room please go quickly to one of the 8
0:27:40.529,0:27:46.899
microphones in the room. Does the signal[br]angel already have a question ready? I
0:27:46.899,0:27:52.998
don't see anything. Anybody lining up at[br]any microphones?
0:28:07.182,0:28:08.950
Alright, number 6 please.
0:28:09.926,0:28:15.140
Mic 6: As you're not actually using any of[br]the Linux drivers, is there an advantage
0:28:15.140,0:28:19.470
to using Linux here or could you use any[br]open source operating system?
0:28:19.470,0:28:24.200
Paul: I don't know about other operating[br]systems but the only thing I'm using of
0:28:24.200,0:28:28.649
Linux here is the ability to easily map[br]that. For some other operating systems we
0:28:28.649,0:28:32.779
might need a small stub driver that maps[br]the stuff in there. You can check out the
0:28:32.779,0:28:36.820
DPDK FreeBSD port which has a small stub[br]driver that just handles the memory
0:28:36.820,0:28:41.379
mapping.[br]Herald: Here, at number 2.
0:28:41.379,0:28:45.340
Mic 2: Hi, erm, slightly disconnected to[br]the talk, but I just like to hear your
0:28:45.340,0:28:50.880
opinion on smart NICs where they're[br]considering putting CPUs on the NIC
0:28:50.880,0:28:55.279
itself. So you could imagine running Open[br]vSwitch on the CPU on the NIC.
0:28:55.279,0:28:59.530
Paul: Yeah, I have some smart NIC[br]somewhere on some lap and have also done
0:28:59.530,0:29:05.639
work with the net FPGA. I think that it's[br]very interesting, but it ... it's a
0:29:05.639,0:29:09.820
complicated trade-off, because these smart[br]NICs come with new restrictions and they
0:29:09.820,0:29:13.820
are not dramatically super fast. So it's[br]... it's interesting from a performance
0:29:13.820,0:29:17.610
perspective to see when it's worth it,[br]when it's not worth it and what I
0:29:17.610,0:29:22.100
personally think it's probably better to[br]do everything with raw CPU power.
0:29:22.100,0:29:25.200
Mic 2: Thanks.[br]Herald: Alright, before we take the next
0:29:25.200,0:29:29.730
question, just for the people who don't[br]want to stick around for the Q&A. If you
0:29:29.730,0:29:33.720
really do have to leave the room early,[br]please do so quietly, so we can continue
0:29:33.720,0:29:39.440
the Q&A. Number 6, please.[br]Mic 6: So how does the performance of the
0:29:39.440,0:29:42.809
userspace driver is compared to the XDP[br]solution?
0:29:42.809,0:29:51.190
Paul: Um, it's slightly faster. But one[br]important thing about XDP is, if you look
0:29:51.190,0:29:54.910
at this, this is still new work and there[br]is ... there are few important
0:29:54.910,0:29:58.340
restrictions like you can write your[br]userspace thing in whatever programming
0:29:58.340,0:30:01.522
language you want. Like I mentioned, snap[br]has a driver entirely written in Lua. With
0:30:01.522,0:30:06.985
XDP you are restricted to eBPF, meaning[br]usually a restricted subset of C and then
0:30:06.985,0:30:09.670
there's bytecode verifier but you can[br]disable the bytecode verifier if you want
0:30:09.670,0:30:13.990
to disable it, and meaning, you again have[br]weird restrictions that you maybe don't
0:30:13.990,0:30:18.960
want and also XDP requires patched driv[br]... not patched drivers but requires a new
0:30:18.960,0:30:23.550
memory model for the drivers. So at moment[br]DPDK supports more drivers than XDP in the
0:30:23.550,0:30:26.740
kernel, which is kind of weird, and[br]they're still lacking many features like
0:30:26.740,0:30:31.187
sending back to a different NIC.[br]One very very good use case for XDP is
0:30:31.187,0:30:35.340
firewalling for applications on the same[br]host because you can pass on a packet to
0:30:35.340,0:30:40.309
the TCP stack and this is a very good use[br]case for XDP. But overall, I think that
0:30:40.309,0:30:46.761
... that both things are very very[br]different and XDP is slightly slower but
0:30:46.761,0:30:51.077
it's not slower in such a way that it[br]would be relevant. So it's fast, to
0:30:51.077,0:30:54.960
answer the question.[br]Herald: All right, unfortunately we are
0:30:54.960,0:30:59.172
out of time. So that was the last[br]question. Thanks again, Paul.
0:30:59.172,0:31:07.957
Applause
0:31:07.957,0:31:12.769
34c3 outro
0:31:12.769,0:31:30.000
subtitles created by c3subtitles.de[br]in the year 2018. Join, and help us!