34C3 - Demystifying Network Cards

Edit subtitles

0:00 - 0:16

34c3 intro
0:16 - 0:20

Herald: All right, now it's my great
pleasure to introduce Paul Emmerich who is
0:20 - 0:27

going to talk about "Demystifying Network
Cards". Paul is a PhD student at the
0:27 - 0:34

Technical University in Munich. He's doing
all kinds of network related stuff and
0:34 - 0:38

hopefully today he's gonna help us make
network cards a bit less of a black box.
0:38 - 0:49

So, please give a warm welcome to Paul
applause
0:49 - 0:51

Paul: Thank you and as the introduction
0:51 - 0:55

already said I'm a PhD student and I'm
researching performance of software packet
0:55 - 0:58

processing and forwarding systems.
That means I spend a lot of time doing
0:58 - 1:03

low-level optimizations and looking into
what makes a system fast, what makes it
1:03 - 1:06

slow, what can be done to improve it
and I'm mostly working on my packet
1:06 - 1:10

generator MoonGen
I have some cross promotion of a lightning
1:10 - 1:13

talk about this on Saturday but here I
have this long slot
1:13 - 1:18

and I brought a lot of content here so I
have to talk really fast so sorry for the
1:18 - 1:21

translators and I hope you can mainly
follow along
1:21 - 1:25

So: this is about Network cards meaning
network cards you all have seen. This is a
1:25 - 1:30

usual 10G network card with the SFP+ port
and this is a faster network card with a
1:30 - 1:35

QSFP+ port. This is 20, 40, or 100G
and now you bought this fancy network
1:35 - 1:38

card, you plug it into your server or your
macbook or whatever,
1:38 - 1:42

and you start your web server that serves
cat pictures and cat videos.
1:42 - 1:46

You all know that there's a whole stack of
protocols that your cat picture has to go
1:46 - 1:48

through until it arrives at a network card
at the bottom
1:48 - 1:52

and the only thing that I care about are
the lower layers. I don't care about TCP,
1:52 - 1:56

I have no idea how TCP works.
Well I have some idea how it works, but
1:56 - 1:58

this is not my research, I don't care
about it.
1:58 - 2:01

I just want to look at individual packets
and the highest thing I look at it's maybe
2:01 - 2:08

an IP address or maybe a part of the
protocol to identify flows or anything.
2:08 - 2:11

Now you might wonder: Is there anything
even interesting in these lower layers?
2:11 - 2:15

Because people nowadays think that
everything runs on top of HTTP,
2:15 - 2:19

but you might be surprised that not all
applications run on top of HTTP.
2:19 - 2:23

There is a lot of software that needs to
run at these lower levels and in the
2:23 - 2:26

recent years
there is a trend of moving network
2:26 - 2:31

infrastructure stuff from specialized
hardware black boxes to open software
2:31 - 2:33

boxes
and examples for such software that was
2:33 - 2:38

hardware in the past are: routers, switches,
firewalls, middle boxes and so on.
2:38 - 2:40

If you want to look up the relevant
buzzwords: It's Network Function
2:40 - 2:46

Virtualization what it's called and this
is a recent trend of the recent years.
2:46 - 2:51

Now let's say we want to build our own
fancy application on that low-level thing.
2:51 - 2:55

We want to build our firewall router
packet forward modifier thing that does
2:55 - 2:59

whatever useful on that lower layer for
network infrastructure
2:59 - 3:04

and I will use this application as a demo
application for this talk as everything
3:04 - 3:08

will be about this hypothetical router
fireball packet forward modifier thing.
3:08 - 3:12

What it does: It receives packets on one
or multiple network interfaces, it does
3:12 - 3:16

stuff with the packets - filter them,
modify them, route them
3:16 - 3:20

and sent them out to some other port or
maybe the same port or maybe multiple
3:20 - 3:23

ports - whatever these low-level
applications do.
3:23 - 3:28

And this means the application operates on
individual packets, not a stream of TCP
3:28 - 3:31

packets, not a stream of UDP packets, they
have to cope with small packets.
3:31 - 3:34

Because that's just the worst case: You
get a lot of small packets.
3:34 - 3:38

Now you want to build the application. You
go to the Internet and you look up: How to
3:38 - 3:41

build a packet forwarding application?
The internet tells you: There is the
3:41 - 3:46

socket API, the socket API is great and it
allows you to get packets to your program.
3:46 - 3:50

So you build your application on top of
the socket API. Once in userspace, you use
3:50 - 3:53

your socket, the socket talks to the
operating system,
3:53 - 3:56

the operating system talks to the driver
and the driver talks to the network cards,
3:56 - 3:59

and everything is fine except for that it
isn't
3:59 - 4:02

because what it really looks like if you
build this application:
4:02 - 4:07

There is this huge scary big gap between
user space and kernel space and you
4:07 - 4:13

somehow need your packets to go across
that without being eaten.
4:13 - 4:16

You might wonder why I said this is a big
deal and a huge deal that you have this
4:16 - 4:19

gap in there
and because I think: "Well, my web server
4:19 - 4:23

serving cat pictures is doing just fine on
a fast connection."
4:23 - 4:29

Well, it is because it is serving large
packets or even large chunks of files that
4:29 - 4:34

it sends at one to the server
like you can send a... can take your whole
4:34 - 4:37

cat video, give it to the kernel and the
kernel will handle everything
4:37 - 4:43

from doing... from packetizing it to TCP.
But what we want to build is a application
4:43 - 4:48

that needs to cope with the worst case of
lots of small packets coming in,
4:48 - 4:54

and then the overhead that you get here
from this gap is mostly on a packet basis
4:54 - 4:57

not on a pair-byte basis.
So, lots of small packets are a problem
4:57 - 5:01

for this interface.
When I say "problem" I'm always talking
5:01 - 5:03

about performance because I'm mostly about
performance.
5:03 - 5:09

So if you look at performance... a few
figures to get started is...
5:09 - 5:13

well how many packets can you fit over
your usual 10G link? That's around fifteen
5:13 - 5:18

million.
But 10G that's last year's news, this year
5:18 - 5:21

you have multiple hundred G connections
even to this location here.
5:21 - 5:28

So 100G link can handle up to 150 million
packets per second, and, well, how long
5:28 - 5:33

does that give us if we have a CPU?
And say we have a three gigahertz CPU in
5:33 - 5:37

our Macbook running the router and that
means we have around 200 cycles per packet
5:37 - 5:40

if we want to handle one 10G link with one
CPU core.
5:40 - 5:46

Okay we don't want to handle... we have of
course multiple cores. But you also have
5:46 - 5:50

multiple links, and faster links than 10G.
So the typical performance target that you
5:50 - 5:55

would aim for when building such an
application is five to ten million packets
5:55 - 5:57

per second per CPU core per thread that
you start.
5:57 - 6:01

Thats like a usual target. And that is
just for forwarding, just to receive the
6:01 - 6:06

packet and to send it back out. All the
stuff, that is: all the remaining cycles
6:06 - 6:09

can be used for your application.
So we don't want any big overhead just for
6:09 - 6:12

receiving and sending them without doing
any useful work.
6:12 - 6:20

So these these figures translate to
around 300 to 600 cycles per packet, on a
6:20 - 6:24

three gigahertz CPU core. Now, how long
does it take to cross that userspace
6:24 - 6:31

boundary? Well, very very very long for an
individual packet. So in some performance
6:31 - 6:35

measurements, if you do single core packet
forwarding, with a raw socket socket you
6:35 - 6:39

can maybe achieve 300,000 packets per
second, if you use libpcap, you can
6:39 - 6:43

achieve a million packets per second.
These figures can be tuned. You can maybe
6:43 - 6:46

get factor two out of that by some tuning,
but there are more problems, like
6:46 - 6:50

multicore scaling is unnecessarily hard
and so on, so this doesn't really seem to
6:50 - 6:55

work. So the boundary is the problem, so
let's get rid of the boundary by just
6:55 - 6:59

moving the application into the kernel. We
rewrite our application as a kernel module
6:59 - 7:04

and use it directly. You might think "what
an incredibly stupid idea, to write kernel
7:04 - 7:09

code for something that clearly should be
user space". Well, it's not that
7:09 - 7:12

unreasonable, there are lots of examples
of applications doing this, like a certain
7:12 - 7:17

web server by Microsoft runs as a kernel
module, the latest Linux kernel has TLS
7:17 - 7:21

offloading, to speed that up. Another
interesting use case is Open vSwitch, that
7:21 - 7:24

has a fast internal chache, that just
caches stuff and does complex processing
7:24 - 7:27

in a userspace thing, so it's not
completely unreasonable.
7:27 - 7:31

But it comes with a lot of drawbacks, like
it's very cumbersome to develop, most your
7:31 - 7:35

usual tools don't work or don't work as
expected, you have to follow the usual
7:35 - 7:38

kernel restrictions, like you have to use
C as a programming language, what you
7:38 - 7:42

maybe don't want to, and your application
can and will crash the kernel, which can
7:42 - 7:47

be quite bad. But lets not care about the
restrictions, we wanted to fix
7:47 - 7:51

performance, so same figures again: We
have 300 to 600 cycles to receive and sent
7:51 - 7:55

a packet. What I did: I tested this, I
profiled the Linux kernel to see how long
7:55 - 7:59

does it take to receive a packet until I
can do some useful work on it. This is an
7:59 - 8:04

average cost of a longer profiling run. So
on average it takes 500 cycles just to
8:04 - 8:08

receive the packet. Well, that's bad but
sending it out is slightly faster and
8:08 - 8:11

again, we are now over our budget. Now you
might think "what else do I need to do
8:11 - 8:16

besides receiving and sending the packet?"
There is some more overhead, there's you
8:16 - 8:21

need some time to the sk_buff, the data
structure used in the kernel for all
8:21 - 8:25

packet buffers, and this is quite bloated,
old, big data structure that is growing
8:25 - 8:30

bigger and bigger with each release and
this takes another 400 cycles. So if you
8:30 - 8:33

measure a real world application, single
core packet forwarding with Open vSwitch
8:33 - 8:36

with the minimum processing possible: One
open flow rule that matches on physical
8:36 - 8:41

ports and the processing, I profiled this
at around 200 cycles per packet.
8:41 - 8:45

And while the overhead of the kernel is
another thousand something cycles, so in
8:45 - 8:49

the end you achieve two million packets
per second - and this is faster than our
8:49 - 8:55

user space stuff but still kind of slow,
well, we want to be faster, because yeah.
8:55 - 8:59

And the currently hottest topic, which I'm
not talking about in the Linux kernel is
8:59 - 9:03

XDP. This fixes some of these problems but
comes with new restrictions. I cut that
9:03 - 9:10

for my talk for time reasons and so let's
just talk about not XDP. So the problem
9:10 - 9:14

was that our application - and we wanted
to move the application to the kernel
9:14 - 9:18

space - and it didn't work, so can we
instead move stuff from the kernel to the
9:18 - 9:22

user space? Well, yes we can. There a
libraries called "user space packet
9:22 - 9:26

processing frameworks". They come in two
parts: One is a library, you link your
9:26 - 9:29

program against, in the user space and one
is a kernel module. These two parts
9:29 - 9:34

communicate and they setup shared, mapped
memory and this shared mapped memory is
9:34 - 9:38

used to directly communicate from your
application to the driver. You directly
9:38 - 9:41

fill the packet buffers that the driver
then sends out and this is way faster.
9:41 - 9:44

And you might have noticed that the
operating system box here is not connected
9:44 - 9:47

to anything. That means your operating
system doesn't even know that the network
9:47 - 9:52

card is there in most cases, this can be
quite annoying. But there are quite a few
9:52 - 9:58

such frameworks, the biggest examples are
netmap PF_RING and pfq and they come with
9:58 - 10:02

restrictions, like there is a non-standard
API, you can't port between one framework
10:02 - 10:06

and the other or one framework in the
kernel or sockets, there's a custom kernel
10:06 - 10:11

module required, most of these frameworks
require some small patches to the drivers,
10:11 - 10:16

it's just a mess to maintain and of course
they need exclusive access to the network
10:16 - 10:19

card, because this one network card is
direc- this one application is talking
10:19 - 10:24

directly to the network card.
Ok, and the next thing is you lose the
10:24 - 10:28

access to the usual kernel features, which
can be quite annoying and then there's
10:28 - 10:31

often poor support for hardware offloading
features of the network cards, because
10:31 - 10:34

they often found on different parts of the
kernel that we no longer have reasonable
10:34 - 10:38

access to. And of course these frameworks,
we talk directly to a network card,
10:38 - 10:42

meaning we need support for each network
card individually. Usually they just
10:42 - 10:46

support one to two or maybe three NIC
families, which can be quite restricting,
10:46 - 10:51

if you don't have that specific NIC that
is restricted. But can we do an even more
10:51 - 10:55

radical approach, because we have all
these problems with kernel dependencies
10:55 - 10:59

and so on? Well, turns out we can get rid
of the kernel entirely and move everything
10:59 - 11:04

into one application. This means we take
our driver put it in the application, the
11:04 - 11:08

driver directly accesses the network card
and the sets up DMA memory in the user
11:08 - 11:12

space, because the network card doesn't
care, where it copies the packets from. We
11:12 - 11:15

just have to set up the pointers in the
right way and we can build this framework
11:15 - 11:17

like this, that everything runs in the
application.
11:17 - 11:23

We remove the driver from the kernel, no
kernel driver running and this is super
11:23 - 11:28

fast and we can also use this to implement
crazy and obscure hardware features and
11:28 - 11:31

network cards that are not supported by
the standard driver. Now I'm not the first
11:31 - 11:36

one to do this, there are two big
frameworks that that do that: One is DPDK,
11:36 - 11:41

which is quite quite big. This is a Linux
Foundation project and it has basically
11:41 - 11:45

support by all NIC vendors, meaning
everyone who builds a high-speed NIC
11:45 - 11:49

writes a driver that works for DPDK and
the second such framework is Snabb, which
11:49 - 11:54

I think is quite interesting, because it
doesn't write the drivers in C but is
11:54 - 11:58

entirely written in Lua, in the scripting
language, so this is kind of nice to see a
11:58 - 12:03

driver that's written in a scripting
language. Okay, what problems did we solve
12:03 - 12:07

and what problems did we now gain? One
problem is we still have the non-standard
12:07 - 12:11

API, we still need exclusive access to the
network card from one application, because
12:11 - 12:15

the driver runs in that thing, so there's
some hardware tricks to solve that, but
12:15 - 12:18

mainly it's one application that is
running.
12:18 - 12:22

Then the framework needs explicit support
for all the unique models out there. It's
12:22 - 12:26

not that big a problem with DPDK, because
it's such a big project that virtually
12:26 - 12:31

everyone has a driver for DPDK NIC. And
yes, limited support for interrupts but
12:31 - 12:34

it turns out interrupts are not something
that is useful, when you are building
12:34 - 12:38

something that processes more than a few
hundred thousand packets per second,
12:38 - 12:41

because the overhead of the interrupt is
just too large, it's just mainly a power
12:41 - 12:45

saving thing, if you ever run into low
load. But I don't care about the low load
12:45 - 12:50

scenario and power saving, so for me it's
polling all the way and all the CPU. And
12:50 - 12:55

you of course lose all the access to the
usual kernel features. And, well, time to
12:55 - 13:00

ask "what has the kernel ever done for
us?" Well, the kernel has lots of mature
13:00 - 13:03

drivers. Okay, what has the kernel ever
done for us, except for all these nice
13:03 - 13:08

mature drivers? There are very nice
protocol implementations that actually
13:08 - 13:10

work, like the kernel TCP stack is a work
of art.
13:10 - 13:14

It actually works in real world scenarios,
unlike all these other TCP stacks that
13:14 - 13:18

fail under some things or don't support
the features we want, so there is quite
13:18 - 13:23

some nice stuff. But what has the kernel
ever done for us, except for these mature
13:23 - 13:27

drivers and these nice protocol stack
implementations? Okay, quite a few things
13:27 - 13:33

and we are all throwing them out. And one
thing to notice: We mostly don't care
13:33 - 13:38

about these features, when building our
packet forward modify router firewall
13:38 - 13:44

thing, because these are mostly high-level
features mostly I think. But it's still a
13:44 - 13:49

lot of features that we are losing, like
building a TCP stack on top of these
13:49 - 13:53

frameworks is kind of an unsolved problem.
There are TCP stacks but they all suck in
13:53 - 13:58

different ways. Ok, we lost features but
we didn't care about the features in the
13:58 - 14:03

first place, we wanted performance.
Back to our performance figure we want 300
14:03 - 14:06

to 600 cycles per packet that we have
available, how long does it take in, for
14:06 - 14:11

example, DPDK to receive and send a
packet? That is around a hundred cycles to
14:11 - 14:15

get a packet through the whole stack, from
like like receiving a packet, processing
14:15 - 14:20

it, well, not processing it but getting it
to the application and back to the driver
14:20 - 14:23

to send it out. A hundred cycles and the
other frameworks typically play in the
14:23 - 14:28

same league. DPDK is slightly faster than
the other ones, because it's full of magic
14:28 - 14:33

SSE and AVX intrinsics and the driver is
kind of black magic but it's super fast.
14:33 - 14:37

Now in kind of real world scenario, Open
vSwitch, as I've mentioned as an example
14:37 - 14:42

earlier, that was 2 million packets was
the kernel version and Open vSwitch can be
14:42 - 14:45

compiled with an optional DPDK backend, so
you set some magic flags when compiling,
14:45 - 14:50

then it links against DPDK and uses the
network card directly, runs completely in
14:50 - 14:55

userspace and now it's a factor of around
6 or 7 faster and we can achieve 13
14:55 - 14:58

million packets per second with the same,
around the same processing step on a
14:58 - 15:03

single CPU core. So, great, where does do
the performance gains come from? Well,
15:03 - 15:08

there are two things: Mainly it's compared
to the kernel, not compared to sockets.
15:08 - 15:13

What people often say is that this is,
zero copy which is a stupid term because
15:13 - 15:18

the kernel doesn't copy packets either, so
it's not copying packets that was slow, it
15:18 - 15:22

was other things. Mainly it's batching,
meaning it's very efficient to process a
15:22 - 15:29

relatively large number of packets at once
and that really helps and the thing has
15:29 - 15:33

reduced memory overhead, the SK_Buff data
structure is really big and if you cut
15:33 - 15:37

that down you save a lot of cycles. These
DPDK figures, because DPDK has, unlike
15:37 - 15:43

some other frameworks, has memory
management, and this is already included
15:43 - 15:47

in these 50 cycles.
Okay, now we know that these frameworks
15:47 - 15:52

exist and everything, and the next obvious
question is: "Can we build our own
15:52 - 15:58

driver?" Well, but why? First for fun,
obviously, and then to understand how that
15:58 - 16:01

stuff works; how these drivers work,
how these packet processing frameworks
16:01 - 16:05

work.
I've seen in my work in academia; I've
16:05 - 16:08

seen a lot of people using these
frameworks. It's nice, because they are
16:08 - 16:12

fast and they enable a few things, that
just weren't possible before. But people
16:12 - 16:16

often treat these as magic black boxes you
put in your packet and then it magically
16:16 - 16:20

is faster and sometimes I don't blame
them. If you look at DPDK source code,
16:20 - 16:24

there are more than 20,000 lines of code
for each driver. And just for example,
16:24 - 16:29

looking at the receive and transmit
functions of the IXGBE driver and DPDK,
16:29 - 16:34

this is one file with around 3,000 lines
of code and they do a lot of magic, just
16:34 - 16:38

to receive and send packets. No one wants
to read through that, so the question is:
16:38 - 16:41

"How hard can it be to write your own
driver?"
16:41 - 16:45

Turns out: It's quite easy! This was like
a weekend project. I have written the
16:45 - 16:48

driver called XC. It's less than a
thousand lines of C code. That is the full
16:48 - 16:54

driver for 10 G network cards and the full
framework to get some applications and 2
16:54 - 16:58

simple example applications. Took me like
less than two days to write it completely,
16:58 - 17:01

then two more days to debug it and fix
performance.
17:02 - 17:08

So I've been building this driver on the
Intel IXGBE family. This is a family of
17:08 - 17:13

network cards that you know of, if you
ever had a server to test this. Because
17:13 - 17:18

almost all servers, that have 10 G
connections, have these Intel cards. And
17:18 - 17:23

they are also embedded in some Xeon CPUs.
They are also onboard chips on many
17:23 - 17:29

mainboards and the nice thing about them
is, they have a publicly available data
17:29 - 17:34

sheet. Meaning Intel publishes this 1,000
pages of PDF, that describes everything,
17:34 - 17:37

you ever wanted to know, when writing a
driver for these. And the next nice thing
17:37 - 17:41

is, that there is almost no logic hidden
behind the black box magic firmware. Many
17:41 - 17:46

newer network cards -especially Mellanox,
the newer ones- hide a lot of
17:46 - 17:50

functionality behind a firmware and the
driver. Mostly just exchanges messages
17:50 - 17:54

with the firmware, which is kind of
boring, and with this family, it is not
17:54 - 17:58

the case, which i think is very nice. So
how can we build a driver for this in four
17:58 - 18:03

very simple steps? One: We remove the
driver that is currently loaded, because
18:03 - 18:08

we don't want it to interfere with our
stuff. Okay, easy so far. Second, we
18:08 - 18:13

memory-map the PCIO memory-mapped I/O
address space. This allows us to access
18:13 - 18:16

the PCI Express device. Number three: We
figure out the physical addresses of our
18:16 - 18:23

DMA; of our process per address region and
then we use them for DMA. And step four is
18:23 - 18:27

slightly more complicated, than the first
three steps, as we write the driver. Now,
18:27 - 18:32

first thing to do, we figure out, where
our network card -let's say we have a
18:32 - 18:35

server and be plugged in our network card-
then it gets assigned an address and the
18:35 - 18:40

PCI bus. We can figure that out with
lspci, this is the address. We need it in
18:40 - 18:43

a slightly different version with the
fully qualified ID, and then we can remove
18:43 - 18:48

the kernel driver by telling the currently
bound driver to remove that specific ID.
18:48 - 18:52

Now the operating system doesn't know,
that this is a network card; doesn't know
18:52 - 18:56

anything, just notes that some PCI device
has no driver. Then we write our
18:56 - 18:59

application.
This is written in C and we just opened
18:59 - 19:04

this magic file in sysfs and this magic
file; we just mmap it. Ain't no magic,
19:04 - 19:08

just a normal mmap there. But what we get
back is a kind of special memory region.
19:08 - 19:12

This is the memory mapped I/O memory
region of the PCI address configuration
19:12 - 19:18

space and this is where all the registers
are available. Meaning, I will show you
19:18 - 19:21

what that means in just a second. If we if
go through the datasheet, there are
19:21 - 19:26

hundreds of pages of tables like this and
these tables tell us the registers, that
19:26 - 19:30

exist on that network card, the offset
they have and a link to more detailed
19:30 - 19:35

descriptions. And in code that looks like
this: For example the LED control register
19:35 - 19:38

is at this offset and then the LED control
register.
19:38 - 19:43

On this register, there are 32 bits, there
are some bits offset. Bit 7 is called
19:43 - 19:49

LED0_BLINK and if we set that bit in that
register, then one of the LEDs will start
19:49 - 19:54

to blink. And we can just do that via our
magic memory region, because all the reads
19:54 - 19:58

and writes, that we do to that memory
region, go directly over the PCI Express
19:58 - 20:02

bus to the network card and the network
card does whatever it wants to do with
20:02 - 20:03

them.
It doesn't have to be a register,
20:03 - 20:09

basically it's just a command, to send to
a network card and it's just a nice and
20:09 - 20:12

convenient interface to map that into
memory. This is a very common technique,
20:12 - 20:15

that you will also find when you do some
microprocessor programming or something.
20:16 - 20:20

So, and one thing to note is, since this
is not memory: That also means, it can't
20:20 - 20:24

be cached. There's no cache in between.
Each of these accesses will trigger a PCI
20:24 - 20:29

Express transaction and it will take quite
some time. Speaking of lots of lots of
20:29 - 20:33

cycles, where lots means like hundreds of
cycles or hundred cycles which is a lot
20:33 - 20:37

for me.
So how do we now handle packets? We now
20:37 - 20:42

can, we have access to this registers we
can read the datasheet and we can write
20:42 - 20:47

the driver but we some need some way to
get packets through that. Of course it
20:47 - 20:51

would be possible to write a network card
that does that via this memory-mapped I/O
20:51 - 20:57

region but it's kind of annoying. The
second way a PCI Express device
20:57 - 21:01

communicates with your server or macbook
is via DMA ,direct memory access, and a
21:01 - 21:08

DMA transfer, unlike the memory-mapped I/O
stuff is initiated by the network card and
21:08 - 21:14

this means the network card can just write
to arbitrary addresses in in main memory.
21:14 - 21:20

And this the network card offers so called
rings which are queue interfaces and like
21:20 - 21:23

for receiving packets and for sending
packets, and they are multiple of these
21:23 - 21:27

interfaces, because this is how you do
multi-core scaling. If you want to
21:27 - 21:31

transmit from multiple cores, you allocate
multiple queues. Each core sends to one
21:31 - 21:34

queue and the network card just merges
these queues in hardware onto the link,
21:34 - 21:39

and on receiving the network card can
either hash on the incoming incoming
21:39 - 21:43

packet like hash over protocol headers or
you can set explicit filters.
21:43 - 21:47

This is not specific to a network card
most PCI Express devices work like this
21:47 - 21:52

like GPUs have queues, a command queues
and so on, a NVME PCI Express disks have
21:52 - 21:57

queues and...
So let's look at queues on example of the
21:57 - 22:01

ixgbe family but you will find that most
NICs work in a very similar way. There are
22:01 - 22:04

sometimes small differences but mainly
they work like this.
22:04 - 22:09

And these rings are just circular buffers
filled with so-called DMA descriptors. A
22:09 - 22:14

DMA descriptor is a 16-byte struct and
that is eight bytes of a physical pointer
22:14 - 22:19

pointing to some location where more stuff
is and eight byte of metadata like "I
22:19 - 22:24

fetch the stuff" or "this packet needs
VLAN tag offloading" or "this packet had a
22:24 - 22:27

VLAN tag that I removed", information like
that is stored in there.
22:27 - 22:31

And what we then need to do is we
translate virtual addresses from our
22:31 - 22:35

address space to physical addresses
because the PCI Express device of course
22:35 - 22:39

needs physical addresses.
And we can use this, do that using procfs:
22:39 - 22:46

In the /proc/self/pagemap we can do that.
And the next thing is we now have this
22:46 - 22:52

this queue of DMA descriptors in memory
and this queue itself is also accessed via
22:52 - 22:57

DMA and it's controlled like it works like
you expect a circular ring to work. It has
22:57 - 23:01

a head and a tail, and the head and tail
pointer are available via registers in
23:01 - 23:06

memory-mapped I/O address space, meaning
in a image it looks kind of like this: We
23:06 - 23:10

have this descriptor ring in our physical
memory to the left full of pointers and
23:10 - 23:16

then we have somewhere else these packets
in some memory pool. And one thing to note
23:16 - 23:20

when allocating this kind of memory: There
is a small trick you have to do because
23:20 - 23:25

the descriptor ring needs to be in
contiguous memory in your physical memory
23:25 - 23:29

and if you use if, you just assume
everything that's contiguous in your
23:29 - 23:34

process is also in hardware physically: No
it isn't, and if you have a bug in there
23:34 - 23:38

and then it writes to somewhere else then
your filesystem dies as I figured out,
23:38 - 23:43

which was not a good thing.
So ... we, what I'm doing is I'm using
23:43 - 23:47

huge pages, two megabyte pages, that's
enough of contiguous memory and that's
23:47 - 23:54

guaranteed to not have weird gaps.
So, um ... now we see packets we need to
23:54 - 23:59

set up the ring so we tell the network
car via memory mapped I/O the location and
23:59 - 24:03

the size of the ring, then we fill up the
ring with pointers to freshly allocated
24:03 - 24:10

memory that are just empty and now we set
the head and tail pointer to tell the head
24:10 - 24:13

and tail pointer that the queue is full,
because the queue is at the moment full,
24:13 - 24:17

it's full of packets. These packets are
just not yet filled with anything. And now
24:17 - 24:21

what the NIC does, it fetches one of the
DNA descriptors and as soon as it receives
24:21 - 24:26

a packet it writes the packet via DMA to
the location specified in the register and
24:26 - 24:30

increments the head pointer of the queue
and it also sets a status flag in the DMA
24:30 - 24:34

descriptor once it's done like in the
packet to memory and this step is
24:34 - 24:40

important because reading back the head
pointer via MM I/O would be way too slow.
24:40 - 24:43

So instead we check the status flag
because the status flag gets optimized by
24:43 - 24:47

the ... by the cache and is already in
cache so we can check that really fast.
24:49 - 24:52

Next step is we periodically poll the
status flag. This is the point where
24:52 - 24:56

interrupts might come in useful.
There's some misconception: people
24:56 - 24:59

sometimes believe that if you receive a
packet then you get an interrupt and the
24:59 - 25:02

interrupt somehow magically contains the
packet. No it doesn't. The interrupt just
25:02 - 25:06

contains the information that there is a
new packet. After the interrupt you would
25:06 - 25:12

have to poll the status flag anyways. So
we now have the packet, we process the
25:12 - 25:16

packet or do whatever, then we reset the
DMA descriptor, we can either recycle the
25:16 - 25:22

old packet or allocate a new one and we
set the ready flag on the status register
25:22 - 25:26

and we adjust the tail pointer register to
tell the network card that we are done
25:26 - 25:28

with this and we don't have to do that for
any time because we don't have to keep the
25:28 - 25:33

queue 100% utilized. We can only update
the tail pointer like every hundred
25:33 - 25:38

packets or so and then that's not a
performance problem. What now, we have a
25:38 - 25:42

driver that can receive packets. Next
steps, well transmit packets, it basically
25:42 - 25:46

works the same. I won't bore you with the
details. Then there's of course a lot of
25:46 - 25:51

boring boring initialization code and it's
just following the datasheet, they are
25:51 - 25:54

like: set this register, set that
register, do that and I just coded it down
25:54 - 25:59

from the datasheet and it works, so big
surprise. Then now you know how to write a
25:59 - 26:04

driver like this and a few ideas of what
... what I want to do, what maybe you want
26:04 - 26:07

to do with a driver like this. One of
course want to look at performance to look
26:07 - 26:10

at what makes this faster than the kernel,
then I want some obscure
26:10 - 26:13

hardware/offloading features.
In the past I've looked at IPSec
26:13 - 26:16

offloading, just quite interesting,
because the Intel network cards have
26:16 - 26:20

hardware support for IPSec offloading, but
none of the Intel drivers had it and it
26:20 - 26:24

seems to work just fine. So not sure
what's going on there. Then security is
26:24 - 26:29

interesting. There is the ... there's
obvious some security implications of
26:29 - 26:33

having the whole driver in a user space
process and ... and I'm wondering about
26:33 - 26:37

how we can use the IOMMU, because it turns
out, once we have set up the memory
26:37 - 26:40

mapping we can drop all the privileges, we
don't need them.
26:40 - 26:44

And if we set up the IOMMU before to
restrict the network card to certain
26:44 - 26:49

things then we could have a safe driver in
userspace that can't do anything wrong,
26:49 - 26:52

because has no privileges and the network
card has no access because goes through
26:52 - 26:56

the IOMMU and there are performance
implications of the IOMMU and so on. Of
26:56 - 27:00

course, support for other NICs. I want to
support virtIO, virtual NICs and other
27:00 - 27:04

programming languages for the driver would
also be interesting. It's just written in
27:04 - 27:07

C because C is the lowest common
denominator of programming languages.
27:07 - 27:13

To conclude, check out ixy. It's BSD
license on github and the main thing to
27:13 - 27:16

take with you is that drivers are really
simple. Don't be afraid of drivers. Don't
27:16 - 27:20

be afraid of writing your drivers. You can
do it in any language and you don't even
27:20 - 27:23

need to add kernel code. Just map the
stuff to your process, write the driver
27:23 - 27:27

and do whatever you want. Okay, thanks for
your attention.
27:27 - 27:33

Applause
27:33 - 27:36

Herald: You have very few minutes left for
27:36 - 27:41

questions. So if you have a question in
the room please go quickly to one of the 8
27:41 - 27:47

microphones in the room. Does the signal
angel already have a question ready? I
27:47 - 27:53

don't see anything. Anybody lining up at
any microphones?
28:07 - 28:09

Alright, number 6 please.
28:10 - 28:15

Mic 6: As you're not actually using any of
the Linux drivers, is there an advantage
28:15 - 28:19

to using Linux here or could you use any
open source operating system?
28:19 - 28:24

Paul: I don't know about other operating
systems but the only thing I'm using of
28:24 - 28:29

Linux here is the ability to easily map
that. For some other operating systems we
28:29 - 28:33

might need a small stub driver that maps
the stuff in there. You can check out the
28:33 - 28:37

DPDK FreeBSD port which has a small stub
driver that just handles the memory
28:37 - 28:41

mapping.
Herald: Here, at number 2.
28:41 - 28:45

Mic 2: Hi, erm, slightly disconnected to
the talk, but I just like to hear your
28:45 - 28:51

opinion on smart NICs where they're
considering putting CPUs on the NIC
28:51 - 28:55

itself. So you could imagine running Open
vSwitch on the CPU on the NIC.
28:55 - 29:00

Paul: Yeah, I have some smart NIC
somewhere on some lap and have also done
29:00 - 29:06

work with the net FPGA. I think that it's
very interesting, but it ... it's a
29:06 - 29:10

complicated trade-off, because these smart
NICs come with new restrictions and they
29:10 - 29:14

are not dramatically super fast. So it's
... it's interesting from a performance
29:14 - 29:18

perspective to see when it's worth it,
when it's not worth it and what I
29:18 - 29:22

personally think it's probably better to
do everything with raw CPU power.
29:22 - 29:25

Mic 2: Thanks.
Herald: Alright, before we take the next
29:25 - 29:30

question, just for the people who don't
want to stick around for the Q&A. If you
29:30 - 29:34

really do have to leave the room early,
please do so quietly, so we can continue
29:34 - 29:39

the Q&A. Number 6, please.
Mic 6: So how does the performance of the
29:39 - 29:43

userspace driver is compared to the XDP
solution?
29:43 - 29:51

Paul: Um, it's slightly faster. But one
important thing about XDP is, if you look
29:51 - 29:55

at this, this is still new work and there
is ... there are few important
29:55 - 29:58

restrictions like you can write your
userspace thing in whatever programming
29:58 - 30:02

language you want. Like I mentioned, snap
has a driver entirely written in Lua. With
30:02 - 30:07

XDP you are restricted to eBPF, meaning
usually a restricted subset of C and then
30:07 - 30:10

there's bytecode verifier but you can
disable the bytecode verifier if you want
30:10 - 30:14

to disable it, and meaning, you again have
weird restrictions that you maybe don't
30:14 - 30:19

want and also XDP requires patched driv
... not patched drivers but requires a new
30:19 - 30:24

memory model for the drivers. So at moment
DPDK supports more drivers than XDP in the
30:24 - 30:27

kernel, which is kind of weird, and
they're still lacking many features like
30:27 - 30:31

sending back to a different NIC.
One very very good use case for XDP is
30:31 - 30:35

firewalling for applications on the same
host because you can pass on a packet to
30:35 - 30:40

the TCP stack and this is a very good use
case for XDP. But overall, I think that
30:40 - 30:47

... that both things are very very
different and XDP is slightly slower but
30:47 - 30:51

it's not slower in such a way that it
would be relevant. So it's fast, to
30:51 - 30:55

answer the question.
Herald: All right, unfortunately we are
30:55 - 30:59

out of time. So that was the last
question. Thanks again, Paul.
30:59 - 31:08

Applause
31:08 - 31:13

34c3 outro
31:13 - 31:30

subtitles created by c3subtitles.de
in the year 2018. Join, and help us!

Title:: 34C3 - Demystifying Network Cards
Description:: more » « less
Video Language:: English
Duration:: 31:30

	C3Subtitles edited English subtitles for 34C3 - Demystifying Network Cards
	Bar Sch edited English subtitles for 34C3 - Demystifying Network Cards
	locu edited English subtitles for 34C3 - Demystifying Network Cards
	Thomas Heinrichs edited English subtitles for 34C3 - Demystifying Network Cards
	Maximilian Marx edited English subtitles for 34C3 - Demystifying Network Cards
	Bar Sch edited English subtitles for 34C3 - Demystifying Network Cards

English subtitles

Revisions

Revision 6 Edited

C3Subtitles

34C3 - Demystifying Network Cards

Revisions

Our website uses cookies

Operating cookies (Required)