-
34c3 intro
-
Herald: All right, now it's my great
pleasure to introduce Paul Emmerich who is
-
going to talk about "Demystifying Network
Cards". Paul is a PhD student at the
-
Technical University in Munich. He's doing
all kinds of network related stuff and
-
hopefully today he's gonna help us make
network cards a bit less of a black box.
-
So, please give a warm welcome to Paul
applause
-
Paul: Thank you and as the introduction
-
already said I'm a PhD student and I'm
researching performance of software packet
-
processing and forwarding systems.
That means I spend a lot of time doing
-
low-level optimizations and looking into
what makes a system fast, what makes it
-
slow, what can be done to improve it
and I'm mostly working on my packet
-
generator MoonGen
I have some cross promotion of a lightning
-
talk about this on Saturday but here I
have this long slot
-
and I brought a lot of content here so I
have to talk really fast so sorry for the
-
translators and I hope you can mainly
follow along
-
So: this is about Network cards meaning
network cards you all have seen. This is a
-
usual 10G network card with the SFP+ port
and this is a faster network card with a
-
QSFP+ port. This is 20, 40, or 100G
and now you bought this fancy network
-
card, you plug it into your server or your
macbook or whatever,
-
and you start your web server that serves
cat pictures and cat videos.
-
You all know that there's a whole stack of
protocols that your cat picture has to go
-
through until it arrives at a network card
at the bottom
-
and the only thing that I care about are
the lower layers. I don't care about TCP,
-
I have no idea how TCP works.
Well I have some idea how it works, but
-
this is not my research, I don't care
about it.
-
I just want to look at individual packets
and the highest thing I look at it's maybe
-
an IP address or maybe a part of the
protocol to identify flows or anything.
-
Now you might wonder: Is there anything
even interesting in these lower layers?
-
Because people nowadays think that
everything runs on top of HTTP,
-
but you might be surprised that not all
applications run on top of HTTP.
-
There is a lot of software that needs to
run at these lower levels and in the
-
recent years
there is a trend of moving network
-
infrastructure stuff from specialized
hardware black boxes to open software
-
boxes
and examples for such software that was
-
hardware in the past are: routers, switches,
firewalls, middle boxes and so on.
-
If you want to look up the relevant
buzzwords: It's Network Function
-
Virtualization what it's called and this
is a recent trend of the recent years.
-
Now let's say we want to build our own
fancy application on that low-level thing.
-
We want to build our firewall router
packet forward modifier thing that does
-
whatever useful on that lower layer for
network infrastructure
-
and I will use this application as a demo
application for this talk as everything
-
will be about this hypothetical router
fireball packet forward modifier thing.
-
What it does: It receives packets on one
or multiple network interfaces, it does
-
stuff with the packets - filter them,
modify them, route them
-
and sent them out to some other port or
maybe the same port or maybe multiple
-
ports - whatever these low-level
applications do.
-
And this means the application operates on
individual packets, not a stream of TCP
-
packets, not a stream of UDP packets, they
have to cope with small packets.
-
Because that's just the worst case: You
get a lot of small packets.
-
Now you want to build the application. You
go to the Internet and you look up: How to
-
build a packet forwarding application?
The internet tells you: There is the
-
socket API, the socket API is great and it
allows you to get packets to your program.
-
So you build your application on top of
the socket API. Once in userspace, you use
-
your socket, the socket talks to the
operating system,
-
the operating system talks to the driver
and the driver talks to the network cards,
-
and everything is fine except for that it
isn't
-
because what it really looks like if you
build this application:
-
There is this huge scary big gap between
user space and kernel space and you
-
somehow need your packets to go across
that without being eaten.
-
You might wonder why I said this is a big
deal and a huge deal that you have this
-
gap in there
and because I think: "Well, my web server
-
serving cat pictures is doing just fine on
a fast connection."
-
Well, it is because it is serving large
packets or even large chunks of files that
-
it sends at one to the server
like you can send a... can take your whole
-
cat video, give it to the kernel and the
kernel will handle everything
-
from doing... from packetizing it to TCP.
But what we want to build is a application
-
that needs to cope with the worst case of
lots of small packets coming in,
-
and then the overhead that you get here
from this gap is mostly on a packet basis
-
not on a pair-byte basis.
So, lots of small packets are a problem
-
for this interface.
When I say "problem" I'm always talking
-
about performance because I'm mostly about
performance.
-
So if you look at performance... a few
figures to get started is...
-
well how many packets can you fit over
your usual 10G link? That's around fifteen
-
million.
But 10G that's last year's news, this year
-
you have multiple hundred G connections
even to this location here.
-
So 100G link can handle up to 150 million
packets per second, and, well, how long
-
does that give us if we have a CPU?
And say we have a three gigahertz CPU in
-
our Macbook running the router and that
means we have around 200 cycles per packet
-
if we want to handle one 10G link with one
CPU core.
-
Okay we don't want to handle... we have of
course multiple cores. But you also have
-
multiple links, and faster links than 10G.
So the typical performance target that you
-
would aim for when building such an
application is five to ten million packets
-
per second per CPU core per thread that
you start.
-
Thats like a usual target. And that is
just for forwarding, just to receive the
-
packet and to send it back out. All the
stuff, that is: all the remaining cycles
-
can be used for your application.
So we don't want any big overhead just for
-
receiving and sending them without doing
any useful work.
-
So these these figures translate to
around 300 to 600 cycles per packet, on a
-
three gigahertz CPU core. Now, how long
does it take to cross that userspace
-
boundary? Well, very very very long for an
individual packet. So in some performance
-
measurements, if you do single core packet
forwarding, with a raw socket socket you
-
can maybe achieve 300,000 packets per
second, if you use libpcap, you can
-
achieve a million packets per second.
These figures can be tuned. You can maybe
-
get factor two out of that by some tuning,
but there are more problems, like
-
multicore scaling is unnecessarily hard
and so on, so this doesn't really seem to
-
work. So the boundary is the problem, so
let's get rid of the boundary by just
-
moving the application into the kernel. We
rewrite our application as a kernel module
-
and use it directly. You might think "what
an incredibly stupid idea, to write kernel
-
code for something that clearly should be
user space". Well, it's not that
-
unreasonable, there are lots of examples
of applications doing this, like a certain
-
web server by Microsoft runs as a kernel
module, the latest Linux kernel has TLS
-
offloading, to speed that up. Another
interesting use case is Open vSwitch, that
-
has a fast internal chache, that just
caches stuff and does complex processing
-
in a userspace thing, so it's not
completely unreasonable.
-
But it comes with a lot of drawbacks, like
it's very cumbersome to develop, most your
-
usual tools don't work or don't work as
expected, you have to follow the usual
-
kernel restrictions, like you have to use
C as a programming language, what you
-
maybe don't want to, and your application
can and will crash the kernel, which can
-
be quite bad. But lets not care about the
restrictions, we wanted to fix
-
performance, so same figures again: We
have 300 to 600 cycles to receive and sent
-
a packet. What I did: I tested this, I
profiled the Linux kernel to see how long
-
does it take to receive a packet until I
can do some useful work on it. This is an
-
average cost of a longer profiling run. So
on average it takes 500 cycles just to
-
receive the packet. Well, that's bad but
sending it out is slightly faster and
-
again, we are now over our budget. Now you
might think "what else do I need to do
-
besides receiving and sending the packet?"
There is some more overhead, there's you
-
need some time to the sk_buff, the data
structure used in the kernel for all
-
packet buffers, and this is quite bloated,
old, big data structure that is growing
-
bigger and bigger with each release and
this takes another 400 cycles. So if you
-
measure a real world application, single
core packet forwarding with Open vSwitch
-
with the minimum processing possible: One
open flow rule that matches on physical
-
ports and the processing, I profiled this
at around 200 cycles per packet.
-
And while the overhead of the kernel is
another thousand something cycles, so in
-
the end you achieve two million packets
per second - and this is faster than our
-
user space stuff but still kind of slow,
well, we want to be faster, because yeah.
-
And the currently hottest topic, which I'm
not talking about in the Linux kernel is
-
XDP. This fixes some of these problems but
comes with new restrictions. I cut that
-
for my talk for time reasons and so let's
just talk about not XDP. So the problem
-
was that our application - and we wanted
to move the application to the kernel
-
space - and it didn't work, so can we
instead move stuff from the kernel to the
-
user space? Well, yes we can. There a
libraries called "user space packet
-
processing frameworks". They come in two
parts: One is a library, you link your
-
program against, in the user space and one
is a kernel module. These two parts
-
communicate and they setup shared, mapped
memory and this shared mapped memory is
-
used to directly communicate from your
application to the driver. You directly
-
fill the packet buffers that the driver
then sends out and this is way faster.
-
And you might have noticed that the
operating system box here is not connected
-
to anything. That means your operating
system doesn't even know that the network
-
card is there in most cases, this can be
quite annoying. But there are quite a few
-
such frameworks, the biggest examples are
netmap PF_RING and pfq and they come with
-
restrictions, like there is a non-standard
API, you can't port between one framework
-
and the other or one framework in the
kernel or sockets, there's a custom kernel
-
module required, most of these frameworks
require some small patches to the drivers,
-
it's just a mess to maintain and of course
they need exclusive access to the network
-
card, because this one network card is
direc- this one application is talking
-
directly to the network card.
Ok, and the next thing is you lose the
-
access to the usual kernel features, which
can be quite annoying and then there's
-
often poor support for hardware offloading
features of the network cards, because
-
they often found on different parts of the
kernel that we no longer have reasonable
-
access to. And of course these frameworks,
we talk directly to a network card,
-
meaning we need support for each network
card individually. Usually they just
-
support one to two or maybe three NIC
families, which can be quite restricting,
-
if you don't have that specific NIC that
is restricted. But can we do an even more
-
radical approach, because we have all
these problems with kernel dependencies
-
and so on? Well, turns out we can get rid
of the kernel entirely and move everything
-
into one application. This means we take
our driver put it in the application, the
-
driver directly accesses the network card
and the sets up DMA memory in the user
-
space, because the network card doesn't
care, where it copies the packets from. We
-
just have to set up the pointers in the
right way and we can build this framework
-
like this, that everything runs in the
application.
-
We remove the driver from the kernel, no
kernel driver running and this is super
-
fast and we can also use this to implement
crazy and obscure hardware features and
-
network cards that are not supported by
the standard driver. Now I'm not the first
-
one to do this, there are two big
frameworks that that do that: One is DPDK,
-
which is quite quite big. This is a Linux
Foundation project and it has basically
-
support by all NIC vendors, meaning
everyone who builds a high-speed NIC
-
writes a driver that works for DPDK and
the second such framework is Snabb, which
-
I think is quite interesting, because it
doesn't write the drivers in C but is
-
entirely written in Lua, in the scripting
language, so this is kind of nice to see a
-
driver that's written in a scripting
language. Okay, what problems did we solve
-
and what problems did we now gain? One
problem is we still have the non-standard
-
API, we still need exclusive access to the
network card from one application, because
-
the driver runs in that thing, so there's
some hardware tricks to solve that, but
-
mainly it's one application that is
running.
-
Then the framework needs explicit support
for all the unique models out there. It's
-
not that big a problem with DPDK, because
it's such a big project that virtually
-
everyone has a driver for DPDK NIC. And
yes, limited support for interrupts but
-
it turns out interrupts are not something
that is useful, when you are building
-
something that processes more than a few
hundred thousand packets per second,
-
because the overhead of the interrupt is
just too large, it's just mainly a power
-
saving thing, if you ever run into low
load. But I don't care about the low load
-
scenario and power saving, so for me it's
polling all the way and all the CPU. And
-
you of course lose all the access to the
usual kernel features. And, well, time to
-
ask "what has the kernel ever done for
us?" Well, the kernel has lots of mature
-
drivers. Okay, what has the kernel ever
done for us, except for all these nice
-
mature drivers? There are very nice
protocol implementations that actually
-
work, like the kernel TCP stack is a work
of art.
-
It actually works in real world scenarios,
unlike all these other TCP stacks that
-
fail under some things or don't support
the features we want, so there is quite
-
some nice stuff. But what has the kernel
ever done for us, except for these mature
-
drivers and these nice protocol stack
implementations? Okay, quite a few things
-
and we are all throwing them out. And one
thing to notice: We mostly don't care
-
about these features, when building our
packet forward modify router firewall
-
thing, because these are mostly high-level
features mostly I think. But it's still a
-
lot of features that we are losing, like
building a TCP stack on top of these
-
frameworks is kind of an unsolved problem.
There are TCP stacks but they all suck in
-
different ways. Ok, we lost features but
we didn't care about the features in the
-
first place, we wanted performance.
Back to our performance figure we want 300
-
to 600 cycles per packet that we have
available, how long does it take in, for
-
example, DPDK to receive and send a
packet? That is around a hundred cycles to
-
get a packet through the whole stack, from
like like receiving a packet, processing
-
it, well, not processing it but getting it
to the application and back to the driver
-
to send it out. A hundred cycles and the
other frameworks typically play in the
-
same league. DPDK is slightly faster than
the other ones, because it's full of magic
-
SSE and AVX intrinsics and the driver is
kind of black magic but it's super fast.
-
Now in kind of real world scenario, Open
vSwitch, as I've mentioned as an example
-
earlier, that was 2 million packets was
the kernel version and Open vSwitch can be
-
compiled with an optional DPDK backend, so
you set some magic flags when compiling,
-
then it links against DPDK and uses the
network card directly, runs completely in
-
userspace and now it's a factor of around
6 or 7 faster and we can achieve 13
-
million packets per second with the same,
around the same processing step on a
-
single CPU core. So, great, where does do
the performance gains come from? Well,
-
there are two things: Mainly it's compared
to the kernel, not compared to sockets.
-
What people often say is that this is,
zero copy which is a stupid term because
-
the kernel doesn't copy packets either, so
it's not copying packets that was slow, it
-
was other things. Mainly it's batching,
meaning it's very efficient to process a
-
relatively large number of packets at once
and that really helps and the thing has
-
reduced memory overhead, the SK_Buff data
structure is really big and if you cut
-
that down you save a lot of cycles. These
DPDK figures, because DPDK has, unlike
-
some other frameworks, has memory
management, and this is already included
-
in these 50 cycles.
Okay, now we know that these frameworks
-
exist and everything, and the next obvious
question is: "Can we build our own
-
driver?" Well, but why? First for fun,
obviously, and then to understand how that
-
stuff works; how these drivers work,
how these packet processing frameworks
-
work.
I've seen in my work in academia; I've
-
seen a lot of people using these
frameworks. It's nice, because they are
-
fast and they enable a few things, that
just weren't possible before. But people
-
often treat these as magic black boxes you
put in your packet and then it magically
-
is faster and sometimes I don't blame
them. If you look at DPDK source code,
-
there are more than 20,000 lines of code
for each driver. And just for example,
-
looking at the receive and transmit
functions of the IXGBE driver and DPDK,
-
this is one file with around 3,000 lines
of code and they do a lot of magic, just
-
to receive and send packets. No one wants
to read through that, so the question is:
-
"How hard can it be to write your own
driver?"
-
Turns out: It's quite easy! This was like
a weekend project. I have written the
-
driver called XC. It's less than a
thousand lines of C code. That is the full
-
driver for 10 G network cards and the full
framework to get some applications and 2
-
simple example applications. Took me like
less than two days to write it completely,
-
then two more days to debug it and fix
performance.
-
So I've been building this driver on the
Intel IXGBE family. This is a family of
-
network cards that you know of, if you
ever had a server to test this. Because
-
almost all servers, that have 10 G
connections, have these Intel cards. And
-
they are also embedded in some Xeon CPUs.
They are also onboard chips on many
-
mainboards and the nice thing about them
is, they have a publicly available data
-
sheet. Meaning Intel publishes this 1,000
pages of PDF, that describes everything,
-
you ever wanted to know, when writing a
driver for these. And the next nice thing
-
is, that there is almost no logic hidden
behind the black box magic firmware. Many
-
newer network cards -especially Mellanox,
the newer ones- hide a lot of
-
functionality behind a firmware and the
driver. Mostly just exchanges messages
-
with the firmware, which is kind of
boring, and with this family, it is not
-
the case, which i think is very nice. So
how can we build a driver for this in four
-
very simple steps? One: We remove the
driver that is currently loaded, because
-
we don't want it to interfere with our
stuff. Okay, easy so far. Second, we
-
memory-map the PCIO memory-mapped I/O
address space. This allows us to access
-
the PCI Express device. Number three: We
figure out the physical addresses of our
-
DMA; of our process per address region and
then we use them for DMA. And step four is
-
slightly more complicated, than the first
three steps, as we write the driver. Now,
-
first thing to do, we figure out, where
our network card -let's say we have a
-
server and be plugged in our network card-
then it gets assigned an address and the
-
PCI bus. We can figure that out with
lspci, this is the address. We need it in
-
a slightly different version with the
fully qualified ID, and then we can remove
-
the kernel driver by telling the currently
bound driver to remove that specific ID.
-
Now the operating system doesn't know,
that this is a network card; doesn't know
-
anything, just notes that some PCI device
has no driver. Then we write our
-
application.
This is written in C and we just opened
-
this magic file in sysfs and this magic
file; we just mmap it. Ain't no magic,
-
just a normal mmap there. But what we get
back is a kind of special memory region.
-
This is the memory mapped I/O memory
region of the PCI address configuration
-
space and this is where all the registers
are available. Meaning, I will show you
-
what that means in just a second. If we if
go through the datasheet, there are
-
hundreds of pages of tables like this and
these tables tell us the registers, that
-
exist on that network card, the offset
they have and a link to more detailed
-
descriptions. And in code that looks like
this: For example the LED control register
-
is at this offset and then the LED control
register.
-
On this register, there are 32 bits, there
are some bits offset. Bit 7 is called
-
LED0_BLINK and if we set that bit in that
register, then one of the LEDs will start
-
to blink. And we can just do that via our
magic memory region, because all the reads
-
and writes, that we do to that memory
region, go directly over the PCI Express
-
bus to the network card and the network
card does whatever it wants to do with
-
them.
It doesn't have to be a register,
-
basically it's just a command, to send to
a network card and it's just a nice and
-
convenient interface to map that into
memory. This is a very common technique,
-
that you will also find when you do some
microprocessor programming or something.
-
So, and one thing to note is, since this
is not memory: That also means, it can't
-
be cached. There's no cache in between.
Each of these accesses will trigger a PCI
-
Express transaction and it will take quite
some time. Speaking of lots of lots of
-
cycles, where lots means like hundreds of
cycles or hundred cycles which is a lot
-
for me.
So how do we now handle packets? We now
-
can, we have access to this registers we
can read the datasheet and we can write
-
the driver but we some need some way to
get packets through that. Of course it
-
would be possible to write a network card
that does that via this memory-mapped I/O
-
region but it's kind of annoying. The
second way a PCI Express device
-
communicates with your server or macbook
is via DMA ,direct memory access, and a
-
DMA transfer, unlike the memory-mapped I/O
stuff is initiated by the network card and
-
this means the network card can just write
to arbitrary addresses in in main memory.
-
And this the network card offers so called
rings which are queue interfaces and like
-
for receiving packets and for sending
packets, and they are multiple of these
-
interfaces, because this is how you do
multi-core scaling. If you want to
-
transmit from multiple cores, you allocate
multiple queues. Each core sends to one
-
queue and the network card just merges
these queues in hardware onto the link,
-
and on receiving the network card can
either hash on the incoming incoming
-
packet like hash over protocol headers or
you can set explicit filters.
-
This is not specific to a network card
most PCI Express devices work like this
-
like GPUs have queues, a command queues
and so on, a NVME PCI Express disks have
-
queues and...
So let's look at queues on example of the
-
ixgbe family but you will find that most
NICs work in a very similar way. There are
-
sometimes small differences but mainly
they work like this.
-
And these rings are just circular buffers
filled with so-called DMA descriptors. A
-
DMA descriptor is a 16-byte struct and
that is eight bytes of a physical pointer
-
pointing to some location where more stuff
is and eight byte of metadata like "I
-
fetch the stuff" or "this packet needs
VLAN tag offloading" or "this packet had a
-
VLAN tag that I removed", information like
that is stored in there.
-
And what we then need to do is we
translate virtual addresses from our
-
address space to physical addresses
because the PCI Express device of course
-
needs physical addresses.
And we can use this, do that using procfs:
-
In the /proc/self/pagemap we can do that.
And the next thing is we now have this
-
this queue of DMA descriptors in memory
and this queue itself is also accessed via
-
DMA and it's controlled like it works like
you expect a circular ring to work. It has
-
a head and a tail, and the head and tail
pointer are available via registers in
-
memory-mapped I/O address space, meaning
in a image it looks kind of like this: We
-
have this descriptor ring in our physical
memory to the left full of pointers and
-
then we have somewhere else these packets
in some memory pool. And one thing to note
-
when allocating this kind of memory: There
is a small trick you have to do because
-
the descriptor ring needs to be in
contiguous memory in your physical memory
-
and if you use if, you just assume
everything that's contiguous in your
-
process is also in hardware physically: No
it isn't, and if you have a bug in there
-
and then it writes to somewhere else then
your filesystem dies as I figured out,
-
which was not a good thing.
So ... we, what I'm doing is I'm using
-
huge pages, two megabyte pages, that's
enough of contiguous memory and that's
-
guaranteed to not have weird gaps.
So, um ... now we see packets we need to
-
set up the ring so we tell the network
car via memory mapped I/O the location and
-
the size of the ring, then we fill up the
ring with pointers to freshly allocated
-
memory that are just empty and now we set
the head and tail pointer to tell the head
-
and tail pointer that the queue is full,
because the queue is at the moment full,
-
it's full of packets. These packets are
just not yet filled with anything. And now
-
what the NIC does, it fetches one of the
DNA descriptors and as soon as it receives
-
a packet it writes the packet via DMA to
the location specified in the register and
-
increments the head pointer of the queue
and it also sets a status flag in the DMA
-
descriptor once it's done like in the
packet to memory and this step is
-
important because reading back the head
pointer via MM I/O would be way too slow.
-
So instead we check the status flag
because the status flag gets optimized by
-
the ... by the cache and is already in
cache so we can check that really fast.
-
Next step is we periodically poll the
status flag. This is the point where
-
interrupts might come in useful.
There's some misconception: people
-
sometimes believe that if you receive a
packet then you get an interrupt and the
-
interrupt somehow magically contains the
packet. No it doesn't. The interrupt just
-
contains the information that there is a
new packet. After the interrupt you would
-
have to poll the status flag anyways. So
we now have the packet, we process the
-
packet or do whatever, then we reset the
DMA descriptor, we can either recycle the
-
old packet or allocate a new one and we
set the ready flag on the status register
-
and we adjust the tail pointer register to
tell the network card that we are done
-
with this and we don't have to do that for
any time because we don't have to keep the
-
queue 100% utilized. We can only update
the tail pointer like every hundred
-
packets or so and then that's not a
performance problem. What now, we have a
-
driver that can receive packets. Next
steps, well transmit packets, it basically
-
works the same. I won't bore you with the
details. Then there's of course a lot of
-
boring boring initialization code and it's
just following the datasheet, they are
-
like: set this register, set that
register, do that and I just coded it down
-
from the datasheet and it works, so big
surprise. Then now you know how to write a
-
driver like this and a few ideas of what
... what I want to do, what maybe you want
-
to do with a driver like this. One of
course want to look at performance to look
-
at what makes this faster than the kernel,
then I want some obscure
-
hardware/offloading features.
In the past I've looked at IPSec
-
offloading, just quite interesting,
because the Intel network cards have
-
hardware support for IPSec offloading, but
none of the Intel drivers had it and it
-
seems to work just fine. So not sure
what's going on there. Then security is
-
interesting. There is the ... there's
obvious some security implications of
-
having the whole driver in a user space
process and ... and I'm wondering about
-
how we can use the IOMMU, because it turns
out, once we have set up the memory
-
mapping we can drop all the privileges, we
don't need them.
-
And if we set up the IOMMU before to
restrict the network card to certain
-
things then we could have a safe driver in
userspace that can't do anything wrong,
-
because has no privileges and the network
card has no access because goes through
-
the IOMMU and there are performance
implications of the IOMMU and so on. Of
-
course, support for other NICs. I want to
support virtIO, virtual NICs and other
-
programming languages for the driver would
also be interesting. It's just written in
-
C because C is the lowest common
denominator of programming languages.
-
To conclude, check out ixy. It's BSD
license on github and the main thing to
-
take with you is that drivers are really
simple. Don't be afraid of drivers. Don't
-
be afraid of writing your drivers. You can
do it in any language and you don't even
-
need to add kernel code. Just map the
stuff to your process, write the driver
-
and do whatever you want. Okay, thanks for
your attention.
-
Applause
-
Herald: You have very few minutes left for
-
questions. So if you have a question in
the room please go quickly to one of the 8
-
microphones in the room. Does the signal
angel already have a question ready? I
-
don't see anything. Anybody lining up at
any microphones?
-
Alright, number 6 please.
-
Mic 6: As you're not actually using any of
the Linux drivers, is there an advantage
-
to using Linux here or could you use any
open source operating system?
-
Paul: I don't know about other operating
systems but the only thing I'm using of
-
Linux here is the ability to easily map
that. For some other operating systems we
-
might need a small stub driver that maps
the stuff in there. You can check out the
-
DPDK FreeBSD port which has a small stub
driver that just handles the memory
-
mapping.
Herald: Here, at number 2.
-
Mic 2: Hi, erm, slightly disconnected to
the talk, but I just like to hear your
-
opinion on smart NICs where they're
considering putting CPUs on the NIC
-
itself. So you could imagine running Open
vSwitch on the CPU on the NIC.
-
Paul: Yeah, I have some smart NIC
somewhere on some lap and have also done
-
work with the net FPGA. I think that it's
very interesting, but it ... it's a
-
complicated trade-off, because these smart
NICs come with new restrictions and they
-
are not dramatically super fast. So it's
... it's interesting from a performance
-
perspective to see when it's worth it,
when it's not worth it and what I
-
personally think it's probably better to
do everything with raw CPU power.
-
Mic 2: Thanks.
Herald: Alright, before we take the next
-
question, just for the people who don't
want to stick around for the Q&A. If you
-
really do have to leave the room early,
please do so quietly, so we can continue
-
the Q&A. Number 6, please.
Mic 6: So how does the performance of the
-
userspace driver is compared to the XDP
solution?
-
Paul: Um, it's slightly faster. But one
important thing about XDP is, if you look
-
at this, this is still new work and there
is ... there are few important
-
restrictions like you can write your
userspace thing in whatever programming
-
language you want. Like I mentioned, snap
has a driver entirely written in Lua. With
-
XDP you are restricted to eBPF, meaning
usually a restricted subset of C and then
-
there's bytecode verifier but you can
disable the bytecode verifier if you want
-
to disable it, and meaning, you again have
weird restrictions that you maybe don't
-
want and also XDP requires patched driv
... not patched drivers but requires a new
-
memory model for the drivers. So at moment
DPDK supports more drivers than XDP in the
-
kernel, which is kind of weird, and
they're still lacking many features like
-
sending back to a different NIC.
One very very good use case for XDP is
-
firewalling for applications on the same
host because you can pass on a packet to
-
the TCP stack and this is a very good use
case for XDP. But overall, I think that
-
... that both things are very very
different and XDP is slightly slower but
-
it's not slower in such a way that it
would be relevant. So it's fast, to
-
answer the question.
Herald: All right, unfortunately we are
-
out of time. So that was the last
question. Thanks again, Paul.
-
Applause
-
34c3 outro
-
subtitles created by c3subtitles.de
in the year 2018. Join, and help us!