34c3 intro
Herald: All right, now it's my great
pleasure to introduce Paul Emmerich who is
going to talk about "Demystifying Network
Cards". Paul is a PhD student at the
Technical University in Munich. He's doing
all kinds of network related stuff and
hopefully today he's gonna help us make
network cards a bit less of a black box.
So, please give a warm welcome to Paul
applause
Paul: Thank you and as the introduction
already said I'm a PhD student and I'm
researching performance of software packet
processing and forwarding systems.
That means I spend a lot of time doing
low-level optimizations and looking into
what makes a system fast, what makes it
slow, what can be done to improve it
and I'm mostly working on my packet
generator MoonGen
I have some cross promotion of a lightning
talk about this on Saturday but here I
have this long slot
and I brought a lot of content here so I
have to talk really fast so sorry for the
translators and I hope you can mainly
follow along
So: this is about Network cards meaning
network cards you all have seen. This is a
usual 10G network card with the SFP+ port
and this is a faster network card with a
QSFP+ port. This is 20, 40, or 100G
and now you bought this fancy network
card, you plug it into your server or your
macbook or whatever,
and you start your web server that serves
cat pictures and cat videos.
You all know that there's a whole stack of
protocols that your cat picture has to go
through until it arrives at a network card
at the bottom
and the only thing that I care about are
the lower layers. I don't care about TCP,
I have no idea how TCP works.
Well I have some idea how it works, but
this is not my research, I don't care
about it.
I just want to look at individual packets
and the highest thing I look at it's maybe
an IP address or maybe a part of the
protocol to identify flows or anything.
Now you might wonder: Is there anything
even interesting in these lower layers?
Because people nowadays think that
everything runs on top of HTTP,
but you might be surprised that not all
applications run on top of HTTP.
There is a lot of software that needs to
run at these lower levels and in the
recent years
there is a trend of moving network
infrastructure stuff from specialized
hardware black boxes to open software
boxes
and examples for such software that was
hardware in the past are: routers, switches,
firewalls, middle boxes and so on.
If you want to look up the relevant
buzzwords: It's Network Function
Virtualization what it's called and this
is a recent trend of the recent years.
Now let's say we want to build our own
fancy application on that low-level thing.
We want to build our firewall router
packet forward modifier thing that does
whatever useful on that lower layer for
network infrastructure
and I will use this application as a demo
application for this talk as everything
will be about this hypothetical router
fireball packet forward modifier thing.
What it does: It receives packets on one
or multiple network interfaces, it does
stuff with the packets - filter them,
modify them, route them
and sent them out to some other port or
maybe the same port or maybe multiple
ports - whatever these low-level
applications do.
And this means the application operates on
individual packets, not a stream of TCP
packets, not a stream of UDP packets, they
have to cope with small packets.
Because that's just the worst case: You
get a lot of small packets.
Now you want to build the application. You
go to the Internet and you look up: How to
build a packet forwarding application?
The internet tells you: There is the
socket API, the socket API is great and it
allows you to get packets to your program.
So you build your application on top of
the socket API. Once in userspace, you use
your socket, the socket talks to the
operating system,
the operating system talks to the driver
and the driver talks to the network cards,
and everything is fine except for that it
isn't
because what it really looks like if you
build this application:
There is this huge scary big gap between
user space and kernel space and you
somehow need your packets to go across
that without being eaten.
You might wonder why I said this is a big
deal and a huge deal that you have this
gap in there
and because I think: "Well, my web server
serving cat pictures is doing just fine on
a fast connection."
Well, it is because it is serving large
packets or even large chunks of files that
it sends at one to the server
like you can send a... can take your whole
cat video, give it to the kernel and the
kernel will handle everything
from doing... from packetizing it to TCP.
But what we want to build is a application
that needs to cope with the worst case of
lots of small packets coming in,
and then the overhead that you get here
from this gap is mostly on a packet basis
not on a pair-byte basis.
So, lots of small packets are a problem
for this interface.
When I say "problem" I'm always talking
about performance because I'm mostly about
performance.
So if you look at performance... a few
figures to get started is...
well how many packets can you fit over
your usual 10G link? That's around fifteen
million.
But 10G that's last year's news, this year
you have multiple hundred G connections
even to this location here.
So 100G link can handle up to 150 million
packets per second, and, well, how long
does that give us if we have a CPU?
And say we have a three gigahertz CPU in
our Macbook running the router and that
means we have around 200 cycles per packet
if we want to handle one 10G link with one
CPU core.
Okay we don't want to handle... we have of
course multiple cores. But you also have
multiple links, and faster links than 10G.
So the typical performance target that you
would aim for when building such an
application is five to ten million packets
per second per CPU core per thread that
you start.
Thats like a usual target. And that is
just for forwarding, just to receive the
packet and to send it back out. All the
stuff, that is: all the remaining cycles
can be used for your application.
So we don't want any big overhead just for
receiving and sending them without doing
any useful work.
So these these figures translate to
around 300 to 600 cycles per packet, on a
three gigahertz CPU core. Now, how long
does it take to cross that userspace
boundary? Well, very very very long for an
individual packet. So in some performance
measurements, if you do single core packet
forwarding, with a raw socket socket you
can maybe achieve 300,000 packets per
second, if you use libpcap, you can
achieve a million packets per second.
These figures can be tuned. You can maybe
get factor two out of that by some tuning,
but there are more problems, like
multicore scaling is unnecessarily hard
and so on, so this doesn't really seem to
work. So the boundary is the problem, so
let's get rid of the boundary by just
moving the application into the kernel. We
rewrite our application as a kernel module
and use it directly. You might think "what
an incredibly stupid idea, to write kernel
code for something that clearly should be
user space". Well, it's not that
unreasonable, there are lots of examples
of applications doing this, like a certain
web server by Microsoft runs as a kernel
module, the latest Linux kernel has TLS
offloading, to speed that up. Another
interesting use case is Open vSwitch, that
has a fast internal chache, that just
caches stuff and does complex processing
in a userspace thing, so it's not
completely unreasonable.
But it comes with a lot of drawbacks, like
it's very cumbersome to develop, most your
usual tools don't work or don't work as
expected, you have to follow the usual
kernel restrictions, like you have to use
C as a programming language, what you
maybe don't want to, and your application
can and will crash the kernel, which can
be quite bad. But lets not care about the
restrictions, we wanted to fix
performance, so same figures again: We
have 300 to 600 cycles to receive and sent
a packet. What I did: I tested this, I
profiled the Linux kernel to see how long
does it take to receive a packet until I
can do some useful work on it. This is an
average cost of a longer profiling run. So
on average it takes 500 cycles just to
receive the packet. Well, that's bad but
sending it out is slightly faster and
again, we are now over our budget. Now you
might think "what else do I need to do
besides receiving and sending the packet?"
There is some more overhead, there's you
need some time to the sk_buff, the data
structure used in the kernel for all
packet buffers, and this is quite bloated,
old, big data structure that is growing
bigger and bigger with each release and
this takes another 400 cycles. So if you
measure a real world application, single
core packet forwarding with Open vSwitch
with the minimum processing possible: One
open flow rule that matches on physical
ports and the processing, I profiled this
at around 200 cycles per packet.
And while the overhead of the kernel is
another thousand something cycles, so in
the end you achieve two million packets
per second - and this is faster than our
user space stuff but still kind of slow,
well, we want to be faster, because yeah.
And the currently hottest topic, which I'm
not talking about in the Linux kernel is
XDP. This fixes some of these problems but
comes with new restrictions. I cut that
for my talk for time reasons and so let's
just talk about not XDP. So the problem
was that our application - and we wanted
to move the application to the kernel
space - and it didn't work, so can we
instead move stuff from the kernel to the
user space? Well, yes we can. There a
libraries called "user space packet
processing frameworks". They come in two
parts: One is a library, you link your
program against, in the user space and one
is a kernel module. These two parts
communicate and they setup shared, mapped
memory and this shared mapped memory is
used to directly communicate from your
application to the driver. You directly
fill the packet buffers that the driver
then sends out and this is way faster.
And you might have noticed that the
operating system box here is not connected
to anything. That means your operating
system doesn't even know that the network
card is there in most cases, this can be
quite annoying. But there are quite a few
such frameworks, the biggest examples are
netmap PF_RING and pfq and they come with
restrictions, like there is a non-standard
API, you can't port between one framework
and the other or one framework in the
kernel or sockets, there's a custom kernel
module required, most of these frameworks
require some small patches to the drivers,
it's just a mess to maintain and of course
they need exclusive access to the network
card, because this one network card is
direc- this one application is talking
directly to the network card.
Ok, and the next thing is you lose the
access to the usual kernel features, which
can be quite annoying and then there's
often poor support for hardware offloading
features of the network cards, because
they often found on different parts of the
kernel that we no longer have reasonable
access to. And of course these frameworks,
we talk directly to a network card,
meaning we need support for each network
card individually. Usually they just
support one to two or maybe three NIC
families, which can be quite restricting,
if you don't have that specific NIC that
is restricted. But can we do an even more
radical approach, because we have all
these problems with kernel dependencies
and so on? Well, turns out we can get rid
of the kernel entirely and move everything
into one application. This means we take
our driver put it in the application, the
driver directly accesses the network card
and the sets up DMA memory in the user
space, because the network card doesn't
care, where it copies the packets from. We
just have to set up the pointers in the
right way and we can build this framework
like this, that everything runs in the
application.
We remove the driver from the kernel, no
kernel driver running and this is super
fast and we can also use this to implement
crazy and obscure hardware features and
network cards that are not supported by
the standard driver. Now I'm not the first
one to do this, there are two big
frameworks that that do that: One is DPDK,
which is quite quite big. This is a Linux
Foundation project and it has basically
support by all NIC vendors, meaning
everyone who builds a high-speed NIC
writes a driver that works for DPDK and
the second such framework is Snabb, which
I think is quite interesting, because it
doesn't write the drivers in C but is
entirely written in Lua, in the scripting
language, so this is kind of nice to see a
driver that's written in a scripting
language. Okay, what problems did we solve
and what problems did we now gain? One
problem is we still have the non-standard
API, we still need exclusive access to the
network card from one application, because
the driver runs in that thing, so there's
some hardware tricks to solve that, but
mainly it's one application that is
running.
Then the framework needs explicit support
for all the unique models out there. It's
not that big a problem with DPDK, because
it's such a big project that virtually
everyone has a driver for DPDK NIC. And
yes, limited support for interrupts but
it turns out interrupts are not something
that is useful, when you are building
something that processes more than a few
hundred thousand packets per second,
because the overhead of the interrupt is
just too large, it's just mainly a power
saving thing, if you ever run into low
load. But I don't care about the low load
scenario and power saving, so for me it's
polling all the way and all the CPU. And
you of course lose all the access to the
usual kernel features. And, well, time to
ask "what has the kernel ever done for
us?" Well, the kernel has lots of mature
drivers. Okay, what has the kernel ever
done for us, except for all these nice
mature drivers? There are very nice
protocol implementations that actually
work, like the kernel TCP stack is a work
of art.
It actually works in real world scenarios,
unlike all these other TCP stacks that
fail under some things or don't support
the features we want, so there is quite
some nice stuff. But what has the kernel
ever done for us, except for these mature
drivers and these nice protocol stack
implementations? Okay, quite a few things
and we are all throwing them out. And one
thing to notice: We mostly don't care
about these features, when building our
packet forward modify router firewall
thing, because these are mostly high-level
features mostly I think. But it's still a
lot of features that we are losing, like
building a TCP stack on top of these
frameworks is kind of an unsolved problem.
There are TCP stacks but they all suck in
different ways. Ok, we lost features but
we didn't care about the features in the
first place, we wanted performance.
Back to our performance figure we want 300
to 600 cycles per packet that we have
available, how long does it take in, for
example, DPDK to receive and send a
packet? That is around a hundred cycles to
get a packet through the whole stack, from
like like receiving a packet, processing
it, well, not processing it but getting it
to the application and back to the driver
to send it out. A hundred cycles and the
other frameworks typically play in the
same league. DPDK is slightly faster than
the other ones, because it's full of magic
SSE and AVX intrinsics and the driver is
kind of black magic but it's super fast.
Now in kind of real world scenario, Open
vSwitch, as I've mentioned as an example
earlier, that was 2 million packets was
the kernel version and Open vSwitch can be
compiled with an optional DPDK backend, so
you set some magic flags when compiling,
then it links against DPDK and uses the
network card directly, runs completely in
userspace and now it's a factor of around
6 or 7 faster and we can achieve 13
million packets per second with the same,
around the same processing step on a
single CPU core. So, great, where does do
the performance gains come from? Well,
there are two things: Mainly it's compared
to the kernel, not compared to sockets.
What people often say is that this is,
zero copy which is a stupid term because
the kernel doesn't copy packets either, so
it's not copying packets that was slow, it
was other things. Mainly it's batching,
meaning it's very efficient to process a
relatively large number of packets at once
and that really helps and the thing has
reduced memory overhead, the SK_Buff data
structure is really big and if you cut
that down you save a lot of cycles. These
DPDK figures, because DPDK has, unlike
some other frameworks, has memory
management, and this is already included
in these 50 cycles.
Okay, now we know that these frameworks
exist and everything, and the next obvious
question is: "Can we build our own
driver?" Well, but why? First for fun,
obviously, and then to understand how that
stuff works; how these drivers work,
how these packet processing frameworks
work.
I've seen in my work in academia; I've
seen a lot of people using these
frameworks. It's nice, because they are
fast and they enable a few things, that
just weren't possible before. But people
often treat these as magic black boxes you
put in your packet and then it magically
is faster and sometimes I don't blame
them. If you look at DPDK source code,
there are more than 20,000 lines of code
for each driver. And just for example,
looking at the receive and transmit
functions of the IXGBE driver and DPDK,
this is one file with around 3,000 lines
of code and they do a lot of magic, just
to receive and send packets. No one wants
to read through that, so the question is:
"How hard can it be to write your own
driver?"
Turns out: It's quite easy! This was like
a weekend project. I have written the
driver called XC. It's less than a
thousand lines of C code. That is the full
driver for 10 G network cards and the full
framework to get some applications and 2
simple example applications. Took me like
less than two days to write it completely,
then two more days to debug it and fix
performance.
So I've been building this driver on the
Intel IXGBE family. This is a family of
network cards that you know of, if you
ever had a server to test this. Because
almost all servers, that have 10 G
connections, have these Intel cards. And
they are also embedded in some Xeon CPUs.
They are also onboard chips on many
mainboards and the nice thing about them
is, they have a publicly available data
sheet. Meaning Intel publishes this 1,000
pages of PDF, that describes everything,
you ever wanted to know, when writing a
driver for these. And the next nice thing
is, that there is almost no logic hidden
behind the black box magic firmware. Many
newer network cards -especially Mellanox,
the newer ones- hide a lot of
functionality behind a firmware and the
driver. Mostly just exchanges messages
with the firmware, which is kind of
boring, and with this family, it is not
the case, which i think is very nice. So
how can we build a driver for this in four
very simple steps? One: We remove the
driver that is currently loaded, because
we don't want it to interfere with our
stuff. Okay, easy so far. Second, we
memory-map the PCIO memory-mapped I/O
address space. This allows us to access
the PCI Express device. Number three: We
figure out the physical addresses of our
DMA; of our process per address region and
then we use them for DMA. And step four is
slightly more complicated, than the first
three steps, as we write the driver. Now,
first thing to do, we figure out, where
our network card -let's say we have a
server and be plugged in our network card-
then it gets assigned an address and the
PCI bus. We can figure that out with
lspci, this is the address. We need it in
a slightly different version with the
fully qualified ID, and then we can remove
the kernel driver by telling the currently
bound driver to remove that specific ID.
Now the operating system doesn't know,
that this is a network card; doesn't know
anything, just notes that some PCI device
has no driver. Then we write our
application.
This is written in C and we just opened
this magic file in sysfs and this magic
file; we just mmap it. Ain't no magic,
just a normal mmap there. But what we get
back is a kind of special memory region.
This is the memory mapped I/O memory
region of the PCI address configuration
space and this is where all the registers
are available. Meaning, I will show you
what that means in just a second. If we if
go through the datasheet, there are
hundreds of pages of tables like this and
these tables tell us the registers, that
exist on that network card, the offset
they have and a link to more detailed
descriptions. And in code that looks like
this: For example the LED control register
is at this offset and then the LED control
register.
On this register, there are 32 bits, there
are some bits offset. Bit 7 is called
LED0_BLINK and if we set that bit in that
register, then one of the LEDs will start
to blink. And we can just do that via our
magic memory region, because all the reads
and writes, that we do to that memory
region, go directly over the PCI Express
bus to the network card and the network
card does whatever it wants to do with
them.
It doesn't have to be a register,
basically it's just a command, to send to
a network card and it's just a nice and
convenient interface to map that into
memory. This is a very common technique,
that you will also find when you do some
microprocessor programming or something.
So, and one thing to note is, since this
is not memory: That also means, it can't
be cached. There's no cache in between.
Each of these accesses will trigger a PCI
Express transaction and it will take quite
some time. Speaking of lots of lots of
cycles, where lots means like hundreds of
cycles or hundred cycles which is a lot
for me.
So how do we now handle packets? We now
can, we have access to this registers we
can read the datasheet and we can write
the driver but we some need some way to
get packets through that. Of course it
would be possible to write a network card
that does that via this memory-mapped I/O
region but it's kind of annoying. The
second way a PCI Express device
communicates with your server or macbook
is via DMA ,direct memory access, and a
DMA transfer, unlike the memory-mapped I/O
stuff is initiated by the network card and
this means the network card can just write
to arbitrary addresses in in main memory.
And this the network card offers so called
rings which are queue interfaces and like
for receiving packets and for sending
packets, and they are multiple of these
interfaces, because this is how you do
multi-core scaling. If you want to
transmit from multiple cores, you allocate
multiple queues. Each core sends to one
queue and the network card just merges
these queues in hardware onto the link,
and on receiving the network card can
either hash on the incoming incoming
packet like hash over protocol headers or
you can set explicit filters.
This is not specific to a network card
most PCI Express devices work like this
like GPUs have queues, a command queues
and so on, a NVME PCI Express disks have
queues and...
So let's look at queues on example of the
ixgbe family but you will find that most
NICs work in a very similar way. There are
sometimes small differences but mainly
they work like this.
And these rings are just circular buffers
filled with so-called DMA descriptors. A
DMA descriptor is a 16-byte struct and
that is eight bytes of a physical pointer
pointing to some location where more stuff
is and eight byte of metadata like "I
fetch the stuff" or "this packet needs
VLAN tag offloading" or "this packet had a
VLAN tag that I removed", information like
that is stored in there.
And what we then need to do is we
translate virtual addresses from our
address space to physical addresses
because the PCI Express device of course
needs physical addresses.
And we can use this, do that using procfs:
In the /proc/self/pagemap we can do that.
And the next thing is we now have this
this queue of DMA descriptors in memory
and this queue itself is also accessed via
DMA and it's controlled like it works like
you expect a circular ring to work. It has
a head and a tail, and the head and tail
pointer are available via registers in
memory-mapped I/O address space, meaning
in a image it looks kind of like this: We
have this descriptor ring in our physical
memory to the left full of pointers and
then we have somewhere else these packets
in some memory pool. And one thing to note
when allocating this kind of memory: There
is a small trick you have to do because
the descriptor ring needs to be in
contiguous memory in your physical memory
and if you use if, you just assume
everything that's contiguous in your
process is also in hardware physically: No
it isn't, and if you have a bug in there
and then it writes to somewhere else then
your filesystem dies as I figured out,
which was not a good thing.
So ... we, what I'm doing is I'm using
huge pages, two megabyte pages, that's
enough of contiguous memory and that's
guaranteed to not have weird gaps.
So, um ... now we see packets we need to
set up the ring so we tell the network
car via memory mapped I/O the location and
the size of the ring, then we fill up the
ring with pointers to freshly allocated
memory that are just empty and now we set
the head and tail pointer to tell the head
and tail pointer that the queue is full,
because the queue is at the moment full,
it's full of packets. These packets are
just not yet filled with anything. And now
what the NIC does, it fetches one of the
DNA descriptors and as soon as it receives
a packet it writes the packet via DMA to
the location specified in the register and
increments the head pointer of the queue
and it also sets a status flag in the DMA
descriptor once it's done like in the
packet to memory and this step is
important because reading back the head
pointer via MM I/O would be way too slow.
So instead we check the status flag
because the status flag gets optimized by
the ... by the cache and is already in
cache so we can check that really fast.
Next step is we periodically poll the
status flag. This is the point where
interrupts might come in useful.
There's some misconception: people
sometimes believe that if you receive a
packet then you get an interrupt and the
interrupt somehow magically contains the
packet. No it doesn't. The interrupt just
contains the information that there is a
new packet. After the interrupt you would
have to poll the status flag anyways. So
we now have the packet, we process the
packet or do whatever, then we reset the
DMA descriptor, we can either recycle the
old packet or allocate a new one and we
set the ready flag on the status register
and we adjust the tail pointer register to
tell the network card that we are done
with this and we don't have to do that for
any time because we don't have to keep the
queue 100% utilized. We can only update
the tail pointer like every hundred
packets or so and then that's not a
performance problem. What now, we have a
driver that can receive packets. Next
steps, well transmit packets, it basically
works the same. I won't bore you with the
details. Then there's of course a lot of
boring boring initialization code and it's
just following the datasheet, they are
like: set this register, set that
register, do that and I just coded it down
from the datasheet and it works, so big
surprise. Then now you know how to write a
driver like this and a few ideas of what
... what I want to do, what maybe you want
to do with a driver like this. One of
course want to look at performance to look
at what makes this faster than the kernel,
then I want some obscure
hardware/offloading features.
In the past I've looked at IPSec
offloading, just quite interesting,
because the Intel network cards have
hardware support for IPSec offloading, but
none of the Intel drivers had it and it
seems to work just fine. So not sure
what's going on there. Then security is
interesting. There is the ... there's
obvious some security implications of
having the whole driver in a user space
process and ... and I'm wondering about
how we can use the IOMMU, because it turns
out, once we have set up the memory
mapping we can drop all the privileges, we
don't need them.
And if we set up the IOMMU before to
restrict the network card to certain
things then we could have a safe driver in
userspace that can't do anything wrong,
because has no privileges and the network
card has no access because goes through
the IOMMU and there are performance
implications of the IOMMU and so on. Of
course, support for other NICs. I want to
support virtIO, virtual NICs and other
programming languages for the driver would
also be interesting. It's just written in
C because C is the lowest common
denominator of programming languages.
To conclude, check out ixy. It's BSD
license on github and the main thing to
take with you is that drivers are really
simple. Don't be afraid of drivers. Don't
be afraid of writing your drivers. You can
do it in any language and you don't even
need to add kernel code. Just map the
stuff to your process, write the driver
and do whatever you want. Okay, thanks for
your attention.
Applause
Herald: You have very few minutes left for
questions. So if you have a question in
the room please go quickly to one of the 8
microphones in the room. Does the signal
angel already have a question ready? I
don't see anything. Anybody lining up at
any microphones?
Alright, number 6 please.
Mic 6: As you're not actually using any of
the Linux drivers, is there an advantage
to using Linux here or could you use any
open source operating system?
Paul: I don't know about other operating
systems but the only thing I'm using of
Linux here is the ability to easily map
that. For some other operating systems we
might need a small stub driver that maps
the stuff in there. You can check out the
DPDK FreeBSD port which has a small stub
driver that just handles the memory
mapping.
Herald: Here, at number 2.
Mic 2: Hi, erm, slightly disconnected to
the talk, but I just like to hear your
opinion on smart NICs where they're
considering putting CPUs on the NIC
itself. So you could imagine running Open
vSwitch on the CPU on the NIC.
Paul: Yeah, I have some smart NIC
somewhere on some lap and have also done
work with the net FPGA. I think that it's
very interesting, but it ... it's a
complicated trade-off, because these smart
NICs come with new restrictions and they
are not dramatically super fast. So it's
... it's interesting from a performance
perspective to see when it's worth it,
when it's not worth it and what I
personally think it's probably better to
do everything with raw CPU power.
Mic 2: Thanks.
Herald: Alright, before we take the next
question, just for the people who don't
want to stick around for the Q&A. If you
really do have to leave the room early,
please do so quietly, so we can continue
the Q&A. Number 6, please.
Mic 6: So how does the performance of the
userspace driver is compared to the XDP
solution?
Paul: Um, it's slightly faster. But one
important thing about XDP is, if you look
at this, this is still new work and there
is ... there are few important
restrictions like you can write your
userspace thing in whatever programming
language you want. Like I mentioned, snap
has a driver entirely written in Lua. With
XDP you are restricted to eBPF, meaning
usually a restricted subset of C and then
there's bytecode verifier but you can
disable the bytecode verifier if you want
to disable it, and meaning, you again have
weird restrictions that you maybe don't
want and also XDP requires patched driv
... not patched drivers but requires a new
memory model for the drivers. So at moment
DPDK supports more drivers than XDP in the
kernel, which is kind of weird, and
they're still lacking many features like
sending back to a different NIC.
One very very good use case for XDP is
firewalling for applications on the same
host because you can pass on a packet to
the TCP stack and this is a very good use
case for XDP. But overall, I think that
... that both things are very very
different and XDP is slightly slower but
it's not slower in such a way that it
would be relevant. So it's fast, to
answer the question.
Herald: All right, unfortunately we are
out of time. So that was the last
question. Thanks again, Paul.
Applause
34c3 outro
subtitles created by c3subtitles.de
in the year 2018. Join, and help us!