Return to Video

34C3 - Demystifying Network Cards

  • 0:00 - 0:16
    34c3 intro
  • 0:16 - 0:20
    Herald: All right, now it's my great
    pleasure to introduce Paul Emmerich who is
  • 0:20 - 0:27
    going to talk about "Demystifying Network
    Cards". Paul is a PhD student at the
  • 0:27 - 0:34
    Technical University in Munich. He's doing
    all kinds of network related stuff and
  • 0:34 - 0:38
    hopefully today he's gonna help us make
    network cards a bit less of a black box.
  • 0:38 - 0:49
    So, please give a warm welcome to Paul
    applause
  • 0:49 - 0:51
    Paul: Thank you and as the introduction
  • 0:51 - 0:55
    already said I'm a PhD student and I'm
    researching performance of software packet
  • 0:55 - 0:58
    processing and forwarding systems.
    That means I spend a lot of time doing
  • 0:58 - 1:03
    low-level optimizations and looking into
    what makes a system fast, what makes it
  • 1:03 - 1:06
    slow, what can be done to improve it
    and I'm mostly working on my packet
  • 1:06 - 1:10
    generator MoonGen
    I have some cross promotion of a lightning
  • 1:10 - 1:13
    talk about this on Saturday but here I
    have this long slot
  • 1:13 - 1:18
    and I brought a lot of content here so I
    have to talk really fast so sorry for the
  • 1:18 - 1:21
    translators and I hope you can mainly
    follow along
  • 1:21 - 1:25
    So: this is about Network cards meaning
    network cards you all have seen. This is a
  • 1:25 - 1:30
    usual 10G network card with the SFP+ port
    and this is a faster network card with a
  • 1:30 - 1:35
    QSFP+ port. This is 20, 40, or 100G
    and now you bought this fancy network
  • 1:35 - 1:38
    card, you plug it into your server or your
    macbook or whatever,
  • 1:38 - 1:42
    and you start your web server that serves
    cat pictures and cat videos.
  • 1:42 - 1:46
    You all know that there's a whole stack of
    protocols that your cat picture has to go
  • 1:46 - 1:48
    through until it arrives at a network card
    at the bottom
  • 1:48 - 1:52
    and the only thing that I care about are
    the lower layers. I don't care about TCP,
  • 1:52 - 1:56
    I have no idea how TCP works.
    Well I have some idea how it works, but
  • 1:56 - 1:58
    this is not my research, I don't care
    about it.
  • 1:58 - 2:01
    I just want to look at individual packets
    and the highest thing I look at it's maybe
  • 2:01 - 2:08
    an IP address or maybe a part of the
    protocol to identify flows or anything.
  • 2:08 - 2:11
    Now you might wonder: Is there anything
    even interesting in these lower layers?
  • 2:11 - 2:15
    Because people nowadays think that
    everything runs on top of HTTP,
  • 2:15 - 2:19
    but you might be surprised that not all
    applications run on top of HTTP.
  • 2:19 - 2:23
    There is a lot of software that needs to
    run at these lower levels and in the
  • 2:23 - 2:26
    recent years
    there is a trend of moving network
  • 2:26 - 2:31
    infrastructure stuff from specialized
    hardware black boxes to open software
  • 2:31 - 2:33
    boxes
    and examples for such software that was
  • 2:33 - 2:38
    hardware in the past are: routers, switches,
    firewalls, middle boxes and so on.
  • 2:38 - 2:40
    If you want to look up the relevant
    buzzwords: It's Network Function
  • 2:40 - 2:46
    Virtualization what it's called and this
    is a recent trend of the recent years.
  • 2:46 - 2:51
    Now let's say we want to build our own
    fancy application on that low-level thing.
  • 2:51 - 2:55
    We want to build our firewall router
    packet forward modifier thing that does
  • 2:55 - 2:59
    whatever useful on that lower layer for
    network infrastructure
  • 2:59 - 3:04
    and I will use this application as a demo
    application for this talk as everything
  • 3:04 - 3:08
    will be about this hypothetical router
    fireball packet forward modifier thing.
  • 3:08 - 3:12
    What it does: It receives packets on one
    or multiple network interfaces, it does
  • 3:12 - 3:16
    stuff with the packets - filter them,
    modify them, route them
  • 3:16 - 3:20
    and sent them out to some other port or
    maybe the same port or maybe multiple
  • 3:20 - 3:23
    ports - whatever these low-level
    applications do.
  • 3:23 - 3:28
    And this means the application operates on
    individual packets, not a stream of TCP
  • 3:28 - 3:31
    packets, not a stream of UDP packets, they
    have to cope with small packets.
  • 3:31 - 3:34
    Because that's just the worst case: You
    get a lot of small packets.
  • 3:34 - 3:38
    Now you want to build the application. You
    go to the Internet and you look up: How to
  • 3:38 - 3:41
    build a packet forwarding application?
    The internet tells you: There is the
  • 3:41 - 3:46
    socket API, the socket API is great and it
    allows you to get packets to your program.
  • 3:46 - 3:50
    So you build your application on top of
    the socket API. Once in userspace, you use
  • 3:50 - 3:53
    your socket, the socket talks to the
    operating system,
  • 3:53 - 3:56
    the operating system talks to the driver
    and the driver talks to the network cards,
  • 3:56 - 3:59
    and everything is fine except for that it
    isn't
  • 3:59 - 4:02
    because what it really looks like if you
    build this application:
  • 4:02 - 4:07
    There is this huge scary big gap between
    user space and kernel space and you
  • 4:07 - 4:13
    somehow need your packets to go across
    that without being eaten.
  • 4:13 - 4:16
    You might wonder why I said this is a big
    deal and a huge deal that you have this
  • 4:16 - 4:19
    gap in there
    and because I think: "Well, my web server
  • 4:19 - 4:23
    serving cat pictures is doing just fine on
    a fast connection."
  • 4:23 - 4:29
    Well, it is because it is serving large
    packets or even large chunks of files that
  • 4:29 - 4:34
    it sends at one to the server
    like you can send a... can take your whole
  • 4:34 - 4:37
    cat video, give it to the kernel and the
    kernel will handle everything
  • 4:37 - 4:43
    from doing... from packetizing it to TCP.
    But what we want to build is a application
  • 4:43 - 4:48
    that needs to cope with the worst case of
    lots of small packets coming in,
  • 4:48 - 4:54
    and then the overhead that you get here
    from this gap is mostly on a packet basis
  • 4:54 - 4:57
    not on a pair-byte basis.
    So, lots of small packets are a problem
  • 4:57 - 5:01
    for this interface.
    When I say "problem" I'm always talking
  • 5:01 - 5:03
    about performance because I'm mostly about
    performance.
  • 5:03 - 5:09
    So if you look at performance... a few
    figures to get started is...
  • 5:09 - 5:13
    well how many packets can you fit over
    your usual 10G link? That's around fifteen
  • 5:13 - 5:18
    million.
    But 10G that's last year's news, this year
  • 5:18 - 5:21
    you have multiple hundred G connections
    even to this location here.
  • 5:21 - 5:28
    So 100G link can handle up to 150 million
    packets per second, and, well, how long
  • 5:28 - 5:33
    does that give us if we have a CPU?
    And say we have a three gigahertz CPU in
  • 5:33 - 5:37
    our Macbook running the router and that
    means we have around 200 cycles per packet
  • 5:37 - 5:40
    if we want to handle one 10G link with one
    CPU core.
  • 5:40 - 5:46
    Okay we don't want to handle... we have of
    course multiple cores. But you also have
  • 5:46 - 5:50
    multiple links, and faster links than 10G.
    So the typical performance target that you
  • 5:50 - 5:55
    would aim for when building such an
    application is five to ten million packets
  • 5:55 - 5:57
    per second per CPU core per thread that
    you start.
  • 5:57 - 6:01
    Thats like a usual target. And that is
    just for forwarding, just to receive the
  • 6:01 - 6:06
    packet and to send it back out. All the
    stuff, that is: all the remaining cycles
  • 6:06 - 6:09
    can be used for your application.
    So we don't want any big overhead just for
  • 6:09 - 6:12
    receiving and sending them without doing
    any useful work.
  • 6:12 - 6:20
    So these these figures translate to
    around 300 to 600 cycles per packet, on a
  • 6:20 - 6:24
    three gigahertz CPU core. Now, how long
    does it take to cross that userspace
  • 6:24 - 6:31
    boundary? Well, very very very long for an
    individual packet. So in some performance
  • 6:31 - 6:35
    measurements, if you do single core packet
    forwarding, with a raw socket socket you
  • 6:35 - 6:39
    can maybe achieve 300,000 packets per
    second, if you use libpcap, you can
  • 6:39 - 6:43
    achieve a million packets per second.
    These figures can be tuned. You can maybe
  • 6:43 - 6:46
    get factor two out of that by some tuning,
    but there are more problems, like
  • 6:46 - 6:50
    multicore scaling is unnecessarily hard
    and so on, so this doesn't really seem to
  • 6:50 - 6:55
    work. So the boundary is the problem, so
    let's get rid of the boundary by just
  • 6:55 - 6:59
    moving the application into the kernel. We
    rewrite our application as a kernel module
  • 6:59 - 7:04
    and use it directly. You might think "what
    an incredibly stupid idea, to write kernel
  • 7:04 - 7:09
    code for something that clearly should be
    user space". Well, it's not that
  • 7:09 - 7:12
    unreasonable, there are lots of examples
    of applications doing this, like a certain
  • 7:12 - 7:17
    web server by Microsoft runs as a kernel
    module, the latest Linux kernel has TLS
  • 7:17 - 7:21
    offloading, to speed that up. Another
    interesting use case is Open vSwitch, that
  • 7:21 - 7:24
    has a fast internal chache, that just
    caches stuff and does complex processing
  • 7:24 - 7:27
    in a userspace thing, so it's not
    completely unreasonable.
  • 7:27 - 7:31
    But it comes with a lot of drawbacks, like
    it's very cumbersome to develop, most your
  • 7:31 - 7:35
    usual tools don't work or don't work as
    expected, you have to follow the usual
  • 7:35 - 7:38
    kernel restrictions, like you have to use
    C as a programming language, what you
  • 7:38 - 7:42
    maybe don't want to, and your application
    can and will crash the kernel, which can
  • 7:42 - 7:47
    be quite bad. But lets not care about the
    restrictions, we wanted to fix
  • 7:47 - 7:51
    performance, so same figures again: We
    have 300 to 600 cycles to receive and sent
  • 7:51 - 7:55
    a packet. What I did: I tested this, I
    profiled the Linux kernel to see how long
  • 7:55 - 7:59
    does it take to receive a packet until I
    can do some useful work on it. This is an
  • 7:59 - 8:04
    average cost of a longer profiling run. So
    on average it takes 500 cycles just to
  • 8:04 - 8:08
    receive the packet. Well, that's bad but
    sending it out is slightly faster and
  • 8:08 - 8:11
    again, we are now over our budget. Now you
    might think "what else do I need to do
  • 8:11 - 8:16
    besides receiving and sending the packet?"
    There is some more overhead, there's you
  • 8:16 - 8:21
    need some time to the sk_buff, the data
    structure used in the kernel for all
  • 8:21 - 8:25
    packet buffers, and this is quite bloated,
    old, big data structure that is growing
  • 8:25 - 8:30
    bigger and bigger with each release and
    this takes another 400 cycles. So if you
  • 8:30 - 8:33
    measure a real world application, single
    core packet forwarding with Open vSwitch
  • 8:33 - 8:36
    with the minimum processing possible: One
    open flow rule that matches on physical
  • 8:36 - 8:41
    ports and the processing, I profiled this
    at around 200 cycles per packet.
  • 8:41 - 8:45
    And while the overhead of the kernel is
    another thousand something cycles, so in
  • 8:45 - 8:49
    the end you achieve two million packets
    per second - and this is faster than our
  • 8:49 - 8:55
    user space stuff but still kind of slow,
    well, we want to be faster, because yeah.
  • 8:55 - 8:59
    And the currently hottest topic, which I'm
    not talking about in the Linux kernel is
  • 8:59 - 9:03
    XDP. This fixes some of these problems but
    comes with new restrictions. I cut that
  • 9:03 - 9:10
    for my talk for time reasons and so let's
    just talk about not XDP. So the problem
  • 9:10 - 9:14
    was that our application - and we wanted
    to move the application to the kernel
  • 9:14 - 9:18
    space - and it didn't work, so can we
    instead move stuff from the kernel to the
  • 9:18 - 9:22
    user space? Well, yes we can. There a
    libraries called "user space packet
  • 9:22 - 9:26
    processing frameworks". They come in two
    parts: One is a library, you link your
  • 9:26 - 9:29
    program against, in the user space and one
    is a kernel module. These two parts
  • 9:29 - 9:34
    communicate and they setup shared, mapped
    memory and this shared mapped memory is
  • 9:34 - 9:38
    used to directly communicate from your
    application to the driver. You directly
  • 9:38 - 9:41
    fill the packet buffers that the driver
    then sends out and this is way faster.
  • 9:41 - 9:44
    And you might have noticed that the
    operating system box here is not connected
  • 9:44 - 9:47
    to anything. That means your operating
    system doesn't even know that the network
  • 9:47 - 9:52
    card is there in most cases, this can be
    quite annoying. But there are quite a few
  • 9:52 - 9:58
    such frameworks, the biggest examples are
    netmap PF_RING and pfq and they come with
  • 9:58 - 10:02
    restrictions, like there is a non-standard
    API, you can't port between one framework
  • 10:02 - 10:06
    and the other or one framework in the
    kernel or sockets, there's a custom kernel
  • 10:06 - 10:11
    module required, most of these frameworks
    require some small patches to the drivers,
  • 10:11 - 10:16
    it's just a mess to maintain and of course
    they need exclusive access to the network
  • 10:16 - 10:19
    card, because this one network card is
    direc- this one application is talking
  • 10:19 - 10:24
    directly to the network card.
    Ok, and the next thing is you lose the
  • 10:24 - 10:28
    access to the usual kernel features, which
    can be quite annoying and then there's
  • 10:28 - 10:31
    often poor support for hardware offloading
    features of the network cards, because
  • 10:31 - 10:34
    they often found on different parts of the
    kernel that we no longer have reasonable
  • 10:34 - 10:38
    access to. And of course these frameworks,
    we talk directly to a network card,
  • 10:38 - 10:42
    meaning we need support for each network
    card individually. Usually they just
  • 10:42 - 10:46
    support one to two or maybe three NIC
    families, which can be quite restricting,
  • 10:46 - 10:51
    if you don't have that specific NIC that
    is restricted. But can we do an even more
  • 10:51 - 10:55
    radical approach, because we have all
    these problems with kernel dependencies
  • 10:55 - 10:59
    and so on? Well, turns out we can get rid
    of the kernel entirely and move everything
  • 10:59 - 11:04
    into one application. This means we take
    our driver put it in the application, the
  • 11:04 - 11:08
    driver directly accesses the network card
    and the sets up DMA memory in the user
  • 11:08 - 11:12
    space, because the network card doesn't
    care, where it copies the packets from. We
  • 11:12 - 11:15
    just have to set up the pointers in the
    right way and we can build this framework
  • 11:15 - 11:17
    like this, that everything runs in the
    application.
  • 11:17 - 11:23
    We remove the driver from the kernel, no
    kernel driver running and this is super
  • 11:23 - 11:28
    fast and we can also use this to implement
    crazy and obscure hardware features and
  • 11:28 - 11:31
    network cards that are not supported by
    the standard driver. Now I'm not the first
  • 11:31 - 11:36
    one to do this, there are two big
    frameworks that that do that: One is DPDK,
  • 11:36 - 11:41
    which is quite quite big. This is a Linux
    Foundation project and it has basically
  • 11:41 - 11:45
    support by all NIC vendors, meaning
    everyone who builds a high-speed NIC
  • 11:45 - 11:49
    writes a driver that works for DPDK and
    the second such framework is Snabb, which
  • 11:49 - 11:54
    I think is quite interesting, because it
    doesn't write the drivers in C but is
  • 11:54 - 11:58
    entirely written in Lua, in the scripting
    language, so this is kind of nice to see a
  • 11:58 - 12:03
    driver that's written in a scripting
    language. Okay, what problems did we solve
  • 12:03 - 12:07
    and what problems did we now gain? One
    problem is we still have the non-standard
  • 12:07 - 12:11
    API, we still need exclusive access to the
    network card from one application, because
  • 12:11 - 12:15
    the driver runs in that thing, so there's
    some hardware tricks to solve that, but
  • 12:15 - 12:18
    mainly it's one application that is
    running.
  • 12:18 - 12:22
    Then the framework needs explicit support
    for all the unique models out there. It's
  • 12:22 - 12:26
    not that big a problem with DPDK, because
    it's such a big project that virtually
  • 12:26 - 12:31
    everyone has a driver for DPDK NIC. And
    yes, limited support for interrupts but
  • 12:31 - 12:34
    it turns out interrupts are not something
    that is useful, when you are building
  • 12:34 - 12:38
    something that processes more than a few
    hundred thousand packets per second,
  • 12:38 - 12:41
    because the overhead of the interrupt is
    just too large, it's just mainly a power
  • 12:41 - 12:45
    saving thing, if you ever run into low
    load. But I don't care about the low load
  • 12:45 - 12:50
    scenario and power saving, so for me it's
    polling all the way and all the CPU. And
  • 12:50 - 12:55
    you of course lose all the access to the
    usual kernel features. And, well, time to
  • 12:55 - 13:00
    ask "what has the kernel ever done for
    us?" Well, the kernel has lots of mature
  • 13:00 - 13:03
    drivers. Okay, what has the kernel ever
    done for us, except for all these nice
  • 13:03 - 13:08
    mature drivers? There are very nice
    protocol implementations that actually
  • 13:08 - 13:10
    work, like the kernel TCP stack is a work
    of art.
  • 13:10 - 13:14
    It actually works in real world scenarios,
    unlike all these other TCP stacks that
  • 13:14 - 13:18
    fail under some things or don't support
    the features we want, so there is quite
  • 13:18 - 13:23
    some nice stuff. But what has the kernel
    ever done for us, except for these mature
  • 13:23 - 13:27
    drivers and these nice protocol stack
    implementations? Okay, quite a few things
  • 13:27 - 13:33
    and we are all throwing them out. And one
    thing to notice: We mostly don't care
  • 13:33 - 13:38
    about these features, when building our
    packet forward modify router firewall
  • 13:38 - 13:44
    thing, because these are mostly high-level
    features mostly I think. But it's still a
  • 13:44 - 13:49
    lot of features that we are losing, like
    building a TCP stack on top of these
  • 13:49 - 13:53
    frameworks is kind of an unsolved problem.
    There are TCP stacks but they all suck in
  • 13:53 - 13:58
    different ways. Ok, we lost features but
    we didn't care about the features in the
  • 13:58 - 14:03
    first place, we wanted performance.
    Back to our performance figure we want 300
  • 14:03 - 14:06
    to 600 cycles per packet that we have
    available, how long does it take in, for
  • 14:06 - 14:11
    example, DPDK to receive and send a
    packet? That is around a hundred cycles to
  • 14:11 - 14:15
    get a packet through the whole stack, from
    like like receiving a packet, processing
  • 14:15 - 14:20
    it, well, not processing it but getting it
    to the application and back to the driver
  • 14:20 - 14:23
    to send it out. A hundred cycles and the
    other frameworks typically play in the
  • 14:23 - 14:28
    same league. DPDK is slightly faster than
    the other ones, because it's full of magic
  • 14:28 - 14:33
    SSE and AVX intrinsics and the driver is
    kind of black magic but it's super fast.
  • 14:33 - 14:37
    Now in kind of real world scenario, Open
    vSwitch, as I've mentioned as an example
  • 14:37 - 14:42
    earlier, that was 2 million packets was
    the kernel version and Open vSwitch can be
  • 14:42 - 14:45
    compiled with an optional DPDK backend, so
    you set some magic flags when compiling,
  • 14:45 - 14:50
    then it links against DPDK and uses the
    network card directly, runs completely in
  • 14:50 - 14:55
    userspace and now it's a factor of around
    6 or 7 faster and we can achieve 13
  • 14:55 - 14:58
    million packets per second with the same,
    around the same processing step on a
  • 14:58 - 15:03
    single CPU core. So, great, where does do
    the performance gains come from? Well,
  • 15:03 - 15:08
    there are two things: Mainly it's compared
    to the kernel, not compared to sockets.
  • 15:08 - 15:13
    What people often say is that this is,
    zero copy which is a stupid term because
  • 15:13 - 15:18
    the kernel doesn't copy packets either, so
    it's not copying packets that was slow, it
  • 15:18 - 15:22
    was other things. Mainly it's batching,
    meaning it's very efficient to process a
  • 15:22 - 15:29
    relatively large number of packets at once
    and that really helps and the thing has
  • 15:29 - 15:33
    reduced memory overhead, the SK_Buff data
    structure is really big and if you cut
  • 15:33 - 15:37
    that down you save a lot of cycles. These
    DPDK figures, because DPDK has, unlike
  • 15:37 - 15:43
    some other frameworks, has memory
    management, and this is already included
  • 15:43 - 15:47
    in these 50 cycles.
    Okay, now we know that these frameworks
  • 15:47 - 15:52
    exist and everything, and the next obvious
    question is: "Can we build our own
  • 15:52 - 15:58
    driver?" Well, but why? First for fun,
    obviously, and then to understand how that
  • 15:58 - 16:01
    stuff works; how these drivers work,
    how these packet processing frameworks
  • 16:01 - 16:05
    work.
    I've seen in my work in academia; I've
  • 16:05 - 16:08
    seen a lot of people using these
    frameworks. It's nice, because they are
  • 16:08 - 16:12
    fast and they enable a few things, that
    just weren't possible before. But people
  • 16:12 - 16:16
    often treat these as magic black boxes you
    put in your packet and then it magically
  • 16:16 - 16:20
    is faster and sometimes I don't blame
    them. If you look at DPDK source code,
  • 16:20 - 16:24
    there are more than 20,000 lines of code
    for each driver. And just for example,
  • 16:24 - 16:29
    looking at the receive and transmit
    functions of the IXGBE driver and DPDK,
  • 16:29 - 16:34
    this is one file with around 3,000 lines
    of code and they do a lot of magic, just
  • 16:34 - 16:38
    to receive and send packets. No one wants
    to read through that, so the question is:
  • 16:38 - 16:41
    "How hard can it be to write your own
    driver?"
  • 16:41 - 16:45
    Turns out: It's quite easy! This was like
    a weekend project. I have written the
  • 16:45 - 16:48
    driver called XC. It's less than a
    thousand lines of C code. That is the full
  • 16:48 - 16:54
    driver for 10 G network cards and the full
    framework to get some applications and 2
  • 16:54 - 16:58
    simple example applications. Took me like
    less than two days to write it completely,
  • 16:58 - 17:01
    then two more days to debug it and fix
    performance.
  • 17:02 - 17:08
    So I've been building this driver on the
    Intel IXGBE family. This is a family of
  • 17:08 - 17:13
    network cards that you know of, if you
    ever had a server to test this. Because
  • 17:13 - 17:18
    almost all servers, that have 10 G
    connections, have these Intel cards. And
  • 17:18 - 17:23
    they are also embedded in some Xeon CPUs.
    They are also onboard chips on many
  • 17:23 - 17:29
    mainboards and the nice thing about them
    is, they have a publicly available data
  • 17:29 - 17:34
    sheet. Meaning Intel publishes this 1,000
    pages of PDF, that describes everything,
  • 17:34 - 17:37
    you ever wanted to know, when writing a
    driver for these. And the next nice thing
  • 17:37 - 17:41
    is, that there is almost no logic hidden
    behind the black box magic firmware. Many
  • 17:41 - 17:46
    newer network cards -especially Mellanox,
    the newer ones- hide a lot of
  • 17:46 - 17:50
    functionality behind a firmware and the
    driver. Mostly just exchanges messages
  • 17:50 - 17:54
    with the firmware, which is kind of
    boring, and with this family, it is not
  • 17:54 - 17:58
    the case, which i think is very nice. So
    how can we build a driver for this in four
  • 17:58 - 18:03
    very simple steps? One: We remove the
    driver that is currently loaded, because
  • 18:03 - 18:08
    we don't want it to interfere with our
    stuff. Okay, easy so far. Second, we
  • 18:08 - 18:13
    memory-map the PCIO memory-mapped I/O
    address space. This allows us to access
  • 18:13 - 18:16
    the PCI Express device. Number three: We
    figure out the physical addresses of our
  • 18:16 - 18:23
    DMA; of our process per address region and
    then we use them for DMA. And step four is
  • 18:23 - 18:27
    slightly more complicated, than the first
    three steps, as we write the driver. Now,
  • 18:27 - 18:32
    first thing to do, we figure out, where
    our network card -let's say we have a
  • 18:32 - 18:35
    server and be plugged in our network card-
    then it gets assigned an address and the
  • 18:35 - 18:40
    PCI bus. We can figure that out with
    lspci, this is the address. We need it in
  • 18:40 - 18:43
    a slightly different version with the
    fully qualified ID, and then we can remove
  • 18:43 - 18:48
    the kernel driver by telling the currently
    bound driver to remove that specific ID.
  • 18:48 - 18:52
    Now the operating system doesn't know,
    that this is a network card; doesn't know
  • 18:52 - 18:56
    anything, just notes that some PCI device
    has no driver. Then we write our
  • 18:56 - 18:59
    application.
    This is written in C and we just opened
  • 18:59 - 19:04
    this magic file in sysfs and this magic
    file; we just mmap it. Ain't no magic,
  • 19:04 - 19:08
    just a normal mmap there. But what we get
    back is a kind of special memory region.
  • 19:08 - 19:12
    This is the memory mapped I/O memory
    region of the PCI address configuration
  • 19:12 - 19:18
    space and this is where all the registers
    are available. Meaning, I will show you
  • 19:18 - 19:21
    what that means in just a second. If we if
    go through the datasheet, there are
  • 19:21 - 19:26
    hundreds of pages of tables like this and
    these tables tell us the registers, that
  • 19:26 - 19:30
    exist on that network card, the offset
    they have and a link to more detailed
  • 19:30 - 19:35
    descriptions. And in code that looks like
    this: For example the LED control register
  • 19:35 - 19:38
    is at this offset and then the LED control
    register.
  • 19:38 - 19:43
    On this register, there are 32 bits, there
    are some bits offset. Bit 7 is called
  • 19:43 - 19:49
    LED0_BLINK and if we set that bit in that
    register, then one of the LEDs will start
  • 19:49 - 19:54
    to blink. And we can just do that via our
    magic memory region, because all the reads
  • 19:54 - 19:58
    and writes, that we do to that memory
    region, go directly over the PCI Express
  • 19:58 - 20:02
    bus to the network card and the network
    card does whatever it wants to do with
  • 20:02 - 20:03
    them.
    It doesn't have to be a register,
  • 20:03 - 20:09
    basically it's just a command, to send to
    a network card and it's just a nice and
  • 20:09 - 20:12
    convenient interface to map that into
    memory. This is a very common technique,
  • 20:12 - 20:15
    that you will also find when you do some
    microprocessor programming or something.
  • 20:16 - 20:20
    So, and one thing to note is, since this
    is not memory: That also means, it can't
  • 20:20 - 20:24
    be cached. There's no cache in between.
    Each of these accesses will trigger a PCI
  • 20:24 - 20:29
    Express transaction and it will take quite
    some time. Speaking of lots of lots of
  • 20:29 - 20:33
    cycles, where lots means like hundreds of
    cycles or hundred cycles which is a lot
  • 20:33 - 20:37
    for me.
    So how do we now handle packets? We now
  • 20:37 - 20:42
    can, we have access to this registers we
    can read the datasheet and we can write
  • 20:42 - 20:47
    the driver but we some need some way to
    get packets through that. Of course it
  • 20:47 - 20:51
    would be possible to write a network card
    that does that via this memory-mapped I/O
  • 20:51 - 20:57
    region but it's kind of annoying. The
    second way a PCI Express device
  • 20:57 - 21:01
    communicates with your server or macbook
    is via DMA ,direct memory access, and a
  • 21:01 - 21:08
    DMA transfer, unlike the memory-mapped I/O
    stuff is initiated by the network card and
  • 21:08 - 21:14
    this means the network card can just write
    to arbitrary addresses in in main memory.
  • 21:14 - 21:20
    And this the network card offers so called
    rings which are queue interfaces and like
  • 21:20 - 21:23
    for receiving packets and for sending
    packets, and they are multiple of these
  • 21:23 - 21:27
    interfaces, because this is how you do
    multi-core scaling. If you want to
  • 21:27 - 21:31
    transmit from multiple cores, you allocate
    multiple queues. Each core sends to one
  • 21:31 - 21:34
    queue and the network card just merges
    these queues in hardware onto the link,
  • 21:34 - 21:39
    and on receiving the network card can
    either hash on the incoming incoming
  • 21:39 - 21:43
    packet like hash over protocol headers or
    you can set explicit filters.
  • 21:43 - 21:47
    This is not specific to a network card
    most PCI Express devices work like this
  • 21:47 - 21:52
    like GPUs have queues, a command queues
    and so on, a NVME PCI Express disks have
  • 21:52 - 21:57
    queues and...
    So let's look at queues on example of the
  • 21:57 - 22:01
    ixgbe family but you will find that most
    NICs work in a very similar way. There are
  • 22:01 - 22:04
    sometimes small differences but mainly
    they work like this.
  • 22:04 - 22:09
    And these rings are just circular buffers
    filled with so-called DMA descriptors. A
  • 22:09 - 22:14
    DMA descriptor is a 16-byte struct and
    that is eight bytes of a physical pointer
  • 22:14 - 22:19
    pointing to some location where more stuff
    is and eight byte of metadata like "I
  • 22:19 - 22:24
    fetch the stuff" or "this packet needs
    VLAN tag offloading" or "this packet had a
  • 22:24 - 22:27
    VLAN tag that I removed", information like
    that is stored in there.
  • 22:27 - 22:31
    And what we then need to do is we
    translate virtual addresses from our
  • 22:31 - 22:35
    address space to physical addresses
    because the PCI Express device of course
  • 22:35 - 22:39
    needs physical addresses.
    And we can use this, do that using procfs:
  • 22:39 - 22:46
    In the /proc/self/pagemap we can do that.
    And the next thing is we now have this
  • 22:46 - 22:52
    this queue of DMA descriptors in memory
    and this queue itself is also accessed via
  • 22:52 - 22:57
    DMA and it's controlled like it works like
    you expect a circular ring to work. It has
  • 22:57 - 23:01
    a head and a tail, and the head and tail
    pointer are available via registers in
  • 23:01 - 23:06
    memory-mapped I/O address space, meaning
    in a image it looks kind of like this: We
  • 23:06 - 23:10
    have this descriptor ring in our physical
    memory to the left full of pointers and
  • 23:10 - 23:16
    then we have somewhere else these packets
    in some memory pool. And one thing to note
  • 23:16 - 23:20
    when allocating this kind of memory: There
    is a small trick you have to do because
  • 23:20 - 23:25
    the descriptor ring needs to be in
    contiguous memory in your physical memory
  • 23:25 - 23:29
    and if you use if, you just assume
    everything that's contiguous in your
  • 23:29 - 23:34
    process is also in hardware physically: No
    it isn't, and if you have a bug in there
  • 23:34 - 23:38
    and then it writes to somewhere else then
    your filesystem dies as I figured out,
  • 23:38 - 23:43
    which was not a good thing.
    So ... we, what I'm doing is I'm using
  • 23:43 - 23:47
    huge pages, two megabyte pages, that's
    enough of contiguous memory and that's
  • 23:47 - 23:54
    guaranteed to not have weird gaps.
    So, um ... now we see packets we need to
  • 23:54 - 23:59
    set up the ring so we tell the network
    car via memory mapped I/O the location and
  • 23:59 - 24:03
    the size of the ring, then we fill up the
    ring with pointers to freshly allocated
  • 24:03 - 24:10
    memory that are just empty and now we set
    the head and tail pointer to tell the head
  • 24:10 - 24:13
    and tail pointer that the queue is full,
    because the queue is at the moment full,
  • 24:13 - 24:17
    it's full of packets. These packets are
    just not yet filled with anything. And now
  • 24:17 - 24:21
    what the NIC does, it fetches one of the
    DNA descriptors and as soon as it receives
  • 24:21 - 24:26
    a packet it writes the packet via DMA to
    the location specified in the register and
  • 24:26 - 24:30
    increments the head pointer of the queue
    and it also sets a status flag in the DMA
  • 24:30 - 24:34
    descriptor once it's done like in the
    packet to memory and this step is
  • 24:34 - 24:40
    important because reading back the head
    pointer via MM I/O would be way too slow.
  • 24:40 - 24:43
    So instead we check the status flag
    because the status flag gets optimized by
  • 24:43 - 24:47
    the ... by the cache and is already in
    cache so we can check that really fast.
  • 24:49 - 24:52
    Next step is we periodically poll the
    status flag. This is the point where
  • 24:52 - 24:56
    interrupts might come in useful.
    There's some misconception: people
  • 24:56 - 24:59
    sometimes believe that if you receive a
    packet then you get an interrupt and the
  • 24:59 - 25:02
    interrupt somehow magically contains the
    packet. No it doesn't. The interrupt just
  • 25:02 - 25:06
    contains the information that there is a
    new packet. After the interrupt you would
  • 25:06 - 25:12
    have to poll the status flag anyways. So
    we now have the packet, we process the
  • 25:12 - 25:16
    packet or do whatever, then we reset the
    DMA descriptor, we can either recycle the
  • 25:16 - 25:22
    old packet or allocate a new one and we
    set the ready flag on the status register
  • 25:22 - 25:26
    and we adjust the tail pointer register to
    tell the network card that we are done
  • 25:26 - 25:28
    with this and we don't have to do that for
    any time because we don't have to keep the
  • 25:28 - 25:33
    queue 100% utilized. We can only update
    the tail pointer like every hundred
  • 25:33 - 25:38
    packets or so and then that's not a
    performance problem. What now, we have a
  • 25:38 - 25:42
    driver that can receive packets. Next
    steps, well transmit packets, it basically
  • 25:42 - 25:46
    works the same. I won't bore you with the
    details. Then there's of course a lot of
  • 25:46 - 25:51
    boring boring initialization code and it's
    just following the datasheet, they are
  • 25:51 - 25:54
    like: set this register, set that
    register, do that and I just coded it down
  • 25:54 - 25:59
    from the datasheet and it works, so big
    surprise. Then now you know how to write a
  • 25:59 - 26:04
    driver like this and a few ideas of what
    ... what I want to do, what maybe you want
  • 26:04 - 26:07
    to do with a driver like this. One of
    course want to look at performance to look
  • 26:07 - 26:10
    at what makes this faster than the kernel,
    then I want some obscure
  • 26:10 - 26:13
    hardware/offloading features.
    In the past I've looked at IPSec
  • 26:13 - 26:16
    offloading, just quite interesting,
    because the Intel network cards have
  • 26:16 - 26:20
    hardware support for IPSec offloading, but
    none of the Intel drivers had it and it
  • 26:20 - 26:24
    seems to work just fine. So not sure
    what's going on there. Then security is
  • 26:24 - 26:29
    interesting. There is the ... there's
    obvious some security implications of
  • 26:29 - 26:33
    having the whole driver in a user space
    process and ... and I'm wondering about
  • 26:33 - 26:37
    how we can use the IOMMU, because it turns
    out, once we have set up the memory
  • 26:37 - 26:40
    mapping we can drop all the privileges, we
    don't need them.
  • 26:40 - 26:44
    And if we set up the IOMMU before to
    restrict the network card to certain
  • 26:44 - 26:49
    things then we could have a safe driver in
    userspace that can't do anything wrong,
  • 26:49 - 26:52
    because has no privileges and the network
    card has no access because goes through
  • 26:52 - 26:56
    the IOMMU and there are performance
    implications of the IOMMU and so on. Of
  • 26:56 - 27:00
    course, support for other NICs. I want to
    support virtIO, virtual NICs and other
  • 27:00 - 27:04
    programming languages for the driver would
    also be interesting. It's just written in
  • 27:04 - 27:07
    C because C is the lowest common
    denominator of programming languages.
  • 27:07 - 27:13
    To conclude, check out ixy. It's BSD
    license on github and the main thing to
  • 27:13 - 27:16
    take with you is that drivers are really
    simple. Don't be afraid of drivers. Don't
  • 27:16 - 27:20
    be afraid of writing your drivers. You can
    do it in any language and you don't even
  • 27:20 - 27:23
    need to add kernel code. Just map the
    stuff to your process, write the driver
  • 27:23 - 27:27
    and do whatever you want. Okay, thanks for
    your attention.
  • 27:27 - 27:33
    Applause
  • 27:33 - 27:36
    Herald: You have very few minutes left for
  • 27:36 - 27:41
    questions. So if you have a question in
    the room please go quickly to one of the 8
  • 27:41 - 27:47
    microphones in the room. Does the signal
    angel already have a question ready? I
  • 27:47 - 27:53
    don't see anything. Anybody lining up at
    any microphones?
  • 28:07 - 28:09
    Alright, number 6 please.
  • 28:10 - 28:15
    Mic 6: As you're not actually using any of
    the Linux drivers, is there an advantage
  • 28:15 - 28:19
    to using Linux here or could you use any
    open source operating system?
  • 28:19 - 28:24
    Paul: I don't know about other operating
    systems but the only thing I'm using of
  • 28:24 - 28:29
    Linux here is the ability to easily map
    that. For some other operating systems we
  • 28:29 - 28:33
    might need a small stub driver that maps
    the stuff in there. You can check out the
  • 28:33 - 28:37
    DPDK FreeBSD port which has a small stub
    driver that just handles the memory
  • 28:37 - 28:41
    mapping.
    Herald: Here, at number 2.
  • 28:41 - 28:45
    Mic 2: Hi, erm, slightly disconnected to
    the talk, but I just like to hear your
  • 28:45 - 28:51
    opinion on smart NICs where they're
    considering putting CPUs on the NIC
  • 28:51 - 28:55
    itself. So you could imagine running Open
    vSwitch on the CPU on the NIC.
  • 28:55 - 29:00
    Paul: Yeah, I have some smart NIC
    somewhere on some lap and have also done
  • 29:00 - 29:06
    work with the net FPGA. I think that it's
    very interesting, but it ... it's a
  • 29:06 - 29:10
    complicated trade-off, because these smart
    NICs come with new restrictions and they
  • 29:10 - 29:14
    are not dramatically super fast. So it's
    ... it's interesting from a performance
  • 29:14 - 29:18
    perspective to see when it's worth it,
    when it's not worth it and what I
  • 29:18 - 29:22
    personally think it's probably better to
    do everything with raw CPU power.
  • 29:22 - 29:25
    Mic 2: Thanks.
    Herald: Alright, before we take the next
  • 29:25 - 29:30
    question, just for the people who don't
    want to stick around for the Q&A. If you
  • 29:30 - 29:34
    really do have to leave the room early,
    please do so quietly, so we can continue
  • 29:34 - 29:39
    the Q&A. Number 6, please.
    Mic 6: So how does the performance of the
  • 29:39 - 29:43
    userspace driver is compared to the XDP
    solution?
  • 29:43 - 29:51
    Paul: Um, it's slightly faster. But one
    important thing about XDP is, if you look
  • 29:51 - 29:55
    at this, this is still new work and there
    is ... there are few important
  • 29:55 - 29:58
    restrictions like you can write your
    userspace thing in whatever programming
  • 29:58 - 30:02
    language you want. Like I mentioned, snap
    has a driver entirely written in Lua. With
  • 30:02 - 30:07
    XDP you are restricted to eBPF, meaning
    usually a restricted subset of C and then
  • 30:07 - 30:10
    there's bytecode verifier but you can
    disable the bytecode verifier if you want
  • 30:10 - 30:14
    to disable it, and meaning, you again have
    weird restrictions that you maybe don't
  • 30:14 - 30:19
    want and also XDP requires patched driv
    ... not patched drivers but requires a new
  • 30:19 - 30:24
    memory model for the drivers. So at moment
    DPDK supports more drivers than XDP in the
  • 30:24 - 30:27
    kernel, which is kind of weird, and
    they're still lacking many features like
  • 30:27 - 30:31
    sending back to a different NIC.
    One very very good use case for XDP is
  • 30:31 - 30:35
    firewalling for applications on the same
    host because you can pass on a packet to
  • 30:35 - 30:40
    the TCP stack and this is a very good use
    case for XDP. But overall, I think that
  • 30:40 - 30:47
    ... that both things are very very
    different and XDP is slightly slower but
  • 30:47 - 30:51
    it's not slower in such a way that it
    would be relevant. So it's fast, to
  • 30:51 - 30:55
    answer the question.
    Herald: All right, unfortunately we are
  • 30:55 - 30:59
    out of time. So that was the last
    question. Thanks again, Paul.
  • 30:59 - 31:08
    Applause
  • 31:08 - 31:13
    34c3 outro
  • 31:13 - 31:30
    subtitles created by c3subtitles.de
    in the year 2018. Join, and help us!
Title:
34C3 - Demystifying Network Cards
Description:

more » « less
Video Language:
English
Duration:
31:30

English subtitles

Revisions