Building a high throughput low-latency PCIe based SDR (33c3)

Edit subtitles

0:04 - 0:09

[Music]
0:09 - 0:22

Herald: Has anyone in here ever worked
with libusb or PI USB? Hands up. Okay. Who
0:22 - 0:32

also thinks USB is a pain? laughs Okay.
Sergey and Alexander were here back in at
0:32 - 0:39

the 26C3, that's a long time ago. I think
it was back in Berlin, and back then they
0:39 - 0:45

presented their first homemade, or not
homemade, SDR, software-defined radio.
0:45 - 0:49

This year they are back again and they
want to show us how they implemented
0:49 - 0:55

another one, using an FPGA, and to
communicate with it they used PCI Express.
0:55 - 1:02

So I think if you thought USB was a pain,
let's see what they can tell us about PCI
1:02 - 1:07

Express. A warm round of applause for
Alexander and Sergey for building a high
1:07 - 1:12

throughput, low latency, PCIe-based
software-defined radio
1:12 - 1:20

[Applause]
Alexander Chemeris: Hi everyone, good
1:20 - 1:30

morning, and welcome to the first day of
the Congress. So, just a little bit
1:30 - 1:36

background about what we've done
previously and why we are doing what we
1:36 - 1:42

are doing right now, is that we started
working with software-defined radios and
1:42 - 1:52

by the way, who knows what software
defined radio is? Okay, perfect. laughs
1:52 - 1:59

And who ever actually used a software-
defined radio? RTL-SDR or...? Okay, less
1:59 - 2:06

people but that's still quite a lot. Okay,
good. I wonder whether anyone here used
2:06 - 2:17

more expensive radios like USRPs? Less
people, but okay, good. Cool. So before
2:17 - 2:23

2008 I've had no idea what software-
defined radio is, was working with voice
2:23 - 2:30

over IP software person, etc., etc., so I
in 2008 I heard about OpenBTS, got
2:30 - 2:40

introduced to software-defined radio and I
wanted to make it really work and that's
2:40 - 2:52

what led us to today. In 2009 we had to
develop a clock tamer. A hardware which
2:52 - 3:00

allows to use, allowed to use USRP1 to run GSM
without problems. If anyone ever tried
3:00 - 3:05

doing this without a good clock source
knows what I'm talking about. And we
3:05 - 3:11

presented this - it wasn't an SDR it was
just a clock source - we presented this in
3:11 - 3:19

2009 in 26C3.
Then I realized that using USRP1 is not
3:19 - 3:24

really a good idea, because we wanted to
build a robust, industrial-grade base
3:24 - 3:30

stations. So we started developing our own
software defined radio, which we call
3:30 - 3:41

UmTRX and it was in - we started started
this in 2011. Our first base stations with
3:41 - 3:52

it were deployed in 2013, but I always
wanted to have something really small and
3:52 - 4:00

really inexpensive and back then it wasn't
possible. My original idea in 2011, we
4:00 - 4:08

were to build a PCI Express card. Mini,
sorry, not PCI Express card but mini PCI
4:08 - 4:10

card.
If you remember there were like all the
4:10 - 4:14

Wi-Fi cards and mini PCI form factor and I
thought that would be really cool to have
4:14 - 4:22

an SDR and mini PCI, so I can plug this
into my laptop or in some embedded PC and
4:22 - 4:32

have a nice SDR equipment, but back then
it just was not really possible, because
4:32 - 4:38

electronics were bigger and more power
hungry and just didn't work that way, so
4:38 - 4:50

we designed UmTRX to work over gigabit
ethernet and it was about that size. So
4:50 - 4:57

now we spend this year at designing
something, which really brings me to what
4:57 - 5:05

I wanted those years ago, so the XTRX is a
mini PCI Express - again there was no PCI
5:05 - 5:10

Express back then, so now it's mini PCI
Express, which is even smaller than PCI, I
5:10 - 5:18

mean mini PCI and it's built to be
embedded friendly, so you can plug this
5:18 - 5:24

into a single board computer, embedded
single board computer. If you have a
5:24 - 5:28

laptop with a mini PCI Express you can
plug this into your laptop and you have a
5:28 - 5:35

really small, software-defined radio
equipment. And we really want to make it
5:35 - 5:39

inexpensive, that's why I was asking how
many of you have ever worked it with RTL-
5:39 - 5:44

SDR, how many of you ever worked with you
USRPs, because the gap between them is
5:44 - 5:54

pretty big and we want to really bring the
software-defined radio to masses.
5:54 - 6:00

Definitely won't be as cheap as RTL-SDR,
but we try to make it as close as
6:00 - 6:03

possible.
And at the same time, so at the size of
6:03 - 6:10

RTL-SDR, at the price well higher but,
hopeful hopefully it will be affordable to
6:10 - 6:17

pretty much everyone, we really want to
bring high performance into your hands.
6:17 - 6:23

And by high performance I mean this is a
full transmit/receive with two channels
6:23 - 6:28

transmit, two channels receive, which is
usually called 2x2 MIMO in in the radio
6:28 - 6:37

world. The goal was to bring it to 160
megasamples per second, which can roughly
6:37 - 6:44

give you like 120 MHz of radio spectrum
available.
6:44 - 6:53

So what we were able to achieve is, again
this is mini PCI Express form factor, it
6:53 - 7:02

has small Artix7, that's the smallest and
most inexpensive FPGA, which has ability
7:02 - 7:18

to work with a PCI Express. It has LMS7000
chip for RFIC, very high performance, very
7:18 - 7:27

tightly embedded chip with even a DSP
blocks inside. It has even a GPS chip
7:27 - 7:37

here, you can actually on the right upper
side, you can see a GPS chip, so you can
7:37 - 7:44

accually synchronize your SDR to GPS for
perfect clock stability,
7:44 - 7:51

so you won't have any problems running any
telecommunication systems like GSM, 3G, 4G
7:51 - 7:59

due to clock problems, and it also has
interface for SIM cards, so you can
7:59 - 8:06

actually create a software-defined radio
modem and run other open source projects
8:06 - 8:16

to build one in a four LT called SRSUI, if
you're interested, etc., etc. so really
8:16 - 8:22

really tightly packed one. And if you put
this into perspective: that's how it all
8:22 - 8:31

started in 2006 and that's what you have
ten years later. It's pretty impressive.
8:31 - 8:37

applause
Thanks. But I think it actually applies to
8:37 - 8:40

the whole industry who is working on
shrinking the sizes because we just put
8:40 - 8:49

stuff on the PCB, you know. We're not
building the silicon itself. Interesting
8:49 - 8:55

thing is that we did the first approach:
we said let's pack everything, let's do a
8:55 - 9:03

very tight PCB design. We did an eight
layer PCB design and when we send it to a
9:03 - 9:10

fab to estimate the cost it turned out
it's $15,000 US per piece. Well in small
9:10 - 9:19

volumes obviously but still a little bit
too much. So we had to redesign this and
9:19 - 9:27

the first thing which we did is we still
kept eight layers, because in our
9:27 - 9:33

experience number of layers nowadays have
only minimal impact on the cost of the
9:33 - 9:42

device. So like six, eight layers - the
price difference is not so big. But we did
9:42 - 9:52

complete rerouting and only kept 2-Deep
MicroVIAs and never use the buried VIAs.
9:52 - 9:57

So this make it much easier and much
faster for the fab to manufacture it and
9:57 - 10:04

the price suddenly went five, six times
down and in volume again it will be
10:04 - 10:18

significantly cheaper. And that's just for
geek porn how PCB looks inside. So now
10:18 - 10:25

let's go into real stuff. So PCI Express:
why did we choose PCI Express? As it was
10:25 - 10:33

said USB is a pain in the ass. You can't
really use USB in industrial systems. For
10:33 - 10:41

a whole variety of reasons just unstable.
So we did use Ethernet for many years
10:41 - 10:47

successfully but Ethernet has one problem:
first of all inexpensive Ethernet is only
10:47 - 10:52

one gigabit and one gigabit does not offer
you enough bandwidth to carry all the data
10:52 - 11:00

we want, plus its power-hungry etc. etc.
So PCI Express is really a good choice
11:00 - 11:06

because it's low power, it has low
latency, it has very high bandwidth and
11:06 - 11:11

it's available almost universally. When we
started looking into this we realize that
11:11 - 11:17

even ARM boards, some of ARM boards have
PCI Express, mini PCI Express slots, which
11:17 - 11:27

was a big surprise for me for example.
So the problems is that unlike USB you do
11:27 - 11:37

need to write your own kernel driver for
this and there's no way around. And it is
11:37 - 11:41

really hard to write this driver
universally so we are writing it obviously
11:41 - 11:45

for Linux because they're working with
embedded systems, but if we want to
11:45 - 11:51

rewrite it for Windows or for macOS we'll
have to do a lot of rewriting. So we focus
11:51 - 11:57

on what we want on Linux only right now.
And now the hardest part: debugging is
11:57 - 12:03

really non-trivial. One small error and
your PC is completely hanged because you
12:03 - 12:09

use something wrong. And you have to
reboot it and restart it. That's like
12:09 - 12:16

debugging kernel but sometimes even
harder. To make it worse there is no
12:16 - 12:19

really easy-to-use plug-and-play
interface. If you want to restart;
12:19 - 12:24

normally, when you when you develop a PCI
Express card, when you want when you want
12:24 - 12:31

to restart it you have to restart your
development machine. Again not a nice way,
12:31 - 12:39

it's really hard. So the first thing we
did is we found, that we can use
12:39 - 12:47

Thunderbolt 3 which is just recently
released, and it has ability to work
12:47 - 12:57

directly with PCI Express bus. So it
basically has a mode in which it converts
12:57 - 13:01

a PCI Express into plug-and-play
interface. So if you have a laptop which
13:01 - 13:09

supports Thunderbolt 3 then you can use
this to do plug and play your - plug or
13:09 - 13:16

unplug your device to make your
development easier. There are always
13:16 - 13:24

problems: there's no easy way, there's no
documentation. Thunderbolt is not
13:24 - 13:27

compatible with Thunderbolt. Thunderbold 3
is not compatible with Thunderbold 2.
13:27 - 13:34

So we had to buy a special laptop with
Thunderbold 3 with special cables like all
13:34 - 13:40

this all this hard stuff. And if you
really want to get documentation you have
13:40 - 13:48

to sign NDA and send a business plan to
them so they can approve that your
13:48 - 13:51

business makes sense.
laughter
13:51 - 13:59

I mean... laughs So we actually opted
out. We set not to go through this, what
13:59 - 14:05

we did is we found that someone is
actually making PCI Express to Thunderbolt
14:05 - 14:11

3 converters and selling them as dev
boards and that was a big relief because
14:11 - 14:17

it saved us lots of time, lots of money.
You just order it from from some from some
14:17 - 14:25

Asian company. And yeah this is how it
looks like this converter. So you buy it,
14:25 - 14:30

like several pieces you can plug in your
PCI Express card there and you plug this
14:30 - 14:38

into your laptop. And this is the with
XTRX already plugged into it. Now the only
14:38 - 14:50

problem we found is that typically UEFI
has a security control enabled, so that
14:50 - 14:57

any random thunderbold device can't hijack
your PCI bus and can't get access to your
14:57 - 15:02

kernel memory and do some bad stuff. Which
is a good idea - the only problem is that
15:02 - 15:07

there is, it's not fully implemented in
Linux. So under Windows if you plug in a
15:07 - 15:12

device which is which has no security
features, which is not certified, it will
15:12 - 15:17

politely ask you like: "Do you really
trust this device? Do you want to use it?"
15:17 - 15:22

you can say "yes". Under Linux it just
does not work. laughs So we spend some
15:22 - 15:26

time trying to figure out how to get
around this. Right, some patches from
15:26 - 15:30

Intel which are not mainline and we were
not able to actually get them work. So we
15:30 - 15:39

just had to disable all this security
measure in the laptop. So be aware that
15:39 - 15:47

this is the case and we suspect that happy
users of Apple might not be able to do
15:47 - 15:54

this because Apple don't have BIOS so it
probably can't disable this feature. So
15:54 - 16:02

probably good incentive for someone to
actually finish writing the driver.
16:02 - 16:08

So now to the goal: so we wanted to, we
want to achieve 160 mega samples per
16:08 - 16:14

second, 2x2 MIMO, which means two
transceiver, two transmit, two receive
16:14 - 16:24

channels at 12 bits, which is roughly 7.5
Gbit/s. So first result when we plug this
16:24 - 16:26

when we got this board on the fab it
didn't work
16:26 - 16:30

Sergey Kostanbaev mumbles: as expected
Alexander Chemeris: yes as expected so the
16:30 - 16:40

first the interesting thing we realized is
that: first of all the FPGA has Hardware
16:40 - 16:47

blocks for talking to a PCI Express which
was called GTP which basically implement
16:47 - 16:57

like a PCI Express serial physical layer
but the thing is the numbering is reversed
16:57 - 17:04

in the in PCI Express in FPGA and we did
not realize this so we had to do very very
17:04 - 17:11

fine soldiering to actually swap the
laughs swap the lanes you can see this
17:11 - 17:18

very fine work there.
We also found that one of the components
17:18 - 17:29

was deadbug which is a well-known term for
chips which design stage are placed at
17:29 - 17:36

mirrored so we mirrored occasionally
mirrored that they pin out so we had to
17:36 - 17:42

solder it upside down and if you can
realize how small it is you can also
17:42 - 17:49

appreciate the work done. And what's funny
when I was looking at dead bugs I actually
17:49 - 17:57

found a manual from NASA which describes
how to properly soldier dead bugs to get
17:57 - 18:01

it approved.
audience laughs
18:01 - 18:08

So this is the link I think you can go
there and enjoy it's also fun stuff there.
18:08 - 18:17

So after fixing all of this our next
attempt this kind of works. So next stage
18:17 - 18:23

is debugging the FPGA code, which has to
talk to PCI Express and PCI Express has to
18:23 - 18:28

talk to Linux kernel and the kernel has to
talk to the driver, driver has talked to
18:28 - 18:38

the user space. So peripherals are easy so
the UART SPIs we've got to work almost
18:38 - 18:45

immediately no problems with that, but DMA
was a real beast. So we spent a lot of
18:45 - 18:53

time trying to get DMA to work and the
problem is that with DMA it's on FPGA so
18:53 - 19:00

you can't just place a breakpoint like you
do in C or C++ or in other languages it's
19:00 - 19:07

real-time system running on system like
it's real-time hardware, which is running
19:07 - 19:16

on the fabric so you we had to Sergey was
mainly developing this had to write a lot
19:16 - 19:23

of small test benches and and test
everything piece by piece.
19:23 - 19:31

So all parts of the DMA code we had was
wrapped into a small test bench which was
19:31 - 19:40

emulating all the all the tricks and as
classics predicted it took about five to
19:40 - 19:48

ten times more than actually writing the
code. So we really blew up our and
19:48 - 19:55

predicted timelines by doing this, but the
end we've got really stable stable work.
19:55 - 20:04

So some suggestions for anyone who will
try to repeat this exercise is there is a
20:04 - 20:10

logic analyzer built-in to Xilinx and you
can use, it it's nice it's, sometimes it's
20:10 - 20:16

very helpful but you can't debug
transient box, which are coming out at
20:16 - 20:23

when some weird conditions are coming up.
So you have to implement some read back
20:23 - 20:29

registers which shows important statistic
like important data about how your system
20:29 - 20:35

behaves, in our case it's various counters
on the DMA interface. So you can actually
20:35 - 20:41

see kind of see what's happening with your
with your data: Is it received? Is it
20:41 - 20:46

sent? How much is and how much is
received? So like for example, we can see
20:46 - 20:54

when we saturate the bus or when actually
is an underrun so host is not providing
20:54 - 20:57

data fast enough, so we can at least
understand whether it's a host problem or
20:57 - 21:02

whether it's an FPGA, problem on which
part we do we debug next because again:
21:02 - 21:08

it's a very multi layer problem you start
with FPGA, PCI Express, kernel, driver,
21:08 - 21:15

user space, and any part can fail. so you
can't work blind like this. So again the
21:15 - 21:23

goal was to get 160 MSPS with the first
implementation we could 2 MSPS: roughly 60
21:23 - 21:30

times slower.
The problem is that software just wasn't
21:30 - 21:36

keeping up and wasn't sending data fast
enough. So it was like many things done
21:36 - 21:41

but the most important parts is: use real-
time priority if you want to get very
21:41 - 21:47

stable results and well fix software bugs.
And one of the most important bugs we had
21:47 - 21:54

was that DMA buffers were not freed in
proper time immediately so they were busy
21:54 - 21:59

for longer than they should be, which
introduced extra cycles and basically just
21:59 - 22:06

reduced the bandwidth.
At this point let's talk a little bit
22:06 - 22:14

about how to implement a high-performance
driver for Linux, because if you want to
22:14 - 22:21

get real real performance you have to
start with the right design. There are
22:21 - 22:27

basically three approaches and the whole
spectrum in between; like two approaches
22:27 - 22:34

and the whole spectrum in between, which
is where you can refer to three. The first
22:34 - 22:42

approach is full kernel control, in which
case kernel driver not only is on the
22:42 - 22:46

transfer, it actually has all the logics
of controlling your device and all the
22:46 - 22:52

export ioctl to the user space and
that's the kind of a traditional way of
22:52 - 22:58

writing drivers. Your your user space is
completely abstracted from all the
22:58 - 23:07

details. The problem is that this is
probably the slowest way to do it. The
23:07 - 23:14

other way is what's called the "zero cup
interface": your only control is held in
23:14 - 23:21

the kernel and data is provided, the raw
data is provided to user space "as-is". So
23:21 - 23:28

you avoid memory copy which make it
faster. But still not fast enough if you
23:28 - 23:34

really want to achieve maximum
performance, because you still have
23:34 - 23:41

context switches between the kernel and
the user space. The most... the fastest
23:41 - 23:47

approach possible is to have full user
space implementation when kernel just
23:47 - 23:53

exposed everything and says "now you do it
yourself" and you have no you have no
23:53 - 24:02

context switches, like almost no, and you
can really optimize everything. So what
24:02 - 24:09

is... what are the problems with this?
The pro the pros I already mentioned: no
24:09 - 24:14

no switches between kernel user space,
it's very low latency because of this as
24:14 - 24:21

well, it's very high bandwidth. But if you
are not interested in getting the very
24:21 - 24:28

high performance, the most performance, and
you just want to have like some little,
24:28 - 24:33

like say low bandwidth performance, then
you will have to add hacks, because you
24:33 - 24:37

can't get notifications of the kernel that
resources available is more data
24:37 - 24:46

available. It also makes it vulnerable
vulnerable because if user space can
24:46 - 24:55

access it, then it can do whatever it
want. We at the end decided that... one
24:55 - 25:03

more important thing: how to actually to
get the best performance out of out of the
25:03 - 25:10

bus. This is a very (?)(?) set as we want
to poll your device or not to poll and get
25:10 - 25:14

notified. What is polling? I guess
everyone as programmer understands it, so
25:14 - 25:18

polling is when you asked repeatedly: "Are
you ready?", "Are you ready?", "Are you
25:18 - 25:20

ready?" and when it's ready you get the
data immediately.
25:20 - 25:25

It's basically a busy loop of your you
just constantly asking device what's
25:25 - 25:33

happening. You need to dedicate a full
core, and thanks God we have multi-core
25:33 - 25:40

CPUs nowadays, so you can dedicate the
full core to this polling and you can just
25:40 - 25:46

pull constantly. But again if you don't
need this highest performance, you just
25:46 - 25:53

need to get something, then you will be
wasting a lot of CPU resources. At the end
25:53 - 26:00

we decided to do a combined architecture
of your, it is possible to pull but
26:00 - 26:06

there's also a chance and to get
notification from a kernel to for for
26:06 - 26:11

applications, which recover, which needs
low bandwidth, but also require a better
26:11 - 26:17

CPU performance. Which I think is the best
way if you are trying to target both
26:17 - 26:31

worlds. Very quickly: the architecture of
system. We try to make it very very
26:31 - 26:51

portable so and flexible. There is a
kernel driver, which talks to low-level
26:51 - 26:56

library which implements all this logic,
which we took out of the driver: to
26:56 - 27:01

control the
PCI Express, to work with DMA, to provide
27:01 - 27:09

all the... to hide all the details of the
actual bus implementation.
27:09 - 27:17

And then there is a high-level library
which talks to this low-level library and
27:17 - 27:22

also to libraries which implement control
of actual peripherals, and most
27:22 - 27:29

importantly to the library which
implements control over our RFIC chip.
27:29 - 27:35

This way it's very modular, we can replace
PCI Express with something else later, we
27:35 - 27:46

might be able to port it to other
operating systems, and that's the goal.
27:46 - 27:50

Another interesting issue is: when you
start writing the Linux kernel driver you
27:50 - 27:57

very quickly realize that while LDD, which
is a classic book for a Linux driver,
27:57 - 28:02

writing is good and it will give you a
good insight; it's not actually up-to-
28:02 - 28:09

date. It's more than ten years old and
there's all of new interfaces which are
28:09 - 28:15

not described there, so you have to resort
to reading the manuals and all the
28:15 - 28:20

documentation in the kernel itself. Well
at least you get the up-to-date
28:20 - 28:32

information. The decisions we made is to
make everything easy. We use TTY for GPS
28:32 - 28:38

and so you can really attach a pretty much
any application which talks to GPS. So all
28:38 - 28:46

of existing applications can just work out
of the box. And we also wanted to be able
28:46 - 28:55

to synchronize system clock to GPS, so we
get automatic log synchronization across
28:55 - 28:59

multiple systems, which is very important
when we are deploying many, many devices
28:59 - 29:07

around the world.
We plan to do two interfaces, one as key
29:07 - 29:16

PPS and another is a DCT, because DCT line
on the UART exposed over TTY. Because
29:16 - 29:20

again we found that there are two types of
applications: one to support one API,
29:20 - 29:26

others that support other API and there is
no common thing so we have to support
29:26 - 29:39

both. As we described, we want to have
polls so we can get notifications of the
29:39 - 29:48

kernel when data is available and we don't
need to do real busy looping all the time.
29:48 - 29:56

After all the software optimizations we've
got to like 10 MSPS: still very, very far
29:56 - 30:02

from what we want to achieve.
Now there should have been a lot of
30:02 - 30:07

explanations about PCI Express, but when
we actually wrote everything we wanted to
30:07 - 30:14

say we realize, it's just like a full two
hours talk just on PCI Express. So we are
30:14 - 30:18

not going to give it here, I'll just give
some highlights which are most
30:18 - 30:24

interesting. If you if there is real
interest, we can set up a workshop and
30:24 - 30:32

some of the later days and talking more
details about PCI Express specifically.
30:32 - 30:39

The thing is there is no open source cores
for PCI Express, which are optimized for
30:39 - 30:48

high performance, real time applications.
There is Xillybus which as I understand is
30:48 - 30:53

going to be open source, but they provide
you a source if you pay them. It's very
30:53 - 31:00

popular because it's very very easy to do,
but it's not giving you performance. If I
31:00 - 31:05

remember correctly the best it can do is
maybe like 50 percent bus saturation.
31:05 - 31:11

So there's also Xilinx implementation, but
if you are using Xilinx implementation
31:11 - 31:21

with AXI bus than you're really locked in
with AXI bus with Xilinx. And it also not
31:21 - 31:25

very efficient in terms of resources and
if you remember we want to make this very,
31:25 - 31:30

very inexpensive. So our goal is to you
... is to be able to fit everything in the
31:30 - 31:38

smallest Arctic's 7 FPGA, and that's quite
challenging with all the stuff in there
31:38 - 31:48

and we just can't waste resources. So
decision is to write your own PCI Express
31:48 - 31:53

implementation. That's how it looks like.
I'm not going to discuss it right now.
31:53 - 32:00

There are several iterations. Initially it
looked much simpler, turned out not to
32:00 - 32:06

work well.
So some interesting stuff about PCI
32:06 - 32:13

Express which we stumbled upon is that it
was working really well on Atom which is
32:13 - 32:17

our main development platform because we
are doing a lot of embedded stuff. Worked
32:17 - 32:26

really well. When we try to plug this into
core i7 just started hanging once in a
32:26 - 32:35

while. So after like several not days
maybe with debugging, Sergey found that
32:35 - 32:39

very interesting statement in the standard
which says that value is zero in byte
32:39 - 32:46

count actually stands not for zero bytes
but for 4096 bytes.
32:46 - 32:59

I mean that's a really cool optimization.
So another thing is completion which is a
32:59 - 33:04

term in PCI Express basically for
acknowledgment which also can carry some
33:04 - 33:12

data back to your request. And sometimes
if you're not sending completion, device
33:12 - 33:21

just hangs. And what happens is that in
this case due to some historical heritage
33:21 - 33:30

of x86 it just starts returning you FFF.
And if you have a register which says: „Is
33:30 - 33:35

your device okay?“ and this register shows
one to say „The device is okay“, guess
33:35 - 33:38

what will happen?
You will be always reading that your
33:38 - 33:47

device is okay. So the suggestion is not
to use one as the status for okay and use
33:47 - 33:53

either zero or better like a two-beat
sequence. So you are definitely sure that
33:53 - 34:04

you are okay and not getting FFF's. So
when you have a device which again may
34:04 - 34:10

fail at any of the layers, you just got
this new board, it's really hard, it's
34:10 - 34:18

really hard to debug because of memory
corruption. So we had a software bug and
34:18 - 34:25

it was writing DMA addresses
incorrectly and we were wondering why we
34:25 - 34:32

are not getting any data in our buffers at
the same time. After several starts,
34:32 - 34:41

operating system just crashes. Well, that's
the reason why there is this UEFI
34:41 - 34:47

protection which prevents you from
plugging in devices like this into your
34:47 - 34:52

computer. Because it was basically writing
data, like random data into random
34:52 - 35:00

portions of your memory. So a lot of
debugging, a lot of tests and test benches
35:00 - 35:11

and we were able to find this. And another
thing is if you deinitialize your driver
35:11 - 35:15

incorrectly, and that's what's happening
when you have plug-and-play device, which
35:15 - 35:22

you can plug and unplug, then you may end
up in a situation of your ... you are
35:22 - 35:28

trying to write into memory which is
already freed by approaching system and
35:28 - 35:36

used for something else. Very well-known
problem but it also happens here. So there
35:36 - 35:51

... why DMA is really hard is because it
has this completion architecture for
35:51 - 35:56

writing for ... sorry ... for reading
data. Writes are easy. You just send the
35:56 - 36:00

data, you forget about it. It's a fire-
and-forget system. But for reading you
36:00 - 36:10

really need to get your data back. And the
thing is, it looks like this. You really
36:10 - 36:16

hope that there would be some pointing
device here. But basically on the top left
36:16 - 36:24

you can see requests for read and on the
right you can see completion transactions.
36:24 - 36:30

So basically each transaction can be and
most likely will be split into multiple
36:30 - 36:39

transactions. So first of all you have to
collect all these pieces and like write
36:39 - 36:46

them into proper parts of the memory.
But that's not all. The thing is the
36:46 - 36:53

latency between request and completion is
really high. It's like 50 cycles. So if
36:53 - 36:59

you have a single, only single transaction
in fly you will get really bad
36:59 - 37:04

performance. You do need to have multiple
transactions in flight. And the worst
37:04 - 37:13

thing is that transactions can return data
in random order. So it's a much more
37:13 - 37:20

complicated state machine than we expected
originally. So when I said, you know, the
37:20 - 37:26

architecture was much simpler originally,
we don't have all of this and we had to
37:26 - 37:32

realize this while implementing. So again
here was a whole description of how
37:32 - 37:41

exactly this works. But not this time. So
now after all these optimizations we've
37:41 - 37:49

got 20 mega samples per second which is
just six times lower than what we are
37:49 - 38:00

aiming at. So now the next thing is PCI
Express lanes scalability. So PCI Express
38:00 - 38:07

is a serial bus. So it has multiple lanes
and they allow you to basically
38:07 - 38:14

horizontally scale your bandwidth. One
lane is like x, than two lane is 2x, four
38:14 - 38:20

lane is 4x. So the more lanes you have the
more performance you are getting out of
38:20 - 38:24

your, out of your bus. So the more
bandwidth you're getting out of your bus.
38:24 - 38:32

Not performance. So the issue is that
typical a mini PCI Express, so the mini
38:32 - 38:39

PCI Express standard only standardized one
lane. And second lane is left as optional.
38:39 - 38:46

So most motherboards don't support this.
There are some but not all of them. And we
38:46 - 38:52

really wanted to get this done. So we
designed a special converter board which
38:52 - 38:58

allows you to plug your mini PCI Express
into a full-size PCI Express and
38:58 - 39:07

get two lanes working. And we're also
planning to have a similar board which
39:07 - 39:13

will have multiple slots so you will be
able to get multiple XTRX-SDRs on to the
39:13 - 39:21

same, onto the same carrier board and plug
this into let's say PCI Express 16x and
39:21 - 39:29

you will get like really a lot of ... SDR
... a lot of IQ data which then will be
39:29 - 39:39

your problem how to, how to process. So
with two x's it's about twice performance
39:39 - 39:49

so we are getting fifty mega samples per
second. And that's the time to really cut
39:49 - 39:59

the fat because the real sample size of
LMS7 is 12 bits and we are transmitting 16
39:59 - 40:07

because it's easier. Because CPU is
working on 8, 16, 32. So we originally
40:07 - 40:14

designed the driver to support 8 bit, 12
bit and 16 bit to be able to do this
40:14 - 40:24

scaling. And for the test we said okay
let's go from 16 to 8 bit. We'll lose
40:24 - 40:33

some dynamic range but who cares these
days. Still stayed the same, it's still 50
40:33 - 40:42

mega samples per second, no matter what we
did. And that was a lot of interesting
40:42 - 40:50

debugging going on. And we realized that
we actually made another, not a really
40:50 - 40:59

mistake. We didn't, we didn't really know
this when we designed. But we should have
40:59 - 41:04

used a higher voltage for this high speed
bus to get it to the full performance. And
41:04 - 41:13

at 1.8 it was just degrading too fast and
the bus itself was not performing well. So
41:13 - 41:22

our next prototype will be using higher
voltage specifically for this bus. And
41:22 - 41:27

this is kind of stuff which makes
designing hardware for high speed really
41:27 - 41:32

hard because you have to care about
coherence of the parallel buses on your,
41:32 - 41:39

on your system. So at the same time we do
want to keep 1.8 volts for everything else
41:39 - 41:43

as much as possible. Because another
problem we are facing with this device is
41:43 - 41:47

that by the standard mini PCI Express
allows only like ...
41:47 - 41:51

Sergey Kostanbaev: ... 2.5 ...
Alexander Chemeris: ... 2.5 watts of power
41:51 - 41:58

consumption, no more. And that's we were,
we were very lucky that LMS7 has such so
41:58 - 42:04

good, so good power consumption
performance. We actually had some extra
42:04 - 42:10

space to have FPGA and GPS and all this
stuff. But we just can't let the power
42:10 - 42:15

consumption go up. Our measurements on
this device showed about ...
42:15 - 42:19

Sergey Kostanbaev: ... 2.3 ...
Alexander Chemeris: ... 2.3 watts of power
42:19 - 42:27

consumption. So we are like at the limit
at this point. So when we fix the bus with
42:27 - 42:31

the higher voltage, you know it's a
theoretical exercise, because we haven't
42:31 - 42:38

done this yet, that's plenty to happen in
a couple months. We should be able to get
42:38 - 42:47

to this numbers which was just 1.2 times
slower. Then the next thing will be to fix
42:47 - 42:56

another issue which we made at the very
beginning: we have procured a wrong chip.
42:56 - 43:05

Just one digit difference, you can see
it's highlighted in red and green, and
43:05 - 43:13

this chip it supports only a generation 1
PCI Express which is twice slower than
43:13 - 43:18

generation 2 PCI Express.
So again, hopefully we'll replace the chip
43:18 - 43:30

and just get very simple doubling of the
performance. Still it will be slower than
43:30 - 43:40

we wanted it to be and here is what comes
like practical versus theoretical numbers.
43:40 - 43:47

Well as every bus it has it has overheads
and one of the things which again we
43:47 - 43:51

realized when we were implementing this
is, that even though the standard
43:51 - 43:59

standardized is the payload size of 4kB,
actual implementations are different. For
43:59 - 44:08

example desktop computers like Intel Core
or Intel Atom they only have 128 byte
44:08 - 44:19

payload. So there is much more overhead
going on the bus to transfer data and even
44:19 - 44:29

theoretically you can only achieve 87%
efficiency. And on Xeon we tested and we
44:29 - 44:37

found that they're using 256 payload size
and this can give you like a 92%
44:37 - 44:45

efficiency on the bus and this is before
the overhead so the real reality is even
44:45 - 44:53

worse. An interesting thing which we also
did not expect, is that we originally were
44:53 - 45:03

developing on Intel Atom and everything
was working great. When we plug this into
45:03 - 45:11

laptop like Core i7 multi-core really
powerful device, we didn't expect that it
45:11 - 45:20

wouldn't work. Obviously Core i7 should
work better than Atom: no, not always.
45:20 - 45:26

The thing is, we were plugging into a
laptop, which had a built-in video card
45:26 - 45:45

which was sitting on the same PCI bus and
probably manufacturer hard-coded the higher
45:45 - 45:51

priority for the video card than for
everything else in the system, because I
45:51 - 45:56

don't want your your screen to flicker.
And so when you move a window you actually
45:56 - 46:04

see the late packets coming to your PCI
device. We had to introduce a jitter
46:04 - 46:15

buffer and add more FIFO into the device
to smooth it out. On the other hand the
46:15 - 46:20

Xeon is performing really well. So it's
very optimized. That said, we have tested
46:20 - 46:28

it with discreet card and it outperforms
everything by whooping five seven percent.
46:28 - 46:39

What you get four for the price. So this
is actually the end of the presentation.
46:39 - 46:44

We still have not scheduled any workshop,
but if there if there is any interest in
46:44 - 46:53

actually seeing the device working or if
you interested in learning more about the
46:53 - 46:58

PCI Express in details let us know we'll
schedule something in the next few days.
46:58 - 47:05

That's the end, I think we can proceed
with questions if there are any.
47:05 - 47:15

Applause
Herald: Okay, thank you very much. If you
47:15 - 47:18

are leaving now: please try to leave
quietly because we might have some
47:18 - 47:23

questions and you want to hear them. If
you have questions please line up right
47:23 - 47:29

behind the microphones and I think we'll
just wait because we don't have anything
47:29 - 47:35

from the signal angel. However, if you are
watching on stream you can hop into the
47:35 - 47:40

channels and over social media to ask
questions and they will be answered,
47:40 - 47:48

hopefully. So on that microphone.
Question 1: What's the minimum and maximum
47:48 - 47:52

frequency of the card?
Alexander Chemeris: You mean RF
47:52 - 47:56

frequency?
Question 1: No, the minimum frequency you
47:56 - 48:06

can sample at. the most SDR devices can
only sample at over 50 MHz. Is there a
48:06 - 48:09

similar limitation at your card?
Alexander Chemeris: Yeah, so if you're
48:09 - 48:16

talking about RF frequency it can go
from like almost zero even though that
48:16 - 48:27

works worse below 50MHz and all the way to
3.8GHz if I remember correctly. And in
48:27 - 48:35

terms of the sample rate right now it
works from like about 2 MSPS and to about
48:35 - 48:40

50 right now. But again, we're planning to
get it to these numbers we quoted.
48:40 - 48:46

Herald: Okay. The microphone over there.
Question 2: Thanks for your talk. Did you
48:46 - 48:49

manage to put your Linux kernel driver to
the main line?
48:49 - 48:54

Alexander Chemeris: No, not yet. I mean,
it's not even like fully published. So I
48:54 - 48:59

did not say in the beginning, sorry for
this. We only just manufactured the first
48:59 - 49:04

prototype, which we debugged heavily. So
we are only planning to manufacture the
49:04 - 49:10

second prototype with all these fixes and
then we will release, like, the kernel
49:10 - 49:17

driver and everything. And maybe we'll try
or maybe won't try, haven't decided yet.
49:17 - 49:18

Question 2: Thanks
Herald: Okay...
49:18 - 49:22

Alexander Chemeris: and that will be the
whole other experience.
49:22 - 49:26

Herald: Okay, over there.
Question 3: Hey, looks like you went
49:26 - 49:30

through some incredible amounts of pain to
make this work. So, I was wondering,
49:30 - 49:35

aren't there any simulators at least for
parts of the system, or the PCIe bus for
49:35 - 49:40

the DMA something? Any simulator so that
you can actually first design the system
49:40 - 49:45

there and debug it more easily?
Sergey Kostanbaev: Yes, there are
49:45 - 49:50

available simulators, but the problem's
all there are non-free. So you have to pay
49:50 - 49:57

for them. So yeah and we choose the hard
way.
49:57 - 50:00

Question 3: Okay thanks.
Herald: We have a question from the signal
50:00 - 50:03

angel.
Question 4: Yeah are the FPGA codes, Linux
50:03 - 50:08

driver, and library code, and the design
project files public and if so, did they
50:08 - 50:13

post them yet? They can't find them on
xtrx.io.
50:13 - 50:18

Alexander Chemeris: Yeah, so they're not
published yet. As I said, we haven't
50:18 - 50:25

released them. So, the drivers and
libraries will definitely be available,
50:25 - 50:29

FPGA code... We are considering this
probably also will be available in open
50:29 - 50:36

source. But we will publish them together
with the public announcement of the
50:36 - 50:42

device.
Herald: Ok, that microphone.
50:42 - 50:46

Question 5: Yes. Did you guys see any
signal integrity issues between on the PCI
50:46 - 50:50

bus, or on this bus to the LMS chip, the
Lime microchip, I think, this doing
50:50 - 50:51

the RF ?
AC: Right.
50:51 - 50:56

Question 5: Did you try to measure signal
integrity issues, or... because there were
50:56 - 51:01

some reliability issues, right?
AC: Yeah, we actually... so, PCI. With PCI
51:01 - 51:03

we never had issues, if I remember
correctly.
51:03 - 51:05

SK: No.
AC: I just... it was just working.
51:05 - 51:11

SK: Well, the board is so small, and when
there are small traces there's no problem
51:11 - 51:15

in signal integrity. So it's actually
saved us.
51:15 - 51:21

AC: Yeah. Designing a small board is easier.
Yeah, with the LMS 7, the problem is not
51:21 - 51:26

the signal integrity in terms of
difference in the length of the traces,
51:26 - 51:37

but rather the fact that the signal
degrades over voltage, also over speed in
51:37 - 51:44

terms of voltage, and drops below the
detection level, and all this stuff. We
51:44 - 51:47

use some measurements. I actually wanted
to add some pictures here, but decided
51:47 - 51:54

that's not going to be super interesting.
H: Okay. Microphone over there.
51:54 - 51:58

Question 6: Yes. Thanks for the talk. How
much work would it be to convert the two
51:58 - 52:06

by two SDR into an 8-input logic analyzer
in terms of hard- and software? So, if you
52:06 - 52:12

have a really fast logic analyzer, where
you can record unlimited traces with?
52:12 - 52:19

AC: A logic analyzer...
Q6: So basically it's just also an analog
52:19 - 52:27

digital converter and you largely want
fast sampling and a large amount of memory
52:27 - 52:31

to store the traces.
AC: Well, I just think it's not the best
52:31 - 52:40

use for it. It's probably... I don't know.
Maybe Sergey has any ideas, but I think it
52:40 - 52:48

just may be easier to get high-speed ADC
and replace the Lime chip with a high-
52:48 - 52:57

speed ADC to get what you want, because
the Lime chip has so many things there
52:57 - 53:01

specifically for RF.
SK: Yeah, the main problem you cannot just
53:01 - 53:09

sample original data. You should shift it
over frequency, so you cannot sample
53:09 - 53:17

original signal, and using it for
something else except spectrum analyzing
53:17 - 53:21

is hard.
Q6: OK. Thanks.
53:21 - 53:26

H: OK. Another question from the internet.
Signal angel: Yes. Have you compared the
53:26 - 53:32

sample rate of the ADC of the Lime DA chip
to the USRP ADCs, and if so, how does the
53:32 - 53:40

lower sample rate affect the performance?
AC: So, comparing low sample rate to
53:40 - 53:49

higher sample rate. We haven't done much
testing on the RF performance yet, because
53:49 - 53:58

we were so busy with all this stuff, so we
are yet to see in terms of low bit rates
53:58 - 54:03

versus sample rates versus high sample
rate. Well, high sample rate always gives
54:03 - 54:10

you better performance, but you also get
higher power consumption. So, I guess it's
54:10 - 54:14

the question of what's more more important
for you.
54:14 - 54:20

H: Okay. Over there.
Question 7: I've gathered there is no
54:20 - 54:25

mixer bypass, so you can't directly sample
the signal. Is there a way to use the same
54:25 - 54:32

antenna for send and receive, yet.
AC: Actually, there is... Input for ADC.
54:32 - 54:38

SK: But it's not a bypass, it's a
dedicated pin on LMS chip, and since we're
54:38 - 54:46

very space-constrained, we didn't route
them, so you can not actually bypass it.
54:46 - 54:50

AC: Okay, in our specific hardware, so in
general, so in the LMS chip there is a
54:50 - 54:58

special pin which allows you to drive your
signal directly to ADC without all the
54:58 - 55:03

mixers, filters, all this radio stuff,
just directly to ADC. So, yes,
55:03 - 55:07

theoretically that's possible.
SK: We even thought about this, but it
55:07 - 55:11

doesn't fit this design.
Q7: Okay. And can I share antennas,
55:11 - 55:16

because I have an existing laptop with
existing antennas, but I would use the
55:16 - 55:22

same antenna to send and receive.
AC: Yeah, so, I mean, that's... depends on
55:22 - 55:26

what exactly do you want to do. If you
want a TDG system, then yes, if you
55:26 - 55:31

want an FDG system, then you will have to
put a small duplexer in there, but yeah,
55:31 - 55:35

that's the idea. So you can plug this into
your laptop and use your existing
55:35 - 55:40

antennas. That's one of the ideas of how
to use xtrx.
55:40 - 55:42

Q7: Yeah, because there's all four
connectors.
55:42 - 55:45

AC: Yeah. One thing which I actually
forgot to mention is - I kind of mentioned
55:45 - 55:54

in the slides - is that any other SDRs
which are based on Ethernet or on the USB
55:54 - 56:02

can't work with a CSMA wireless systems,
and the most famous CSMA system is Wi-Fi.
56:02 - 56:09

So, it turns out that because of the
latency between your operating system and
56:09 - 56:18

your radio on USB, you just can't react
fast enough for Wi-Fi to work, because you
56:18 - 56:23

- probably you know that - in Wi-Fi you
carrier sense, and if you sense that the
56:23 - 56:30

spectrum is free, you start transmitting.
Does make a sense when you have huge
56:30 - 56:36

latency, because you all know that... you
know the spectrum was free back then, so,
56:36 - 56:44

with xtrx, you actually can work with CSMA
systems like Wi-Fi, so again it makes it
56:44 - 56:51

possible to have a fully software
implementation of Wi-Fi in your laptop. It
56:51 - 56:59

obviously won't work like as good as your
commercial Wi-Fi, because you will have to
56:59 - 57:04

do a lot of processing on your CPU, but
for some purposes like experimentation,
57:04 - 57:08

for example, for wireless labs and R&D
labs, that's really valuable.
57:08 - 57:11

Q7: Thanks.
H: Okay. Over there.
57:11 - 57:16

Q8: Okay. what PCB design package did you
use?.
57:16 - 57:18

AC: Altium.
SK: Altium, yeah.
57:18 - 57:23

Q8: And I'd be interested in the PCIe
workshop. Would be really great if you do
57:23 - 57:25

this one.
AC: Say this again?
57:25 - 57:28

Q8: Would be really great if you do the
PCI Express workshop.
57:28 - 57:33

AC: Ah. PCI Express workshop. Okay. Thank
you.
57:33 - 57:37

H: Okay, I think we have one more question
from the microphones, and that's you.
57:37 - 57:43

Q9: Okay. Great talk. And again, I would
appreciate a PCI Express workshop, if it
57:43 - 57:47

ever happens. What are these
synchronization options between multiple
57:47 - 57:55

cards. Can you synchronize the ADC clock,
and can you synchronize the presumably
57:55 - 58:05

digitally created IF? SK: Yes, so... so,
unfortunately, just IF synchronization is
58:05 - 58:10

not possible, because Lime chip doesn't
expose a low frequency. But we can
58:10 - 58:16

synchronize digitally. So, we have special
one PPS signal synchronization. We have
58:16 - 58:25

lines for clock synchronization and other
stuff. We can do it in software. So the
58:25 - 58:32

Lime chip has phase correction register,
so when you measure... if there is a phase
58:32 - 58:35

difference, so you can compensate it on
different boards.
58:35 - 58:39

Q9: Tune to a station a long way away and
then rotate the phase until it aligns.
58:39 - 58:42

SK: Yeah.
Q9: Thank you.
58:42 - 58:46

AC: Little tricky, but possible. So,
that's one of our plans for future,
58:46 - 58:53

because we do want to see, like 128 by 128
MIMO at home.
58:53 - 58:56

H: Okay, we have another question from the
internet.
58:56 - 59:00

Signal angel: I actually have two
questions. The first one is: What is the
59:00 - 59:08

expected price after a prototype stage?
And the second one is: Can you tell us
59:08 - 59:10

more about this setup you had for
debugging the PCIe
59:10 - 59:16

issues?
AC: Could you repeat the second question?
59:16 - 59:20

SK: It's ????????????, I think.
SA: It's more about the setup you had for
59:20 - 59:24

debugging the PCIe issues.
SK: Second question, I think it's most
59:24 - 59:31

about our next workshop, because it's a
more complicated setup, so... mostly
59:31 - 59:36

remove everything about its now current
presentation.
59:36 - 59:40

AC: Yeah, but in general, and in terms of
hardware setup, that was our hardware
59:40 - 59:48

setup, so we bought this PCI Express to
Thunderbolt3, we bought the laptop which
59:48 - 59:53

supports Thunderbolt3, and that's how we
were debugging it. So, we don't need, like
59:53 - 59:58

a full-fledged PC, we don't have to
restart it all the time. So, in terms of
59:58 - 60:07

price, we don't have the fixed price yet.
So, all I can say right now is that we are
60:07 - 60:18

targeting no more than your bladeRF or
HackRF devices, and probably even cheaper.
60:18 - 60:25

For some versions.
H: Okay. We are out of time, so thank you
60:25 - 60:45

again Sergey and Alexander.
[Applause]
60:45 - 60:50

[Music]
60:50 - 60:55

subtitles created by c3subtitles.de
in the year 20??. Join, and help us!

Title:: Building a high throughput low-latency PCIe based SDR (33c3)
Description:: more » « less
Video Language:: English
Duration:: 01:00:55

	drsemmel edited English subtitles for Building a high throughput low-latency PCIe based SDR (33c3)	Dec 28, 2020, 9:23 PM
	C3Subtitles edited English subtitles for Building a high throughput low-latency PCIe based SDR (33c3)	Jan 3, 2020, 1:35 PM
	Maximilian Marx edited English subtitles for Building a high throughput low-latency PCIe based SDR (33c3)	Dec 29, 2017, 1:41 PM
	Bar Sch edited English subtitles for Building a high throughput low-latency PCIe based SDR (33c3)	Dec 27, 2017, 12:50 PM

English subtitles

Revisions

Revision 4 Edited

drsemmel Dec 28, 2020, 9:23 PM

Building a high throughput low-latency PCIe based SDR (33c3)

Revisions

Our website uses cookies

Operating cookies (Required)