WEBVTT

00:00:00.000 --> 00:00:09.970
<i>silent 31C3 preroll</i>

00:00:09.970 --> 00:00:13.220
Dr. Gareth Owen: Hello. Can you hear me?
Yes. Okay. So my name is Gareth Owen.

00:00:13.220 --> 00:00:16.150
I’m from the University of Portsmouth.
I’m an academic

00:00:16.150 --> 00:00:19.320
and I’m going to talk to you about
an experiment that we did

00:00:19.320 --> 00:00:22.610
on the Tor hidden services,
trying to categorize them,

00:00:22.610 --> 00:00:25.230
estimate how many they were etc. etc.

00:00:25.230 --> 00:00:27.380
Well, as we go through the talk
I’m going to explain

00:00:27.380 --> 00:00:31.120
how Tor hidden services work internally,
and how the data was collected.

00:00:31.120 --> 00:00:35.320
So what sort of conclusions you can draw
from the data based on the way that we’ve

00:00:35.320 --> 00:00:39.950
collected it. Just so [that] I get
an idea: how many of you use Tor

00:00:39.950 --> 00:00:42.430
on a regular basis, could you
put your hand up for me?

00:00:42.430 --> 00:00:46.120
So quite a big number. Keep your hand
up if… or put your hand up if you’re

00:00:46.120 --> 00:00:48.320
a relay operator.

00:00:48.320 --> 00:00:51.470
Wow, that’s quite a significant number,
isn’t it? And then, put your hand up

00:00:51.470 --> 00:00:55.250
and/or keep it up if you
run a hidden service.

00:00:55.250 --> 00:00:59.530
Okay, so, a fewer number, but still
some people run hidden services.

00:00:59.530 --> 00:01:02.720
Okay, so, some of you may be very familiar
with the way Tor works, sort of,

00:01:02.720 --> 00:01:06.700
in a low level. But I am gonna go through
it for those which aren’t, so they understand

00:01:06.700 --> 00:01:10.380
just how they work. And as we go along,
because I’m explaining how

00:01:10.380 --> 00:01:14.030
the hidden services work, I’m going
to tag on information on how

00:01:14.030 --> 00:01:19.030
the Tor hidden services themselves can be
deanonymised and also how the users

00:01:19.030 --> 00:01:23.090
of those hidden services can be
deanonymised, if you put

00:01:23.090 --> 00:01:27.040
some strict criteria on what it is you
want to do with respect to them.

00:01:27.040 --> 00:01:30.920
So the things that I’m going to go over:
I wanna go over how Tor works,

00:01:30.920 --> 00:01:34.190
and then specifically how hidden services
work. I’m gonna talk about something

00:01:34.190 --> 00:01:37.889
called the “Tor Distributed Hash Table”
for hidden services. If you’ve heard

00:01:37.889 --> 00:01:40.560
that term and don’t know what
it means, don’t worry, I’ll explain

00:01:40.560 --> 00:01:44.010
what a distributed hash table is and
how it works. It’s not as complicated

00:01:44.010 --> 00:01:47.690
as it sounds. And then I wanna go over
Darknet data, so, data that we collected

00:01:47.690 --> 00:01:53.030
from Tor hidden services. And as I say,
as we go along I will sort of explain

00:01:53.030 --> 00:01:56.650
how you do deanonymisation of both the
services themselves and of the visitors

00:01:56.650 --> 00:02:02.400
to the service. And just
how complicated it is.

00:02:02.400 --> 00:02:07.370
So you may have seen this slide which
I think was from GCHQ, released last year

00:02:07.370 --> 00:02:12.099
as part of the Snowden leaks where they
said: “You can deanonymise some users

00:02:12.099 --> 00:02:15.560
some of the time but they’ve had
no success in deanonymising someone

00:02:15.560 --> 00:02:20.109
in response to a specific request.”
So, given all of you e.g., I may be able

00:02:20.109 --> 00:02:25.090
to deanonymise a small fraction of you
but I can’t choose precisely one person

00:02:25.090 --> 00:02:27.499
I want to deanonymise. That’s what
I’m gonna be explaining in relation

00:02:27.499 --> 00:02:30.940
to the deanonymisation attacks, how
you can deanonymise a section but

00:02:30.940 --> 00:02:38.629
you can’t necessarily choose which section
of the users that you will be deanonymising.

00:02:38.629 --> 00:02:42.740
Tor drives with just a couple
of different problems. On one part

00:02:42.740 --> 00:02:46.239
it allows you to bypass censorship. So if
you’re in a country like China, which

00:02:46.239 --> 00:02:51.010
blocks some types of traffic you can use
Tor to bypass their censorship blocks.

00:02:51.010 --> 00:02:55.541
It tries to give you privacy, so, at some
level in the network someone can’t see

00:02:55.541 --> 00:02:59.200
what you’re doing. And at another point
in the network people who don’t know

00:02:59.200 --> 00:03:02.540
who you are but may necessarily
be able to see what you’re doing.

00:03:02.540 --> 00:03:07.099
Now the traditional case
for this is to look at VPNs.

00:03:07.099 --> 00:03:10.669
With a VPN you have
sort of a single provider.

00:03:10.669 --> 00:03:14.689
You have lots of users connecting
to the VPN. The VPN has sort of

00:03:14.689 --> 00:03:18.240
a mixing effect from an outside or
a server’s point of view. And then

00:03:18.240 --> 00:03:22.499
out of the VPN you see requests
to Twitter, Wikipedia etc. etc.

00:03:22.499 --> 00:03:26.830
And if that traffic doesn’t encrypt it then
the VPN can also read the contents

00:03:26.830 --> 00:03:30.980
of the traffic. Now of course there is
a fundamental weakness with this.

00:03:30.980 --> 00:03:35.730
If you trust the VPN provider the VPN
provider knows both who you are

00:03:35.730 --> 00:03:39.629
and what you’re doing and can
link those two together with absolute

00:03:39.629 --> 00:03:43.580
certainty. So you don’t… whilst you do
get some of these properties, assuming

00:03:43.580 --> 00:03:48.069
you’ve got a trustworthy VPN provider
you don’t get them in the face of

00:03:48.069 --> 00:03:51.609
an untrustworthy VPN provider.
And of course: how do you trust the VPN

00:03:51.609 --> 00:03:59.319
provider? What sort of measure do
you use? That’s sort of an open question.

00:03:59.319 --> 00:04:03.729
So Tor tries to solve this problem
by distributing the trust. Tor is

00:04:03.729 --> 00:04:07.500
an open source project, so you can go
on to their Git repository, you can

00:04:07.500 --> 00:04:12.620
download the source code, and change it,
improve it, submit patches etc.

00:04:12.620 --> 00:04:17.108
As you heard earlier, during Jacob and
Roger’s talk they’re currently partly

00:04:17.108 --> 00:04:20.949
sponsored by the US Government which seems
a bit paradoxical, but they explained

00:04:20.949 --> 00:04:24.770
in that talk many of the… that
doesn’t affect like judgment.

00:04:24.770 --> 00:04:28.540
And indeed, they do have some funding from
other sources, and they design that system

00:04:28.540 --> 00:04:30.841
– which I’ll talk about a little bit
later – in a way where they don’t have

00:04:30.841 --> 00:04:34.230
to trust each other. So there’s sort of
some redundancy, and they’re trying

00:04:34.230 --> 00:04:39.650
to minimize these sort of trust issues
related to this. Now, Tor is

00:04:39.650 --> 00:04:43.310
a partially de-centralized network, which
means that it has some centralized

00:04:43.310 --> 00:04:47.870
components which are under the control of
the Tor Project and some de-centralized

00:04:47.870 --> 00:04:51.190
components which are normally the Tor
relays. If you run a relay you’re

00:04:51.190 --> 00:04:56.290
one of those de-centralized components.
There is, however, no single authority

00:04:56.290 --> 00:05:01.110
on the Tor network.
So no single server which is responsible,

00:05:01.110 --> 00:05:04.290
which you’re required to trust.
So the trust is somewhat distributed,

00:05:04.290 --> 00:05:12.000
but not entirely. When you establish
a circuit through Tor you, the user,

00:05:12.000 --> 00:05:15.500
download a list of all of the relays
inside the Tor network.

00:05:15.500 --> 00:05:19.070
And you get to pick – and I’ll tell you
how you do that – which relays

00:05:19.070 --> 00:05:22.750
you’re going to use to route your traffic
through. So here is a typical example:

00:05:22.750 --> 00:05:27.090
You’re here on the left hand side as the
user. You download a list of the relays

00:05:27.090 --> 00:05:32.010
inside the Tor network and you select from
that list three nodes, a guard node

00:05:32.010 --> 00:05:36.580
which is your entry into the Tor network,
a relay node which is a middle node.

00:05:36.580 --> 00:05:39.010
Essentially, it’s going to route your
traffic to a third hop. And then

00:05:39.010 --> 00:05:42.650
the third hop is the exit node where
your traffic essentially exits out

00:05:42.650 --> 00:05:46.840
on the internet. Now, looking at the
circuit. So this is a circuit through

00:05:46.840 --> 00:05:50.170
the Tor network through which you’re
going to route your traffic. There are

00:05:50.170 --> 00:05:52.540
three layers of encryption at the
beginning, so between you

00:05:52.540 --> 00:05:56.150
and the guard node. Your traffic
is encrypted three times.

00:05:56.150 --> 00:05:59.330
In the first instance encrypted to the
guard, and the it’s encrypted again,

00:05:59.330 --> 00:06:03.180
through the relay, and then encrypted
again to the exit, and as the traffic moves

00:06:03.180 --> 00:06:08.710
through the Tor network each of those
layers of encryption are unpeeled

00:06:08.710 --> 00:06:17.300
from the data. The Guard here in this case
knows who you are, and the exit relay

00:06:17.300 --> 00:06:21.590
knows what you’re doing but neither know
both. And the middle relay doesn’t really

00:06:21.590 --> 00:06:26.710
know a lot, except for which relay is
her guard and which relay is her exit.

00:06:26.710 --> 00:06:31.870
Who runs an exit relay? So if you run
an exit relay all of the traffic which

00:06:31.870 --> 00:06:36.210
users are sending out on the internet they
appear to come from your IP address.

00:06:36.210 --> 00:06:41.360
So running an exit relay is potentially
risky because someone may do something

00:06:41.360 --> 00:06:45.590
through your relay which attracts attention.
And then, when law enforcement

00:06:45.590 --> 00:06:48.940
traced that back to an IP address it’s
going to come back to your address.

00:06:48.940 --> 00:06:51.790
So some relay operators have had trouble
with this, with law enforcement coming

00:06:51.790 --> 00:06:55.360
to them, and saying: “Hey we got this
traffic coming through your IP address

00:06:55.360 --> 00:06:57.950
and you have to go and explain it.”
So if you want to run an exit relay

00:06:57.950 --> 00:07:01.400
it’s a little bit risky, but we’re thankful
for those people that do run exit relays

00:07:01.400 --> 00:07:04.870
because ultimately if people didn’t run
an exit relay you wouldn’t be able

00:07:04.870 --> 00:07:08.000
to get out of the Tor network, and it
wouldn’t be terribly useful from this

00:07:08.000 --> 00:07:20.560
point of view. So, yes.
<i>applause</i>

00:07:20.560 --> 00:07:24.610
So every Tor relay, when you set up
a Tor relay you publish something called

00:07:24.610 --> 00:07:28.780
a descriptor which describes your Tor
relay and how to use it to a set

00:07:28.780 --> 00:07:33.430
of servers called the authorities. And the
trust in the Tor network is essentially

00:07:33.430 --> 00:07:38.610
split across these authorities. They’re run
by the core Tor Project members.

00:07:38.610 --> 00:07:42.639
And they maintain a list of all of the
relays in the network. And they observe

00:07:42.639 --> 00:07:46.010
them over a period of time. If the relays
exhibit certain properties they give

00:07:46.010 --> 00:07:50.480
the relays flags. If e.g. a relay allows
traffic to exit from the Tor network

00:07:50.480 --> 00:07:54.450
it will get the ‘Exit’ flag. If they’d been
switched on for a certain period of time,

00:07:54.450 --> 00:07:58.400
or for a certain amount of traffic they’ll
be allowed to become the guard relay

00:07:58.400 --> 00:08:02.180
which is the first node in your circuit.
So when you build your circuit you

00:08:02.180 --> 00:08:07.230
download a list of these descriptors from
one of the Directory Authorities. You look

00:08:07.230 --> 00:08:10.120
at the flags which have been assigned to
each of the relays, and then you pick

00:08:10.120 --> 00:08:14.150
your route based on that. So you’ll pick
the guard node from a set of relays

00:08:14.150 --> 00:08:16.400
which have the ‘Guard’ flag, your exits
from the set of relays which have

00:08:16.400 --> 00:08:20.860
the ‘Exit’ flag etc. etc. Now, as of
a quick count this morning there are

00:08:20.860 --> 00:08:29.229
about 1500 guard relays, around 1000 exit
relays, and six relays flagged as ‘bad’ exits.

00:08:29.229 --> 00:08:34.360
What does a ‘bad exit’ mean?
<i>waits for audience to respond</i>

00:08:34.360 --> 00:08:37.759
That’s not good! That’s exactly
what it means! Yes! <i>laughs</i>

00:08:37.759 --> 00:08:40.450
<i>applause</i>

00:08:40.450 --> 00:08:45.569
So relays which have been flagged as ‘bad
exits’ your client will never chose to exit

00:08:45.569 --> 00:08:50.660
traffic through. And examples of things
which may get a relay flagged as an

00:08:50.660 --> 00:08:53.829
[bad] exit relay – if they’re fiddling with
the traffic which is coming out of

00:08:53.829 --> 00:08:57.019
the Tor relay. Or doing things like
man-in-the-middle attacks against

00:08:57.019 --> 00:09:01.629
SSL traffic. We’ve seen various things,
there have been relays man-in-the-middling

00:09:01.629 --> 00:09:07.050
SSL traffic, there have very, very recently
been an exit relay which was patching

00:09:07.050 --> 00:09:10.800
binaries that you downloaded from the
internet, inserting malware into the binaries.

00:09:10.800 --> 00:09:14.630
So you can do these things but the Tor
Project tries to scan for them. And if

00:09:14.630 --> 00:09:19.829
these things are detected then they’ll be
flagged as ‘Bad Exits’. It’s true to say

00:09:19.829 --> 00:09:24.610
that the scanning mechanism is not 100%
fool-proof by any stretch of the imagination.

00:09:24.610 --> 00:09:28.559
It tries to pick up common types
of attacks, so as a result

00:09:28.559 --> 00:09:32.480
it won’t pick up unknown attacks or
attacks which haven’t been seen or

00:09:32.480 --> 00:09:36.680
have not been known about beforehand.

00:09:36.680 --> 00:09:45.370
So looking at this, how do you deanonymise
the traffic travelling through the Tor

00:09:45.370 --> 00:09:49.449
networks? Given some traffic coming out
of the exit relay, how do you know

00:09:49.449 --> 00:09:54.269
which user that corresponds to? What is
their IP address? You can’t actually

00:09:54.269 --> 00:09:58.279
modify the traffic because if any of the
relays tried to modify the traffic

00:09:58.279 --> 00:10:02.249
which they’re sending through the network
Tor will tear down the circuit through the relay.

00:10:02.249 --> 00:10:06.290
So there’s these integrity checks, each
of the hops. And if you try to sort of

00:10:06.290 --> 00:10:09.870
– because you can’t decrypt the packet
you can’t modify it in any meaningful way,

00:10:09.870 --> 00:10:13.749
and because there’s an integrity check
at the next hop that means that you can’t

00:10:13.749 --> 00:10:17.019
modify the packet because otherwise it’s
detected. So you can’t do this sort of

00:10:17.019 --> 00:10:20.900
marker, and try and follow the marker
through the network. So instead

00:10:20.900 --> 00:10:26.699
what you can do if you control… so let me
give you two cases. In the worst case

00:10:26.699 --> 00:10:31.330
if the attacker controls all three of your
relays that you pick, which is an unlikely

00:10:31.330 --> 00:10:34.739
scenario that needs to control quite
a big proportion of the network. Then

00:10:34.739 --> 00:10:39.550
it should be quite obvious that they can
work out who you are and also

00:10:39.550 --> 00:10:42.369
see what you’re doing because in that
case they can tag the traffic, and

00:10:42.369 --> 00:10:45.709
they can just discard these integrity
checks at each of the following hops.

00:10:45.709 --> 00:10:50.709
Now in a different case, if you control
the Guard relay and the exit relay

00:10:50.709 --> 00:10:54.160
but not the middle relay the Guard relay
can’t tamper with the traffic because

00:10:54.160 --> 00:10:57.660
this middle relay will close down the
circuit as soon as it happens.

00:10:57.660 --> 00:11:01.130
The exit relay can’t send stuff back down
the circuit to try and identify the user,

00:11:01.130 --> 00:11:05.030
either. Because again, the circuit will be
closed down. So what can you do?

00:11:05.030 --> 00:11:09.869
Well, you can count the number of packets
going through the Guard node. And you can

00:11:09.869 --> 00:11:14.690
measure the timing differences between
packets, and try and spot that pattern

00:11:14.690 --> 00:11:18.750
at the Exit relays. You’re looking at counts of
packets and the timing between those

00:11:18.750 --> 00:11:22.360
packets which are being sent, and
essentially trying to correlate them all.

00:11:22.360 --> 00:11:26.869
So if your user happens to pick you as
your Guard node, and then happens to pick

00:11:26.869 --> 00:11:31.850
your exit relay, then you can deanonymise
them with very high probability using

00:11:31.850 --> 00:11:35.649
this technique. You’re just correlating
the timings of packets and counting

00:11:35.649 --> 00:11:38.889
the number of packets going through.
And the attacks demonstrated in literature

00:11:38.889 --> 00:11:44.509
are very reliable for this. We heard
earlier from the Tor talk about the “relay

00:11:44.509 --> 00:11:50.739
early” tag which was the attack discovered
by the cert researches in the US.

00:11:50.739 --> 00:11:55.050
That attack didn’t rely on timing attacks.
Instead, what they were able to do was

00:11:55.050 --> 00:11:58.720
send a special type of cell containing
the data back down the circuit,

00:11:58.720 --> 00:12:01.889
essentially marking this data, and saying:
“This is the data we’re seeing

00:12:01.889 --> 00:12:06.149
at the Exit relay, or at the hidden
service", and encode into the messages

00:12:06.149 --> 00:12:10.049
travelling back down the circuit, what the
data was. And then you could pick

00:12:10.049 --> 00:12:14.269
those up at the Guard relay and say, okay,
whether it’s this person that’s doing that.

00:12:14.269 --> 00:12:18.370
In fact, although this technique works,
and yeah it was a very nice attack,

00:12:18.370 --> 00:12:21.269
the traffic correlation attacks are
actually just as powerful.

00:12:21.269 --> 00:12:25.259
So although this bug has been fixed traffic
correlation attacks still work and are

00:12:25.259 --> 00:12:29.739
still fairly, fairly reliable. So the problem
still does exist. This is very much

00:12:29.739 --> 00:12:33.399
an open question. How do we solve this
problem? We don’t know, currently,

00:12:33.399 --> 00:12:40.040
how to solve this problem of trying
to tackle the traffic correlation.

00:12:40.040 --> 00:12:45.369
There are a couple of solutions.
But they’re not particularly…

00:12:45.369 --> 00:12:48.569
they’re not particularly reliable. Let me
just go through these, and I’ll skip back

00:12:48.569 --> 00:12:53.061
on the few things I’ve missed. The first
thing is, high-latency networks, so

00:12:53.061 --> 00:12:56.999
networks where packets are delayed
in their transit through the network.

00:12:56.999 --> 00:13:00.740
That throws away a lot of the timing
information. So they promise

00:13:00.740 --> 00:13:03.800
to potentially solve this problem.
But of course, if you want to visit

00:13:03.800 --> 00:13:06.779
Google’s home page, and you have to wait
five minutes for it, you’re simply

00:13:06.779 --> 00:13:11.910
just not going to use Tor. The whole point
is trying to make this technology usable.

00:13:11.910 --> 00:13:14.759
And if you got something which is very,
very slow then it doesn’t make it

00:13:14.759 --> 00:13:18.269
attractive to use. But of course,
this case does work slightly better

00:13:18.269 --> 00:13:22.059
for e-mail. If you think about it with
e-mail, you don’t mind if you’re e-mail

00:13:22.059 --> 00:13:25.399
– well, you may not mind, you may mind –
you don’t mind if your e-mail is delayed

00:13:25.399 --> 00:13:29.120
by some period of time. Which makes this
somewhat difficult. And as Roger said

00:13:29.120 --> 00:13:35.130
earlier, you can also introduce padding
into the circuit, so these are dummy cells.

00:13:35.130 --> 00:13:39.839
But, but… with a big caveat: some of the
research suggests that actually you’d

00:13:39.839 --> 00:13:43.439
need to introduce quite a lot of padding
to defeat these attacks, and that would

00:13:43.439 --> 00:13:47.179
overload the Tor network in its current
state. So, again, not a particular

00:13:47.179 --> 00:13:53.860
practical solution.

00:13:53.860 --> 00:13:58.279
How does Tor try to solve this problem?
Well, Tor makes it very difficult

00:13:58.279 --> 00:14:03.171
to become a users Guard relay. If you
can’t become a users Guard relay

00:14:03.171 --> 00:14:07.839
then you don’t know who the user is, quite
simply. And so by making it very hard

00:14:07.839 --> 00:14:13.249
to become the Guard relay therefore you
can’t do this traffic correlation attack.

00:14:13.249 --> 00:14:17.579
So at the moment the Tor client chooses
one Guard relay and keeps it for a period

00:14:17.579 --> 00:14:22.259
of time. So if I want to sort of target
just one of you I would need to control

00:14:22.259 --> 00:14:26.259
the Guard relay that you were using at
that particular point in time. And in fact

00:14:26.259 --> 00:14:30.679
I’d also need to know what that Guard
relay is. So by making it very unlikely

00:14:30.679 --> 00:14:34.129
that you would select a particular malicious
Guard relay, where the number of malicious

00:14:34.129 --> 00:14:39.179
Guard relays is very small, that’s how Tor
tries to solve this problem. And

00:14:39.179 --> 00:14:43.280
at the moment your Guard relay is your
barrier of security. If the attacker can’t

00:14:43.280 --> 00:14:46.460
control the Guard relay then they won’t
know who you are. That doesn’t mean

00:14:46.460 --> 00:14:50.639
they can’t try other sort of side channel
attacks by messing with the traffic

00:14:50.639 --> 00:14:55.129
at the Exit relay etc. You know that you
may sort of e.g. download dodgy documents

00:14:55.129 --> 00:14:59.499
and open one on your computer, and those
sort of things. Now the alternative

00:14:59.499 --> 00:15:02.769
of course to having a Guard relay
and keeping it for a very long time

00:15:02.769 --> 00:15:06.029
will be to have a Guard relay and
to change it on a regular basis.

00:15:06.029 --> 00:15:09.929
Because you might think, well, just choosing
one Guard relay and sticking with it

00:15:09.929 --> 00:15:13.399
is probably a bad idea. But actually,
that’s not the case. If you pick

00:15:13.399 --> 00:15:18.370
the Guard relay, and assuming that the
chance of picking a Guard relay that is

00:15:18.370 --> 00:15:22.800
malicious is very low, then, when you
first use your Guard relay, if you got

00:15:22.800 --> 00:15:27.420
a good choice, then your traffic is safe.
If you haven’t got a good choice then

00:15:27.420 --> 00:15:31.759
your traffic isn’t safe. Whereas if your
Tor client chooses a Guard relay

00:15:31.759 --> 00:15:35.610
every few minutes, or every hour, or
something on those lines at some point

00:15:35.610 --> 00:15:39.179
you’re gonna pick a malicious Guard relay.
So they’re gonna have some of your traffic

00:15:39.179 --> 00:15:43.399
but not all of it. And so currently the
trade-off is that we make it very difficult

00:15:43.399 --> 00:15:48.490
for an attacker to control a Guard relay
and the user picks a Guard relay and

00:15:48.490 --> 00:15:52.449
keeps it for a long period of time. And
so it’s very difficult for the attackers

00:15:52.449 --> 00:15:58.939
to pick that Guard relay when they control
a very small proportion of the network.

00:15:58.939 --> 00:16:06.420
So this, currently, provides those
properties I described earlier, the privacy

00:16:06.420 --> 00:16:11.410
and the anonymity when you’re browsing the
web, when you’re accessing websites etc.

00:16:11.410 --> 00:16:16.519
But still you know who the website is. So
although you’re anonymous and the website

00:16:16.519 --> 00:16:20.730
doesn’t know who you are you know who the
website is. And there may be some cases

00:16:20.730 --> 00:16:25.499
where e.g. the website would also wish to
remain anonymous. You want the person

00:16:25.499 --> 00:16:29.970
accessing the website and the website
itself to be anonymous to each other.

00:16:29.970 --> 00:16:34.230
And you could think about people e.g.
being in countries where running

00:16:34.230 --> 00:16:39.730
a political blog e.g. might be a dangerous
activity. If you run that on a regular

00:16:39.730 --> 00:16:45.660
webserver you’re easily identified whereas,
if you got some way where you as

00:16:45.660 --> 00:16:49.490
the webserver can be anonymous then
that allows you to do that activity without

00:16:49.490 --> 00:16:57.480
being targeted by your government. So
this is what hidden services try to solve.

00:16:57.480 --> 00:17:03.080
Now when you first think about a problem
you kind of think: “Hang on a second,

00:17:03.080 --> 00:17:06.429
the user doesn’t know who the website
is and the website doesn’t know

00:17:06.429 --> 00:17:09.890
who the user is. So how on earth do they
talk to each other?” Well, that’s essentially

00:17:09.890 --> 00:17:14.220
what the Tor hidden service protocol tries
to sort of set up. How do you identify and

00:17:14.220 --> 00:17:19.579
connect to each other. So at the moment
this is what happens: We’ve got Bob

00:17:19.579 --> 00:17:23.780
on the [right] hand side who is the hidden
service. And we got Alice on the left hand

00:17:23.780 --> 00:17:28.620
side here who is the user who wishes to
visit the hidden service. Now when Bob

00:17:28.620 --> 00:17:34.190
sets up his hidden service he picks three
nodes in the Tor network as introduction

00:17:34.190 --> 00:17:38.831
points and builds several hop circuits to
them. So the introduction points don’t know

00:17:38.831 --> 00:17:44.680
who Bob is. Bob has circuits to them. And
Bob says to each of these introduction points

00:17:44.680 --> 00:17:48.240
“Will you relay traffic to me if someone
connects to you asking for me?”

00:17:48.240 --> 00:17:53.030
And then those introduction points
do that. So then, once Bob has picked

00:17:53.030 --> 00:17:56.840
his introduction points he publishes
a descriptor describing the list of his

00:17:56.840 --> 00:18:01.310
introduction points for someone who wishes
to come onto his websites. And then Alice

00:18:01.310 --> 00:18:06.700
on the left hand side wishing to visit Bob
will pick a rendezvous point in the network

00:18:06.700 --> 00:18:10.030
and build a circuit to it. So this “RP”
here is the rendezvous point.

00:18:10.030 --> 00:18:14.530
And she will relay a message via one of
the introduction points saying to Bob:

00:18:14.530 --> 00:18:18.290
“Meet me at the rendezvous point”.
And then Bob will build a 3-hop-circuit

00:18:18.290 --> 00:18:22.870
to the rendezvous point. So now at this
stage we got Alice with a multi-hop circuit

00:18:22.870 --> 00:18:26.890
to the rendezvous point, and Bob with
a multi-hop circuit to the rendezvous point.

00:18:26.890 --> 00:18:32.550
Alice and Bob haven’t connected to one
another directly. The rendezvous point

00:18:32.550 --> 00:18:36.530
doesn’t know who Bob is, the rendezvous
point doesn’t know who Alice is.

00:18:36.530 --> 00:18:40.261
All they’re doing is forwarding the
traffic. And they can’t inspect the traffic,

00:18:40.261 --> 00:18:43.740
either, because the traffic itself
is encrypted.

00:18:43.740 --> 00:18:47.530
So that’s currently how you solve this
problem with trying to communicate

00:18:47.530 --> 00:18:50.820
with someone who you don’t know
who they are and vice versa.

00:18:50.820 --> 00:18:55.740
<i>drinks from the bottle</i>

00:18:55.740 --> 00:18:58.870
The principle thing I’m going to talk
about today is this database.

00:18:58.870 --> 00:19:01.990
So I said, Bob, when he picks his
introduction points he builds this thing

00:19:01.990 --> 00:19:06.080
called a descriptor, describing who his
introduction points are, and he publishes

00:19:06.080 --> 00:19:10.390
them to a database. This database itself
is distributed throughout the Tor network.

00:19:10.390 --> 00:19:17.860
It’s not a single server. So both, Bob and
Alice need to be able to publish information

00:19:17.860 --> 00:19:22.040
to this database, and also retrieve
information from this database. And Tor

00:19:22.040 --> 00:19:24.820
currently uses something called
a distributed hash table, which I’m gonna

00:19:24.820 --> 00:19:27.930
give an example of what this means and
how it works. And then I’ll talk to you

00:19:27.930 --> 00:19:34.380
specifically how the Tor Distributed Hash
Table works itself. So let’s say e.g.

00:19:34.380 --> 00:19:39.830
you've got a set of servers. So here we've
got 26 servers and you’d like to store

00:19:39.830 --> 00:19:44.240
your files across these different servers
without having a single server responsible

00:19:44.240 --> 00:19:48.050
for deciding, “okay, that file is stored
on that server, and this file is stored

00:19:48.050 --> 00:19:53.050
on that server” etc. etc. Now here is my
list of files. You could take a very naive

00:19:53.050 --> 00:19:57.740
approach. And you could say: “Okay, I’ve
got 26 servers, I got all of these file names

00:19:57.740 --> 00:20:01.250
and start with the letter of the alphabet.”
And I could say: “All of the files that begin

00:20:01.250 --> 00:20:05.450
with A are gonna go under server A; or
the files that begin with B are gonna go

00:20:05.450 --> 00:20:09.900
on server B etc.” And then when you want
to retrieve a file you say: “Okay, what

00:20:09.900 --> 00:20:13.950
does my file name begin with?” And then
you know which server it’s stored on.

00:20:13.950 --> 00:20:17.750
Now of course you could have a lot of
servers – sorry – a lot of files

00:20:17.750 --> 00:20:22.780
which begin with a Z, an X or a Y etc. in
which case you’re gonna overload

00:20:22.780 --> 00:20:27.310
that server. You’re gonna have more files
stored on one server than on another server

00:20:27.310 --> 00:20:32.150
in your set. And if you have a lot of big
files, say e.g. beginning with B then

00:20:32.150 --> 00:20:35.520
rather than distributing your files across
all the servers you’re gonna just be

00:20:35.520 --> 00:20:39.060
overloading one or two of them. So to
solve this problem what we tend to do is:

00:20:39.060 --> 00:20:42.410
we take the file name, and we run it
through a cryptographic hash function.

00:20:42.410 --> 00:20:46.930
A hash function produces output which
looks like random, very small changes

00:20:46.930 --> 00:20:50.740
in the input so a cryptographic hash
function produces a very large change

00:20:50.740 --> 00:20:55.240
in the output. And this change looks
random. So if I take all of my file names

00:20:55.240 --> 00:20:59.820
here, and assuming I have a lot more,
I take a hash of them, and then I use

00:20:59.820 --> 00:21:05.470
that hash to determine which server to
store the file on. Then, with high probability

00:21:05.470 --> 00:21:09.670
my files will be distributed evenly across
all of the servers. And then when I want

00:21:09.670 --> 00:21:12.990
to go and retrieve one of the files I take
my file name, I run it through the

00:21:12.990 --> 00:21:15.980
cryptographic hash function, that gives me
the hash, and then I use that hash

00:21:15.980 --> 00:21:19.740
to identify which server that particular
file is stored on. And then I go and

00:21:19.740 --> 00:21:25.990
retrieve it. So that’s the sort of a loose
idea of how a distributed hash table works.

00:21:25.990 --> 00:21:29.340
There are a couple of problems with this.
What if you got a changing size, what

00:21:29.340 --> 00:21:34.700
if the number of servers you got changes
in size as it does in the Tor network.

00:21:34.700 --> 00:21:42.290
It’s a very brief overview of the theory.
So how does it apply for the Tor network?

00:21:42.290 --> 00:21:47.640
Well, the Tor network has a set of relays
and it has a set of hidden services.

00:21:47.640 --> 00:21:52.710
Now we take all of the relays, and they
have a hash identity which identifies them.

00:21:52.710 --> 00:21:57.460
And we map them onto a circle using that
hash value as an identifier. So you can

00:21:57.460 --> 00:22:03.230
imagine the hash value ranging from Zero
to a very large number. We got a Zero point

00:22:03.230 --> 00:22:07.280
at the very top there. And that runs all
the way round to the very large number.

00:22:07.280 --> 00:22:12.130
So given the identity hash for a relay we
can map that to a particular point on

00:22:12.130 --> 00:22:19.070
the server. And then all we have to do
is also do this for hidden services.

00:22:19.070 --> 00:22:22.320
So there’s a hidden service address,
something.onion, so this is

00:22:22.320 --> 00:22:27.750
one of the hidden websites that you might
visit. You take the – I’m not gonna describe

00:22:27.750 --> 00:22:33.980
in too much detail how this is done but –
the value is done in such a way such that

00:22:33.980 --> 00:22:38.020
it’s evenly distributed about the circle.
So your hidden service will have

00:22:38.020 --> 00:22:44.240
a particular point on the circle. And the
relays will also be mapped onto this circle.

00:22:44.240 --> 00:22:49.640
So there’s the relays. And the hidden
service. And in the case of Tor

00:22:49.640 --> 00:22:53.460
the hidden service actually maps to two
positions on the circle, and it publishes

00:22:53.460 --> 00:22:57.850
its descriptor to the three relays to the
right at one position, and the three relays

00:22:57.850 --> 00:23:01.600
to the right at another position. So there
are actually in total six places where

00:23:01.600 --> 00:23:05.060
this descriptor is published on the
circle. And then if I want to go and

00:23:05.060 --> 00:23:09.450
fetch and connect to a hidden service
I go on to go and pull this hidden descriptor

00:23:09.450 --> 00:23:13.780
down to identify what its introduction
points are. I take the hidden service

00:23:13.780 --> 00:23:17.200
address, I find out where it is on the
circle, I map all of the relays onto

00:23:17.200 --> 00:23:21.110
the circle, and then I identify which
relays on the circle are responsible

00:23:21.110 --> 00:23:24.031
for that particular hidden service. And
I just connect, then I say: “Do you have

00:23:24.031 --> 00:23:26.630
a copy of the descriptor for that
particular hidden service?”

00:23:26.630 --> 00:23:29.620
And if so then we’ve got our list of
introduction points. And we can go

00:23:29.620 --> 00:23:38.020
to the next steps to connect to our hidden
service. So I’m gonna explain how we

00:23:38.020 --> 00:23:41.320
sort of set up our experiments. What we
thought, or what we were interested to do,

00:23:41.320 --> 00:23:48.181
was collect publications of hidden
services. So for everytime a hidden service

00:23:48.181 --> 00:23:51.520
gets set up it publishes to this distributed
hash table. What we wanted to do was

00:23:51.520 --> 00:23:55.750
collect those publications so that we
get a complete list of all of the hidden

00:23:55.750 --> 00:23:59.280
services. And what we also wanted to do
is to find out how many times a particular

00:23:59.280 --> 00:24:06.300
hidden service is requested.

00:24:06.300 --> 00:24:10.540
Just one more point that
will become important later.

00:24:10.540 --> 00:24:14.230
The position which the hidden service
appears on the circle changes

00:24:14.230 --> 00:24:18.950
every 24 hours. So there’s not
a fixed position every single day.

00:24:18.950 --> 00:24:24.370
If we run 40 nodes over a long period of
time we will occupy positions within

00:24:24.370 --> 00:24:29.570
that distributed hash table. And we will be
able to collect publications and requests

00:24:29.570 --> 00:24:34.300
for hidden services that are located at
that position inside the distributed

00:24:34.300 --> 00:24:39.251
hash table. So in that case we ran 40 Tor
nodes, we had a student at university

00:24:39.251 --> 00:24:43.950
who said: “Hey, I run a hosting company,
I got loads of server capacity”, and

00:24:43.950 --> 00:24:46.580
we told him what we were doing, and he
said: “Well, you really helped us out,

00:24:46.580 --> 00:24:49.820
these last couple of years…”
and just gave us loads of server capacity

00:24:49.820 --> 00:24:55.500
to allow us to do this. So we spun up 40
Tor nodes. Each Tor node was required

00:24:55.500 --> 00:24:59.560
to advertise a certain amount of bandwidth
to become a part of that distributed

00:24:59.560 --> 00:25:02.200
hash table. It’s actually a very small
amount, so this didn’t matter too much.

00:25:02.200 --> 00:25:06.050
And then, after – this has changed
recently in the last few days,

00:25:06.050 --> 00:25:10.070
it used to be 25 hours, it’s just been
increased as a result of one of the

00:25:10.070 --> 00:25:14.570
attacks last week. But here… certainly
during our study it was 25 hours. You then

00:25:14.570 --> 00:25:18.300
appear at a particular point inside that
distributed hash table. And you’re then

00:25:18.300 --> 00:25:22.750
in a position to record publications of
hidden services and requests for hidden

00:25:22.750 --> 00:25:27.810
services. So not only can you get a full
list of the onion addresses you can also

00:25:27.810 --> 00:25:32.250
find out how many times each of the
onion addresses are requested.

00:25:32.250 --> 00:25:38.270
And so this is what we recorded. And then,
once we had a full list of… or once

00:25:38.270 --> 00:25:41.830
we had run for a long period of time to
collect a long list of .onion addresses

00:25:41.830 --> 00:25:46.850
we then built a custom crawler that would
visit each of the Tor hidden services

00:25:46.850 --> 00:25:51.450
in turn, and pull down the HTML contents,
the text content from the web page,

00:25:51.450 --> 00:25:54.760
so that we could go ahead and classify
the content. Now it’s really important

00:25:54.760 --> 00:25:59.250
to know here, and it will become obvious
why a little bit later, we only pulled down

00:25:59.250 --> 00:26:03.030
HTML content. We didn’t pull out images.
And there’s a very, very important reason

00:26:03.030 --> 00:26:09.980
for that which will become clear shortly.

00:26:09.980 --> 00:26:13.520
We had a lot of questions when we
first started this. Noone really knew

00:26:13.520 --> 00:26:18.000
how many hidden services there were. It had
been suggested to us there was a very high

00:26:18.000 --> 00:26:21.250
turn-over of hidden services. We wanted to
confirm that whether that was true or not.

00:26:21.250 --> 00:26:24.530
And we also wanted to do this so,
what are the hidden services,

00:26:24.530 --> 00:26:30.140
how popular are they, etc. etc. etc. So
our estimate for how many hidden services

00:26:30.140 --> 00:26:34.770
there are, over the period which we
ran our study, this is a graph plotting

00:26:34.770 --> 00:26:38.560
our estimate for each of the individual
days as to how many hidden services

00:26:38.560 --> 00:26:44.850
there were on that particular day. Now the
data is naturally noisy because we’re only

00:26:44.850 --> 00:26:48.590
a very small proportion of that circle.
So we’re only observing a very small

00:26:48.590 --> 00:26:53.250
proportion of the total publications and
requests every single day, for each of

00:26:53.250 --> 00:26:57.260
those hidden services. And if you
take a long term average for this

00:26:57.260 --> 00:27:02.720
there’s about 45.000 hidden services that
we think were present, on average,

00:27:02.720 --> 00:27:07.880
each day, during our entire study. Which
is a large number of hidden services.

00:27:07.880 --> 00:27:11.070
But over the entire length we
collected about 80.000, in total.

00:27:11.070 --> 00:27:14.270
Some came and went etc.
So the next question after how many

00:27:14.270 --> 00:27:17.750
hidden services there are is how long
the hidden service exists for.

00:27:17.750 --> 00:27:20.620
Does it exist for a very long period
of time, does it exist for a very short

00:27:20.620 --> 00:27:24.220
period of time etc. etc.
So what we did was, for every single

00:27:24.220 --> 00:27:30.260
.onion address we plotted how many times
we saw a publication for that particular

00:27:30.260 --> 00:27:34.160
hidden service during the six months.
How many times did we see it.

00:27:34.160 --> 00:27:38.100
If we saw it a lot of times that suggested
in general the hidden service existed

00:27:38.100 --> 00:27:42.180
for a very long period of time. If we saw
a very short number of publications

00:27:42.180 --> 00:27:45.760
for each hidden service then that
suggests that they were only present

00:27:45.760 --> 00:27:51.690
for a very short period of time. This is
our graph. By far the most number

00:27:51.690 --> 00:27:55.890
of hidden services we only saw once during
the entire study. And we never saw them

00:27:55.890 --> 00:28:00.390
again. We suggest that there’s a very high
turnover of the hidden services, they

00:28:00.390 --> 00:28:04.520
don’t tend to exist on average i.e. for
a very long period of time.

00:28:04.520 --> 00:28:10.730
And then you can see the sort of
a tail here. If we plot just those

00:28:10.730 --> 00:28:16.390
hidden services which existed for a long
time, so e.g. we could take hidden services

00:28:16.390 --> 00:28:20.280
which have a high number of hit requests
and say: “Okay, those that have a high number

00:28:20.280 --> 00:28:24.800
of hits probably existed for a long time.”
That’s not absolutely certain, but probably.

00:28:24.800 --> 00:28:29.190
Then you see this sort of -normal- plot
about 4..5, so we saw on average

00:28:29.190 --> 00:28:34.870
most hidden services four or five times
during the entire six months if they were

00:28:34.870 --> 00:28:40.530
popular and we’re using that as a proxy
measure for whether they existed

00:28:40.530 --> 00:28:48.160
for the entire time. Now, this stage was
over 160 days, so almost six months.

00:28:48.160 --> 00:28:51.490
What we also wanted to do was trying
to confirm this over a longer period.

00:28:51.490 --> 00:28:56.310
So last year, in 2013, about February time
some researchers of the University

00:28:56.310 --> 00:29:00.350
of Luxemburg also ran a similar study
but it ran over a very short period of time

00:29:00.350 --> 00:29:05.060
over the day. But they did it in such
a way it could collect descriptors

00:29:05.060 --> 00:29:08.590
across much of the circle during a single
day. That was because of a bug in the way

00:29:08.590 --> 00:29:12.020
Tor did some of the things which has
now been fixed so we can’t repeat that

00:29:12.020 --> 00:29:16.520
as a particular way. So we got a list of
.onion addresses from February 2013

00:29:16.520 --> 00:29:18.960
from these researchers at the University
of Luxemburg. And then we got our list

00:29:18.960 --> 00:29:23.670
of .onion addresses from this six months
which was March to September of this year.

00:29:23.670 --> 00:29:26.700
And we wanted to say, okay, we’re given
these two sets of .onion addresses.

00:29:26.700 --> 00:29:30.740
Which .onion addresses existed in his set
but not ours and vice versa, and which

00:29:30.740 --> 00:29:39.740
.onion addresses existed in both sets?

00:29:39.740 --> 00:29:45.520
So as you can see a very small minority
of hidden service addresses existed

00:29:45.520 --> 00:29:50.000
in both sets. This is over an 18 month
period between these two collection points.

00:29:50.000 --> 00:29:54.430
A very small number of services existed
in both his data set and in

00:29:54.430 --> 00:29:58.390
our data set. Which again suggested
there’s a very high turnover of hidden

00:29:58.390 --> 00:30:02.920
services that don’t tend to exist
for a very long period of time.

00:30:02.920 --> 00:30:06.530
So the question is why is that?
Which we’ll come on to a little bit later.

00:30:06.530 --> 00:30:11.120
It’s a very valid question, can’t answer
it 100%, we have some inclines as to

00:30:11.120 --> 00:30:15.560
why that may be the case. So in terms
of popularity which hidden services

00:30:15.560 --> 00:30:19.700
did we see, or which .onion addresses
did we see requested the most?

00:30:19.700 --> 00:30:26.980
Which got the most number of hits? Or the
most number of directory requests.

00:30:26.980 --> 00:30:30.120
So botnet Command &amp; Control servers
– if you’re not familiar with what

00:30:30.120 --> 00:30:34.340
a botnet is, the idea is to infect lots of
people with a piece of malware.

00:30:34.340 --> 00:30:37.630
And this malware phones home to
a Command &amp; Control server where

00:30:37.630 --> 00:30:41.500
the botnet master can give instructions
to each of the bots on to do things.

00:30:41.500 --> 00:30:46.780
So it might be e.g. to collect passwords,
key strokes, banking details.

00:30:46.780 --> 00:30:51.010
Or it might be to do things like
Distributed Denial of Service attacks,

00:30:51.010 --> 00:30:55.220
or to send spam, those sorts of things.
And a couple of years ago someone gave

00:30:55.220 --> 00:31:00.720
a talk and said: “Well, the problem with
running a botnet is your C&amp;C servers

00:31:00.720 --> 00:31:05.750
are vulnerable.” Once a C&amp;C server is taken
down you no longer have control over

00:31:05.750 --> 00:31:10.030
your botnet. So it’s been a sort of arms
race against anti-virus companies and

00:31:10.030 --> 00:31:15.130
against malware authors to try and come up
with techniques to run C&amp;C servers in a way

00:31:15.130 --> 00:31:18.490
which they can’t be taken down. And
a couple of years ago someone gave a talk

00:31:18.490 --> 00:31:22.450
at a conference that said: “You know what?
It would be a really good idea if botnet

00:31:22.450 --> 00:31:25.809
C&amp;C servers were run as Tor hidden
services because then no one knows

00:31:25.809 --> 00:31:29.370
where they are, and in theory they can’t
be taken down.” So in the fact we have this

00:31:29.370 --> 00:31:33.000
there are loads and loads and loads of
these addresses associated with several

00:31:33.000 --> 00:31:38.122
different botnets, ‘Sefnit’ and ‘Skynet’.
Now Skynet is the one I wanted to talk

00:31:38.122 --> 00:31:42.840
to you about because the guy that runs
Skynet had a twitter account, and he also

00:31:42.840 --> 00:31:47.210
did a Reddit AMA. If you not heard
of a Reddit AMA before, that’s a Reddit

00:31:47.210 --> 00:31:51.500
ask-me-anything. You can go on the website
and ask the guy anything. So this guy

00:31:51.500 --> 00:31:54.790
wasn’t hiding in the shadows. He’d say:
“Hey, I’m running this massive botnet,

00:31:54.790 --> 00:31:58.180
here’s my Twitter account which I update
regularly, here is my Reddit AMA where

00:31:58.180 --> 00:32:01.620
you can ask me questions!” etc.

00:32:01.620 --> 00:32:04.590
He was arrested last year, which is not,
perhaps, a huge surprise.

00:32:04.590 --> 00:32:11.750
<i>laughter and applause</i>

00:32:11.750 --> 00:32:15.970
But… so he was arrested,
his C&amp;C servers disappeared

00:32:15.970 --> 00:32:21.600
but there were still infected hosts trying
to connect with the C&amp;C servers and

00:32:21.600 --> 00:32:24.490
request access to the C&amp;C server.

00:32:24.490 --> 00:32:27.570
This is why we’re saying: “A large number
of hits.” So all of these requests are

00:32:27.570 --> 00:32:31.520
failed requests, i.e. we didn’t have
a descriptor for them because

00:32:31.520 --> 00:32:34.910
the hidden service had gone away but
there were still clients requesting each

00:32:34.910 --> 00:32:38.040
of the hidden services.

00:32:38.040 --> 00:32:41.980
And the next thing we wanted to do was
to try and categorize sites. So, as I said

00:32:41.980 --> 00:32:45.960
earlier, we crawled all of the hidden
services that we could, and we classified

00:32:45.960 --> 00:32:50.230
them into different categories based
on what the type of content was

00:32:50.230 --> 00:32:53.650
on the hidden service side. The first
graph I have is the number of sites

00:32:53.650 --> 00:32:58.040
in each of the categories. So you can see
down the bottom here we got lots of

00:32:58.040 --> 00:33:04.280
different categories. We got drugs, market
places, etc. on the bottom. And the graph

00:33:04.280 --> 00:33:07.360
shows the percentage of the hidden
services that we crawled that fit in

00:33:07.360 --> 00:33:12.680
to each of these categories. So e.g. looking
at this, drugs, the most number of sites

00:33:12.680 --> 00:33:16.250
that we crawled were made up of
drugs-focused websites, followed by

00:33:16.250 --> 00:33:20.970
market places etc. There’s a couple of
questions you might have here,

00:33:20.970 --> 00:33:25.640
so which ones are gonna stick out, what
does ‘porn’ mean, well, you know

00:33:25.640 --> 00:33:31.060
what ‘porn’ means. There are some very
notorious porn sites on the Tor Darknet.

00:33:31.060 --> 00:33:34.470
There was one in particular which was
focused on revenge porn. It turns out

00:33:34.470 --> 00:33:37.520
that youngsters wish to take pictures
of themselves, and send it to their

00:33:37.520 --> 00:33:45.040
boyfriends or their girlfriends. And
when they get dumped they publish them

00:33:45.040 --> 00:33:49.750
on these websites. So there were several
of these sites on the main internet

00:33:49.750 --> 00:33:53.070
which have mostly been shut down.
And some of these sites were archived

00:33:53.070 --> 00:33:58.220
on the Darknet. The second one is that
we should probably wonder what is,

00:33:58.220 --> 00:34:03.430
is ‘abuse’. Abuse was… every single
site we classified in this category

00:34:03.430 --> 00:34:07.750
were child abuse sites. So they were in
some way facilitating child abuse.

00:34:07.750 --> 00:34:10.980
And how do we know that? Well, the data
that came back from the crawler

00:34:10.980 --> 00:34:14.789
made it completely unambiguous as to what
the content was in these sites. That was

00:34:14.789 --> 00:34:18.918
completely obvious, from then content, from
the crawler as to what was on these sites.

00:34:18.918 --> 00:34:23.449
And this is the principal reason why we
didn’t pull down images from sites.

00:34:23.449 --> 00:34:26.099
There are many countries that
would be a criminal offense to do so.

00:34:26.099 --> 00:34:29.530
So our crawler only pulled down text
content from all of these sites, and that

00:34:29.530 --> 00:34:34.470
enabled us to classify them, based on
that. We didn’t pull down any images.

00:34:34.470 --> 00:34:37.880
So of course the next thing we liked to do
is to say: “Okay, well, given each of these

00:34:37.880 --> 00:34:42.759
categories, what proportion of directory
requests went to each of the categories?”

00:34:42.759 --> 00:34:45.489
Now the next graph is going to need some
explaining as to precisely what it

00:34:45.489 --> 00:34:52.090
means, and I’m gonna give that. This is
the proportion of directory requests

00:34:52.090 --> 00:34:55.830
which we saw that went to each of the
categories of hidden service that we

00:34:55.830 --> 00:34:59.740
classified. As you can see, in fact, we
saw a very large number going to these

00:34:59.740 --> 00:35:05.010
abuse sites. And the rest sort of
distributed right there, at the bottom.

00:35:05.010 --> 00:35:07.230
And the question is: “What is it
we’re collecting here?”

00:35:07.230 --> 00:35:12.070
We’re collecting successful hidden service
directory requests. What does a hidden

00:35:12.070 --> 00:35:16.790
service directory request mean?
It probably loosely correlates with

00:35:16.790 --> 00:35:22.230
either a visit or a visitor. So somewhere
in between those two. Because when you

00:35:22.230 --> 00:35:26.790
want to visit a hidden service you make
a request for the hidden service descriptor

00:35:26.790 --> 00:35:31.080
and that allows you to connect to it
and browse through the web site.

00:35:31.080 --> 00:35:34.770
But there are cases where, e.g. if you
restart Tor, you’ll go back and you

00:35:34.770 --> 00:35:40.100
re-fetch the descriptor. So in that case
we’ll count twice, for example.

00:35:40.100 --> 00:35:43.050
What proportion of these are people,
and which proportion of them are

00:35:43.050 --> 00:35:46.619
something else? The answer to that is
we just simply don’t know.

00:35:46.619 --> 00:35:50.250
We've got directory requests but that doesn’t
tell us about what they’re doing on these

00:35:50.250 --> 00:35:55.130
sites, what they’re fetching, or who
indeed they are, or what it is they are.

00:35:55.130 --> 00:35:58.690
So these could be automated requests,
they could be human beings. We can’t

00:35:58.690 --> 00:36:03.750
distinguish between those two things.

00:36:03.750 --> 00:36:06.420
What are the limitations?

00:36:06.420 --> 00:36:12.170
A hidden service directory request neither
exactly correlates to a visit -or- a visitor.

00:36:12.170 --> 00:36:16.380
It’s probably somewhere in between.
So you can’t say whether it’s exactly one

00:36:16.380 --> 00:36:19.810
or the other. We cannot say whether
a hidden service directory request

00:36:19.810 --> 00:36:26.230
is a person or something automated.
We can’t distinguish between those two.

00:36:26.230 --> 00:36:31.890
Any type of site could be targeted by e.g.
DoS attacks, by web crawlers which would

00:36:31.890 --> 00:36:40.040
greatly inflate the figures. If you were
to do a DoS attack it’s likely you’d only

00:36:40.040 --> 00:36:44.700
request a small number of descriptors.
You’d actually be flooding the site itself

00:36:44.700 --> 00:36:47.740
rather than the directories. But, in
theory, you could flood the directories.

00:36:47.740 --> 00:36:52.840
But we didn’t see any sort of shutdown
of our directories based on flooding, e.g.

00:36:52.840 --> 00:36:58.720
Whilst we can’t rule that out, it doesn’t
seem to fit too well with what we’ve got.

00:36:58.720 --> 00:37:02.971
The other question is ‘crawlers’.
I obviously talked with the Tor Project

00:37:02.971 --> 00:37:08.570
about these results and they’ve suggested
that there are groups, so the child

00:37:08.570 --> 00:37:12.740
protection agencies e.g. that will crawl
these sites on a regular basis. And,

00:37:12.740 --> 00:37:15.879
again, that doesn’t necessarily correlate
with a human being. And that could

00:37:15.879 --> 00:37:19.830
inflate the figures. How many hidden
directory requests would there be

00:37:19.830 --> 00:37:24.610
if a crawler was pointed at it. Typically,
if I crawl them on a single day, one request.

00:37:24.610 --> 00:37:27.850
But if they got a large number of servers
doing the crawling then it could be

00:37:27.850 --> 00:37:32.840
a request per day for every single server.
So, again, I can’t give you, definitive,

00:37:32.840 --> 00:37:37.930
“yes, this is human beings” or
“yes, this is automated requests”.

00:37:37.930 --> 00:37:43.300
The other important point is, these two
content graphs are only hidden services

00:37:43.300 --> 00:37:48.550
offering web content. There are hidden
services that do things, e.g. IRC,

00:37:48.550 --> 00:37:52.490
the instant messaging etc. Those aren’t
included in these figures. We’re only

00:37:52.490 --> 00:37:57.990
concentrating on hidden services offering
web sites. They’re HTTP services, or HTTPS

00:37:57.990 --> 00:38:01.640
services. Because that allows to easily
classify them. And, in fact, some of

00:38:01.640 --> 00:38:06.080
the other types are IRC and Jabber the
result was probably not directly comparable

00:38:06.080 --> 00:38:08.920
with web sites. That’s sort of the use
case for using them, it’s probably

00:38:08.920 --> 00:38:16.490
slightly different. So I appreciate the
last graph is somewhat alarming.

00:38:16.490 --> 00:38:20.640
If you have any questions please ask
either me or the Tor developers

00:38:20.640 --> 00:38:24.810
as to how to interpret these results. It’s
not quite as straight-forward as it may

00:38:24.810 --> 00:38:27.500
look when you look at the graph. You
might look at the graph and say: “Hey,

00:38:27.500 --> 00:38:30.980
that looks like there’s lots of people
visiting these sites”. It’s difficult

00:38:30.980 --> 00:38:40.240
to conclude that from the results.

00:38:40.240 --> 00:38:45.990
The next slide is gonna be very
contentious. I will prefix it with:

00:38:45.990 --> 00:38:50.970
“I’m not advocating -any- kind of
action whatsoever. I’m just trying

00:38:50.970 --> 00:38:56.130
to describe technically as to what could
be done. It’s not up to me to make decisions

00:38:56.130 --> 00:39:02.869
on these types of things.” So, of course,
when we found this out, frankly, I think

00:39:02.869 --> 00:39:06.190
we were stunned. I mean, it took us
several days, frankly, it just stunned us,

00:39:06.190 --> 00:39:09.610
“what the hell, this is not
what we expected at all.”

00:39:09.610 --> 00:39:13.210
So a natural step is, well, we think, most
of us think that Tor is a great thing,

00:39:13.210 --> 00:39:18.510
it seems. Could this problem be sorted out
while still keeping Tor as it is?

00:39:18.510 --> 00:39:21.510
And probably the next step to say: “Well,
okay, could we just block this class

00:39:21.510 --> 00:39:26.060
of content and not other types of content?”
So could we block just hidden services

00:39:26.060 --> 00:39:29.630
that are associated with these sites and
not other types of hidden services?

00:39:29.630 --> 00:39:33.370
We thought there’s three ways in which
we could block hidden services.

00:39:33.370 --> 00:39:36.960
And I’ll talk about whether these were
impossible in the coming months,

00:39:36.960 --> 00:39:39.430
after explaining them. But during our
study these would have been impossible

00:39:39.430 --> 00:39:43.590
and presently they are possible.

00:39:43.590 --> 00:39:48.630
A single individual could shut down
a single hidden service by controlling

00:39:48.630 --> 00:39:53.640
all of the relays which are responsible
for receiving a publication request

00:39:53.640 --> 00:39:57.280
on that distributed hash table. It’s
possible to place one of your relays

00:39:57.280 --> 00:40:01.460
at a particular position on that circle
and so therefore make yourself be

00:40:01.460 --> 00:40:04.290
the responsible relay for
a particular hidden service.

00:40:04.290 --> 00:40:08.500
And if you control all of the six relays
which are responsible for a hidden service,

00:40:08.500 --> 00:40:11.390
when someone comes to you and says:
“Can I have a descriptor for that site”

00:40:11.390 --> 00:40:15.910
you can just say: “No, I haven’t got it”.
And provided you control those relays

00:40:15.910 --> 00:40:20.580
users won’t be able to fetch those sites.

00:40:20.580 --> 00:40:25.010
The second option is you could say:
“Okay, the Tor Project are blocking these”

00:40:25.010 --> 00:40:28.941
– which I’ll talk about in a second –
“as a relay operator”. Could I

00:40:28.941 --> 00:40:32.500
as a relay operator say: “Okay, as
a relay operator I don’t want to carry

00:40:32.500 --> 00:40:35.930
this type of content, and I don’t want to
be responsible for serving up this type

00:40:35.930 --> 00:40:39.930
of content.” A relay operator could patch
his relay and say: “You know what,

00:40:39.930 --> 00:40:44.020
if anyone comes to this relay requesting
anyone of these sites then, again, just

00:40:44.020 --> 00:40:48.740
refuse to do it”. The problem is a lot of
relay operators need to do it. So a very,

00:40:48.740 --> 00:40:51.990
very large number of the potential relay
operators would need to do that

00:40:51.990 --> 00:40:56.170
to effectively block these sites. The
final option is the Tor Project could

00:40:56.170 --> 00:41:00.740
modify the Tor program and actually embed
these ingresses in the Tor program itself

00:41:00.740 --> 00:41:05.030
so as that all relays by default both
block hidden service directory requests

00:41:05.030 --> 00:41:10.560
to these sites, and also clients themselves
would say: “Okay, if anyone’s requesting

00:41:10.560 --> 00:41:15.000
these block them at the client level.”
Now I hasten to add: I’m not advocating

00:41:15.000 --> 00:41:18.230
any kind of action that is entirely up to
other people because, frankly, I think

00:41:18.230 --> 00:41:22.530
if I advocated blocking hidden services
I probably wouldn’t make it out alive,

00:41:22.530 --> 00:41:27.050
so I’m just saying: this is a description
of what technical measures could be used

00:41:27.050 --> 00:41:30.730
to block some classes of sites. And of
course there’s lots of questions here.

00:41:30.730 --> 00:41:35.150
If e.g. the Tor Project themselves decided:
“Okay, we’re gonna block these sites”

00:41:35.150 --> 00:41:38.490
that means they are essentially
in control of the block list.

00:41:38.490 --> 00:41:41.360
The block list would be somewhat public
so everyone would be up to inspect

00:41:41.360 --> 00:41:44.930
what the sites are that are being blocked
and they would be in control of some kind

00:41:44.930 --> 00:41:54.360
of block list. Which, you know, arguably
is against what the Tor Projects are after.

00:41:54.360 --> 00:41:59.560
<i>takes a sip, coughs</i>

00:41:59.560 --> 00:42:05.480
So how about deanonymising visitors
to hidden service web sites?

00:42:05.480 --> 00:42:08.940
So in this case we got a user on the
left-hand side who is connected to

00:42:08.940 --> 00:42:12.630
a Guard node. We’ve got a hidden service
on the right-hand side who is connected

00:42:12.630 --> 00:42:17.530
to a Guard node and on the top we got
one of those directory servers which is

00:42:17.530 --> 00:42:21.850
responsible for serving up those
hidden service directory requests.

00:42:21.850 --> 00:42:28.660
Now, when you first want to connect to
a hidden service you connect through

00:42:28.660 --> 00:42:31.619
your Guard node and through a couple of hops
up to the hidden service directory and

00:42:31.619 --> 00:42:35.840
you request the descriptor off of them.
So at this point if you are the attacker

00:42:35.840 --> 00:42:39.440
and you control one of the hidden service
directory nodes for a particular site

00:42:39.440 --> 00:42:43.100
you can send back down the circuit
a particular pattern of traffic.

00:42:43.100 --> 00:42:47.740
And if you control that user’s
Guard node – which is a big if –

00:42:47.740 --> 00:42:52.110
then you can spot that pattern of traffic
at the Guard node. The question is:

00:42:52.110 --> 00:42:56.940
“How do you control a particular user’s
Guard node?” That’s very, very hard.

00:42:56.940 --> 00:43:01.480
But if e.g. I run a hidden service and all
of you visit my hidden service, and

00:43:01.480 --> 00:43:05.670
I’m running a couple of dodgy Guard relays
then the probability is that some of you,

00:43:05.670 --> 00:43:09.760
certainly not all of you by any stretch will
select my dodgy Guard relay, and

00:43:09.760 --> 00:43:13.220
I could deanonymise you, but I couldn’t
deanonymise the rest of them.

00:43:13.220 --> 00:43:18.260
So what we’re saying here is that
you can deanonymise some of the users

00:43:18.260 --> 00:43:22.130
some of the time but you can’t pick which
users those are which you’re going to

00:43:22.130 --> 00:43:26.609
deanonymise. You can’t deanonymise someone
specific but you can deanonymise a fraction

00:43:26.609 --> 00:43:32.170
based on what fraction of the network you
control in terms of Guard capacity.

00:43:32.170 --> 00:43:36.340
How about… so the attacker controls those
two – here’s a picture from a research of

00:43:36.340 --> 00:43:40.200
the University of Luxemburg which
did this. And these are plots of

00:43:40.200 --> 00:43:45.270
taking the user’s IP address visiting
a C&amp;C server, and then geolocating it

00:43:45.270 --> 00:43:48.480
and putting it on a map. So “where was the
user located when they called one of

00:43:48.480 --> 00:43:51.620
the Tor hidden services?” So, again,
this is a selection, a percentage

00:43:51.620 --> 00:43:58.060
of the users visiting C&amp;C servers
using this technique.

00:43:58.060 --> 00:44:03.770
How about deanonymising hidden services
themselves? Well, again, you got a problem.

00:44:03.770 --> 00:44:08.340
You’re the user. You’re gonna connect
through your Guard into the Tor network.

00:44:08.340 --> 00:44:12.160
And then, eventually, through the hidden
service’s Guard node, and talk to

00:44:12.160 --> 00:44:16.740
the hidden service. As the attacker you
need to control the hidden service’s

00:44:16.740 --> 00:44:20.859
Guard node to do these traffic correlation
attacks. So again, it’s very difficult

00:44:20.859 --> 00:44:24.390
to deanonymise a specific Tor hidden
service. But if you think about, okay,

00:44:24.390 --> 00:44:30.200
there is 1.000 Tor hidden services, if you
can control a percentage of the Guard nodes

00:44:30.200 --> 00:44:34.230
then some hidden services will pick you
and then you’ll be able to deanonymise those.

00:44:34.230 --> 00:44:37.330
So provided you don’t care which hidden
services you gonna deanonymise

00:44:37.330 --> 00:44:41.400
then it becomes much more straight-forward
to control the Guard nodes of some hidden

00:44:41.400 --> 00:44:44.910
services but you can’t pick exactly
what those are.

00:44:44.910 --> 00:44:51.040
So what sort of data can you see
traversing a relay?

00:44:51.040 --> 00:44:55.880
This is a modified Tor client which just
dumps cells which are coming…

00:44:55.880 --> 00:44:58.750
essentially packets travelling down
a circuit, and the information you can

00:44:58.750 --> 00:45:04.020
extract from them at a Guard node.
And this is done off the main Tor network.

00:45:04.020 --> 00:45:08.590
So I’ve got a client connected to
a “malicious” Guard relay

00:45:08.590 --> 00:45:14.040
and it logs every single packet – they’re
called ‘cells’ in the Tor protocol –

00:45:14.040 --> 00:45:17.619
coming through the Guard relay. We can’t
decrypt the packet because it’s encrypted

00:45:17.619 --> 00:45:21.780
three times. What we can record,
though, is the IP address of the user,

00:45:21.780 --> 00:45:25.070
the IP address of the next hop,
and we can count packets travelling

00:45:25.070 --> 00:45:29.240
in each direction down the circuit. And we
can also record the time at which those

00:45:29.240 --> 00:45:32.210
packets were sent. So of course, if you’re
doing the traffic correlation attacks

00:45:32.210 --> 00:45:37.970
you’re using that time in the information
to try and work out whether you’re seeing

00:45:37.970 --> 00:45:42.370
traffic which you’ve sent and which
identifies a particular user or not.

00:45:42.370 --> 00:45:44.810
Or indeed traffic which they’ve sent
which you’ve seen at a different point

00:45:44.810 --> 00:45:49.100
in the network.

00:45:49.100 --> 00:45:51.980
Moving on to my…

00:45:51.980 --> 00:45:55.760
…interesting problems,
research questions etc.

00:45:55.760 --> 00:45:59.250
Based on what I’ve said, I’ve said there’s
these directory authorities which are

00:45:59.250 --> 00:46:05.070
controlled by the core Tor members. If
e.g. they were malicious then they could

00:46:05.070 --> 00:46:08.990
manipulate the Tor… – if a big enough
chunk of them are malicious then

00:46:08.990 --> 00:46:12.700
they can manipulate the consensus
to direct you to particular nodes.

00:46:12.700 --> 00:46:15.920
I don’t think that’s the case, and that
anyone thinks that’s the case.

00:46:15.920 --> 00:46:19.180
And Tor is designed in a way to tr…
I mean that you’d have to control

00:46:19.180 --> 00:46:22.480
a certain number of the authorities
to be able to do anything important.

00:46:22.480 --> 00:46:25.270
So the Tor people… I said this
to them a couple of days ago.

00:46:25.270 --> 00:46:28.780
I find it quite funny that you’d design
your system as if you don’t trust

00:46:28.780 --> 00:46:31.880
each other. To which their response was:
“No, we design our system so that

00:46:31.880 --> 00:46:35.620
we don’t have to trust each other.” Which
I think is a very good model to have,

00:46:35.620 --> 00:46:39.430
when you have this type of system.
So could we eliminate these sort of

00:46:39.430 --> 00:46:43.240
centralized servers? I think that’s
actually a very hard problem to do.

00:46:43.240 --> 00:46:46.340
There are lots of attacks which could
potentially be deployed against

00:46:46.340 --> 00:46:51.250
a decentralized network. At the moment the
Tor network is relatively well understood

00:46:51.250 --> 00:46:54.490
both in terms of what types of attack it
is vulnerable to. So if we were to move

00:46:54.490 --> 00:46:58.880
to a new architecture then we may open it
to a whole new class of attacks.

00:46:58.880 --> 00:47:02.000
The Tor network has been existing
for quite some time and it’s been

00:47:02.000 --> 00:47:06.820
very well studied. What about global
adversaries like the NSA, where you could

00:47:06.820 --> 00:47:10.980
monitor network links all across the
world? It’s very difficult to defend

00:47:10.980 --> 00:47:15.530
against that. Where they can monitor…
if they can identify which Guard relay

00:47:15.530 --> 00:47:18.760
you’re using, they can monitor traffic
going into and out of the Guard relay,

00:47:18.760 --> 00:47:23.259
and they log each of the subsequent hops
along. It’s very, very difficult to defend against

00:47:23.259 --> 00:47:26.470
these types of things. Do we know if
they’re doing it? The documents that were

00:47:26.470 --> 00:47:29.850
released yesterday – I’ve only had a very
brief look through them, but they suggest

00:47:29.850 --> 00:47:32.480
that they’re not presently doing it and
they haven’t had much success.

00:47:32.480 --> 00:47:36.450
I don’t know why, there are very powerful
attacks described in the academic literature

00:47:36.450 --> 00:47:40.830
which are very, very reliable and most
academic literature you can access for free

00:47:40.830 --> 00:47:43.960
so it’s not even as if they have to figure
out how to do it. They just have to read

00:47:43.960 --> 00:47:47.010
the academic literature and try and
implement some of these attacks.

00:47:47.010 --> 00:47:52.000
I don’t know what – why they’re not. The
next question is how to detect malicious

00:47:52.000 --> 00:47:57.760
relays. So in my case we’re running
40 relays. Our relays were on consecutive

00:47:57.760 --> 00:48:01.570
IP addresses, so we’re running 40
– well, most of them are on consecutive

00:48:01.570 --> 00:48:04.820
IP addresses in two blocks. So they’re
running on IP addresses numbered

00:48:04.820 --> 00:48:09.280
e.g. 1,2,3,4,…
We were running two relays per IP address,

00:48:09.280 --> 00:48:12.210
and every single relay had my name
plastered across it.

00:48:12.210 --> 00:48:14.740
So after I set up these 40 relays in

00:48:14.740 --> 00:48:17.420
a relatively short period of time
I expected someone from the Tor Project

00:48:17.420 --> 00:48:22.260
to come to me and say: “Hey Gareth, what
are you doing?” – no one noticed,

00:48:22.260 --> 00:48:26.090
no one noticed. So this is presently
an open question. On the Tor Project

00:48:26.090 --> 00:48:28.790
they’re quite open about this. They
acknowledged that, in fact, last year

00:48:28.790 --> 00:48:33.210
we had the CERT researchers launch much
more relays than that. The Tor Project

00:48:33.210 --> 00:48:36.510
spotted those large number of relays
but chose not to do anything about it

00:48:36.510 --> 00:48:40.119
and, in fact, they were deploying an
attack. But, as you know, it’s often very

00:48:40.119 --> 00:48:43.700
difficult to defend against unknown
attacks. So at the moment how to detect

00:48:43.700 --> 00:48:47.780
malicious relays is a bit of an open
question. Which as I think is being

00:48:47.780 --> 00:48:50.720
discussed on the mailing list.

00:48:50.720 --> 00:48:54.230
The other one is defending against unknown
tampering at exits. If you took or take

00:48:54.230 --> 00:48:57.220
the exit relays – the exit relay
can tamper with the traffic.

00:48:57.220 --> 00:49:01.040
So we know particular types of attacks
doing SSL man-in-the-middles etc.

00:49:01.040 --> 00:49:05.350
We’ve seen recently binary patching.
How do we detect unknown tampering

00:49:05.350 --> 00:49:08.970
with traffic, other types of traffic? So
the binary tampering wasn’t spotted

00:49:08.970 --> 00:49:12.060
until it was spotted by someone who
told the Tor Project. So it wasn’t

00:49:12.060 --> 00:49:15.609
detected e.g. by the Tor Project
themselves, it was spotted by someone else

00:49:15.609 --> 00:49:20.500
and notified to them. And then the final
one open on here is the Tor code review.

00:49:20.500 --> 00:49:25.400
So the Tor code is open source. We know
from OpenSSL that, although everyone

00:49:25.400 --> 00:49:29.260
can read source code, people don’t always
look at it. And OpenSSL has been

00:49:29.260 --> 00:49:32.230
a huge mess, and there’s been
lots of stuff disclosed over that

00:49:32.230 --> 00:49:35.880
over the last coming days. There are
lots of eyes on the Tor code but I think

00:49:35.880 --> 00:49:41.519
always, more eyes are better. I’d say,
ideally if we can get people to look

00:49:41.519 --> 00:49:45.140
at the Tor code and look for
vulnerabilities then… I encourage people

00:49:45.140 --> 00:49:49.860
to do that. It’s a very useful thing to
do. There could be unknown vulnerabilities

00:49:49.860 --> 00:49:53.119
as we’ve seen with the “relay early” type
quite recently in the Tor code which

00:49:53.119 --> 00:49:56.990
could be quite serious. The truth is we
just don’t know until people do thorough

00:49:56.990 --> 00:50:02.500
code audits, and even then it’s very
difficult to know for certain.

00:50:02.500 --> 00:50:08.170
So my last point, I think, yes,

00:50:08.170 --> 00:50:11.130
is advice to future researchers.
So if you ever wanted, or are planning

00:50:11.130 --> 00:50:16.349
on doing a study in the future, e.g. on
Tor, do not do what the CERT researchers

00:50:16.349 --> 00:50:20.550
do and start deanonymising people on the
live Tor network and doing it in a way

00:50:20.550 --> 00:50:25.060
which is incredibly irresponsible. I don’t
think…I mean, I tend, myself, to give you with

00:50:25.060 --> 00:50:28.510
the benefit of a doubt, I don’t think the
CERT researchers set out to be malicious.

00:50:28.510 --> 00:50:33.320
I think they’re just very naive.
That’s what it was they were doing.

00:50:33.320 --> 00:50:36.780
That was rapidly pointed out to them.
In my case we are running

00:50:36.780 --> 00:50:43.090
40 relays. Our Tor relays they were forwarding
traffic, they were acting as good relays.

00:50:43.090 --> 00:50:45.970
The only thing that we were doing
was logging publication requests

00:50:45.970 --> 00:50:50.050
to the directories. Big question whether
that’s malicious or not – I don’t know.

00:50:50.050 --> 00:50:53.330
One thing that has been pointed out to me
is that the .onion addresses themselves

00:50:53.330 --> 00:50:58.270
could be considered sensitive information,
so any data we will be retaining

00:50:58.270 --> 00:51:01.840
from the study is the aggregated data.
So we won't be retaining information

00:51:01.840 --> 00:51:05.400
on individual .onion addresses because
that could potentially be considered

00:51:05.400 --> 00:51:08.900
sensitive information. If you think about
someone running an .onion address which

00:51:08.900 --> 00:51:11.240
contains something which they don’t want
other people knowing about. So we won’t

00:51:11.240 --> 00:51:15.060
be retaining that data, and
we’ll be destroying them.

00:51:15.060 --> 00:51:19.920
So I think that brings me now
to starting the questions.

00:51:19.920 --> 00:51:22.770
I want to say “Thanks” to a couple of
people. The student who donated

00:51:22.770 --> 00:51:26.820
the server to us. Nick Savage who is one
of my colleagues who was a sounding board

00:51:26.820 --> 00:51:30.510
during the entire study. Ivan Pustogarov
who is the researcher at the University

00:51:30.510 --> 00:51:34.700
of Luxembourg who sent us the large data
set of .onion addresses from last year.

00:51:34.700 --> 00:51:37.670
He’s also the chap who has demonstrated
those deanonymisation attacks

00:51:37.670 --> 00:51:41.500
that I talked about. A big "Thank you" to
Roger Dingledine who has frankly been…

00:51:41.500 --> 00:51:45.230
presented loads of questions to me over
the last couple of days and allowed me

00:51:45.230 --> 00:51:49.410
to bounce ideas back and forth.
That has been a very useful process.

00:51:49.410 --> 00:51:53.640
If you are doing future research I strongly
encourage you to contact the Tor Project

00:51:53.640 --> 00:51:57.040
at the earliest opportunity. You’ll find
them… certainly I found them to be

00:51:57.040 --> 00:51:59.460
extremely helpful.

00:51:59.460 --> 00:52:04.640
Donncha also did something similar,
so both Ivan and Donncha have done

00:52:04.640 --> 00:52:09.520
a similar study in trying to classify the
types of hidden services or work out

00:52:09.520 --> 00:52:13.520
how many hits there are to particular
types of hidden service. Ivan Pustogarov

00:52:13.520 --> 00:52:17.430
did it on a bigger scale
and found similar results to us.

00:52:17.430 --> 00:52:21.910
That is that these abuse sites
featured frequently

00:52:21.910 --> 00:52:26.740
in the top requested sites. That was done
over a year ago, and again, he was seeing

00:52:26.740 --> 00:52:31.109
similar sorts of pattern. There were these
abuse sites being requested frequently.

00:52:31.109 --> 00:52:35.450
So that also sort of probates
what we’re saying.

00:52:35.450 --> 00:52:38.540
The data I put online is at this address,
there will probably be the slides,

00:52:38.540 --> 00:52:41.609
something called ‘The Tor Research
Framework’ which is an implementation

00:52:41.609 --> 00:52:47.510
of a Java client, so an implementation
of a Tor client in Java specifically aimed

00:52:47.510 --> 00:52:52.080
at researchers. So if e.g. you wanna pull
out data from a consensus you can do.

00:52:52.080 --> 00:52:55.290
If you want to build custom routes
through the network you can do.

00:52:55.290 --> 00:52:58.230
If you want to build routes through the
network and start sending padding traffic

00:52:58.230 --> 00:53:01.720
down them you can do etc.
The code is designed in a way which is

00:53:01.720 --> 00:53:06.000
designed to be easily modifiable
for testing lots of these things.

00:53:06.000 --> 00:53:10.580
There is also a link to the Tor FBI
exploit which they deployed against

00:53:10.580 --> 00:53:16.230
visitors to some Tor hidden services last
year. They exploited a Mozilla Firefox bug

00:53:16.230 --> 00:53:20.540
and then ran code on users who were
visiting these hidden service, and ran

00:53:20.540 --> 00:53:24.619
code on their computer to identify them.
At this address there is a link to that

00:53:24.619 --> 00:53:29.250
including a copy of the shell code and an
analysis of exactly what it was doing.

00:53:29.250 --> 00:53:31.670
And then of course a list of references,
with papers and things.

00:53:31.670 --> 00:53:34.260
So I’m quite happy to take questions now.

00:53:34.260 --> 00:53:46.960
<i>applause</i>

00:53:46.960 --> 00:53:50.880
Herald: Thanks for the nice talk!
Do we have any questions

00:53:50.880 --> 00:53:57.000
from the internet?

00:53:57.000 --> 00:53:59.740
Signal Angel: One question. It’s very hard
to block addresses since creating them

00:53:59.740 --> 00:54:03.620
is cheap, and they can be generated
for each user, and rotated often. So

00:54:03.620 --> 00:54:07.510
can you think of any other way
for doing the blocking?

00:54:07.510 --> 00:54:09.799
Gareth: That is absolutely true, so, yes.
If you were to block a particular .onion

00:54:09.799 --> 00:54:13.060
address they can wail: “I want another
.onion address.” So I don’t know of

00:54:13.060 --> 00:54:16.760
any way to counter that now.

00:54:16.760 --> 00:54:18.510
Herald: Another one from the internet?
<i>inaudible answer from Signal Angel</i>

00:54:18.510 --> 00:54:22.030
Okay, then, Microphone 1, please!

00:54:22.030 --> 00:54:26.359
Question: Thank you, that’s fascinating
research. You mentioned that it is

00:54:26.359 --> 00:54:32.200
possible to influence the hash of your
relay node in a sense that you could

00:54:32.200 --> 00:54:35.970
to be choosing which service you are
advertising, or which hidden service

00:54:35.970 --> 00:54:38.050
you are responsible for. Is that right?
Gareth: Yeah, correct!

00:54:38.050 --> 00:54:40.390
Question: So could you elaborate
on how this is possible?

00:54:40.390 --> 00:54:44.740
Gareth: So e.g. you just keep regenerating
a public key for your relay,

00:54:44.740 --> 00:54:48.140
you’ll get closer and closer to the point
where you’ll be the responsible relay

00:54:48.140 --> 00:54:51.160
for that particular hidden service. That’s
just – you keep regenerating your identity

00:54:51.160 --> 00:54:54.720
hash until you’re at that particular point
in the relay. That’s not particularly

00:54:54.720 --> 00:55:00.490
computationally intensive to do.
That was it?

00:55:00.490 --> 00:55:04.740
Herald: Okay, next question
from Microphone 5, please.

00:55:04.740 --> 00:55:09.490
Question: Hi, I was wondering for the
attacks where you identify a certain number

00:55:09.490 --> 00:55:15.170
of users using a hidden service. Have
those attacks been used, or is there

00:55:15.170 --> 00:55:18.880
any evidence there, and is there
any way of protecting against that?

00:55:18.880 --> 00:55:22.260
Gareth: That’s a very interesting question,
is there any way to detect these types

00:55:22.260 --> 00:55:24.970
of attacks? So some of the attacks,
if you’re going to generate particular

00:55:24.970 --> 00:55:29.030
traffic patterns, one way to do that is to
use the padding cells. The padding cells

00:55:29.030 --> 00:55:32.070
aren’t used at the moment by the official
Tor client. So the detection of those

00:55:32.070 --> 00:55:36.510
could be indicative but it doesn't... 
it`s not conclusive evidence in our tool.

00:55:36.510 --> 00:55:40.050
Question: And is there any way of
protecting against a government

00:55:40.050 --> 00:55:46.510
or something trying to denial-of-service
hidden services?

00:55:46.510 --> 00:55:48.180
Gareth: So I… trying to… did not…

00:55:48.180 --> 00:55:52.500
Question: Is it possible to protect
against this kind of attack?

00:55:52.500 --> 00:55:56.180
Gareth: Not that I’m aware of. The Tor
Project are currently revising how they

00:55:56.180 --> 00:55:59.500
do the hidden service protocol which will
make e.g. what I did, enumerating

00:55:59.500 --> 00:56:03.230
the hidden services, much more difficult.
And to also be in a position on the

00:56:03.230 --> 00:56:07.470
distributed hash table in advance
for a particular hidden service.

00:56:07.470 --> 00:56:10.510
So they are at the moment trying to change
the way it’s done, and make some of

00:56:10.510 --> 00:56:15.270
these things more difficult.

00:56:15.270 --> 00:56:20.290
Herald: Good. Next question
from Microphone 2, please.

00:56:20.290 --> 00:56:27.220
Mic2: Hi. I’m running the Tor2Web abuse,
and so I used to see a lot of abuse of requests

00:56:27.220 --> 00:56:31.130
concerning the Tor hidden service
being exposed on the internet through

00:56:31.130 --> 00:56:37.270
the Tor2Web.org domain name. And I just
wanted to comment on, like you said,

00:56:37.270 --> 00:56:45.410
the abuse number of the requests. I used
to spoke with some of the child protection

00:56:45.410 --> 00:56:50.070
agencies that reported abuse at
Tor2Web.org, and they are effectively

00:56:50.070 --> 00:56:55.570
using crawlers that periodically look for
changes in order to get new images to be

00:56:55.570 --> 00:57:00.190
put in the database. And what I was able
to understand is that the German agency

00:57:00.190 --> 00:57:07.440
doing that is crawling the same sites that
the Italian agencies are crawling, too.

00:57:07.440 --> 00:57:11.890
So it’s likely that in most of the
countries there are the child protection

00:57:11.890 --> 00:57:16.790
agencies that are crawling those few
numbers of Tor hidden services that

00:57:16.790 --> 00:57:22.760
contain child porn. And I saw it also
a bit from the statistics of Tor2Web

00:57:22.760 --> 00:57:28.500
where the amount of abuse relating to
that kind of content, it’s relatively low.

00:57:28.500 --> 00:57:30.000
Just as contribution!

00:57:30.000 --> 00:57:33.500
Gareth: Yes, that’s very interesting,
thank you for that!

00:57:33.500 --> 00:57:37.260
<i>applause</i>

00:57:37.260 --> 00:57:39.560
Herald: Next, Microphone 4, please.

00:57:39.560 --> 00:57:45.260
Mic4: You then attacked or deanonymised
users with an infected or a modified Guard

00:57:45.260 --> 00:57:51.810
relay? Is it required to modify the Guard
relay if I control the entry point

00:57:51.810 --> 00:57:57.360
of the user to the internet?
If I’m his ISP?

00:57:57.360 --> 00:58:01.900
Gareth: Yes, if you observe traffic
travelling into a Guard relay without

00:58:01.900 --> 00:58:04.570
controlling the Guard relay itself.
Mic4: Yeah.

00:58:04.570 --> 00:58:07.500
Gareth: In theory, yes. I wouldn’t be able
to tell you how reliable that is

00:58:07.500 --> 00:58:10.500
off the top of my head.
Mic4: Thanks!

00:58:10.500 --> 00:58:13.630
Herald: So another question
from the internet!

00:58:13.630 --> 00:58:16.339
Signal Angel: Wouldn’t the ability to
choose the key hash prefix give

00:58:16.339 --> 00:58:19.980
the ability to target specific .onions?

00:58:19.980 --> 00:58:23.680
Gareth: So you can only target one .onion
address at a time. Because of the way

00:58:23.680 --> 00:58:28.080
they are generated. So you wouldn’t be
able to say e.g. “Pick a key which targeted

00:58:28.080 --> 00:58:32.339
two or more .onion addresses.” You can
only target one .onion address at a time

00:58:32.339 --> 00:58:37.720
by positioning yourself at a particular
point on the distributed hash table.

00:58:37.720 --> 00:58:40.260
Herald: Another one
from the internet? … Okay.

00:58:40.260 --> 00:58:43.369
Then Microphone 3, please.

00:58:43.369 --> 00:58:47.780
Mic3: Hey. Thanks for this research.
I think it strengthens the network.

00:58:47.780 --> 00:58:54.300
So in the deem (?) I was wondering whether
you can donate this relays to be a part of

00:58:54.300 --> 00:58:59.500
non-malicious relays pool, basically
use them as regular relays afterwards?

00:58:59.500 --> 00:59:02.750
Gareth: Okay, so can I donate the relays
a rerun and at the Tor capacity (?) ?

00:59:02.750 --> 00:59:05.490
Unfortunately, I said they were run by
a student and they were donated for

00:59:05.490 --> 00:59:09.510
a fixed period of time. So we’ve given
those back to him. We are very grateful

00:59:09.510 --> 00:59:14.790
to him, he was very generous. In fact,
without his contribution donating these

00:59:14.790 --> 00:59:18.700
it would have been much more difficult
to collect as much data as we did.

00:59:18.700 --> 00:59:21.490
Herald: Good, next, Microphone 5, please!

00:59:21.490 --> 00:59:25.839
Mic5: Yeah hi, first of all thanks
for your talk. I think you’ve raised

00:59:25.839 --> 00:59:29.310
some real issues that need to be
considered very carerfully by everyone

00:59:29.310 --> 00:59:33.950
on the Tor Project. My question: I’d like
to go back to the issue with so many

00:59:33.950 --> 00:59:38.470
abuse related web sites running over
the Tor Project. I think it’s an important

00:59:38.470 --> 00:59:41.900
issue that really needs to be considered
because we don’t wanna be associated

00:59:41.900 --> 00:59:44.840
with that at the end of the day.
Anyone who uses Tor, who runs a relay

00:59:44.840 --> 00:59:51.250
or an exit node. And I understand it’s
a bit of a censored issue, and you don’t

00:59:51.250 --> 00:59:55.300
really have any say over whether it’s
implemented or not. But I’d like to get

00:59:55.300 --> 01:00:02.410
your opinion on the implementation
of a distributed block-deny system

01:00:02.410 --> 01:00:06.980
that would run in very much a similar way
to those of the directory authorities.

01:00:06.980 --> 01:00:08.950
I’d just like to see what
you think of that.

01:00:08.950 --> 01:00:13.200
Gareth: So you’re asking me whether I want
to support a particular blocking mechanism

01:00:13.200 --> 01:00:14.200
then?

01:00:14.200 --> 01:00:16.470
Mic5: I’d like to get your opinion on it.
<i>Gareth laughs</i>

01:00:16.470 --> 01:00:20.540
I know it’s a sensitive issue but I think,
like I said, I think something…

01:00:20.540 --> 01:00:25.700
I think it needs to be considered because
everyone running exit nodes and relays

01:00:25.700 --> 01:00:30.270
and people of the Tor Project don’t
want to be known or associated with

01:00:30.270 --> 01:00:34.790
these massive amount of abuse web sites
that currently exist within the Tor network.

01:00:34.790 --> 01:00:40.210
Gareth: I absolutely agree, and I think
the Tor Project are horrified as well that

01:00:40.210 --> 01:00:43.960
this problem exists, and they, in fact,
talked on it in previous years that

01:00:43.960 --> 01:00:48.690
they have a problem with this type of
content. I asked to what if anything is

01:00:48.690 --> 01:00:52.340
done about it, it’s very much up to them.
Could it be done in a distributed fashion?

01:00:52.340 --> 01:00:56.240
So the example I gave was a way which
it could be done by relay operators.

01:00:56.240 --> 01:00:59.770
So e.g. that would need the consensus of
a large number of relay operators to be

01:00:59.770 --> 01:01:02.890
effective. So that is done in
a distributed fashion. The question is:

01:01:02.890 --> 01:01:06.810
who gives the list of .onion addresses to
block to each of the relay operators?

01:01:06.810 --> 01:01:09.640
Clearly, the relay operators aren’t going
to collect themselves. It needs to be

01:01:09.640 --> 01:01:15.780
supplied by someone like the Tor Project,
e.g., or someone trustworthy. Yes, it can

01:01:15.780 --> 01:01:20.480
be done in a distributed fashion.
It can be done in an open fashion.

01:01:20.480 --> 01:01:21.710
Mic5: Who knows?
Gareth: Okay.

01:01:21.710 --> 01:01:23.750
Mic5: Thank you.

01:01:23.750 --> 01:01:27.260
Herald: Good. And another
question from the internet.

01:01:27.260 --> 01:01:31.210
Signal Angel: Apparently there’s an option
in the Tor client to collect statistics

01:01:31.210 --> 01:01:35.169
on hidden services. Do you know about
this, and how it relates to your research?

01:01:35.169 --> 01:01:38.551
Gareth: Yes, I believe they’re going to
be… the extent to which I know about it

01:01:38.551 --> 01:01:41.930
is they’re gonna be trying this next
month, to try and estimate how many

01:01:41.930 --> 01:01:46.490
hidden services there are. So keep
your eye on the Tor Project web site,

01:01:46.490 --> 01:01:50.340
I’m sure they’ll be publishing
their data in the coming months.

01:01:50.340 --> 01:01:55.090
Herald: And, sadly, we are running out of
time, so this will be the last question,

01:01:55.090 --> 01:01:56.980
so Microphone 4, please!

01:01:56.980 --> 01:02:01.250
Mic4: Hi, I’m just wondering if you could
sort of outline what ethical clearances

01:02:01.250 --> 01:02:04.510
you had to get from your university
to conduct this kind of research.

01:02:04.510 --> 01:02:07.260
Gareth: So we have to discuss these
types of things before undertaking

01:02:07.260 --> 01:02:11.970
any research. And we go through the steps
to make sure that we’re not e.g. storing

01:02:11.970 --> 01:02:16.370
sensitive information about particular
people. So yes, we are very mindful

01:02:16.370 --> 01:02:19.240
of that. And that’s why I made a
particular point of putting on the slides

01:02:19.240 --> 01:02:21.510
as to some of the things to consider.

01:02:21.510 --> 01:02:26.180
Mic4: So like… you outlined a potential
implementation of the traffic correlation

01:02:26.180 --> 01:02:29.500
attack. Are you saying that
you performed the attack? Or…

01:02:29.500 --> 01:02:33.180
Gareth: No, no no, absolutely not.
So the link I’m giving… absolutely not.

01:02:33.180 --> 01:02:34.849
We have not engaged in any…

01:02:34.849 --> 01:02:36.350
Mic4: It just wasn’t clear
from the slides.

01:02:36.350 --> 01:02:39.380
Gareth: I apologize. So it’s absolutely
clear on that. No, we’re not engaging

01:02:39.380 --> 01:02:42.860
in any deanonymisation research on the
Tor network. The research I showed

01:02:42.860 --> 01:02:46.079
is linked on the references, I think,
which I put at the end of the slides.

01:02:46.079 --> 01:02:52.000
You can read about it. But it’s done in
simulation. So e.g. there’s a way

01:02:52.000 --> 01:02:54.730
to do simulation of the Tor network on
a single computer. I can’t remember

01:02:54.730 --> 01:02:58.880
the name of the project, though.
Shadow! Yes, it’s a system

01:02:58.880 --> 01:03:02.170
called Shadow, we can run a large
number of Tor relays on a single computer

01:03:02.170 --> 01:03:04.579
and simulate the traffic between them.
If you’re going to do that type of research

01:03:04.579 --> 01:03:09.380
then you should use that. Okay,
thank you very much, everyone.

01:03:09.380 --> 01:03:17.985
<i>applause</i>

01:03:17.985 --> 01:03:22.071
<i>silent postroll titles</i>

01:03:22.071 --> 01:03:27.000
subtitles created by c3subtitles.de
Join, and help us!