1
00:00:00,000 --> 00:00:09,970
<i>silent 31C3 preroll</i>

2
00:00:09,970 --> 00:00:13,220
Dr. Gareth Owen: Hello. Can you hear me?
Yes. Okay. So my name is Gareth Owen.

3
00:00:13,220 --> 00:00:16,150
I’m from the University of Portsmouth.
I’m an academic

4
00:00:16,150 --> 00:00:19,320
and I’m going to talk to you about
an experiment that we did

5
00:00:19,320 --> 00:00:22,610
on the Tor hidden services,
trying to categorize them,

6
00:00:22,610 --> 00:00:25,230
estimate how many they were etc. etc.

7
00:00:25,230 --> 00:00:27,380
Well, as we go through the talk
I’m going to explain

8
00:00:27,380 --> 00:00:31,120
how Tor hidden services work internally,
and how the data was collected.

9
00:00:31,120 --> 00:00:35,320
So what sort of conclusions you can draw
from the data based on the way that we’ve

10
00:00:35,320 --> 00:00:39,950
collected it. Just so [that] I get
an idea: how many of you use Tor

11
00:00:39,950 --> 00:00:42,430
on a regular basis, could you
put your hand up for me?

12
00:00:42,430 --> 00:00:46,120
So quite a big number. Keep your hand
up if… or put your hand up if you’re

13
00:00:46,120 --> 00:00:48,320
a relay operator.

14
00:00:48,320 --> 00:00:51,470
Wow, that’s quite a significant number,
isn’t it? And then, put your hand up

15
00:00:51,470 --> 00:00:55,250
and/or keep it up if you
run a hidden service.

16
00:00:55,250 --> 00:00:59,530
Okay, so, a fewer number, but still
some people run hidden services.

17
00:00:59,530 --> 00:01:02,720
Okay, so, some of you may be very familiar
with the way Tor works, sort of,

18
00:01:02,720 --> 00:01:06,700
in a low level. But I am gonna go through
it for those which aren’t, so they understand

19
00:01:06,700 --> 00:01:10,380
just how they work. And as we go along,
because I’m explaining how

20
00:01:10,380 --> 00:01:14,030
the hidden services work, I’m going
to tag on information on how

21
00:01:14,030 --> 00:01:19,030
the Tor hidden services themselves can be
deanonymised and also how the users

22
00:01:19,030 --> 00:01:23,090
of those hidden services can be
deanonymised, if you put

23
00:01:23,090 --> 00:01:27,040
some strict criteria on what it is you
want to do with respect to them.

24
00:01:27,040 --> 00:01:30,920
So the things that I’m going to go over:
I wanna go over how Tor works,

25
00:01:30,920 --> 00:01:34,190
and then specifically how hidden services
work. I’m gonna talk about something

26
00:01:34,190 --> 00:01:37,889
called the “Tor Distributed Hash Table”
for hidden services. If you’ve heard

27
00:01:37,889 --> 00:01:40,560
that term and don’t know what
it means, don’t worry, I’ll explain

28
00:01:40,560 --> 00:01:44,010
what a distributed hash table is and
how it works. It’s not as complicated

29
00:01:44,010 --> 00:01:47,690
as it sounds. And then I wanna go over
Darknet data, so, data that we collected

30
00:01:47,690 --> 00:01:53,030
from Tor hidden services. And as I say,
as we go along I will sort of explain

31
00:01:53,030 --> 00:01:56,650
how you do deanonymisation of both the
services themselves and of the visitors

32
00:01:56,650 --> 00:02:02,400
to the service. And just
how complicated it is.

33
00:02:02,400 --> 00:02:07,370
So you may have seen this slide which
I think was from GCHQ, released last year

34
00:02:07,370 --> 00:02:12,099
as part of the Snowden leaks where they
said: “You can deanonymise some users

35
00:02:12,099 --> 00:02:15,560
some of the time but they’ve had
no success in deanonymising someone

36
00:02:15,560 --> 00:02:20,109
in response to a specific request.”
So, given all of you e.g., I may be able

37
00:02:20,109 --> 00:02:25,090
to deanonymise a small fraction of you
but I can’t choose precisely one person

38
00:02:25,090 --> 00:02:27,499
I want to deanonymise. That’s what
I’m gonna be explaining in relation

39
00:02:27,499 --> 00:02:30,940
to the deanonymisation attacks, how
you can deanonymise a section but

40
00:02:30,940 --> 00:02:38,629
you can’t necessarily choose which section
of the users that you will be deanonymising.

41
00:02:38,629 --> 00:02:42,740
Tor drives with just a couple
of different problems. On one part

42
00:02:42,740 --> 00:02:46,239
it allows you to bypass censorship. So if
you’re in a country like China, which

43
00:02:46,239 --> 00:02:51,010
blocks some types of traffic you can use
Tor to bypass their censorship blocks.

44
00:02:51,010 --> 00:02:55,541
It tries to give you privacy, so, at some
level in the network someone can’t see

45
00:02:55,541 --> 00:02:59,200
what you’re doing. And at another point
in the network people who don’t know

46
00:02:59,200 --> 00:03:02,540
who you are but may necessarily
be able to see what you’re doing.

47
00:03:02,540 --> 00:03:07,099
Now the traditional case
for this is to look at VPNs.

48
00:03:07,099 --> 00:03:10,669
With a VPN you have
sort of a single provider.

49
00:03:10,669 --> 00:03:14,689
You have lots of users connecting
to the VPN. The VPN has sort of

50
00:03:14,689 --> 00:03:18,240
a mixing effect from an outside or
a server’s point of view. And then

51
00:03:18,240 --> 00:03:22,499
out of the VPN you see requests
to Twitter, Wikipedia etc. etc.

52
00:03:22,499 --> 00:03:26,830
And if that traffic doesn’t encrypt it then
the VPN can also read the contents

53
00:03:26,830 --> 00:03:30,980
of the traffic. Now of course there is
a fundamental weakness with this.

54
00:03:30,980 --> 00:03:35,730
If you trust the VPN provider the VPN
provider knows both who you are

55
00:03:35,730 --> 00:03:39,629
and what you’re doing and can
link those two together with absolute

56
00:03:39,629 --> 00:03:43,580
certainty. So you don’t… whilst you do
get some of these properties, assuming

57
00:03:43,580 --> 00:03:48,069
you’ve got a trustworthy VPN provider
you don’t get them in the face of

58
00:03:48,069 --> 00:03:51,609
an untrustworthy VPN provider.
And of course: how do you trust the VPN

59
00:03:51,609 --> 00:03:59,319
provider? What sort of measure do
you use? That’s sort of an open question.

60
00:03:59,319 --> 00:04:03,729
So Tor tries to solve this problem
by distributing the trust. Tor is

61
00:04:03,729 --> 00:04:07,500
an open source project, so you can go
on to their Git repository, you can

62
00:04:07,500 --> 00:04:12,620
download the source code, and change it,
improve it, submit patches etc.

63
00:04:12,620 --> 00:04:17,108
As you heard earlier, during Jacob and
Roger’s talk they’re currently partly

64
00:04:17,108 --> 00:04:20,949
sponsored by the US Government which seems
a bit paradoxical, but they explained

65
00:04:20,949 --> 00:04:24,770
in that talk many of the… that
doesn’t affect like judgment.

66
00:04:24,770 --> 00:04:28,540
And indeed, they do have some funding from
other sources, and they design that system

67
00:04:28,540 --> 00:04:30,841
– which I’ll talk about a little bit
later – in a way where they don’t have

68
00:04:30,841 --> 00:04:34,230
to trust each other. So there’s sort of
some redundancy, and they’re trying

69
00:04:34,230 --> 00:04:39,650
to minimize these sort of trust issues
related to this. Now, Tor is

70
00:04:39,650 --> 00:04:43,310
a partially de-centralized network, which
means that it has some centralized

71
00:04:43,310 --> 00:04:47,870
components which are under the control of
the Tor Project and some de-centralized

72
00:04:47,870 --> 00:04:51,190
components which are normally the Tor
relays. If you run a relay you’re

73
00:04:51,190 --> 00:04:56,290
one of those de-centralized components.
There is, however, no single authority

74
00:04:56,290 --> 00:05:01,110
on the Tor network.
So no single server which is responsible,

75
00:05:01,110 --> 00:05:04,290
which you’re required to trust.
So the trust is somewhat distributed,

76
00:05:04,290 --> 00:05:12,000
but not entirely. When you establish
a circuit through Tor you, the user,

77
00:05:12,000 --> 00:05:15,500
download a list of all of the relays
inside the Tor network.

78
00:05:15,500 --> 00:05:19,070
And you get to pick – and I’ll tell you
how you do that – which relays

79
00:05:19,070 --> 00:05:22,750
you’re going to use to route your traffic
through. So here is a typical example:

80
00:05:22,750 --> 00:05:27,090
You’re here on the left hand side as the
user. You download a list of the relays

81
00:05:27,090 --> 00:05:32,010
inside the Tor network and you select from
that list three nodes, a guard node

82
00:05:32,010 --> 00:05:36,580
which is your entry into the Tor network,
a relay node which is a middle node.

83
00:05:36,580 --> 00:05:39,010
Essentially, it’s going to route your
traffic to a third hop. And then

84
00:05:39,010 --> 00:05:42,650
the third hop is the exit node where
your traffic essentially exits out

85
00:05:42,650 --> 00:05:46,840
on the internet. Now, looking at the
circuit. So this is a circuit through

86
00:05:46,840 --> 00:05:50,170
the Tor network through which you’re
going to route your traffic. There are

87
00:05:50,170 --> 00:05:52,540
three layers of encryption at the
beginning, so between you

88
00:05:52,540 --> 00:05:56,150
and the guard node. Your traffic
is encrypted three times.

89
00:05:56,150 --> 00:05:59,330
In the first instance encrypted to the
guard, and the it’s encrypted again,

90
00:05:59,330 --> 00:06:03,180
through the relay, and then encrypted
again to the exit, and as the traffic moves

91
00:06:03,180 --> 00:06:08,710
through the Tor network each of those
layers of encryption are unpeeled

92
00:06:08,710 --> 00:06:17,300
from the data. The Guard here in this case
knows who you are, and the exit relay

93
00:06:17,300 --> 00:06:21,590
knows what you’re doing but neither know
both. And the middle relay doesn’t really

94
00:06:21,590 --> 00:06:26,710
know a lot, except for which relay is
her guard and which relay is her exit.

95
00:06:26,710 --> 00:06:31,870
Who runs an exit relay? So if you run
an exit relay all of the traffic which

96
00:06:31,870 --> 00:06:36,210
users are sending out on the internet they
appear to come from your IP address.

97
00:06:36,210 --> 00:06:41,360
So running an exit relay is potentially
risky because someone may do something

98
00:06:41,360 --> 00:06:45,590
through your relay which attracts attention.
And then, when law enforcement

99
00:06:45,590 --> 00:06:48,940
traced that back to an IP address it’s
going to come back to your address.

100
00:06:48,940 --> 00:06:51,790
So some relay operators have had trouble
with this, with law enforcement coming

101
00:06:51,790 --> 00:06:55,360
to them, and saying: “Hey we got this
traffic coming through your IP address

102
00:06:55,360 --> 00:06:57,950
and you have to go and explain it.”
So if you want to run an exit relay

103
00:06:57,950 --> 00:07:01,400
it’s a little bit risky, but we’re thankful
for those people that do run exit relays

104
00:07:01,400 --> 00:07:04,870
because ultimately if people didn’t run
an exit relay you wouldn’t be able

105
00:07:04,870 --> 00:07:08,000
to get out of the Tor network, and it
wouldn’t be terribly useful from this

106
00:07:08,000 --> 00:07:20,560
point of view. So, yes.
<i>applause</i>

107
00:07:20,560 --> 00:07:24,610
So every Tor relay, when you set up
a Tor relay you publish something called

108
00:07:24,610 --> 00:07:28,780
a descriptor which describes your Tor
relay and how to use it to a set

109
00:07:28,780 --> 00:07:33,430
of servers called the authorities. And the
trust in the Tor network is essentially

110
00:07:33,430 --> 00:07:38,610
split across these authorities. They’re run
by the core Tor Project members.

111
00:07:38,610 --> 00:07:42,639
And they maintain a list of all of the
relays in the network. And they observe

112
00:07:42,639 --> 00:07:46,010
them over a period of time. If the relays
exhibit certain properties they give

113
00:07:46,010 --> 00:07:50,480
the relays flags. If e.g. a relay allows
traffic to exit from the Tor network

114
00:07:50,480 --> 00:07:54,450
it will get the ‘Exit’ flag. If they’d been
switched on for a certain period of time,

115
00:07:54,450 --> 00:07:58,400
or for a certain amount of traffic they’ll
be allowed to become the guard relay

116
00:07:58,400 --> 00:08:02,180
which is the first node in your circuit.
So when you build your circuit you

117
00:08:02,180 --> 00:08:07,230
download a list of these descriptors from
one of the Directory Authorities. You look

118
00:08:07,230 --> 00:08:10,120
at the flags which have been assigned to
each of the relays, and then you pick

119
00:08:10,120 --> 00:08:14,150
your route based on that. So you’ll pick
the guard node from a set of relays

120
00:08:14,150 --> 00:08:16,400
which have the ‘Guard’ flag, your exits
from the set of relays which have

121
00:08:16,400 --> 00:08:20,860
the ‘Exit’ flag etc. etc. Now, as of
a quick count this morning there are

122
00:08:20,860 --> 00:08:29,229
about 1500 guard relays, around 1000 exit
relays, and six relays flagged as ‘bad’ exits.

123
00:08:29,229 --> 00:08:34,360
What does a ‘bad exit’ mean?
<i>waits for audience to respond</i>

124
00:08:34,360 --> 00:08:37,759
That’s not good! That’s exactly
what it means! Yes! <i>laughs</i>

125
00:08:37,759 --> 00:08:40,450
<i>applause</i>

126
00:08:40,450 --> 00:08:45,569
So relays which have been flagged as ‘bad
exits’ your client will never chose to exit

127
00:08:45,569 --> 00:08:50,660
traffic through. And examples of things
which may get a relay flagged as an

128
00:08:50,660 --> 00:08:53,829
[bad] exit relay – if they’re fiddling with
the traffic which is coming out of

129
00:08:53,829 --> 00:08:57,019
the Tor relay. Or doing things like
man-in-the-middle attacks against

130
00:08:57,019 --> 00:09:01,629
SSL traffic. We’ve seen various things,
there have been relays man-in-the-middling

131
00:09:01,629 --> 00:09:07,050
SSL traffic, there have very, very recently
been an exit relay which was patching

132
00:09:07,050 --> 00:09:10,800
binaries that you downloaded from the
internet, inserting malware into the binaries.

133
00:09:10,800 --> 00:09:14,630
So you can do these things but the Tor
Project tries to scan for them. And if

134
00:09:14,630 --> 00:09:19,829
these things are detected then they’ll be
flagged as ‘Bad Exits’. It’s true to say

135
00:09:19,829 --> 00:09:24,610
that the scanning mechanism is not 100%
fool-proof by any stretch of the imagination.

136
00:09:24,610 --> 00:09:28,559
It tries to pick up common types
of attacks, so as a result

137
00:09:28,559 --> 00:09:32,480
it won’t pick up unknown attacks or
attacks which haven’t been seen or

138
00:09:32,480 --> 00:09:36,680
have not been known about beforehand.

139
00:09:36,680 --> 00:09:45,370
So looking at this, how do you deanonymise
the traffic travelling through the Tor

140
00:09:45,370 --> 00:09:49,449
networks? Given some traffic coming out
of the exit relay, how do you know

141
00:09:49,449 --> 00:09:54,269
which user that corresponds to? What is
their IP address? You can’t actually

142
00:09:54,269 --> 00:09:58,279
modify the traffic because if any of the
relays tried to modify the traffic

143
00:09:58,279 --> 00:10:02,249
which they’re sending through the network
Tor will tear down the circuit through the relay.

144
00:10:02,249 --> 00:10:06,290
So there’s these integrity checks, each
of the hops. And if you try to sort of

145
00:10:06,290 --> 00:10:09,870
– because you can’t decrypt the packet
you can’t modify it in any meaningful way,

146
00:10:09,870 --> 00:10:13,749
and because there’s an integrity check
at the next hop that means that you can’t

147
00:10:13,749 --> 00:10:17,019
modify the packet because otherwise it’s
detected. So you can’t do this sort of

148
00:10:17,019 --> 00:10:20,900
marker, and try and follow the marker
through the network. So instead

149
00:10:20,900 --> 00:10:26,699
what you can do if you control… so let me
give you two cases. In the worst case

150
00:10:26,699 --> 00:10:31,330
if the attacker controls all three of your
relays that you pick, which is an unlikely

151
00:10:31,330 --> 00:10:34,739
scenario that needs to control quite
a big proportion of the network. Then

152
00:10:34,739 --> 00:10:39,550
it should be quite obvious that they can
work out who you are and also

153
00:10:39,550 --> 00:10:42,369
see what you’re doing because in that
case they can tag the traffic, and

154
00:10:42,369 --> 00:10:45,709
they can just discard these integrity
checks at each of the following hops.

155
00:10:45,709 --> 00:10:50,709
Now in a different case, if you control
the Guard relay and the exit relay

156
00:10:50,709 --> 00:10:54,160
but not the middle relay the Guard relay
can’t tamper with the traffic because

157
00:10:54,160 --> 00:10:57,660
this middle relay will close down the
circuit as soon as it happens.

158
00:10:57,660 --> 00:11:01,130
The exit relay can’t send stuff back down
the circuit to try and identify the user,

159
00:11:01,130 --> 00:11:05,030
either. Because again, the circuit will be
closed down. So what can you do?

160
00:11:05,030 --> 00:11:09,869
Well, you can count the number of packets
going through the Guard node. And you can

161
00:11:09,869 --> 00:11:14,690
measure the timing differences between
packets, and try and spot that pattern

162
00:11:14,690 --> 00:11:18,750
at the Exit relays. You’re looking at counts of
packets and the timing between those

163
00:11:18,750 --> 00:11:22,360
packets which are being sent, and
essentially trying to correlate them all.

164
00:11:22,360 --> 00:11:26,869
So if your user happens to pick you as
your Guard node, and then happens to pick

165
00:11:26,869 --> 00:11:31,850
your exit relay, then you can deanonymise
them with very high probability using

166
00:11:31,850 --> 00:11:35,649
this technique. You’re just correlating
the timings of packets and counting

167
00:11:35,649 --> 00:11:38,889
the number of packets going through.
And the attacks demonstrated in literature

168
00:11:38,889 --> 00:11:44,509
are very reliable for this. We heard
earlier from the Tor talk about the “relay

169
00:11:44,509 --> 00:11:50,739
early” tag which was the attack discovered
by the cert researches in the US.

170
00:11:50,739 --> 00:11:55,050
That attack didn’t rely on timing attacks.
Instead, what they were able to do was

171
00:11:55,050 --> 00:11:58,720
send a special type of cell containing
the data back down the circuit,

172
00:11:58,720 --> 00:12:01,889
essentially marking this data, and saying:
“This is the data we’re seeing

173
00:12:01,889 --> 00:12:06,149
at the Exit relay, or at the hidden
service", and encode into the messages

174
00:12:06,149 --> 00:12:10,049
travelling back down the circuit, what the
data was. And then you could pick

175
00:12:10,049 --> 00:12:14,269
those up at the Guard relay and say, okay,
whether it’s this person that’s doing that.

176
00:12:14,269 --> 00:12:18,370
In fact, although this technique works,
and yeah it was a very nice attack,

177
00:12:18,370 --> 00:12:21,269
the traffic correlation attacks are
actually just as powerful.

178
00:12:21,269 --> 00:12:25,259
So although this bug has been fixed traffic
correlation attacks still work and are

179
00:12:25,259 --> 00:12:29,739
still fairly, fairly reliable. So the problem
still does exist. This is very much

180
00:12:29,739 --> 00:12:33,399
an open question. How do we solve this
problem? We don’t know, currently,

181
00:12:33,399 --> 00:12:40,040
how to solve this problem of trying
to tackle the traffic correlation.

182
00:12:40,040 --> 00:12:45,369
There are a couple of solutions.
But they’re not particularly…

183
00:12:45,369 --> 00:12:48,569
they’re not particularly reliable. Let me
just go through these, and I’ll skip back

184
00:12:48,569 --> 00:12:53,061
on the few things I’ve missed. The first
thing is, high-latency networks, so

185
00:12:53,061 --> 00:12:56,999
networks where packets are delayed
in their transit through the network.

186
00:12:56,999 --> 00:13:00,740
That throws away a lot of the timing
information. So they promise

187
00:13:00,740 --> 00:13:03,800
to potentially solve this problem.
But of course, if you want to visit

188
00:13:03,800 --> 00:13:06,779
Google’s home page, and you have to wait
five minutes for it, you’re simply

189
00:13:06,779 --> 00:13:11,910
just not going to use Tor. The whole point
is trying to make this technology usable.

190
00:13:11,910 --> 00:13:14,759
And if you got something which is very,
very slow then it doesn’t make it

191
00:13:14,759 --> 00:13:18,269
attractive to use. But of course,
this case does work slightly better

192
00:13:18,269 --> 00:13:22,059
for e-mail. If you think about it with
e-mail, you don’t mind if you’re e-mail

193
00:13:22,059 --> 00:13:25,399
– well, you may not mind, you may mind –
you don’t mind if your e-mail is delayed

194
00:13:25,399 --> 00:13:29,120
by some period of time. Which makes this
somewhat difficult. And as Roger said

195
00:13:29,120 --> 00:13:35,130
earlier, you can also introduce padding
into the circuit, so these are dummy cells.

196
00:13:35,130 --> 00:13:39,839
But, but… with a big caveat: some of the
research suggests that actually you’d

197
00:13:39,839 --> 00:13:43,439
need to introduce quite a lot of padding
to defeat these attacks, and that would

198
00:13:43,439 --> 00:13:47,179
overload the Tor network in its current
state. So, again, not a particular

199
00:13:47,179 --> 00:13:53,860
practical solution.

200
00:13:53,860 --> 00:13:58,279
How does Tor try to solve this problem?
Well, Tor makes it very difficult

201
00:13:58,279 --> 00:14:03,171
to become a users Guard relay. If you
can’t become a users Guard relay

202
00:14:03,171 --> 00:14:07,839
then you don’t know who the user is, quite
simply. And so by making it very hard

203
00:14:07,839 --> 00:14:13,249
to become the Guard relay therefore you
can’t do this traffic correlation attack.

204
00:14:13,249 --> 00:14:17,579
So at the moment the Tor client chooses
one Guard relay and keeps it for a period

205
00:14:17,579 --> 00:14:22,259
of time. So if I want to sort of target
just one of you I would need to control

206
00:14:22,259 --> 00:14:26,259
the Guard relay that you were using at
that particular point in time. And in fact

207
00:14:26,259 --> 00:14:30,679
I’d also need to know what that Guard
relay is. So by making it very unlikely

208
00:14:30,679 --> 00:14:34,129
that you would select a particular malicious
Guard relay, where the number of malicious

209
00:14:34,129 --> 00:14:39,179
Guard relays is very small, that’s how Tor
tries to solve this problem. And

210
00:14:39,179 --> 00:14:43,280
at the moment your Guard relay is your
barrier of security. If the attacker can’t

211
00:14:43,280 --> 00:14:46,460
control the Guard relay then they won’t
know who you are. That doesn’t mean

212
00:14:46,460 --> 00:14:50,639
they can’t try other sort of side channel
attacks by messing with the traffic

213
00:14:50,639 --> 00:14:55,129
at the Exit relay etc. You know that you
may sort of e.g. download dodgy documents

214
00:14:55,129 --> 00:14:59,499
and open one on your computer, and those
sort of things. Now the alternative

215
00:14:59,499 --> 00:15:02,769
of course to having a Guard relay
and keeping it for a very long time

216
00:15:02,769 --> 00:15:06,029
will be to have a Guard relay and
to change it on a regular basis.

217
00:15:06,029 --> 00:15:09,929
Because you might think, well, just choosing
one Guard relay and sticking with it

218
00:15:09,929 --> 00:15:13,399
is probably a bad idea. But actually,
that’s not the case. If you pick

219
00:15:13,399 --> 00:15:18,370
the Guard relay, and assuming that the
chance of picking a Guard relay that is

220
00:15:18,370 --> 00:15:22,800
malicious is very low, then, when you
first use your Guard relay, if you got

221
00:15:22,800 --> 00:15:27,420
a good choice, then your traffic is safe.
If you haven’t got a good choice then

222
00:15:27,420 --> 00:15:31,759
your traffic isn’t safe. Whereas if your
Tor client chooses a Guard relay

223
00:15:31,759 --> 00:15:35,610
every few minutes, or every hour, or
something on those lines at some point

224
00:15:35,610 --> 00:15:39,179
you’re gonna pick a malicious Guard relay.
So they’re gonna have some of your traffic

225
00:15:39,179 --> 00:15:43,399
but not all of it. And so currently the
trade-off is that we make it very difficult

226
00:15:43,399 --> 00:15:48,490
for an attacker to control a Guard relay
and the user picks a Guard relay and

227
00:15:48,490 --> 00:15:52,449
keeps it for a long period of time. And
so it’s very difficult for the attackers

228
00:15:52,449 --> 00:15:58,939
to pick that Guard relay when they control
a very small proportion of the network.

229
00:15:58,939 --> 00:16:06,420
So this, currently, provides those
properties I described earlier, the privacy

230
00:16:06,420 --> 00:16:11,410
and the anonymity when you’re browsing the
web, when you’re accessing websites etc.

231
00:16:11,410 --> 00:16:16,519
But still you know who the website is. So
although you’re anonymous and the website

232
00:16:16,519 --> 00:16:20,730
doesn’t know who you are you know who the
website is. And there may be some cases

233
00:16:20,730 --> 00:16:25,499
where e.g. the website would also wish to
remain anonymous. You want the person

234
00:16:25,499 --> 00:16:29,970
accessing the website and the website
itself to be anonymous to each other.

235
00:16:29,970 --> 00:16:34,230
And you could think about people e.g.
being in countries where running

236
00:16:34,230 --> 00:16:39,730
a political blog e.g. might be a dangerous
activity. If you run that on a regular

237
00:16:39,730 --> 00:16:45,660
webserver you’re easily identified whereas,
if you got some way where you as

238
00:16:45,660 --> 00:16:49,490
the webserver can be anonymous then
that allows you to do that activity without

239
00:16:49,490 --> 00:16:57,480
being targeted by your government. So
this is what hidden services try to solve.

240
00:16:57,480 --> 00:17:03,080
Now when you first think about a problem
you kind of think: “Hang on a second,

241
00:17:03,080 --> 00:17:06,429
the user doesn’t know who the website
is and the website doesn’t know

242
00:17:06,429 --> 00:17:09,890
who the user is. So how on earth do they
talk to each other?” Well, that’s essentially

243
00:17:09,890 --> 00:17:14,220
what the Tor hidden service protocol tries
to sort of set up. How do you identify and

244
00:17:14,220 --> 00:17:19,579
connect to each other. So at the moment
this is what happens: We’ve got Bob

245
00:17:19,579 --> 00:17:23,780
on the [right] hand side who is the hidden
service. And we got Alice on the left hand

246
00:17:23,780 --> 00:17:28,620
side here who is the user who wishes to
visit the hidden service. Now when Bob

247
00:17:28,620 --> 00:17:34,190
sets up his hidden service he picks three
nodes in the Tor network as introduction

248
00:17:34,190 --> 00:17:38,831
points and builds several hop circuits to
them. So the introduction points don’t know

249
00:17:38,831 --> 00:17:44,680
who Bob is. Bob has circuits to them. And
Bob says to each of these introduction points

250
00:17:44,680 --> 00:17:48,240
“Will you relay traffic to me if someone
connects to you asking for me?”

251
00:17:48,240 --> 00:17:53,030
And then those introduction points
do that. So then, once Bob has picked

252
00:17:53,030 --> 00:17:56,840
his introduction points he publishes
a descriptor describing the list of his

253
00:17:56,840 --> 00:18:01,310
introduction points for someone who wishes
to come onto his websites. And then Alice

254
00:18:01,310 --> 00:18:06,700
on the left hand side wishing to visit Bob
will pick a rendezvous point in the network

255
00:18:06,700 --> 00:18:10,030
and build a circuit to it. So this “RP”
here is the rendezvous point.

256
00:18:10,030 --> 00:18:14,530
And she will relay a message via one of
the introduction points saying to Bob:

257
00:18:14,530 --> 00:18:18,290
“Meet me at the rendezvous point”.
And then Bob will build a 3-hop-circuit

258
00:18:18,290 --> 00:18:22,870
to the rendezvous point. So now at this
stage we got Alice with a multi-hop circuit

259
00:18:22,870 --> 00:18:26,890
to the rendezvous point, and Bob with
a multi-hop circuit to the rendezvous point.

260
00:18:26,890 --> 00:18:32,550
Alice and Bob haven’t connected to one
another directly. The rendezvous point

261
00:18:32,550 --> 00:18:36,530
doesn’t know who Bob is, the rendezvous
point doesn’t know who Alice is.

262
00:18:36,530 --> 00:18:40,261
All they’re doing is forwarding the
traffic. And they can’t inspect the traffic,

263
00:18:40,261 --> 00:18:43,740
either, because the traffic itself
is encrypted.

264
00:18:43,740 --> 00:18:47,530
So that’s currently how you solve this
problem with trying to communicate

265
00:18:47,530 --> 00:18:50,820
with someone who you don’t know
who they are and vice versa.

266
00:18:50,820 --> 00:18:55,740
<i>drinks from the bottle</i>

267
00:18:55,740 --> 00:18:58,870
The principle thing I’m going to talk
about today is this database.

268
00:18:58,870 --> 00:19:01,990
So I said, Bob, when he picks his
introduction points he builds this thing

269
00:19:01,990 --> 00:19:06,080
called a descriptor, describing who his
introduction points are, and he publishes

270
00:19:06,080 --> 00:19:10,390
them to a database. This database itself
is distributed throughout the Tor network.

271
00:19:10,390 --> 00:19:17,860
It’s not a single server. So both, Bob and
Alice need to be able to publish information

272
00:19:17,860 --> 00:19:22,040
to this database, and also retrieve
information from this database. And Tor

273
00:19:22,040 --> 00:19:24,820
currently uses something called
a distributed hash table, which I’m gonna

274
00:19:24,820 --> 00:19:27,930
give an example of what this means and
how it works. And then I’ll talk to you

275
00:19:27,930 --> 00:19:34,380
specifically how the Tor Distributed Hash
Table works itself. So let’s say e.g.

276
00:19:34,380 --> 00:19:39,830
you've got a set of servers. So here we've
got 26 servers and you’d like to store

277
00:19:39,830 --> 00:19:44,240
your files across these different servers
without having a single server responsible

278
00:19:44,240 --> 00:19:48,050
for deciding, “okay, that file is stored
on that server, and this file is stored

279
00:19:48,050 --> 00:19:53,050
on that server” etc. etc. Now here is my
list of files. You could take a very naive

280
00:19:53,050 --> 00:19:57,740
approach. And you could say: “Okay, I’ve
got 26 servers, I got all of these file names

281
00:19:57,740 --> 00:20:01,250
and start with the letter of the alphabet.”
And I could say: “All of the files that begin

282
00:20:01,250 --> 00:20:05,450
with A are gonna go under server A; or
the files that begin with B are gonna go

283
00:20:05,450 --> 00:20:09,900
on server B etc.” And then when you want
to retrieve a file you say: “Okay, what

284
00:20:09,900 --> 00:20:13,950
does my file name begin with?” And then
you know which server it’s stored on.

285
00:20:13,950 --> 00:20:17,750
Now of course you could have a lot of
servers – sorry – a lot of files

286
00:20:17,750 --> 00:20:22,780
which begin with a Z, an X or a Y etc. in
which case you’re gonna overload

287
00:20:22,780 --> 00:20:27,310
that server. You’re gonna have more files
stored on one server than on another server

288
00:20:27,310 --> 00:20:32,150
in your set. And if you have a lot of big
files, say e.g. beginning with B then

289
00:20:32,150 --> 00:20:35,520
rather than distributing your files across
all the servers you’re gonna just be

290
00:20:35,520 --> 00:20:39,060
overloading one or two of them. So to
solve this problem what we tend to do is:

291
00:20:39,060 --> 00:20:42,410
we take the file name, and we run it
through a cryptographic hash function.

292
00:20:42,410 --> 00:20:46,930
A hash function produces output which
looks like random, very small changes

293
00:20:46,930 --> 00:20:50,740
in the input so a cryptographic hash
function produces a very large change

294
00:20:50,740 --> 00:20:55,240
in the output. And this change looks
random. So if I take all of my file names

295
00:20:55,240 --> 00:20:59,820
here, and assuming I have a lot more,
I take a hash of them, and then I use

296
00:20:59,820 --> 00:21:05,470
that hash to determine which server to
store the file on. Then, with high probability

297
00:21:05,470 --> 00:21:09,670
my files will be distributed evenly across
all of the servers. And then when I want

298
00:21:09,670 --> 00:21:12,990
to go and retrieve one of the files I take
my file name, I run it through the

299
00:21:12,990 --> 00:21:15,980
cryptographic hash function, that gives me
the hash, and then I use that hash

300
00:21:15,980 --> 00:21:19,740
to identify which server that particular
file is stored on. And then I go and

301
00:21:19,740 --> 00:21:25,990
retrieve it. So that’s the sort of a loose
idea of how a distributed hash table works.

302
00:21:25,990 --> 00:21:29,340
There are a couple of problems with this.
What if you got a changing size, what

303
00:21:29,340 --> 00:21:34,700
if the number of servers you got changes
in size as it does in the Tor network.

304
00:21:34,700 --> 00:21:42,290
It’s a very brief overview of the theory.
So how does it apply for the Tor network?

305
00:21:42,290 --> 00:21:47,640
Well, the Tor network has a set of relays
and it has a set of hidden services.

306
00:21:47,640 --> 00:21:52,710
Now we take all of the relays, and they
have a hash identity which identifies them.

307
00:21:52,710 --> 00:21:57,460
And we map them onto a circle using that
hash value as an identifier. So you can

308
00:21:57,460 --> 00:22:03,230
imagine the hash value ranging from Zero
to a very large number. We got a Zero point

309
00:22:03,230 --> 00:22:07,280
at the very top there. And that runs all
the way round to the very large number.

310
00:22:07,280 --> 00:22:12,130
So given the identity hash for a relay we
can map that to a particular point on

311
00:22:12,130 --> 00:22:19,070
the server. And then all we have to do
is also do this for hidden services.

312
00:22:19,070 --> 00:22:22,320
So there’s a hidden service address,
something.onion, so this is

313
00:22:22,320 --> 00:22:27,750
one of the hidden websites that you might
visit. You take the – I’m not gonna describe

314
00:22:27,750 --> 00:22:33,980
in too much detail how this is done but –
the value is done in such a way such that

315
00:22:33,980 --> 00:22:38,020
it’s evenly distributed about the circle.
So your hidden service will have

316
00:22:38,020 --> 00:22:44,240
a particular point on the circle. And the
relays will also be mapped onto this circle.

317
00:22:44,240 --> 00:22:49,640
So there’s the relays. And the hidden
service. And in the case of Tor

318
00:22:49,640 --> 00:22:53,460
the hidden service actually maps to two
positions on the circle, and it publishes

319
00:22:53,460 --> 00:22:57,850
its descriptor to the three relays to the
right at one position, and the three relays

320
00:22:57,850 --> 00:23:01,600
to the right at another position. So there
are actually in total six places where

321
00:23:01,600 --> 00:23:05,060
this descriptor is published on the
circle. And then if I want to go and

322
00:23:05,060 --> 00:23:09,450
fetch and connect to a hidden service
I go on to go and pull this hidden descriptor

323
00:23:09,450 --> 00:23:13,780
down to identify what its introduction
points are. I take the hidden service

324
00:23:13,780 --> 00:23:17,200
address, I find out where it is on the
circle, I map all of the relays onto

325
00:23:17,200 --> 00:23:21,110
the circle, and then I identify which
relays on the circle are responsible

326
00:23:21,110 --> 00:23:24,031
for that particular hidden service. And
I just connect, then I say: “Do you have

327
00:23:24,031 --> 00:23:26,630
a copy of the descriptor for that
particular hidden service?”

328
00:23:26,630 --> 00:23:29,620
And if so then we’ve got our list of
introduction points. And we can go

329
00:23:29,620 --> 00:23:38,020
to the next steps to connect to our hidden
service. So I’m gonna explain how we

330
00:23:38,020 --> 00:23:41,320
sort of set up our experiments. What we
thought, or what we were interested to do,

331
00:23:41,320 --> 00:23:48,181
was collect publications of hidden
services. So for everytime a hidden service

332
00:23:48,181 --> 00:23:51,520
gets set up it publishes to this distributed
hash table. What we wanted to do was

333
00:23:51,520 --> 00:23:55,750
collect those publications so that we
get a complete list of all of the hidden

334
00:23:55,750 --> 00:23:59,280
services. And what we also wanted to do
is to find out how many times a particular

335
00:23:59,280 --> 00:24:06,300
hidden service is requested.

336
00:24:06,300 --> 00:24:10,540
Just one more point that
will become important later.

337
00:24:10,540 --> 00:24:14,230
The position which the hidden service
appears on the circle changes

338
00:24:14,230 --> 00:24:18,950
every 24 hours. So there’s not
a fixed position every single day.

339
00:24:18,950 --> 00:24:24,370
If we run 40 nodes over a long period of
time we will occupy positions within

340
00:24:24,370 --> 00:24:29,570
that distributed hash table. And we will be
able to collect publications and requests

341
00:24:29,570 --> 00:24:34,300
for hidden services that are located at
that position inside the distributed

342
00:24:34,300 --> 00:24:39,251
hash table. So in that case we ran 40 Tor
nodes, we had a student at university

343
00:24:39,251 --> 00:24:43,950
who said: “Hey, I run a hosting company,
I got loads of server capacity”, and

344
00:24:43,950 --> 00:24:46,580
we told him what we were doing, and he
said: “Well, you really helped us out,

345
00:24:46,580 --> 00:24:49,820
these last couple of years…”
and just gave us loads of server capacity

346
00:24:49,820 --> 00:24:55,500
to allow us to do this. So we spun up 40
Tor nodes. Each Tor node was required

347
00:24:55,500 --> 00:24:59,560
to advertise a certain amount of bandwidth
to become a part of that distributed

348
00:24:59,560 --> 00:25:02,200
hash table. It’s actually a very small
amount, so this didn’t matter too much.

349
00:25:02,200 --> 00:25:06,050
And then, after – this has changed
recently in the last few days,

350
00:25:06,050 --> 00:25:10,070
it used to be 25 hours, it’s just been
increased as a result of one of the

351
00:25:10,070 --> 00:25:14,570
attacks last week. But here… certainly
during our study it was 25 hours. You then

352
00:25:14,570 --> 00:25:18,300
appear at a particular point inside that
distributed hash table. And you’re then

353
00:25:18,300 --> 00:25:22,750
in a position to record publications of
hidden services and requests for hidden

354
00:25:22,750 --> 00:25:27,810
services. So not only can you get a full
list of the onion addresses you can also

355
00:25:27,810 --> 00:25:32,250
find out how many times each of the
onion addresses are requested.

356
00:25:32,250 --> 00:25:38,270
And so this is what we recorded. And then,
once we had a full list of… or once

357
00:25:38,270 --> 00:25:41,830
we had run for a long period of time to
collect a long list of .onion addresses

358
00:25:41,830 --> 00:25:46,850
we then built a custom crawler that would
visit each of the Tor hidden services

359
00:25:46,850 --> 00:25:51,450
in turn, and pull down the HTML contents,
the text content from the web page,

360
00:25:51,450 --> 00:25:54,760
so that we could go ahead and classify
the content. Now it’s really important

361
00:25:54,760 --> 00:25:59,250
to know here, and it will become obvious
why a little bit later, we only pulled down

362
00:25:59,250 --> 00:26:03,030
HTML content. We didn’t pull out images.
And there’s a very, very important reason

363
00:26:03,030 --> 00:26:09,980
for that which will become clear shortly.

364
00:26:09,980 --> 00:26:13,520
We had a lot of questions when we
first started this. Noone really knew

365
00:26:13,520 --> 00:26:18,000
how many hidden services there were. It had
been suggested to us there was a very high

366
00:26:18,000 --> 00:26:21,250
turn-over of hidden services. We wanted to
confirm that whether that was true or not.

367
00:26:21,250 --> 00:26:24,530
And we also wanted to do this so,
what are the hidden services,

368
00:26:24,530 --> 00:26:30,140
how popular are they, etc. etc. etc. So
our estimate for how many hidden services

369
00:26:30,140 --> 00:26:34,770
there are, over the period which we
ran our study, this is a graph plotting

370
00:26:34,770 --> 00:26:38,560
our estimate for each of the individual
days as to how many hidden services

371
00:26:38,560 --> 00:26:44,850
there were on that particular day. Now the
data is naturally noisy because we’re only

372
00:26:44,850 --> 00:26:48,590
a very small proportion of that circle.
So we’re only observing a very small

373
00:26:48,590 --> 00:26:53,250
proportion of the total publications and
requests every single day, for each of

374
00:26:53,250 --> 00:26:57,260
those hidden services. And if you
take a long term average for this

375
00:26:57,260 --> 00:27:02,720
there’s about 45.000 hidden services that
we think were present, on average,

376
00:27:02,720 --> 00:27:07,880
each day, during our entire study. Which
is a large number of hidden services.

377
00:27:07,880 --> 00:27:11,070
But over the entire length we
collected about 80.000, in total.

378
00:27:11,070 --> 00:27:14,270
Some came and went etc.
So the next question after how many

379
00:27:14,270 --> 00:27:17,750
hidden services there are is how long
the hidden service exists for.

380
00:27:17,750 --> 00:27:20,620
Does it exist for a very long period
of time, does it exist for a very short

381
00:27:20,620 --> 00:27:24,220
period of time etc. etc.
So what we did was, for every single

382
00:27:24,220 --> 00:27:30,260
.onion address we plotted how many times
we saw a publication for that particular

383
00:27:30,260 --> 00:27:34,160
hidden service during the six months.
How many times did we see it.

384
00:27:34,160 --> 00:27:38,100
If we saw it a lot of times that suggested
in general the hidden service existed

385
00:27:38,100 --> 00:27:42,180
for a very long period of time. If we saw
a very short number of publications

386
00:27:42,180 --> 00:27:45,760
for each hidden service then that
suggests that they were only present

387
00:27:45,760 --> 00:27:51,690
for a very short period of time. This is
our graph. By far the most number

388
00:27:51,690 --> 00:27:55,890
of hidden services we only saw once during
the entire study. And we never saw them

389
00:27:55,890 --> 00:28:00,390
again. We suggest that there’s a very high
turnover of the hidden services, they

390
00:28:00,390 --> 00:28:04,520
don’t tend to exist on average i.e. for
a very long period of time.

391
00:28:04,520 --> 00:28:10,730
And then you can see the sort of
a tail here. If we plot just those

392
00:28:10,730 --> 00:28:16,390
hidden services which existed for a long
time, so e.g. we could take hidden services

393
00:28:16,390 --> 00:28:20,280
which have a high number of hit requests
and say: “Okay, those that have a high number

394
00:28:20,280 --> 00:28:24,800
of hits probably existed for a long time.”
That’s not absolutely certain, but probably.

395
00:28:24,800 --> 00:28:29,190
Then you see this sort of -normal- plot
about 4..5, so we saw on average

396
00:28:29,190 --> 00:28:34,870
most hidden services four or five times
during the entire six months if they were

397
00:28:34,870 --> 00:28:40,530
popular and we’re using that as a proxy
measure for whether they existed

398
00:28:40,530 --> 00:28:48,160
for the entire time. Now, this stage was
over 160 days, so almost six months.

399
00:28:48,160 --> 00:28:51,490
What we also wanted to do was trying
to confirm this over a longer period.

400
00:28:51,490 --> 00:28:56,310
So last year, in 2013, about February time
some researchers of the University

401
00:28:56,310 --> 00:29:00,350
of Luxemburg also ran a similar study
but it ran over a very short period of time

402
00:29:00,350 --> 00:29:05,060
over the day. But they did it in such
a way it could collect descriptors

403
00:29:05,060 --> 00:29:08,590
across much of the circle during a single
day. That was because of a bug in the way

404
00:29:08,590 --> 00:29:12,020
Tor did some of the things which has
now been fixed so we can’t repeat that

405
00:29:12,020 --> 00:29:16,520
as a particular way. So we got a list of
.onion addresses from February 2013

406
00:29:16,520 --> 00:29:18,960
from these researchers at the University
of Luxemburg. And then we got our list

407
00:29:18,960 --> 00:29:23,670
of .onion addresses from this six months
which was March to September of this year.

408
00:29:23,670 --> 00:29:26,700
And we wanted to say, okay, we’re given
these two sets of .onion addresses.

409
00:29:26,700 --> 00:29:30,740
Which .onion addresses existed in his set
but not ours and vice versa, and which

410
00:29:30,740 --> 00:29:39,740
.onion addresses existed in both sets?

411
00:29:39,740 --> 00:29:45,520
So as you can see a very small minority
of hidden service addresses existed

412
00:29:45,520 --> 00:29:50,000
in both sets. This is over an 18 month
period between these two collection points.

413
00:29:50,000 --> 00:29:54,430
A very small number of services existed
in both his data set and in

414
00:29:54,430 --> 00:29:58,390
our data set. Which again suggested
there’s a very high turnover of hidden

415
00:29:58,390 --> 00:30:02,920
services that don’t tend to exist
for a very long period of time.

416
00:30:02,920 --> 00:30:06,530
So the question is why is that?
Which we’ll come on to a little bit later.

417
00:30:06,530 --> 00:30:11,120
It’s a very valid question, can’t answer
it 100%, we have some inclines as to

418
00:30:11,120 --> 00:30:15,560
why that may be the case. So in terms
of popularity which hidden services

419
00:30:15,560 --> 00:30:19,700
did we see, or which .onion addresses
did we see requested the most?

420
00:30:19,700 --> 00:30:26,980
Which got the most number of hits? Or the
most number of directory requests.

421
00:30:26,980 --> 00:30:30,120
So botnet Command & Control servers
– if you’re not familiar with what

422
00:30:30,120 --> 00:30:34,340
a botnet is, the idea is to infect lots of
people with a piece of malware.

423
00:30:34,340 --> 00:30:37,630
And this malware phones home to
a Command & Control server where

424
00:30:37,630 --> 00:30:41,500
the botnet master can give instructions
to each of the bots on to do things.

425
00:30:41,500 --> 00:30:46,780
So it might be e.g. to collect passwords,
key strokes, banking details.

426
00:30:46,780 --> 00:30:51,010
Or it might be to do things like
Distributed Denial of Service attacks,

427
00:30:51,010 --> 00:30:55,220
or to send spam, those sorts of things.
And a couple of years ago someone gave

428
00:30:55,220 --> 00:31:00,720
a talk and said: “Well, the problem with
running a botnet is your C&C servers

429
00:31:00,720 --> 00:31:05,750
are vulnerable.” Once a C&C server is taken
down you no longer have control over

430
00:31:05,750 --> 00:31:10,030
your botnet. So it’s been a sort of arms
race against anti-virus companies and

431
00:31:10,030 --> 00:31:15,130
against malware authors to try and come up
with techniques to run C&C servers in a way

432
00:31:15,130 --> 00:31:18,490
which they can’t be taken down. And
a couple of years ago someone gave a talk

433
00:31:18,490 --> 00:31:22,450
at a conference that said: “You know what?
It would be a really good idea if botnet

434
00:31:22,450 --> 00:31:25,809
C&C servers were run as Tor hidden
services because then no one knows

435
00:31:25,809 --> 00:31:29,370
where they are, and in theory they can’t
be taken down.” So in the fact we have this

436
00:31:29,370 --> 00:31:33,000
there are loads and loads and loads of
these addresses associated with several

437
00:31:33,000 --> 00:31:38,122
different botnets, ‘Sefnit’ and ‘Skynet’.
Now Skynet is the one I wanted to talk

438
00:31:38,122 --> 00:31:42,840
to you about because the guy that runs
Skynet had a twitter account, and he also

439
00:31:42,840 --> 00:31:47,210
did a Reddit AMA. If you not heard
of a Reddit AMA before, that’s a Reddit

440
00:31:47,210 --> 00:31:51,500
ask-me-anything. You can go on the website
and ask the guy anything. So this guy

441
00:31:51,500 --> 00:31:54,790
wasn’t hiding in the shadows. He’d say:
“Hey, I’m running this massive botnet,

442
00:31:54,790 --> 00:31:58,180
here’s my Twitter account which I update
regularly, here is my Reddit AMA where

443
00:31:58,180 --> 00:32:01,620
you can ask me questions!” etc.

444
00:32:01,620 --> 00:32:04,590
He was arrested last year, which is not,
perhaps, a huge surprise.

445
00:32:04,590 --> 00:32:11,750
<i>laughter and applause</i>

446
00:32:11,750 --> 00:32:15,970
But… so he was arrested,
his C&C servers disappeared

447
00:32:15,970 --> 00:32:21,600
but there were still infected hosts trying
to connect with the C&C servers and

448
00:32:21,600 --> 00:32:24,490
request access to the C&C server.

449
00:32:24,490 --> 00:32:27,570
This is why we’re saying: “A large number
of hits.” So all of these requests are

450
00:32:27,570 --> 00:32:31,520
failed requests, i.e. we didn’t have
a descriptor for them because

451
00:32:31,520 --> 00:32:34,910
the hidden service had gone away but
there were still clients requesting each

452
00:32:34,910 --> 00:32:38,040
of the hidden services.

453
00:32:38,040 --> 00:32:41,980
And the next thing we wanted to do was
to try and categorize sites. So, as I said

454
00:32:41,980 --> 00:32:45,960
earlier, we crawled all of the hidden
services that we could, and we classified

455
00:32:45,960 --> 00:32:50,230
them into different categories based
on what the type of content was

456
00:32:50,230 --> 00:32:53,650
on the hidden service side. The first
graph I have is the number of sites

457
00:32:53,650 --> 00:32:58,040
in each of the categories. So you can see
down the bottom here we got lots of

458
00:32:58,040 --> 00:33:04,280
different categories. We got drugs, market
places, etc. on the bottom. And the graph

459
00:33:04,280 --> 00:33:07,360
shows the percentage of the hidden
services that we crawled that fit in

460
00:33:07,360 --> 00:33:12,680
to each of these categories. So e.g. looking
at this, drugs, the most number of sites

461
00:33:12,680 --> 00:33:16,250
that we crawled were made up of
drugs-focused websites, followed by

462
00:33:16,250 --> 00:33:20,970
market places etc. There’s a couple of
questions you might have here,

463
00:33:20,970 --> 00:33:25,640
so which ones are gonna stick out, what
does ‘porn’ mean, well, you know

464
00:33:25,640 --> 00:33:31,060
what ‘porn’ means. There are some very
notorious porn sites on the Tor Darknet.

465
00:33:31,060 --> 00:33:34,470
There was one in particular which was
focused on revenge porn. It turns out

466
00:33:34,470 --> 00:33:37,520
that youngsters wish to take pictures
of themselves, and send it to their

467
00:33:37,520 --> 00:33:45,040
boyfriends or their girlfriends. And
when they get dumped they publish them

468
00:33:45,040 --> 00:33:49,750
on these websites. So there were several
of these sites on the main internet

469
00:33:49,750 --> 00:33:53,070
which have mostly been shut down.
And some of these sites were archived

470
00:33:53,070 --> 00:33:58,220
on the Darknet. The second one is that
we should probably wonder what is,

471
00:33:58,220 --> 00:34:03,430
is ‘abuse’. Abuse was… every single
site we classified in this category

472
00:34:03,430 --> 00:34:07,750
were child abuse sites. So they were in
some way facilitating child abuse.

473
00:34:07,750 --> 00:34:10,980
And how do we know that? Well, the data
that came back from the crawler

474
00:34:10,980 --> 00:34:14,789
made it completely unambiguous as to what
the content was in these sites. That was

475
00:34:14,789 --> 00:34:18,918
completely obvious, from then content, from
the crawler as to what was on these sites.

476
00:34:18,918 --> 00:34:23,449
And this is the principal reason why we
didn’t pull down images from sites.

477
00:34:23,449 --> 00:34:26,099
There are many countries that
would be a criminal offense to do so.

478
00:34:26,099 --> 00:34:29,530
So our crawler only pulled down text
content from all of these sites, and that

479
00:34:29,530 --> 00:34:34,470
enabled us to classify them, based on
that. We didn’t pull down any images.

480
00:34:34,470 --> 00:34:37,880
So of course the next thing we liked to do
is to say: “Okay, well, given each of these

481
00:34:37,880 --> 00:34:42,759
categories, what proportion of directory
requests went to each of the categories?”

482
00:34:42,759 --> 00:34:45,489
Now the next graph is going to need some
explaining as to precisely what it

483
00:34:45,489 --> 00:34:52,090
means, and I’m gonna give that. This is
the proportion of directory requests

484
00:34:52,090 --> 00:34:55,830
which we saw that went to each of the
categories of hidden service that we

485
00:34:55,830 --> 00:34:59,740
classified. As you can see, in fact, we
saw a very large number going to these

486
00:34:59,740 --> 00:35:05,010
abuse sites. And the rest sort of
distributed right there, at the bottom.

487
00:35:05,010 --> 00:35:07,230
And the question is: “What is it
we’re collecting here?”

488
00:35:07,230 --> 00:35:12,070
We’re collecting successful hidden service
directory requests. What does a hidden

489
00:35:12,070 --> 00:35:16,790
service directory request mean?
It probably loosely correlates with

490
00:35:16,790 --> 00:35:22,230
either a visit or a visitor. So somewhere
in between those two. Because when you

491
00:35:22,230 --> 00:35:26,790
want to visit a hidden service you make
a request for the hidden service descriptor

492
00:35:26,790 --> 00:35:31,080
and that allows you to connect to it
and browse through the web site.

493
00:35:31,080 --> 00:35:34,770
But there are cases where, e.g. if you
restart Tor, you’ll go back and you

494
00:35:34,770 --> 00:35:40,100
re-fetch the descriptor. So in that case
we’ll count twice, for example.

495
00:35:40,100 --> 00:35:43,050
What proportion of these are people,
and which proportion of them are

496
00:35:43,050 --> 00:35:46,619
something else? The answer to that is
we just simply don’t know.

497
00:35:46,619 --> 00:35:50,250
We've got directory requests but that doesn’t
tell us about what they’re doing on these

498
00:35:50,250 --> 00:35:55,130
sites, what they’re fetching, or who
indeed they are, or what it is they are.

499
00:35:55,130 --> 00:35:58,690
So these could be automated requests,
they could be human beings. We can’t

500
00:35:58,690 --> 00:36:03,750
distinguish between those two things.

501
00:36:03,750 --> 00:36:06,420
What are the limitations?

502
00:36:06,420 --> 00:36:12,170
A hidden service directory request neither
exactly correlates to a visit -or- a visitor.

503
00:36:12,170 --> 00:36:16,380
It’s probably somewhere in between.
So you can’t say whether it’s exactly one

504
00:36:16,380 --> 00:36:19,810
or the other. We cannot say whether
a hidden service directory request

505
00:36:19,810 --> 00:36:26,230
is a person or something automated.
We can’t distinguish between those two.

506
00:36:26,230 --> 00:36:31,890
Any type of site could be targeted by e.g.
DoS attacks, by web crawlers which would

507
00:36:31,890 --> 00:36:40,040
greatly inflate the figures. If you were
to do a DoS attack it’s likely you’d only

508
00:36:40,040 --> 00:36:44,700
request a small number of descriptors.
You’d actually be flooding the site itself

509
00:36:44,700 --> 00:36:47,740
rather than the directories. But, in
theory, you could flood the directories.

510
00:36:47,740 --> 00:36:52,840
But we didn’t see any sort of shutdown
of our directories based on flooding, e.g.

511
00:36:52,840 --> 00:36:58,720
Whilst we can’t rule that out, it doesn’t
seem to fit too well with what we’ve got.

512
00:36:58,720 --> 00:37:02,971
The other question is ‘crawlers’.
I obviously talked with the Tor Project

513
00:37:02,971 --> 00:37:08,570
about these results and they’ve suggested
that there are groups, so the child

514
00:37:08,570 --> 00:37:12,740
protection agencies e.g. that will crawl
these sites on a regular basis. And,

515
00:37:12,740 --> 00:37:15,879
again, that doesn’t necessarily correlate
with a human being. And that could

516
00:37:15,879 --> 00:37:19,830
inflate the figures. How many hidden
directory requests would there be

517
00:37:19,830 --> 00:37:24,610
if a crawler was pointed at it. Typically,
if I crawl them on a single day, one request.

518
00:37:24,610 --> 00:37:27,850
But if they got a large number of servers
doing the crawling then it could be

519
00:37:27,850 --> 00:37:32,840
a request per day for every single server.
So, again, I can’t give you, definitive,

520
00:37:32,840 --> 00:37:37,930
“yes, this is human beings” or
“yes, this is automated requests”.

521
00:37:37,930 --> 00:37:43,300
The other important point is, these two
content graphs are only hidden services

522
00:37:43,300 --> 00:37:48,550
offering web content. There are hidden
services that do things, e.g. IRC,

523
00:37:48,550 --> 00:37:52,490
the instant messaging etc. Those aren’t
included in these figures. We’re only

524
00:37:52,490 --> 00:37:57,990
concentrating on hidden services offering
web sites. They’re HTTP services, or HTTPS

525
00:37:57,990 --> 00:38:01,640
services. Because that allows to easily
classify them. And, in fact, some of

526
00:38:01,640 --> 00:38:06,080
the other types are IRC and Jabber the
result was probably not directly comparable

527
00:38:06,080 --> 00:38:08,920
with web sites. That’s sort of the use
case for using them, it’s probably

528
00:38:08,920 --> 00:38:16,490
slightly different. So I appreciate the
last graph is somewhat alarming.

529
00:38:16,490 --> 00:38:20,640
If you have any questions please ask
either me or the Tor developers

530
00:38:20,640 --> 00:38:24,810
as to how to interpret these results. It’s
not quite as straight-forward as it may

531
00:38:24,810 --> 00:38:27,500
look when you look at the graph. You
might look at the graph and say: “Hey,

532
00:38:27,500 --> 00:38:30,980
that looks like there’s lots of people
visiting these sites”. It’s difficult

533
00:38:30,980 --> 00:38:40,240
to conclude that from the results.

534
00:38:40,240 --> 00:38:45,990
The next slide is gonna be very
contentious. I will prefix it with:

535
00:38:45,990 --> 00:38:50,970
“I’m not advocating -any- kind of
action whatsoever. I’m just trying

536
00:38:50,970 --> 00:38:56,130
to describe technically as to what could
be done. It’s not up to me to make decisions

537
00:38:56,130 --> 00:39:02,869
on these types of things.” So, of course,
when we found this out, frankly, I think

538
00:39:02,869 --> 00:39:06,190
we were stunned. I mean, it took us
several days, frankly, it just stunned us,

539
00:39:06,190 --> 00:39:09,610
“what the hell, this is not
what we expected at all.”

540
00:39:09,610 --> 00:39:13,210
So a natural step is, well, we think, most
of us think that Tor is a great thing,

541
00:39:13,210 --> 00:39:18,510
it seems. Could this problem be sorted out
while still keeping Tor as it is?

542
00:39:18,510 --> 00:39:21,510
And probably the next step to say: “Well,
okay, could we just block this class

543
00:39:21,510 --> 00:39:26,060
of content and not other types of content?”
So could we block just hidden services

544
00:39:26,060 --> 00:39:29,630
that are associated with these sites and
not other types of hidden services?

545
00:39:29,630 --> 00:39:33,370
We thought there’s three ways in which
we could block hidden services.

546
00:39:33,370 --> 00:39:36,960
And I’ll talk about whether these were
impossible in the coming months,

547
00:39:36,960 --> 00:39:39,430
after explaining them. But during our
study these would have been impossible

548
00:39:39,430 --> 00:39:43,590
and presently they are possible.

549
00:39:43,590 --> 00:39:48,630
A single individual could shut down
a single hidden service by controlling

550
00:39:48,630 --> 00:39:53,640
all of the relays which are responsible
for receiving a publication request

551
00:39:53,640 --> 00:39:57,280
on that distributed hash table. It’s
possible to place one of your relays

552
00:39:57,280 --> 00:40:01,460
at a particular position on that circle
and so therefore make yourself be

553
00:40:01,460 --> 00:40:04,290
the responsible relay for
a particular hidden service.

554
00:40:04,290 --> 00:40:08,500
And if you control all of the six relays
which are responsible for a hidden service,

555
00:40:08,500 --> 00:40:11,390
when someone comes to you and says:
“Can I have a descriptor for that site”

556
00:40:11,390 --> 00:40:15,910
you can just say: “No, I haven’t got it”.
And provided you control those relays

557
00:40:15,910 --> 00:40:20,580
users won’t be able to fetch those sites.

558
00:40:20,580 --> 00:40:25,010
The second option is you could say:
“Okay, the Tor Project are blocking these”

559
00:40:25,010 --> 00:40:28,941
– which I’ll talk about in a second –
“as a relay operator”. Could I

560
00:40:28,941 --> 00:40:32,500
as a relay operator say: “Okay, as
a relay operator I don’t want to carry

561
00:40:32,500 --> 00:40:35,930
this type of content, and I don’t want to
be responsible for serving up this type

562
00:40:35,930 --> 00:40:39,930
of content.” A relay operator could patch
his relay and say: “You know what,

563
00:40:39,930 --> 00:40:44,020
if anyone comes to this relay requesting
anyone of these sites then, again, just

564
00:40:44,020 --> 00:40:48,740
refuse to do it”. The problem is a lot of
relay operators need to do it. So a very,

565
00:40:48,740 --> 00:40:51,990
very large number of the potential relay
operators would need to do that

566
00:40:51,990 --> 00:40:56,170
to effectively block these sites. The
final option is the Tor Project could

567
00:40:56,170 --> 00:41:00,740
modify the Tor program and actually embed
these ingresses in the Tor program itself

568
00:41:00,740 --> 00:41:05,030
so as that all relays by default both
block hidden service directory requests

569
00:41:05,030 --> 00:41:10,560
to these sites, and also clients themselves
would say: “Okay, if anyone’s requesting

570
00:41:10,560 --> 00:41:15,000
these block them at the client level.”
Now I hasten to add: I’m not advocating

571
00:41:15,000 --> 00:41:18,230
any kind of action that is entirely up to
other people because, frankly, I think

572
00:41:18,230 --> 00:41:22,530
if I advocated blocking hidden services
I probably wouldn’t make it out alive,

573
00:41:22,530 --> 00:41:27,050
so I’m just saying: this is a description
of what technical measures could be used

574
00:41:27,050 --> 00:41:30,730
to block some classes of sites. And of
course there’s lots of questions here.

575
00:41:30,730 --> 00:41:35,150
If e.g. the Tor Project themselves decided:
“Okay, we’re gonna block these sites”

576
00:41:35,150 --> 00:41:38,490
that means they are essentially
in control of the block list.

577
00:41:38,490 --> 00:41:41,360
The block list would be somewhat public
so everyone would be up to inspect

578
00:41:41,360 --> 00:41:44,930
what the sites are that are being blocked
and they would be in control of some kind

579
00:41:44,930 --> 00:41:54,360
of block list. Which, you know, arguably
is against what the Tor Projects are after.

580
00:41:54,360 --> 00:41:59,560
<i>takes a sip, coughs</i>

581
00:41:59,560 --> 00:42:05,480
So how about deanonymising visitors
to hidden service web sites?

582
00:42:05,480 --> 00:42:08,940
So in this case we got a user on the
left-hand side who is connected to

583
00:42:08,940 --> 00:42:12,630
a Guard node. We’ve got a hidden service
on the right-hand side who is connected

584
00:42:12,630 --> 00:42:17,530
to a Guard node and on the top we got
one of those directory servers which is

585
00:42:17,530 --> 00:42:21,850
responsible for serving up those
hidden service directory requests.

586
00:42:21,850 --> 00:42:28,660
Now, when you first want to connect to
a hidden service you connect through

587
00:42:28,660 --> 00:42:31,619
your Guard node and through a couple of hops
up to the hidden service directory and

588
00:42:31,619 --> 00:42:35,840
you request the descriptor off of them.
So at this point if you are the attacker

589
00:42:35,840 --> 00:42:39,440
and you control one of the hidden service
directory nodes for a particular site

590
00:42:39,440 --> 00:42:43,100
you can send back down the circuit
a particular pattern of traffic.

591
00:42:43,100 --> 00:42:47,740
And if you control that user’s
Guard node – which is a big if –

592
00:42:47,740 --> 00:42:52,110
then you can spot that pattern of traffic
at the Guard node. The question is:

593
00:42:52,110 --> 00:42:56,940
“How do you control a particular user’s
Guard node?” That’s very, very hard.

594
00:42:56,940 --> 00:43:01,480
But if e.g. I run a hidden service and all
of you visit my hidden service, and

595
00:43:01,480 --> 00:43:05,670
I’m running a couple of dodgy Guard relays
then the probability is that some of you,

596
00:43:05,670 --> 00:43:09,760
certainly not all of you by any stretch will
select my dodgy Guard relay, and

597
00:43:09,760 --> 00:43:13,220
I could deanonymise you, but I couldn’t
deanonymise the rest of them.

598
00:43:13,220 --> 00:43:18,260
So what we’re saying here is that
you can deanonymise some of the users

599
00:43:18,260 --> 00:43:22,130
some of the time but you can’t pick which
users those are which you’re going to

600
00:43:22,130 --> 00:43:26,609
deanonymise. You can’t deanonymise someone
specific but you can deanonymise a fraction

601
00:43:26,609 --> 00:43:32,170
based on what fraction of the network you
control in terms of Guard capacity.

602
00:43:32,170 --> 00:43:36,340
How about… so the attacker controls those
two – here’s a picture from a research of

603
00:43:36,340 --> 00:43:40,200
the University of Luxemburg which
did this. And these are plots of

604
00:43:40,200 --> 00:43:45,270
taking the user’s IP address visiting
a C&C server, and then geolocating it

605
00:43:45,270 --> 00:43:48,480
and putting it on a map. So “where was the
user located when they called one of

606
00:43:48,480 --> 00:43:51,620
the Tor hidden services?” So, again,
this is a selection, a percentage

607
00:43:51,620 --> 00:43:58,060
of the users visiting C&C servers
using this technique.

608
00:43:58,060 --> 00:44:03,770
How about deanonymising hidden services
themselves? Well, again, you got a problem.

609
00:44:03,770 --> 00:44:08,340
You’re the user. You’re gonna connect
through your Guard into the Tor network.

610
00:44:08,340 --> 00:44:12,160
And then, eventually, through the hidden
service’s Guard node, and talk to

611
00:44:12,160 --> 00:44:16,740
the hidden service. As the attacker you
need to control the hidden service’s

612
00:44:16,740 --> 00:44:20,859
Guard node to do these traffic correlation
attacks. So again, it’s very difficult

613
00:44:20,859 --> 00:44:24,390
to deanonymise a specific Tor hidden
service. But if you think about, okay,

614
00:44:24,390 --> 00:44:30,200
there is 1.000 Tor hidden services, if you
can control a percentage of the Guard nodes

615
00:44:30,200 --> 00:44:34,230
then some hidden services will pick you
and then you’ll be able to deanonymise those.

616
00:44:34,230 --> 00:44:37,330
So provided you don’t care which hidden
services you gonna deanonymise

617
00:44:37,330 --> 00:44:41,400
then it becomes much more straight-forward
to control the Guard nodes of some hidden

618
00:44:41,400 --> 00:44:44,910
services but you can’t pick exactly
what those are.

619
00:44:44,910 --> 00:44:51,040
So what sort of data can you see
traversing a relay?

620
00:44:51,040 --> 00:44:55,880
This is a modified Tor client which just
dumps cells which are coming…

621
00:44:55,880 --> 00:44:58,750
essentially packets travelling down
a circuit, and the information you can

622
00:44:58,750 --> 00:45:04,020
extract from them at a Guard node.
And this is done off the main Tor network.

623
00:45:04,020 --> 00:45:08,590
So I’ve got a client connected to
a “malicious” Guard relay

624
00:45:08,590 --> 00:45:14,040
and it logs every single packet – they’re
called ‘cells’ in the Tor protocol –

625
00:45:14,040 --> 00:45:17,619
coming through the Guard relay. We can’t
decrypt the packet because it’s encrypted

626
00:45:17,619 --> 00:45:21,780
three times. What we can record,
though, is the IP address of the user,

627
00:45:21,780 --> 00:45:25,070
the IP address of the next hop,
and we can count packets travelling

628
00:45:25,070 --> 00:45:29,240
in each direction down the circuit. And we
can also record the time at which those

629
00:45:29,240 --> 00:45:32,210
packets were sent. So of course, if you’re
doing the traffic correlation attacks

630
00:45:32,210 --> 00:45:37,970
you’re using that time in the information
to try and work out whether you’re seeing

631
00:45:37,970 --> 00:45:42,370
traffic which you’ve sent and which
identifies a particular user or not.

632
00:45:42,370 --> 00:45:44,810
Or indeed traffic which they’ve sent
which you’ve seen at a different point

633
00:45:44,810 --> 00:45:49,100
in the network.

634
00:45:49,100 --> 00:45:51,980
Moving on to my…

635
00:45:51,980 --> 00:45:55,760
…interesting problems,
research questions etc.

636
00:45:55,760 --> 00:45:59,250
Based on what I’ve said, I’ve said there’s
these directory authorities which are

637
00:45:59,250 --> 00:46:05,070
controlled by the core Tor members. If
e.g. they were malicious then they could

638
00:46:05,070 --> 00:46:08,990
manipulate the Tor… – if a big enough
chunk of them are malicious then

639
00:46:08,990 --> 00:46:12,700
they can manipulate the consensus
to direct you to particular nodes.

640
00:46:12,700 --> 00:46:15,920
I don’t think that’s the case, and that
anyone thinks that’s the case.

641
00:46:15,920 --> 00:46:19,180
And Tor is designed in a way to tr…
I mean that you’d have to control

642
00:46:19,180 --> 00:46:22,480
a certain number of the authorities
to be able to do anything important.

643
00:46:22,480 --> 00:46:25,270
So the Tor people… I said this
to them a couple of days ago.

644
00:46:25,270 --> 00:46:28,780
I find it quite funny that you’d design
your system as if you don’t trust

645
00:46:28,780 --> 00:46:31,880
each other. To which their response was:
“No, we design our system so that

646
00:46:31,880 --> 00:46:35,620
we don’t have to trust each other.” Which
I think is a very good model to have,

647
00:46:35,620 --> 00:46:39,430
when you have this type of system.
So could we eliminate these sort of

648
00:46:39,430 --> 00:46:43,240
centralized servers? I think that’s
actually a very hard problem to do.

649
00:46:43,240 --> 00:46:46,340
There are lots of attacks which could
potentially be deployed against

650
00:46:46,340 --> 00:46:51,250
a decentralized network. At the moment the
Tor network is relatively well understood

651
00:46:51,250 --> 00:46:54,490
both in terms of what types of attack it
is vulnerable to. So if we were to move

652
00:46:54,490 --> 00:46:58,880
to a new architecture then we may open it
to a whole new class of attacks.

653
00:46:58,880 --> 00:47:02,000
The Tor network has been existing
for quite some time and it’s been

654
00:47:02,000 --> 00:47:06,820
very well studied. What about global
adversaries like the NSA, where you could

655
00:47:06,820 --> 00:47:10,980
monitor network links all across the
world? It’s very difficult to defend

656
00:47:10,980 --> 00:47:15,530
against that. Where they can monitor…
if they can identify which Guard relay

657
00:47:15,530 --> 00:47:18,760
you’re using, they can monitor traffic
going into and out of the Guard relay,

658
00:47:18,760 --> 00:47:23,259
and they log each of the subsequent hops
along. It’s very, very difficult to defend against

659
00:47:23,259 --> 00:47:26,470
these types of things. Do we know if
they’re doing it? The documents that were

660
00:47:26,470 --> 00:47:29,850
released yesterday – I’ve only had a very
brief look through them, but they suggest

661
00:47:29,850 --> 00:47:32,480
that they’re not presently doing it and
they haven’t had much success.

662
00:47:32,480 --> 00:47:36,450
I don’t know why, there are very powerful
attacks described in the academic literature

663
00:47:36,450 --> 00:47:40,830
which are very, very reliable and most
academic literature you can access for free

664
00:47:40,830 --> 00:47:43,960
so it’s not even as if they have to figure
out how to do it. They just have to read

665
00:47:43,960 --> 00:47:47,010
the academic literature and try and
implement some of these attacks.

666
00:47:47,010 --> 00:47:52,000
I don’t know what – why they’re not. The
next question is how to detect malicious

667
00:47:52,000 --> 00:47:57,760
relays. So in my case we’re running
40 relays. Our relays were on consecutive

668
00:47:57,760 --> 00:48:01,570
IP addresses, so we’re running 40
– well, most of them are on consecutive

669
00:48:01,570 --> 00:48:04,820
IP addresses in two blocks. So they’re
running on IP addresses numbered

670
00:48:04,820 --> 00:48:09,280
e.g. 1,2,3,4,…
We were running two relays per IP address,

671
00:48:09,280 --> 00:48:12,210
and every single relay had my name
plastered across it.

672
00:48:12,210 --> 00:48:14,740
So after I set up these 40 relays in

673
00:48:14,740 --> 00:48:17,420
a relatively short period of time
I expected someone from the Tor Project

674
00:48:17,420 --> 00:48:22,260
to come to me and say: “Hey Gareth, what
are you doing?” – no one noticed,

675
00:48:22,260 --> 00:48:26,090
no one noticed. So this is presently
an open question. On the Tor Project

676
00:48:26,090 --> 00:48:28,790
they’re quite open about this. They
acknowledged that, in fact, last year

677
00:48:28,790 --> 00:48:33,210
we had the CERT researchers launch much
more relays than that. The Tor Project

678
00:48:33,210 --> 00:48:36,510
spotted those large number of relays
but chose not to do anything about it

679
00:48:36,510 --> 00:48:40,119
and, in fact, they were deploying an
attack. But, as you know, it’s often very

680
00:48:40,119 --> 00:48:43,700
difficult to defend against unknown
attacks. So at the moment how to detect

681
00:48:43,700 --> 00:48:47,780
malicious relays is a bit of an open
question. Which as I think is being

682
00:48:47,780 --> 00:48:50,720
discussed on the mailing list.

683
00:48:50,720 --> 00:48:54,230
The other one is defending against unknown
tampering at exits. If you took or take

684
00:48:54,230 --> 00:48:57,220
the exit relays – the exit relay
can tamper with the traffic.

685
00:48:57,220 --> 00:49:01,040
So we know particular types of attacks
doing SSL man-in-the-middles etc.

686
00:49:01,040 --> 00:49:05,350
We’ve seen recently binary patching.
How do we detect unknown tampering

687
00:49:05,350 --> 00:49:08,970
with traffic, other types of traffic? So
the binary tampering wasn’t spotted

688
00:49:08,970 --> 00:49:12,060
until it was spotted by someone who
told the Tor Project. So it wasn’t

689
00:49:12,060 --> 00:49:15,609
detected e.g. by the Tor Project
themselves, it was spotted by someone else

690
00:49:15,609 --> 00:49:20,500
and notified to them. And then the final
one open on here is the Tor code review.

691
00:49:20,500 --> 00:49:25,400
So the Tor code is open source. We know
from OpenSSL that, although everyone

692
00:49:25,400 --> 00:49:29,260
can read source code, people don’t always
look at it. And OpenSSL has been

693
00:49:29,260 --> 00:49:32,230
a huge mess, and there’s been
lots of stuff disclosed over that

694
00:49:32,230 --> 00:49:35,880
over the last coming days. There are
lots of eyes on the Tor code but I think

695
00:49:35,880 --> 00:49:41,519
always, more eyes are better. I’d say,
ideally if we can get people to look

696
00:49:41,519 --> 00:49:45,140
at the Tor code and look for
vulnerabilities then… I encourage people

697
00:49:45,140 --> 00:49:49,860
to do that. It’s a very useful thing to
do. There could be unknown vulnerabilities

698
00:49:49,860 --> 00:49:53,119
as we’ve seen with the “relay early” type
quite recently in the Tor code which

699
00:49:53,119 --> 00:49:56,990
could be quite serious. The truth is we
just don’t know until people do thorough

700
00:49:56,990 --> 00:50:02,500
code audits, and even then it’s very
difficult to know for certain.

701
00:50:02,500 --> 00:50:08,170
So my last point, I think, yes,

702
00:50:08,170 --> 00:50:11,130
is advice to future researchers.
So if you ever wanted, or are planning

703
00:50:11,130 --> 00:50:16,349
on doing a study in the future, e.g. on
Tor, do not do what the CERT researchers

704
00:50:16,349 --> 00:50:20,550
do and start deanonymising people on the
live Tor network and doing it in a way

705
00:50:20,550 --> 00:50:25,060
which is incredibly irresponsible. I don’t
think…I mean, I tend, myself, to give you with

706
00:50:25,060 --> 00:50:28,510
the benefit of a doubt, I don’t think the
CERT researchers set out to be malicious.

707
00:50:28,510 --> 00:50:33,320
I think they’re just very naive.
That’s what it was they were doing.

708
00:50:33,320 --> 00:50:36,780
That was rapidly pointed out to them.
In my case we are running

709
00:50:36,780 --> 00:50:43,090
40 relays. Our Tor relays they were forwarding
traffic, they were acting as good relays.

710
00:50:43,090 --> 00:50:45,970
The only thing that we were doing
was logging publication requests

711
00:50:45,970 --> 00:50:50,050
to the directories. Big question whether
that’s malicious or not – I don’t know.

712
00:50:50,050 --> 00:50:53,330
One thing that has been pointed out to me
is that the .onion addresses themselves

713
00:50:53,330 --> 00:50:58,270
could be considered sensitive information,
so any data we will be retaining

714
00:50:58,270 --> 00:51:01,840
from the study is the aggregated data.
So we won't be retaining information

715
00:51:01,840 --> 00:51:05,400
on individual .onion addresses because
that could potentially be considered

716
00:51:05,400 --> 00:51:08,900
sensitive information. If you think about
someone running an .onion address which

717
00:51:08,900 --> 00:51:11,240
contains something which they don’t want
other people knowing about. So we won’t

718
00:51:11,240 --> 00:51:15,060
be retaining that data, and
we’ll be destroying them.

719
00:51:15,060 --> 00:51:19,920
So I think that brings me now
to starting the questions.

720
00:51:19,920 --> 00:51:22,770
I want to say “Thanks” to a couple of
people. The student who donated

721
00:51:22,770 --> 00:51:26,820
the server to us. Nick Savage who is one
of my colleagues who was a sounding board

722
00:51:26,820 --> 00:51:30,510
during the entire study. Ivan Pustogarov
who is the researcher at the University

723
00:51:30,510 --> 00:51:34,700
of Luxembourg who sent us the large data
set of .onion addresses from last year.

724
00:51:34,700 --> 00:51:37,670
He’s also the chap who has demonstrated
those deanonymisation attacks

725
00:51:37,670 --> 00:51:41,500
that I talked about. A big "Thank you" to
Roger Dingledine who has frankly been…

726
00:51:41,500 --> 00:51:45,230
presented loads of questions to me over
the last couple of days and allowed me

727
00:51:45,230 --> 00:51:49,410
to bounce ideas back and forth.
That has been a very useful process.

728
00:51:49,410 --> 00:51:53,640
If you are doing future research I strongly
encourage you to contact the Tor Project

729
00:51:53,640 --> 00:51:57,040
at the earliest opportunity. You’ll find
them… certainly I found them to be

730
00:51:57,040 --> 00:51:59,460
extremely helpful.

731
00:51:59,460 --> 00:52:04,640
Donncha also did something similar,
so both Ivan and Donncha have done

732
00:52:04,640 --> 00:52:09,520
a similar study in trying to classify the
types of hidden services or work out

733
00:52:09,520 --> 00:52:13,520
how many hits there are to particular
types of hidden service. Ivan Pustogarov

734
00:52:13,520 --> 00:52:17,430
did it on a bigger scale
and found similar results to us.

735
00:52:17,430 --> 00:52:21,910
That is that these abuse sites
featured frequently

736
00:52:21,910 --> 00:52:26,740
in the top requested sites. That was done
over a year ago, and again, he was seeing

737
00:52:26,740 --> 00:52:31,109
similar sorts of pattern. There were these
abuse sites being requested frequently.

738
00:52:31,109 --> 00:52:35,450
So that also sort of probates
what we’re saying.

739
00:52:35,450 --> 00:52:38,540
The data I put online is at this address,
there will probably be the slides,

740
00:52:38,540 --> 00:52:41,609
something called ‘The Tor Research
Framework’ which is an implementation

741
00:52:41,609 --> 00:52:47,510
of a Java client, so an implementation
of a Tor client in Java specifically aimed

742
00:52:47,510 --> 00:52:52,080
at researchers. So if e.g. you wanna pull
out data from a consensus you can do.

743
00:52:52,080 --> 00:52:55,290
If you want to build custom routes
through the network you can do.

744
00:52:55,290 --> 00:52:58,230
If you want to build routes through the
network and start sending padding traffic

745
00:52:58,230 --> 00:53:01,720
down them you can do etc.
The code is designed in a way which is

746
00:53:01,720 --> 00:53:06,000
designed to be easily modifiable
for testing lots of these things.

747
00:53:06,000 --> 00:53:10,580
There is also a link to the Tor FBI
exploit which they deployed against

748
00:53:10,580 --> 00:53:16,230
visitors to some Tor hidden services last
year. They exploited a Mozilla Firefox bug

749
00:53:16,230 --> 00:53:20,540
and then ran code on users who were
visiting these hidden service, and ran

750
00:53:20,540 --> 00:53:24,619
code on their computer to identify them.
At this address there is a link to that

751
00:53:24,619 --> 00:53:29,250
including a copy of the shell code and an
analysis of exactly what it was doing.

752
00:53:29,250 --> 00:53:31,670
And then of course a list of references,
with papers and things.

753
00:53:31,670 --> 00:53:34,260
So I’m quite happy to take questions now.

754
00:53:34,260 --> 00:53:46,960
<i>applause</i>

755
00:53:46,960 --> 00:53:50,880
Herald: Thanks for the nice talk!
Do we have any questions

756
00:53:50,880 --> 00:53:57,000
from the internet?

757
00:53:57,000 --> 00:53:59,740
Signal Angel: One question. It’s very hard
to block addresses since creating them

758
00:53:59,740 --> 00:54:03,620
is cheap, and they can be generated
for each user, and rotated often. So

759
00:54:03,620 --> 00:54:07,510
can you think of any other way
for doing the blocking?

760
00:54:07,510 --> 00:54:09,799
Gareth: That is absolutely true, so, yes.
If you were to block a particular .onion

761
00:54:09,799 --> 00:54:13,060
address they can wail: “I want another
.onion address.” So I don’t know of

762
00:54:13,060 --> 00:54:16,760
any way to counter that now.

763
00:54:16,760 --> 00:54:18,510
Herald: Another one from the internet?
<i>inaudible answer from Signal Angel</i>

764
00:54:18,510 --> 00:54:22,030
Okay, then, Microphone 1, please!

765
00:54:22,030 --> 00:54:26,359
Question: Thank you, that’s fascinating
research. You mentioned that it is

766
00:54:26,359 --> 00:54:32,200
possible to influence the hash of your
relay node in a sense that you could

767
00:54:32,200 --> 00:54:35,970
to be choosing which service you are
advertising, or which hidden service

768
00:54:35,970 --> 00:54:38,050
you are responsible for. Is that right?
Gareth: Yeah, correct!

769
00:54:38,050 --> 00:54:40,390
Question: So could you elaborate
on how this is possible?

770
00:54:40,390 --> 00:54:44,740
Gareth: So e.g. you just keep regenerating
a public key for your relay,

771
00:54:44,740 --> 00:54:48,140
you’ll get closer and closer to the point
where you’ll be the responsible relay

772
00:54:48,140 --> 00:54:51,160
for that particular hidden service. That’s
just – you keep regenerating your identity

773
00:54:51,160 --> 00:54:54,720
hash until you’re at that particular point
in the relay. That’s not particularly

774
00:54:54,720 --> 00:55:00,490
computationally intensive to do.
That was it?

775
00:55:00,490 --> 00:55:04,740
Herald: Okay, next question
from Microphone 5, please.

776
00:55:04,740 --> 00:55:09,490
Question: Hi, I was wondering for the
attacks where you identify a certain number

777
00:55:09,490 --> 00:55:15,170
of users using a hidden service. Have
those attacks been used, or is there

778
00:55:15,170 --> 00:55:18,880
any evidence there, and is there
any way of protecting against that?

779
00:55:18,880 --> 00:55:22,260
Gareth: That’s a very interesting question,
is there any way to detect these types

780
00:55:22,260 --> 00:55:24,970
of attacks? So some of the attacks,
if you’re going to generate particular

781
00:55:24,970 --> 00:55:29,030
traffic patterns, one way to do that is to
use the padding cells. The padding cells

782
00:55:29,030 --> 00:55:32,070
aren’t used at the moment by the official
Tor client. So the detection of those

783
00:55:32,070 --> 00:55:36,510
could be indicative but it doesn't... 
it`s not conclusive evidence in our tool.

784
00:55:36,510 --> 00:55:40,050
Question: And is there any way of
protecting against a government

785
00:55:40,050 --> 00:55:46,510
or something trying to denial-of-service
hidden services?

786
00:55:46,510 --> 00:55:48,180
Gareth: So I… trying to… did not…

787
00:55:48,180 --> 00:55:52,500
Question: Is it possible to protect
against this kind of attack?

788
00:55:52,500 --> 00:55:56,180
Gareth: Not that I’m aware of. The Tor
Project are currently revising how they

789
00:55:56,180 --> 00:55:59,500
do the hidden service protocol which will
make e.g. what I did, enumerating

790
00:55:59,500 --> 00:56:03,230
the hidden services, much more difficult.
And to also be in a position on the

791
00:56:03,230 --> 00:56:07,470
distributed hash table in advance
for a particular hidden service.

792
00:56:07,470 --> 00:56:10,510
So they are at the moment trying to change
the way it’s done, and make some of

793
00:56:10,510 --> 00:56:15,270
these things more difficult.

794
00:56:15,270 --> 00:56:20,290
Herald: Good. Next question
from Microphone 2, please.

795
00:56:20,290 --> 00:56:27,220
Mic2: Hi. I’m running the Tor2Web abuse,
and so I used to see a lot of abuse of requests

796
00:56:27,220 --> 00:56:31,130
concerning the Tor hidden service
being exposed on the internet through

797
00:56:31,130 --> 00:56:37,270
the Tor2Web.org domain name. And I just
wanted to comment on, like you said,

798
00:56:37,270 --> 00:56:45,410
the abuse number of the requests. I used
to spoke with some of the child protection

799
00:56:45,410 --> 00:56:50,070
agencies that reported abuse at
Tor2Web.org, and they are effectively

800
00:56:50,070 --> 00:56:55,570
using crawlers that periodically look for
changes in order to get new images to be

801
00:56:55,570 --> 00:57:00,190
put in the database. And what I was able
to understand is that the German agency

802
00:57:00,190 --> 00:57:07,440
doing that is crawling the same sites that
the Italian agencies are crawling, too.

803
00:57:07,440 --> 00:57:11,890
So it’s likely that in most of the
countries there are the child protection

804
00:57:11,890 --> 00:57:16,790
agencies that are crawling those few
numbers of Tor hidden services that

805
00:57:16,790 --> 00:57:22,760
contain child porn. And I saw it also
a bit from the statistics of Tor2Web

806
00:57:22,760 --> 00:57:28,500
where the amount of abuse relating to
that kind of content, it’s relatively low.

807
00:57:28,500 --> 00:57:30,000
Just as contribution!

808
00:57:30,000 --> 00:57:33,500
Gareth: Yes, that’s very interesting,
thank you for that!

809
00:57:33,500 --> 00:57:37,260
<i>applause</i>

810
00:57:37,260 --> 00:57:39,560
Herald: Next, Microphone 4, please.

811
00:57:39,560 --> 00:57:45,260
Mic4: You then attacked or deanonymised
users with an infected or a modified Guard

812
00:57:45,260 --> 00:57:51,810
relay? Is it required to modify the Guard
relay if I control the entry point

813
00:57:51,810 --> 00:57:57,360
of the user to the internet?
If I’m his ISP?

814
00:57:57,360 --> 00:58:01,900
Gareth: Yes, if you observe traffic
travelling into a Guard relay without

815
00:58:01,900 --> 00:58:04,570
controlling the Guard relay itself.
Mic4: Yeah.

816
00:58:04,570 --> 00:58:07,500
Gareth: In theory, yes. I wouldn’t be able
to tell you how reliable that is

817
00:58:07,500 --> 00:58:10,500
off the top of my head.
Mic4: Thanks!

818
00:58:10,500 --> 00:58:13,630
Herald: So another question
from the internet!

819
00:58:13,630 --> 00:58:16,339
Signal Angel: Wouldn’t the ability to
choose the key hash prefix give

820
00:58:16,339 --> 00:58:19,980
the ability to target specific .onions?

821
00:58:19,980 --> 00:58:23,680
Gareth: So you can only target one .onion
address at a time. Because of the way

822
00:58:23,680 --> 00:58:28,080
they are generated. So you wouldn’t be
able to say e.g. “Pick a key which targeted

823
00:58:28,080 --> 00:58:32,339
two or more .onion addresses.” You can
only target one .onion address at a time

824
00:58:32,339 --> 00:58:37,720
by positioning yourself at a particular
point on the distributed hash table.

825
00:58:37,720 --> 00:58:40,260
Herald: Another one
from the internet? … Okay.

826
00:58:40,260 --> 00:58:43,369
Then Microphone 3, please.

827
00:58:43,369 --> 00:58:47,780
Mic3: Hey. Thanks for this research.
I think it strengthens the network.

828
00:58:47,780 --> 00:58:54,300
So in the deem (?) I was wondering whether
you can donate this relays to be a part of

829
00:58:54,300 --> 00:58:59,500
non-malicious relays pool, basically
use them as regular relays afterwards?

830
00:58:59,500 --> 00:59:02,750
Gareth: Okay, so can I donate the relays
a rerun and at the Tor capacity (?) ?

831
00:59:02,750 --> 00:59:05,490
Unfortunately, I said they were run by
a student and they were donated for

832
00:59:05,490 --> 00:59:09,510
a fixed period of time. So we’ve given
those back to him. We are very grateful

833
00:59:09,510 --> 00:59:14,790
to him, he was very generous. In fact,
without his contribution donating these

834
00:59:14,790 --> 00:59:18,700
it would have been much more difficult
to collect as much data as we did.

835
00:59:18,700 --> 00:59:21,490
Herald: Good, next, Microphone 5, please!

836
00:59:21,490 --> 00:59:25,839
Mic5: Yeah hi, first of all thanks
for your talk. I think you’ve raised

837
00:59:25,839 --> 00:59:29,310
some real issues that need to be
considered very carerfully by everyone

838
00:59:29,310 --> 00:59:33,950
on the Tor Project. My question: I’d like
to go back to the issue with so many

839
00:59:33,950 --> 00:59:38,470
abuse related web sites running over
the Tor Project. I think it’s an important

840
00:59:38,470 --> 00:59:41,900
issue that really needs to be considered
because we don’t wanna be associated

841
00:59:41,900 --> 00:59:44,840
with that at the end of the day.
Anyone who uses Tor, who runs a relay

842
00:59:44,840 --> 00:59:51,250
or an exit node. And I understand it’s
a bit of a censored issue, and you don’t

843
00:59:51,250 --> 00:59:55,300
really have any say over whether it’s
implemented or not. But I’d like to get

844
00:59:55,300 --> 01:00:02,410
your opinion on the implementation
of a distributed block-deny system

845
01:00:02,410 --> 01:00:06,980
that would run in very much a similar way
to those of the directory authorities.

846
01:00:06,980 --> 01:00:08,950
I’d just like to see what
you think of that.

847
01:00:08,950 --> 01:00:13,200
Gareth: So you’re asking me whether I want
to support a particular blocking mechanism

848
01:00:13,200 --> 01:00:14,200
then?

849
01:00:14,200 --> 01:00:16,470
Mic5: I’d like to get your opinion on it.
<i>Gareth laughs</i>

850
01:00:16,470 --> 01:00:20,540
I know it’s a sensitive issue but I think,
like I said, I think something…

851
01:00:20,540 --> 01:00:25,700
I think it needs to be considered because
everyone running exit nodes and relays

852
01:00:25,700 --> 01:00:30,270
and people of the Tor Project don’t
want to be known or associated with

853
01:00:30,270 --> 01:00:34,790
these massive amount of abuse web sites
that currently exist within the Tor network.

854
01:00:34,790 --> 01:00:40,210
Gareth: I absolutely agree, and I think
the Tor Project are horrified as well that

855
01:00:40,210 --> 01:00:43,960
this problem exists, and they, in fact,
talked on it in previous years that

856
01:00:43,960 --> 01:00:48,690
they have a problem with this type of
content. I asked to what if anything is

857
01:00:48,690 --> 01:00:52,340
done about it, it’s very much up to them.
Could it be done in a distributed fashion?

858
01:00:52,340 --> 01:00:56,240
So the example I gave was a way which
it could be done by relay operators.

859
01:00:56,240 --> 01:00:59,770
So e.g. that would need the consensus of
a large number of relay operators to be

860
01:00:59,770 --> 01:01:02,890
effective. So that is done in
a distributed fashion. The question is:

861
01:01:02,890 --> 01:01:06,810
who gives the list of .onion addresses to
block to each of the relay operators?

862
01:01:06,810 --> 01:01:09,640
Clearly, the relay operators aren’t going
to collect themselves. It needs to be

863
01:01:09,640 --> 01:01:15,780
supplied by someone like the Tor Project,
e.g., or someone trustworthy. Yes, it can

864
01:01:15,780 --> 01:01:20,480
be done in a distributed fashion.
It can be done in an open fashion.

865
01:01:20,480 --> 01:01:21,710
Mic5: Who knows?
Gareth: Okay.

866
01:01:21,710 --> 01:01:23,750
Mic5: Thank you.

867
01:01:23,750 --> 01:01:27,260
Herald: Good. And another
question from the internet.

868
01:01:27,260 --> 01:01:31,210
Signal Angel: Apparently there’s an option
in the Tor client to collect statistics

869
01:01:31,210 --> 01:01:35,169
on hidden services. Do you know about
this, and how it relates to your research?

870
01:01:35,169 --> 01:01:38,551
Gareth: Yes, I believe they’re going to
be… the extent to which I know about it

871
01:01:38,551 --> 01:01:41,930
is they’re gonna be trying this next
month, to try and estimate how many

872
01:01:41,930 --> 01:01:46,490
hidden services there are. So keep
your eye on the Tor Project web site,

873
01:01:46,490 --> 01:01:50,340
I’m sure they’ll be publishing
their data in the coming months.

874
01:01:50,340 --> 01:01:55,090
Herald: And, sadly, we are running out of
time, so this will be the last question,

875
01:01:55,090 --> 01:01:56,980
so Microphone 4, please!

876
01:01:56,980 --> 01:02:01,250
Mic4: Hi, I’m just wondering if you could
sort of outline what ethical clearances

877
01:02:01,250 --> 01:02:04,510
you had to get from your university
to conduct this kind of research.

878
01:02:04,510 --> 01:02:07,260
Gareth: So we have to discuss these
types of things before undertaking

879
01:02:07,260 --> 01:02:11,970
any research. And we go through the steps
to make sure that we’re not e.g. storing

880
01:02:11,970 --> 01:02:16,370
sensitive information about particular
people. So yes, we are very mindful

881
01:02:16,370 --> 01:02:19,240
of that. And that’s why I made a
particular point of putting on the slides

882
01:02:19,240 --> 01:02:21,510
as to some of the things to consider.

883
01:02:21,510 --> 01:02:26,180
Mic4: So like… you outlined a potential
implementation of the traffic correlation

884
01:02:26,180 --> 01:02:29,500
attack. Are you saying that
you performed the attack? Or…

885
01:02:29,500 --> 01:02:33,180
Gareth: No, no no, absolutely not.
So the link I’m giving… absolutely not.

886
01:02:33,180 --> 01:02:34,849
We have not engaged in any…

887
01:02:34,849 --> 01:02:36,350
Mic4: It just wasn’t clear
from the slides.

888
01:02:36,350 --> 01:02:39,380
Gareth: I apologize. So it’s absolutely
clear on that. No, we’re not engaging

889
01:02:39,380 --> 01:02:42,860
in any deanonymisation research on the
Tor network. The research I showed

890
01:02:42,860 --> 01:02:46,079
is linked on the references, I think,
which I put at the end of the slides.

891
01:02:46,079 --> 01:02:52,000
You can read about it. But it’s done in
simulation. So e.g. there’s a way

892
01:02:52,000 --> 01:02:54,730
to do simulation of the Tor network on
a single computer. I can’t remember

893
01:02:54,730 --> 01:02:58,880
the name of the project, though.
Shadow! Yes, it’s a system

894
01:02:58,880 --> 01:03:02,170
called Shadow, we can run a large
number of Tor relays on a single computer

895
01:03:02,170 --> 01:03:04,579
and simulate the traffic between them.
If you’re going to do that type of research

896
01:03:04,579 --> 01:03:09,380
then you should use that. Okay,
thank you very much, everyone.

897
01:03:09,380 --> 01:03:17,985
<i>applause</i>

898
01:03:17,985 --> 01:03:22,071
<i>silent postroll titles</i>

899
01:03:22,071 --> 01:03:27,000
subtitles created by c3subtitles.de
Join, and help us!