WEBVTT 00:00:00.000 --> 00:00:09.970 silent 31C3 preroll 00:00:09.970 --> 00:00:13.220 Dr. Gareth Owen: Hello. Can you hear me? Yes. Okay. So my name is Gareth Owen. 00:00:13.220 --> 00:00:16.150 I’m from the University of Portsmouth. I’m an academic 00:00:16.150 --> 00:00:19.320 and I’m going to talk to you about an experiment that we did 00:00:19.320 --> 00:00:22.610 on the Tor hidden services, trying to categorize them, 00:00:22.610 --> 00:00:25.230 estimate how many they were etc. etc. 00:00:25.230 --> 00:00:27.380 Well, as we go through the talk I’m going to explain 00:00:27.380 --> 00:00:31.120 how Tor hidden services work internally, and how the data was collected. 00:00:31.120 --> 00:00:35.320 So what sort of conclusions you can draw from the data based on the way that we’ve 00:00:35.320 --> 00:00:39.950 collected it. Just so [that] I get an idea: how many of you use Tor 00:00:39.950 --> 00:00:42.430 on a regular basis, could you put your hand up for me? 00:00:42.430 --> 00:00:46.120 So quite a big number. Keep your hand up if… or put your hand up if you’re 00:00:46.120 --> 00:00:48.320 a relay operator. 00:00:48.320 --> 00:00:51.470 Wow, that’s quite a significant number, isn’t it? And then, put your hand up 00:00:51.470 --> 00:00:55.250 and/or keep it up if you run a hidden service. 00:00:55.250 --> 00:00:59.530 Okay, so, a fewer number, but still some people run hidden services. 00:00:59.530 --> 00:01:02.720 Okay, so, some of you may be very familiar with the way Tor works, sort of, 00:01:02.720 --> 00:01:06.700 in a low level. But I am gonna go through it for those which aren’t, so they understand 00:01:06.700 --> 00:01:10.380 just how they work. And as we go along, because I’m explaining how 00:01:10.380 --> 00:01:14.030 the hidden services work, I’m going to tag on information on how 00:01:14.030 --> 00:01:19.030 the Tor hidden services themselves can be deanonymised and also how the users 00:01:19.030 --> 00:01:23.090 of those hidden services can be deanonymised, if you put 00:01:23.090 --> 00:01:27.040 some strict criteria on what it is you want to do with respect to them. 00:01:27.040 --> 00:01:30.920 So the things that I’m going to go over: I wanna go over how Tor works, 00:01:30.920 --> 00:01:34.190 and then specifically how hidden services work. I’m gonna talk about something 00:01:34.190 --> 00:01:37.889 called the “Tor Distributed Hash Table” for hidden services. If you’ve heard 00:01:37.889 --> 00:01:40.560 that term and don’t know what it means, don’t worry, I’ll explain 00:01:40.560 --> 00:01:44.010 what a distributed hash table is and how it works. It’s not as complicated 00:01:44.010 --> 00:01:47.690 as it sounds. And then I wanna go over Darknet data, so, data that we collected 00:01:47.690 --> 00:01:53.030 from Tor hidden services. And as I say, as we go along I will sort of explain 00:01:53.030 --> 00:01:56.650 how you do deanonymisation of both the services themselves and of the visitors 00:01:56.650 --> 00:02:02.400 to the service. And just how complicated it is. 00:02:02.400 --> 00:02:07.370 So you may have seen this slide which I think was from GCHQ, released last year 00:02:07.370 --> 00:02:12.099 as part of the Snowden leaks where they said: “You can deanonymise some users 00:02:12.099 --> 00:02:15.560 some of the time but they’ve had no success in deanonymising someone 00:02:15.560 --> 00:02:20.109 in response to a specific request.” So, given all of you e.g., I may be able 00:02:20.109 --> 00:02:25.090 to deanonymise a small fraction of you but I can’t choose precisely one person 00:02:25.090 --> 00:02:27.499 I want to deanonymise. That’s what I’m gonna be explaining in relation 00:02:27.499 --> 00:02:30.940 to the deanonymisation attacks, how you can deanonymise a section but 00:02:30.940 --> 00:02:38.629 you can’t necessarily choose which section of the users that you will be deanonymising. 00:02:38.629 --> 00:02:42.740 Tor drives with just a couple of different problems. On one part 00:02:42.740 --> 00:02:46.239 it allows you to bypass censorship. So if you’re in a country like China, which 00:02:46.239 --> 00:02:51.010 blocks some types of traffic you can use Tor to bypass their censorship blocks. 00:02:51.010 --> 00:02:55.541 It tries to give you privacy, so, at some level in the network someone can’t see 00:02:55.541 --> 00:02:59.200 what you’re doing. And at another point in the network people who don’t know 00:02:59.200 --> 00:03:02.540 who you are but may necessarily be able to see what you’re doing. 00:03:02.540 --> 00:03:07.099 Now the traditional case for this is to look at VPNs. 00:03:07.099 --> 00:03:10.669 With a VPN you have sort of a single provider. 00:03:10.669 --> 00:03:14.689 You have lots of users connecting to the VPN. The VPN has sort of 00:03:14.689 --> 00:03:18.240 a mixing effect from an outside or a server’s point of view. And then 00:03:18.240 --> 00:03:22.499 out of the VPN you see requests to Twitter, Wikipedia etc. etc. 00:03:22.499 --> 00:03:26.830 And if that traffic doesn’t encrypt it then the VPN can also read the contents 00:03:26.830 --> 00:03:30.980 of the traffic. Now of course there is a fundamental weakness with this. 00:03:30.980 --> 00:03:35.730 If you trust the VPN provider the VPN provider knows both who you are 00:03:35.730 --> 00:03:39.629 and what you’re doing and can link those two together with absolute 00:03:39.629 --> 00:03:43.580 certainty. So you don’t… whilst you do get some of these properties, assuming 00:03:43.580 --> 00:03:48.069 you’ve got a trustworthy VPN provider you don’t get them in the face of 00:03:48.069 --> 00:03:51.609 an untrustworthy VPN provider. And of course: how do you trust the VPN 00:03:51.609 --> 00:03:59.319 provider? What sort of measure do you use? That’s sort of an open question. 00:03:59.319 --> 00:04:03.729 So Tor tries to solve this problem by distributing the trust. Tor is 00:04:03.729 --> 00:04:07.500 an open source project, so you can go on to their Git repository, you can 00:04:07.500 --> 00:04:12.620 download the source code, and change it, improve it, submit patches etc. 00:04:12.620 --> 00:04:17.108 As you heard earlier, during Jacob and Roger’s talk they’re currently partly 00:04:17.108 --> 00:04:20.949 sponsored by the US Government which seems a bit paradoxical, but they explained 00:04:20.949 --> 00:04:24.770 in that talk many of the… that doesn’t affect like judgment. 00:04:24.770 --> 00:04:28.540 And indeed, they do have some funding from other sources, and they design that system 00:04:28.540 --> 00:04:30.841 – which I’ll talk about a little bit later – in a way where they don’t have 00:04:30.841 --> 00:04:34.230 to trust each other. So there’s sort of some redundancy, and they’re trying 00:04:34.230 --> 00:04:39.650 to minimize these sort of trust issues related to this. Now, Tor is 00:04:39.650 --> 00:04:43.310 a partially de-centralized network, which means that it has some centralized 00:04:43.310 --> 00:04:47.870 components which are under the control of the Tor Project and some de-centralized 00:04:47.870 --> 00:04:51.190 components which are normally the Tor relays. If you run a relay you’re 00:04:51.190 --> 00:04:56.290 one of those de-centralized components. There is, however, no single authority 00:04:56.290 --> 00:05:01.110 on the Tor network. So no single server which is responsible, 00:05:01.110 --> 00:05:04.290 which you’re required to trust. So the trust is somewhat distributed, 00:05:04.290 --> 00:05:12.000 but not entirely. When you establish a circuit through Tor you, the user, 00:05:12.000 --> 00:05:15.500 download a list of all of the relays inside the Tor network. 00:05:15.500 --> 00:05:19.070 And you get to pick – and I’ll tell you how you do that – which relays 00:05:19.070 --> 00:05:22.750 you’re going to use to route your traffic through. So here is a typical example: 00:05:22.750 --> 00:05:27.090 You’re here on the left hand side as the user. You download a list of the relays 00:05:27.090 --> 00:05:32.010 inside the Tor network and you select from that list three nodes, a guard node 00:05:32.010 --> 00:05:36.580 which is your entry into the Tor network, a relay node which is a middle node. 00:05:36.580 --> 00:05:39.010 Essentially, it’s going to route your traffic to a third hop. And then 00:05:39.010 --> 00:05:42.650 the third hop is the exit node where your traffic essentially exits out 00:05:42.650 --> 00:05:46.840 on the internet. Now, looking at the circuit. So this is a circuit through 00:05:46.840 --> 00:05:50.170 the Tor network through which you’re going to route your traffic. There are 00:05:50.170 --> 00:05:52.540 three layers of encryption at the beginning, so between you 00:05:52.540 --> 00:05:56.150 and the guard node. Your traffic is encrypted three times. 00:05:56.150 --> 00:05:59.330 In the first instance encrypted to the guard, and the it’s encrypted again, 00:05:59.330 --> 00:06:03.180 through the relay, and then encrypted again to the exit, and as the traffic moves 00:06:03.180 --> 00:06:08.710 through the Tor network each of those layers of encryption are unpeeled 00:06:08.710 --> 00:06:17.300 from the data. The Guard here in this case knows who you are, and the exit relay 00:06:17.300 --> 00:06:21.590 knows what you’re doing but neither know both. And the middle relay doesn’t really 00:06:21.590 --> 00:06:26.710 know a lot, except for which relay is her guard and which relay is her exit. 00:06:26.710 --> 00:06:31.870 Who runs an exit relay? So if you run an exit relay all of the traffic which 00:06:31.870 --> 00:06:36.210 users are sending out on the internet they appear to come from your IP address. 00:06:36.210 --> 00:06:41.360 So running an exit relay is potentially risky because someone may do something 00:06:41.360 --> 00:06:45.590 through your relay which attracts attention. And then, when law enforcement 00:06:45.590 --> 00:06:48.940 traced that back to an IP address it’s going to come back to your address. 00:06:48.940 --> 00:06:51.790 So some relay operators have had trouble with this, with law enforcement coming 00:06:51.790 --> 00:06:55.360 to them, and saying: “Hey we got this traffic coming through your IP address 00:06:55.360 --> 00:06:57.950 and you have to go and explain it.” So if you want to run an exit relay 00:06:57.950 --> 00:07:01.400 it’s a little bit risky, but we’re thankful for those people that do run exit relays 00:07:01.400 --> 00:07:04.870 because ultimately if people didn’t run an exit relay you wouldn’t be able 00:07:04.870 --> 00:07:08.000 to get out of the Tor network, and it wouldn’t be terribly useful from this 00:07:08.000 --> 00:07:20.560 point of view. So, yes. applause 00:07:20.560 --> 00:07:24.610 So every Tor relay, when you set up a Tor relay you publish something called 00:07:24.610 --> 00:07:28.780 a descriptor which describes your Tor relay and how to use it to a set 00:07:28.780 --> 00:07:33.430 of servers called the authorities. And the trust in the Tor network is essentially 00:07:33.430 --> 00:07:38.610 split across these authorities. They’re run by the core Tor Project members. 00:07:38.610 --> 00:07:42.639 And they maintain a list of all of the relays in the network. And they observe 00:07:42.639 --> 00:07:46.010 them over a period of time. If the relays exhibit certain properties they give 00:07:46.010 --> 00:07:50.480 the relays flags. If e.g. a relay allows traffic to exit from the Tor network 00:07:50.480 --> 00:07:54.450 it will get the ‘Exit’ flag. If they’d been switched on for a certain period of time, 00:07:54.450 --> 00:07:58.400 or for a certain amount of traffic they’ll be allowed to become the guard relay 00:07:58.400 --> 00:08:02.180 which is the first node in your circuit. So when you build your circuit you 00:08:02.180 --> 00:08:07.230 download a list of these descriptors from one of the Directory Authorities. You look 00:08:07.230 --> 00:08:10.120 at the flags which have been assigned to each of the relays, and then you pick 00:08:10.120 --> 00:08:14.150 your route based on that. So you’ll pick the guard node from a set of relays 00:08:14.150 --> 00:08:16.400 which have the ‘Guard’ flag, your exits from the set of relays which have 00:08:16.400 --> 00:08:20.860 the ‘Exit’ flag etc. etc. Now, as of a quick count this morning there are 00:08:20.860 --> 00:08:29.229 about 1500 guard relays, around 1000 exit relays, and six relays flagged as ‘bad’ exits. 00:08:29.229 --> 00:08:34.360 What does a ‘bad exit’ mean? waits for audience to respond 00:08:34.360 --> 00:08:37.759 That’s not good! That’s exactly what it means! Yes! laughs 00:08:37.759 --> 00:08:40.450 applause 00:08:40.450 --> 00:08:45.569 So relays which have been flagged as ‘bad exits’ your client will never chose to exit 00:08:45.569 --> 00:08:50.660 traffic through. And examples of things which may get a relay flagged as an 00:08:50.660 --> 00:08:53.829 [bad] exit relay – if they’re fiddling with the traffic which is coming out of 00:08:53.829 --> 00:08:57.019 the Tor relay. Or doing things like man-in-the-middle attacks against 00:08:57.019 --> 00:09:01.629 SSL traffic. We’ve seen various things, there have been relays man-in-the-middling 00:09:01.629 --> 00:09:07.050 SSL traffic, there have very, very recently been an exit relay which was patching 00:09:07.050 --> 00:09:10.800 binaries that you downloaded from the internet, inserting malware into the binaries. 00:09:10.800 --> 00:09:14.630 So you can do these things but the Tor Project tries to scan for them. And if 00:09:14.630 --> 00:09:19.829 these things are detected then they’ll be flagged as ‘Bad Exits’. It’s true to say 00:09:19.829 --> 00:09:24.610 that the scanning mechanism is not 100% fool-proof by any stretch of the imagination. 00:09:24.610 --> 00:09:28.559 It tries to pick up common types of attacks, so as a result 00:09:28.559 --> 00:09:32.480 it won’t pick up unknown attacks or attacks which haven’t been seen or 00:09:32.480 --> 00:09:36.680 have not been known about beforehand. 00:09:36.680 --> 00:09:45.370 So looking at this, how do you deanonymise the traffic travelling through the Tor 00:09:45.370 --> 00:09:49.449 networks? Given some traffic coming out of the exit relay, how do you know 00:09:49.449 --> 00:09:54.269 which user that corresponds to? What is their IP address? You can’t actually 00:09:54.269 --> 00:09:58.279 modify the traffic because if any of the relays tried to modify the traffic 00:09:58.279 --> 00:10:02.249 which they’re sending through the network Tor will tear down the circuit through the relay. 00:10:02.249 --> 00:10:06.290 So there’s these integrity checks, each of the hops. And if you try to sort of 00:10:06.290 --> 00:10:09.870 – because you can’t decrypt the packet you can’t modify it in any meaningful way, 00:10:09.870 --> 00:10:13.749 and because there’s an integrity check at the next hop that means that you can’t 00:10:13.749 --> 00:10:17.019 modify the packet because otherwise it’s detected. So you can’t do this sort of 00:10:17.019 --> 00:10:20.900 marker, and try and follow the marker through the network. So instead 00:10:20.900 --> 00:10:26.699 what you can do if you control… so let me give you two cases. In the worst case 00:10:26.699 --> 00:10:31.330 if the attacker controls all three of your relays that you pick, which is an unlikely 00:10:31.330 --> 00:10:34.739 scenario that needs to control quite a big proportion of the network. Then 00:10:34.739 --> 00:10:39.550 it should be quite obvious that they can work out who you are and also 00:10:39.550 --> 00:10:42.369 see what you’re doing because in that case they can tag the traffic, and 00:10:42.369 --> 00:10:45.709 they can just discard these integrity checks at each of the following hops. 00:10:45.709 --> 00:10:50.709 Now in a different case, if you control the Guard relay and the exit relay 00:10:50.709 --> 00:10:54.160 but not the middle relay the Guard relay can’t tamper with the traffic because 00:10:54.160 --> 00:10:57.660 this middle relay will close down the circuit as soon as it happens. 00:10:57.660 --> 00:11:01.130 The exit relay can’t send stuff back down the circuit to try and identify the user, 00:11:01.130 --> 00:11:05.030 either. Because again, the circuit will be closed down. So what can you do? 00:11:05.030 --> 00:11:09.869 Well, you can count the number of packets going through the Guard node. And you can 00:11:09.869 --> 00:11:14.690 measure the timing differences between packets, and try and spot that pattern 00:11:14.690 --> 00:11:18.750 at the Exit relays. You’re looking at counts of packets and the timing between those 00:11:18.750 --> 00:11:22.360 packets which are being sent, and essentially trying to correlate them all. 00:11:22.360 --> 00:11:26.869 So if your user happens to pick you as your Guard node, and then happens to pick 00:11:26.869 --> 00:11:31.850 your exit relay, then you can deanonymise them with very high probability using 00:11:31.850 --> 00:11:35.649 this technique. You’re just correlating the timings of packets and counting 00:11:35.649 --> 00:11:38.889 the number of packets going through. And the attacks demonstrated in literature 00:11:38.889 --> 00:11:44.509 are very reliable for this. We heard earlier from the Tor talk about the “relay 00:11:44.509 --> 00:11:50.739 early” tag which was the attack discovered by the cert researches in the US. 00:11:50.739 --> 00:11:55.050 That attack didn’t rely on timing attacks. Instead, what they were able to do was 00:11:55.050 --> 00:11:58.720 send a special type of cell containing the data back down the circuit, 00:11:58.720 --> 00:12:01.889 essentially marking this data, and saying: “This is the data we’re seeing 00:12:01.889 --> 00:12:06.149 at the Exit relay, or at the hidden service", and encode into the messages 00:12:06.149 --> 00:12:10.049 travelling back down the circuit, what the data was. And then you could pick 00:12:10.049 --> 00:12:14.269 those up at the Guard relay and say, okay, whether it’s this person that’s doing that. 00:12:14.269 --> 00:12:18.370 In fact, although this technique works, and yeah it was a very nice attack, 00:12:18.370 --> 00:12:21.269 the traffic correlation attacks are actually just as powerful. 00:12:21.269 --> 00:12:25.259 So although this bug has been fixed traffic correlation attacks still work and are 00:12:25.259 --> 00:12:29.739 still fairly, fairly reliable. So the problem still does exist. This is very much 00:12:29.739 --> 00:12:33.399 an open question. How do we solve this problem? We don’t know, currently, 00:12:33.399 --> 00:12:40.040 how to solve this problem of trying to tackle the traffic correlation. 00:12:40.040 --> 00:12:45.369 There are a couple of solutions. But they’re not particularly… 00:12:45.369 --> 00:12:48.569 they’re not particularly reliable. Let me just go through these, and I’ll skip back 00:12:48.569 --> 00:12:53.061 on the few things I’ve missed. The first thing is, high-latency networks, so 00:12:53.061 --> 00:12:56.999 networks where packets are delayed in their transit through the network. 00:12:56.999 --> 00:13:00.740 That throws away a lot of the timing information. So they promise 00:13:00.740 --> 00:13:03.800 to potentially solve this problem. But of course, if you want to visit 00:13:03.800 --> 00:13:06.779 Google’s home page, and you have to wait five minutes for it, you’re simply 00:13:06.779 --> 00:13:11.910 just not going to use Tor. The whole point is trying to make this technology usable. 00:13:11.910 --> 00:13:14.759 And if you got something which is very, very slow then it doesn’t make it 00:13:14.759 --> 00:13:18.269 attractive to use. But of course, this case does work slightly better 00:13:18.269 --> 00:13:22.059 for e-mail. If you think about it with e-mail, you don’t mind if you’re e-mail 00:13:22.059 --> 00:13:25.399 – well, you may not mind, you may mind – you don’t mind if your e-mail is delayed 00:13:25.399 --> 00:13:29.120 by some period of time. Which makes this somewhat difficult. And as Roger said 00:13:29.120 --> 00:13:35.130 earlier, you can also introduce padding into the circuit, so these are dummy cells. 00:13:35.130 --> 00:13:39.839 But, but… with a big caveat: some of the research suggests that actually you’d 00:13:39.839 --> 00:13:43.439 need to introduce quite a lot of padding to defeat these attacks, and that would 00:13:43.439 --> 00:13:47.179 overload the Tor network in its current state. So, again, not a particular 00:13:47.179 --> 00:13:53.860 practical solution. 00:13:53.860 --> 00:13:58.279 How does Tor try to solve this problem? Well, Tor makes it very difficult 00:13:58.279 --> 00:14:03.171 to become a users Guard relay. If you can’t become a users Guard relay 00:14:03.171 --> 00:14:07.839 then you don’t know who the user is, quite simply. And so by making it very hard 00:14:07.839 --> 00:14:13.249 to become the Guard relay therefore you can’t do this traffic correlation attack. 00:14:13.249 --> 00:14:17.579 So at the moment the Tor client chooses one Guard relay and keeps it for a period 00:14:17.579 --> 00:14:22.259 of time. So if I want to sort of target just one of you I would need to control 00:14:22.259 --> 00:14:26.259 the Guard relay that you were using at that particular point in time. And in fact 00:14:26.259 --> 00:14:30.679 I’d also need to know what that Guard relay is. So by making it very unlikely 00:14:30.679 --> 00:14:34.129 that you would select a particular malicious Guard relay, where the number of malicious 00:14:34.129 --> 00:14:39.179 Guard relays is very small, that’s how Tor tries to solve this problem. And 00:14:39.179 --> 00:14:43.280 at the moment your Guard relay is your barrier of security. If the attacker can’t 00:14:43.280 --> 00:14:46.460 control the Guard relay then they won’t know who you are. That doesn’t mean 00:14:46.460 --> 00:14:50.639 they can’t try other sort of side channel attacks by messing with the traffic 00:14:50.639 --> 00:14:55.129 at the Exit relay etc. You know that you may sort of e.g. download dodgy documents 00:14:55.129 --> 00:14:59.499 and open one on your computer, and those sort of things. Now the alternative 00:14:59.499 --> 00:15:02.769 of course to having a Guard relay and keeping it for a very long time 00:15:02.769 --> 00:15:06.029 will be to have a Guard relay and to change it on a regular basis. 00:15:06.029 --> 00:15:09.929 Because you might think, well, just choosing one Guard relay and sticking with it 00:15:09.929 --> 00:15:13.399 is probably a bad idea. But actually, that’s not the case. If you pick 00:15:13.399 --> 00:15:18.370 the Guard relay, and assuming that the chance of picking a Guard relay that is 00:15:18.370 --> 00:15:22.800 malicious is very low, then, when you first use your Guard relay, if you got 00:15:22.800 --> 00:15:27.420 a good choice, then your traffic is safe. If you haven’t got a good choice then 00:15:27.420 --> 00:15:31.759 your traffic isn’t safe. Whereas if your Tor client chooses a Guard relay 00:15:31.759 --> 00:15:35.610 every few minutes, or every hour, or something on those lines at some point 00:15:35.610 --> 00:15:39.179 you’re gonna pick a malicious Guard relay. So they’re gonna have some of your traffic 00:15:39.179 --> 00:15:43.399 but not all of it. And so currently the trade-off is that we make it very difficult 00:15:43.399 --> 00:15:48.490 for an attacker to control a Guard relay and the user picks a Guard relay and 00:15:48.490 --> 00:15:52.449 keeps it for a long period of time. And so it’s very difficult for the attackers 00:15:52.449 --> 00:15:58.939 to pick that Guard relay when they control a very small proportion of the network. 00:15:58.939 --> 00:16:06.420 So this, currently, provides those properties I described earlier, the privacy 00:16:06.420 --> 00:16:11.410 and the anonymity when you’re browsing the web, when you’re accessing websites etc. 00:16:11.410 --> 00:16:16.519 But still you know who the website is. So although you’re anonymous and the website 00:16:16.519 --> 00:16:20.730 doesn’t know who you are you know who the website is. And there may be some cases 00:16:20.730 --> 00:16:25.499 where e.g. the website would also wish to remain anonymous. You want the person 00:16:25.499 --> 00:16:29.970 accessing the website and the website itself to be anonymous to each other. 00:16:29.970 --> 00:16:34.230 And you could think about people e.g. being in countries where running 00:16:34.230 --> 00:16:39.730 a political blog e.g. might be a dangerous activity. If you run that on a regular 00:16:39.730 --> 00:16:45.660 webserver you’re easily identified whereas, if you got some way where you as 00:16:45.660 --> 00:16:49.490 the webserver can be anonymous then that allows you to do that activity without 00:16:49.490 --> 00:16:57.480 being targeted by your government. So this is what hidden services try to solve. 00:16:57.480 --> 00:17:03.080 Now when you first think about a problem you kind of think: “Hang on a second, 00:17:03.080 --> 00:17:06.429 the user doesn’t know who the website is and the website doesn’t know 00:17:06.429 --> 00:17:09.890 who the user is. So how on earth do they talk to each other?” Well, that’s essentially 00:17:09.890 --> 00:17:14.220 what the Tor hidden service protocol tries to sort of set up. How do you identify and 00:17:14.220 --> 00:17:19.579 connect to each other. So at the moment this is what happens: We’ve got Bob 00:17:19.579 --> 00:17:23.780 on the [right] hand side who is the hidden service. And we got Alice on the left hand 00:17:23.780 --> 00:17:28.620 side here who is the user who wishes to visit the hidden service. Now when Bob 00:17:28.620 --> 00:17:34.190 sets up his hidden service he picks three nodes in the Tor network as introduction 00:17:34.190 --> 00:17:38.831 points and builds several hop circuits to them. So the introduction points don’t know 00:17:38.831 --> 00:17:44.680 who Bob is. Bob has circuits to them. And Bob says to each of these introduction points 00:17:44.680 --> 00:17:48.240 “Will you relay traffic to me if someone connects to you asking for me?” 00:17:48.240 --> 00:17:53.030 And then those introduction points do that. So then, once Bob has picked 00:17:53.030 --> 00:17:56.840 his introduction points he publishes a descriptor describing the list of his 00:17:56.840 --> 00:18:01.310 introduction points for someone who wishes to come onto his websites. And then Alice 00:18:01.310 --> 00:18:06.700 on the left hand side wishing to visit Bob will pick a rendezvous point in the network 00:18:06.700 --> 00:18:10.030 and build a circuit to it. So this “RP” here is the rendezvous point. 00:18:10.030 --> 00:18:14.530 And she will relay a message via one of the introduction points saying to Bob: 00:18:14.530 --> 00:18:18.290 “Meet me at the rendezvous point”. And then Bob will build a 3-hop-circuit 00:18:18.290 --> 00:18:22.870 to the rendezvous point. So now at this stage we got Alice with a multi-hop circuit 00:18:22.870 --> 00:18:26.890 to the rendezvous point, and Bob with a multi-hop circuit to the rendezvous point. 00:18:26.890 --> 00:18:32.550 Alice and Bob haven’t connected to one another directly. The rendezvous point 00:18:32.550 --> 00:18:36.530 doesn’t know who Bob is, the rendezvous point doesn’t know who Alice is. 00:18:36.530 --> 00:18:40.261 All they’re doing is forwarding the traffic. And they can’t inspect the traffic, 00:18:40.261 --> 00:18:43.740 either, because the traffic itself is encrypted. 00:18:43.740 --> 00:18:47.530 So that’s currently how you solve this problem with trying to communicate 00:18:47.530 --> 00:18:50.820 with someone who you don’t know who they are and vice versa. 00:18:50.820 --> 00:18:55.740 drinks from the bottle 00:18:55.740 --> 00:18:58.870 The principle thing I’m going to talk about today is this database. 00:18:58.870 --> 00:19:01.990 So I said, Bob, when he picks his introduction points he builds this thing 00:19:01.990 --> 00:19:06.080 called a descriptor, describing who his introduction points are, and he publishes 00:19:06.080 --> 00:19:10.390 them to a database. This database itself is distributed throughout the Tor network. 00:19:10.390 --> 00:19:17.860 It’s not a single server. So both, Bob and Alice need to be able to publish information 00:19:17.860 --> 00:19:22.040 to this database, and also retrieve information from this database. And Tor 00:19:22.040 --> 00:19:24.820 currently uses something called a distributed hash table, which I’m gonna 00:19:24.820 --> 00:19:27.930 give an example of what this means and how it works. And then I’ll talk to you 00:19:27.930 --> 00:19:34.380 specifically how the Tor Distributed Hash Table works itself. So let’s say e.g. 00:19:34.380 --> 00:19:39.830 you've got a set of servers. So here we've got 26 servers and you’d like to store 00:19:39.830 --> 00:19:44.240 your files across these different servers without having a single server responsible 00:19:44.240 --> 00:19:48.050 for deciding, “okay, that file is stored on that server, and this file is stored 00:19:48.050 --> 00:19:53.050 on that server” etc. etc. Now here is my list of files. You could take a very naive 00:19:53.050 --> 00:19:57.740 approach. And you could say: “Okay, I’ve got 26 servers, I got all of these file names 00:19:57.740 --> 00:20:01.250 and start with the letter of the alphabet.” And I could say: “All of the files that begin 00:20:01.250 --> 00:20:05.450 with A are gonna go under server A; or the files that begin with B are gonna go 00:20:05.450 --> 00:20:09.900 on server B etc.” And then when you want to retrieve a file you say: “Okay, what 00:20:09.900 --> 00:20:13.950 does my file name begin with?” And then you know which server it’s stored on. 00:20:13.950 --> 00:20:17.750 Now of course you could have a lot of servers – sorry – a lot of files 00:20:17.750 --> 00:20:22.780 which begin with a Z, an X or a Y etc. in which case you’re gonna overload 00:20:22.780 --> 00:20:27.310 that server. You’re gonna have more files stored on one server than on another server 00:20:27.310 --> 00:20:32.150 in your set. And if you have a lot of big files, say e.g. beginning with B then 00:20:32.150 --> 00:20:35.520 rather than distributing your files across all the servers you’re gonna just be 00:20:35.520 --> 00:20:39.060 overloading one or two of them. So to solve this problem what we tend to do is: 00:20:39.060 --> 00:20:42.410 we take the file name, and we run it through a cryptographic hash function. 00:20:42.410 --> 00:20:46.930 A hash function produces output which looks like random, very small changes 00:20:46.930 --> 00:20:50.740 in the input so a cryptographic hash function produces a very large change 00:20:50.740 --> 00:20:55.240 in the output. And this change looks random. So if I take all of my file names 00:20:55.240 --> 00:20:59.820 here, and assuming I have a lot more, I take a hash of them, and then I use 00:20:59.820 --> 00:21:05.470 that hash to determine which server to store the file on. Then, with high probability 00:21:05.470 --> 00:21:09.670 my files will be distributed evenly across all of the servers. And then when I want 00:21:09.670 --> 00:21:12.990 to go and retrieve one of the files I take my file name, I run it through the 00:21:12.990 --> 00:21:15.980 cryptographic hash function, that gives me the hash, and then I use that hash 00:21:15.980 --> 00:21:19.740 to identify which server that particular file is stored on. And then I go and 00:21:19.740 --> 00:21:25.990 retrieve it. So that’s the sort of a loose idea of how a distributed hash table works. 00:21:25.990 --> 00:21:29.340 There are a couple of problems with this. What if you got a changing size, what 00:21:29.340 --> 00:21:34.700 if the number of servers you got changes in size as it does in the Tor network. 00:21:34.700 --> 00:21:42.290 It’s a very brief overview of the theory. So how does it apply for the Tor network? 00:21:42.290 --> 00:21:47.640 Well, the Tor network has a set of relays and it has a set of hidden services. 00:21:47.640 --> 00:21:52.710 Now we take all of the relays, and they have a hash identity which identifies them. 00:21:52.710 --> 00:21:57.460 And we map them onto a circle using that hash value as an identifier. So you can 00:21:57.460 --> 00:22:03.230 imagine the hash value ranging from Zero to a very large number. We got a Zero point 00:22:03.230 --> 00:22:07.280 at the very top there. And that runs all the way round to the very large number. 00:22:07.280 --> 00:22:12.130 So given the identity hash for a relay we can map that to a particular point on 00:22:12.130 --> 00:22:19.070 the server. And then all we have to do is also do this for hidden services. 00:22:19.070 --> 00:22:22.320 So there’s a hidden service address, something.onion, so this is 00:22:22.320 --> 00:22:27.750 one of the hidden websites that you might visit. You take the – I’m not gonna describe 00:22:27.750 --> 00:22:33.980 in too much detail how this is done but – the value is done in such a way such that 00:22:33.980 --> 00:22:38.020 it’s evenly distributed about the circle. So your hidden service will have 00:22:38.020 --> 00:22:44.240 a particular point on the circle. And the relays will also be mapped onto this circle. 00:22:44.240 --> 00:22:49.640 So there’s the relays. And the hidden service. And in the case of Tor 00:22:49.640 --> 00:22:53.460 the hidden service actually maps to two positions on the circle, and it publishes 00:22:53.460 --> 00:22:57.850 its descriptor to the three relays to the right at one position, and the three relays 00:22:57.850 --> 00:23:01.600 to the right at another position. So there are actually in total six places where 00:23:01.600 --> 00:23:05.060 this descriptor is published on the circle. And then if I want to go and 00:23:05.060 --> 00:23:09.450 fetch and connect to a hidden service I go on to go and pull this hidden descriptor 00:23:09.450 --> 00:23:13.780 down to identify what its introduction points are. I take the hidden service 00:23:13.780 --> 00:23:17.200 address, I find out where it is on the circle, I map all of the relays onto 00:23:17.200 --> 00:23:21.110 the circle, and then I identify which relays on the circle are responsible 00:23:21.110 --> 00:23:24.031 for that particular hidden service. And I just connect, then I say: “Do you have 00:23:24.031 --> 00:23:26.630 a copy of the descriptor for that particular hidden service?” 00:23:26.630 --> 00:23:29.620 And if so then we’ve got our list of introduction points. And we can go 00:23:29.620 --> 00:23:38.020 to the next steps to connect to our hidden service. So I’m gonna explain how we 00:23:38.020 --> 00:23:41.320 sort of set up our experiments. What we thought, or what we were interested to do, 00:23:41.320 --> 00:23:48.181 was collect publications of hidden services. So for everytime a hidden service 00:23:48.181 --> 00:23:51.520 gets set up it publishes to this distributed hash table. What we wanted to do was 00:23:51.520 --> 00:23:55.750 collect those publications so that we get a complete list of all of the hidden 00:23:55.750 --> 00:23:59.280 services. And what we also wanted to do is to find out how many times a particular 00:23:59.280 --> 00:24:06.300 hidden service is requested. 00:24:06.300 --> 00:24:10.540 Just one more point that will become important later. 00:24:10.540 --> 00:24:14.230 The position which the hidden service appears on the circle changes 00:24:14.230 --> 00:24:18.950 every 24 hours. So there’s not a fixed position every single day. 00:24:18.950 --> 00:24:24.370 If we run 40 nodes over a long period of time we will occupy positions within 00:24:24.370 --> 00:24:29.570 that distributed hash table. And we will be able to collect publications and requests 00:24:29.570 --> 00:24:34.300 for hidden services that are located at that position inside the distributed 00:24:34.300 --> 00:24:39.251 hash table. So in that case we ran 40 Tor nodes, we had a student at university 00:24:39.251 --> 00:24:43.950 who said: “Hey, I run a hosting company, I got loads of server capacity”, and 00:24:43.950 --> 00:24:46.580 we told him what we were doing, and he said: “Well, you really helped us out, 00:24:46.580 --> 00:24:49.820 these last couple of years…” and just gave us loads of server capacity 00:24:49.820 --> 00:24:55.500 to allow us to do this. So we spun up 40 Tor nodes. Each Tor node was required 00:24:55.500 --> 00:24:59.560 to advertise a certain amount of bandwidth to become a part of that distributed 00:24:59.560 --> 00:25:02.200 hash table. It’s actually a very small amount, so this didn’t matter too much. 00:25:02.200 --> 00:25:06.050 And then, after – this has changed recently in the last few days, 00:25:06.050 --> 00:25:10.070 it used to be 25 hours, it’s just been increased as a result of one of the 00:25:10.070 --> 00:25:14.570 attacks last week. But here… certainly during our study it was 25 hours. You then 00:25:14.570 --> 00:25:18.300 appear at a particular point inside that distributed hash table. And you’re then 00:25:18.300 --> 00:25:22.750 in a position to record publications of hidden services and requests for hidden 00:25:22.750 --> 00:25:27.810 services. So not only can you get a full list of the onion addresses you can also 00:25:27.810 --> 00:25:32.250 find out how many times each of the onion addresses are requested. 00:25:32.250 --> 00:25:38.270 And so this is what we recorded. And then, once we had a full list of… or once 00:25:38.270 --> 00:25:41.830 we had run for a long period of time to collect a long list of .onion addresses 00:25:41.830 --> 00:25:46.850 we then built a custom crawler that would visit each of the Tor hidden services 00:25:46.850 --> 00:25:51.450 in turn, and pull down the HTML contents, the text content from the web page, 00:25:51.450 --> 00:25:54.760 so that we could go ahead and classify the content. Now it’s really important 00:25:54.760 --> 00:25:59.250 to know here, and it will become obvious why a little bit later, we only pulled down 00:25:59.250 --> 00:26:03.030 HTML content. We didn’t pull out images. And there’s a very, very important reason 00:26:03.030 --> 00:26:09.980 for that which will become clear shortly. 00:26:09.980 --> 00:26:13.520 We had a lot of questions when we first started this. Noone really knew 00:26:13.520 --> 00:26:18.000 how many hidden services there were. It had been suggested to us there was a very high 00:26:18.000 --> 00:26:21.250 turn-over of hidden services. We wanted to confirm that whether that was true or not. 00:26:21.250 --> 00:26:24.530 And we also wanted to do this so, what are the hidden services, 00:26:24.530 --> 00:26:30.140 how popular are they, etc. etc. etc. So our estimate for how many hidden services 00:26:30.140 --> 00:26:34.770 there are, over the period which we ran our study, this is a graph plotting 00:26:34.770 --> 00:26:38.560 our estimate for each of the individual days as to how many hidden services 00:26:38.560 --> 00:26:44.850 there were on that particular day. Now the data is naturally noisy because we’re only 00:26:44.850 --> 00:26:48.590 a very small proportion of that circle. So we’re only observing a very small 00:26:48.590 --> 00:26:53.250 proportion of the total publications and requests every single day, for each of 00:26:53.250 --> 00:26:57.260 those hidden services. And if you take a long term average for this 00:26:57.260 --> 00:27:02.720 there’s about 45.000 hidden services that we think were present, on average, 00:27:02.720 --> 00:27:07.880 each day, during our entire study. Which is a large number of hidden services. 00:27:07.880 --> 00:27:11.070 But over the entire length we collected about 80.000, in total. 00:27:11.070 --> 00:27:14.270 Some came and went etc. So the next question after how many 00:27:14.270 --> 00:27:17.750 hidden services there are is how long the hidden service exists for. 00:27:17.750 --> 00:27:20.620 Does it exist for a very long period of time, does it exist for a very short 00:27:20.620 --> 00:27:24.220 period of time etc. etc. So what we did was, for every single 00:27:24.220 --> 00:27:30.260 .onion address we plotted how many times we saw a publication for that particular 00:27:30.260 --> 00:27:34.160 hidden service during the six months. How many times did we see it. 00:27:34.160 --> 00:27:38.100 If we saw it a lot of times that suggested in general the hidden service existed 00:27:38.100 --> 00:27:42.180 for a very long period of time. If we saw a very short number of publications 00:27:42.180 --> 00:27:45.760 for each hidden service then that suggests that they were only present 00:27:45.760 --> 00:27:51.690 for a very short period of time. This is our graph. By far the most number 00:27:51.690 --> 00:27:55.890 of hidden services we only saw once during the entire study. And we never saw them 00:27:55.890 --> 00:28:00.390 again. We suggest that there’s a very high turnover of the hidden services, they 00:28:00.390 --> 00:28:04.520 don’t tend to exist on average i.e. for a very long period of time. 00:28:04.520 --> 00:28:10.730 And then you can see the sort of a tail here. If we plot just those 00:28:10.730 --> 00:28:16.390 hidden services which existed for a long time, so e.g. we could take hidden services 00:28:16.390 --> 00:28:20.280 which have a high number of hit requests and say: “Okay, those that have a high number 00:28:20.280 --> 00:28:24.800 of hits probably existed for a long time.” That’s not absolutely certain, but probably. 00:28:24.800 --> 00:28:29.190 Then you see this sort of -normal- plot about 4..5, so we saw on average 00:28:29.190 --> 00:28:34.870 most hidden services four or five times during the entire six months if they were 00:28:34.870 --> 00:28:40.530 popular and we’re using that as a proxy measure for whether they existed 00:28:40.530 --> 00:28:48.160 for the entire time. Now, this stage was over 160 days, so almost six months. 00:28:48.160 --> 00:28:51.490 What we also wanted to do was trying to confirm this over a longer period. 00:28:51.490 --> 00:28:56.310 So last year, in 2013, about February time some researchers of the University 00:28:56.310 --> 00:29:00.350 of Luxemburg also ran a similar study but it ran over a very short period of time 00:29:00.350 --> 00:29:05.060 over the day. But they did it in such a way it could collect descriptors 00:29:05.060 --> 00:29:08.590 across much of the circle during a single day. That was because of a bug in the way 00:29:08.590 --> 00:29:12.020 Tor did some of the things which has now been fixed so we can’t repeat that 00:29:12.020 --> 00:29:16.520 as a particular way. So we got a list of .onion addresses from February 2013 00:29:16.520 --> 00:29:18.960 from these researchers at the University of Luxemburg. And then we got our list 00:29:18.960 --> 00:29:23.670 of .onion addresses from this six months which was March to September of this year. 00:29:23.670 --> 00:29:26.700 And we wanted to say, okay, we’re given these two sets of .onion addresses. 00:29:26.700 --> 00:29:30.740 Which .onion addresses existed in his set but not ours and vice versa, and which 00:29:30.740 --> 00:29:39.740 .onion addresses existed in both sets? 00:29:39.740 --> 00:29:45.520 So as you can see a very small minority of hidden service addresses existed 00:29:45.520 --> 00:29:50.000 in both sets. This is over an 18 month period between these two collection points. 00:29:50.000 --> 00:29:54.430 A very small number of services existed in both his data set and in 00:29:54.430 --> 00:29:58.390 our data set. Which again suggested there’s a very high turnover of hidden 00:29:58.390 --> 00:30:02.920 services that don’t tend to exist for a very long period of time. 00:30:02.920 --> 00:30:06.530 So the question is why is that? Which we’ll come on to a little bit later. 00:30:06.530 --> 00:30:11.120 It’s a very valid question, can’t answer it 100%, we have some inclines as to 00:30:11.120 --> 00:30:15.560 why that may be the case. So in terms of popularity which hidden services 00:30:15.560 --> 00:30:19.700 did we see, or which .onion addresses did we see requested the most? 00:30:19.700 --> 00:30:26.980 Which got the most number of hits? Or the most number of directory requests. 00:30:26.980 --> 00:30:30.120 So botnet Command & Control servers – if you’re not familiar with what 00:30:30.120 --> 00:30:34.340 a botnet is, the idea is to infect lots of people with a piece of malware. 00:30:34.340 --> 00:30:37.630 And this malware phones home to a Command & Control server where 00:30:37.630 --> 00:30:41.500 the botnet master can give instructions to each of the bots on to do things. 00:30:41.500 --> 00:30:46.780 So it might be e.g. to collect passwords, key strokes, banking details. 00:30:46.780 --> 00:30:51.010 Or it might be to do things like Distributed Denial of Service attacks, 00:30:51.010 --> 00:30:55.220 or to send spam, those sorts of things. And a couple of years ago someone gave 00:30:55.220 --> 00:31:00.720 a talk and said: “Well, the problem with running a botnet is your C&C servers 00:31:00.720 --> 00:31:05.750 are vulnerable.” Once a C&C server is taken down you no longer have control over 00:31:05.750 --> 00:31:10.030 your botnet. So it’s been a sort of arms race against anti-virus companies and 00:31:10.030 --> 00:31:15.130 against malware authors to try and come up with techniques to run C&C servers in a way 00:31:15.130 --> 00:31:18.490 which they can’t be taken down. And a couple of years ago someone gave a talk 00:31:18.490 --> 00:31:22.450 at a conference that said: “You know what? It would be a really good idea if botnet 00:31:22.450 --> 00:31:25.809 C&C servers were run as Tor hidden services because then no one knows 00:31:25.809 --> 00:31:29.370 where they are, and in theory they can’t be taken down.” So in the fact we have this 00:31:29.370 --> 00:31:33.000 there are loads and loads and loads of these addresses associated with several 00:31:33.000 --> 00:31:38.122 different botnets, ‘Sefnit’ and ‘Skynet’. Now Skynet is the one I wanted to talk 00:31:38.122 --> 00:31:42.840 to you about because the guy that runs Skynet had a twitter account, and he also 00:31:42.840 --> 00:31:47.210 did a Reddit AMA. If you not heard of a Reddit AMA before, that’s a Reddit 00:31:47.210 --> 00:31:51.500 ask-me-anything. You can go on the website and ask the guy anything. So this guy 00:31:51.500 --> 00:31:54.790 wasn’t hiding in the shadows. He’d say: “Hey, I’m running this massive botnet, 00:31:54.790 --> 00:31:58.180 here’s my Twitter account which I update regularly, here is my Reddit AMA where 00:31:58.180 --> 00:32:01.620 you can ask me questions!” etc. 00:32:01.620 --> 00:32:04.590 He was arrested last year, which is not, perhaps, a huge surprise. 00:32:04.590 --> 00:32:11.750 laughter and applause 00:32:11.750 --> 00:32:15.970 But… so he was arrested, his C&C servers disappeared 00:32:15.970 --> 00:32:21.600 but there were still infected hosts trying to connect with the C&C servers and 00:32:21.600 --> 00:32:24.490 request access to the C&C server. 00:32:24.490 --> 00:32:27.570 This is why we’re saying: “A large number of hits.” So all of these requests are 00:32:27.570 --> 00:32:31.520 failed requests, i.e. we didn’t have a descriptor for them because 00:32:31.520 --> 00:32:34.910 the hidden service had gone away but there were still clients requesting each 00:32:34.910 --> 00:32:38.040 of the hidden services. 00:32:38.040 --> 00:32:41.980 And the next thing we wanted to do was to try and categorize sites. So, as I said 00:32:41.980 --> 00:32:45.960 earlier, we crawled all of the hidden services that we could, and we classified 00:32:45.960 --> 00:32:50.230 them into different categories based on what the type of content was 00:32:50.230 --> 00:32:53.650 on the hidden service side. The first graph I have is the number of sites 00:32:53.650 --> 00:32:58.040 in each of the categories. So you can see down the bottom here we got lots of 00:32:58.040 --> 00:33:04.280 different categories. We got drugs, market places, etc. on the bottom. And the graph 00:33:04.280 --> 00:33:07.360 shows the percentage of the hidden services that we crawled that fit in 00:33:07.360 --> 00:33:12.680 to each of these categories. So e.g. looking at this, drugs, the most number of sites 00:33:12.680 --> 00:33:16.250 that we crawled were made up of drugs-focused websites, followed by 00:33:16.250 --> 00:33:20.970 market places etc. There’s a couple of questions you might have here, 00:33:20.970 --> 00:33:25.640 so which ones are gonna stick out, what does ‘porn’ mean, well, you know 00:33:25.640 --> 00:33:31.060 what ‘porn’ means. There are some very notorious porn sites on the Tor Darknet. 00:33:31.060 --> 00:33:34.470 There was one in particular which was focused on revenge porn. It turns out 00:33:34.470 --> 00:33:37.520 that youngsters wish to take pictures of themselves, and send it to their 00:33:37.520 --> 00:33:45.040 boyfriends or their girlfriends. And when they get dumped they publish them 00:33:45.040 --> 00:33:49.750 on these websites. So there were several of these sites on the main internet 00:33:49.750 --> 00:33:53.070 which have mostly been shut down. And some of these sites were archived 00:33:53.070 --> 00:33:58.220 on the Darknet. The second one is that we should probably wonder what is, 00:33:58.220 --> 00:34:03.430 is ‘abuse’. Abuse was… every single site we classified in this category 00:34:03.430 --> 00:34:07.750 were child abuse sites. So they were in some way facilitating child abuse. 00:34:07.750 --> 00:34:10.980 And how do we know that? Well, the data that came back from the crawler 00:34:10.980 --> 00:34:14.789 made it completely unambiguous as to what the content was in these sites. That was 00:34:14.789 --> 00:34:18.918 completely obvious, from then content, from the crawler as to what was on these sites. 00:34:18.918 --> 00:34:23.449 And this is the principal reason why we didn’t pull down images from sites. 00:34:23.449 --> 00:34:26.099 There are many countries that would be a criminal offense to do so. 00:34:26.099 --> 00:34:29.530 So our crawler only pulled down text content from all of these sites, and that 00:34:29.530 --> 00:34:34.470 enabled us to classify them, based on that. We didn’t pull down any images. 00:34:34.470 --> 00:34:37.880 So of course the next thing we liked to do is to say: “Okay, well, given each of these 00:34:37.880 --> 00:34:42.759 categories, what proportion of directory requests went to each of the categories?” 00:34:42.759 --> 00:34:45.489 Now the next graph is going to need some explaining as to precisely what it 00:34:45.489 --> 00:34:52.090 means, and I’m gonna give that. This is the proportion of directory requests 00:34:52.090 --> 00:34:55.830 which we saw that went to each of the categories of hidden service that we 00:34:55.830 --> 00:34:59.740 classified. As you can see, in fact, we saw a very large number going to these 00:34:59.740 --> 00:35:05.010 abuse sites. And the rest sort of distributed right there, at the bottom. 00:35:05.010 --> 00:35:07.230 And the question is: “What is it we’re collecting here?” 00:35:07.230 --> 00:35:12.070 We’re collecting successful hidden service directory requests. What does a hidden 00:35:12.070 --> 00:35:16.790 service directory request mean? It probably loosely correlates with 00:35:16.790 --> 00:35:22.230 either a visit or a visitor. So somewhere in between those two. Because when you 00:35:22.230 --> 00:35:26.790 want to visit a hidden service you make a request for the hidden service descriptor 00:35:26.790 --> 00:35:31.080 and that allows you to connect to it and browse through the web site. 00:35:31.080 --> 00:35:34.770 But there are cases where, e.g. if you restart Tor, you’ll go back and you 00:35:34.770 --> 00:35:40.100 re-fetch the descriptor. So in that case we’ll count twice, for example. 00:35:40.100 --> 00:35:43.050 What proportion of these are people, and which proportion of them are 00:35:43.050 --> 00:35:46.619 something else? The answer to that is we just simply don’t know. 00:35:46.619 --> 00:35:50.250 We've got directory requests but that doesn’t tell us about what they’re doing on these 00:35:50.250 --> 00:35:55.130 sites, what they’re fetching, or who indeed they are, or what it is they are. 00:35:55.130 --> 00:35:58.690 So these could be automated requests, they could be human beings. We can’t 00:35:58.690 --> 00:36:03.750 distinguish between those two things. 00:36:03.750 --> 00:36:06.420 What are the limitations? 00:36:06.420 --> 00:36:12.170 A hidden service directory request neither exactly correlates to a visit -or- a visitor. 00:36:12.170 --> 00:36:16.380 It’s probably somewhere in between. So you can’t say whether it’s exactly one 00:36:16.380 --> 00:36:19.810 or the other. We cannot say whether a hidden service directory request 00:36:19.810 --> 00:36:26.230 is a person or something automated. We can’t distinguish between those two. 00:36:26.230 --> 00:36:31.890 Any type of site could be targeted by e.g. DoS attacks, by web crawlers which would 00:36:31.890 --> 00:36:40.040 greatly inflate the figures. If you were to do a DoS attack it’s likely you’d only 00:36:40.040 --> 00:36:44.700 request a small number of descriptors. You’d actually be flooding the site itself 00:36:44.700 --> 00:36:47.740 rather than the directories. But, in theory, you could flood the directories. 00:36:47.740 --> 00:36:52.840 But we didn’t see any sort of shutdown of our directories based on flooding, e.g. 00:36:52.840 --> 00:36:58.720 Whilst we can’t rule that out, it doesn’t seem to fit too well with what we’ve got. 00:36:58.720 --> 00:37:02.971 The other question is ‘crawlers’. I obviously talked with the Tor Project 00:37:02.971 --> 00:37:08.570 about these results and they’ve suggested that there are groups, so the child 00:37:08.570 --> 00:37:12.740 protection agencies e.g. that will crawl these sites on a regular basis. And, 00:37:12.740 --> 00:37:15.879 again, that doesn’t necessarily correlate with a human being. And that could 00:37:15.879 --> 00:37:19.830 inflate the figures. How many hidden directory requests would there be 00:37:19.830 --> 00:37:24.610 if a crawler was pointed at it. Typically, if I crawl them on a single day, one request. 00:37:24.610 --> 00:37:27.850 But if they got a large number of servers doing the crawling then it could be 00:37:27.850 --> 00:37:32.840 a request per day for every single server. So, again, I can’t give you, definitive, 00:37:32.840 --> 00:37:37.930 “yes, this is human beings” or “yes, this is automated requests”. 00:37:37.930 --> 00:37:43.300 The other important point is, these two content graphs are only hidden services 00:37:43.300 --> 00:37:48.550 offering web content. There are hidden services that do things, e.g. IRC, 00:37:48.550 --> 00:37:52.490 the instant messaging etc. Those aren’t included in these figures. We’re only 00:37:52.490 --> 00:37:57.990 concentrating on hidden services offering web sites. They’re HTTP services, or HTTPS 00:37:57.990 --> 00:38:01.640 services. Because that allows to easily classify them. And, in fact, some of 00:38:01.640 --> 00:38:06.080 the other types are IRC and Jabber the result was probably not directly comparable 00:38:06.080 --> 00:38:08.920 with web sites. That’s sort of the use case for using them, it’s probably 00:38:08.920 --> 00:38:16.490 slightly different. So I appreciate the last graph is somewhat alarming. 00:38:16.490 --> 00:38:20.640 If you have any questions please ask either me or the Tor developers 00:38:20.640 --> 00:38:24.810 as to how to interpret these results. It’s not quite as straight-forward as it may 00:38:24.810 --> 00:38:27.500 look when you look at the graph. You might look at the graph and say: “Hey, 00:38:27.500 --> 00:38:30.980 that looks like there’s lots of people visiting these sites”. It’s difficult 00:38:30.980 --> 00:38:40.240 to conclude that from the results. 00:38:40.240 --> 00:38:45.990 The next slide is gonna be very contentious. I will prefix it with: 00:38:45.990 --> 00:38:50.970 “I’m not advocating -any- kind of action whatsoever. I’m just trying 00:38:50.970 --> 00:38:56.130 to describe technically as to what could be done. It’s not up to me to make decisions 00:38:56.130 --> 00:39:02.869 on these types of things.” So, of course, when we found this out, frankly, I think 00:39:02.869 --> 00:39:06.190 we were stunned. I mean, it took us several days, frankly, it just stunned us, 00:39:06.190 --> 00:39:09.610 “what the hell, this is not what we expected at all.” 00:39:09.610 --> 00:39:13.210 So a natural step is, well, we think, most of us think that Tor is a great thing, 00:39:13.210 --> 00:39:18.510 it seems. Could this problem be sorted out while still keeping Tor as it is? 00:39:18.510 --> 00:39:21.510 And probably the next step to say: “Well, okay, could we just block this class 00:39:21.510 --> 00:39:26.060 of content and not other types of content?” So could we block just hidden services 00:39:26.060 --> 00:39:29.630 that are associated with these sites and not other types of hidden services? 00:39:29.630 --> 00:39:33.370 We thought there’s three ways in which we could block hidden services. 00:39:33.370 --> 00:39:36.960 And I’ll talk about whether these were impossible in the coming months, 00:39:36.960 --> 00:39:39.430 after explaining them. But during our study these would have been impossible 00:39:39.430 --> 00:39:43.590 and presently they are possible. 00:39:43.590 --> 00:39:48.630 A single individual could shut down a single hidden service by controlling 00:39:48.630 --> 00:39:53.640 all of the relays which are responsible for receiving a publication request 00:39:53.640 --> 00:39:57.280 on that distributed hash table. It’s possible to place one of your relays 00:39:57.280 --> 00:40:01.460 at a particular position on that circle and so therefore make yourself be 00:40:01.460 --> 00:40:04.290 the responsible relay for a particular hidden service. 00:40:04.290 --> 00:40:08.500 And if you control all of the six relays which are responsible for a hidden service, 00:40:08.500 --> 00:40:11.390 when someone comes to you and says: “Can I have a descriptor for that site” 00:40:11.390 --> 00:40:15.910 you can just say: “No, I haven’t got it”. And provided you control those relays 00:40:15.910 --> 00:40:20.580 users won’t be able to fetch those sites. 00:40:20.580 --> 00:40:25.010 The second option is you could say: “Okay, the Tor Project are blocking these” 00:40:25.010 --> 00:40:28.941 – which I’ll talk about in a second – “as a relay operator”. Could I 00:40:28.941 --> 00:40:32.500 as a relay operator say: “Okay, as a relay operator I don’t want to carry 00:40:32.500 --> 00:40:35.930 this type of content, and I don’t want to be responsible for serving up this type 00:40:35.930 --> 00:40:39.930 of content.” A relay operator could patch his relay and say: “You know what, 00:40:39.930 --> 00:40:44.020 if anyone comes to this relay requesting anyone of these sites then, again, just 00:40:44.020 --> 00:40:48.740 refuse to do it”. The problem is a lot of relay operators need to do it. So a very, 00:40:48.740 --> 00:40:51.990 very large number of the potential relay operators would need to do that 00:40:51.990 --> 00:40:56.170 to effectively block these sites. The final option is the Tor Project could 00:40:56.170 --> 00:41:00.740 modify the Tor program and actually embed these ingresses in the Tor program itself 00:41:00.740 --> 00:41:05.030 so as that all relays by default both block hidden service directory requests 00:41:05.030 --> 00:41:10.560 to these sites, and also clients themselves would say: “Okay, if anyone’s requesting 00:41:10.560 --> 00:41:15.000 these block them at the client level.” Now I hasten to add: I’m not advocating 00:41:15.000 --> 00:41:18.230 any kind of action that is entirely up to other people because, frankly, I think 00:41:18.230 --> 00:41:22.530 if I advocated blocking hidden services I probably wouldn’t make it out alive, 00:41:22.530 --> 00:41:27.050 so I’m just saying: this is a description of what technical measures could be used 00:41:27.050 --> 00:41:30.730 to block some classes of sites. And of course there’s lots of questions here. 00:41:30.730 --> 00:41:35.150 If e.g. the Tor Project themselves decided: “Okay, we’re gonna block these sites” 00:41:35.150 --> 00:41:38.490 that means they are essentially in control of the block list. 00:41:38.490 --> 00:41:41.360 The block list would be somewhat public so everyone would be up to inspect 00:41:41.360 --> 00:41:44.930 what the sites are that are being blocked and they would be in control of some kind 00:41:44.930 --> 00:41:54.360 of block list. Which, you know, arguably is against what the Tor Projects are after. 00:41:54.360 --> 00:41:59.560 takes a sip, coughs 00:41:59.560 --> 00:42:05.480 So how about deanonymising visitors to hidden service web sites? 00:42:05.480 --> 00:42:08.940 So in this case we got a user on the left-hand side who is connected to 00:42:08.940 --> 00:42:12.630 a Guard node. We’ve got a hidden service on the right-hand side who is connected 00:42:12.630 --> 00:42:17.530 to a Guard node and on the top we got one of those directory servers which is 00:42:17.530 --> 00:42:21.850 responsible for serving up those hidden service directory requests. 00:42:21.850 --> 00:42:28.660 Now, when you first want to connect to a hidden service you connect through 00:42:28.660 --> 00:42:31.619 your Guard node and through a couple of hops up to the hidden service directory and 00:42:31.619 --> 00:42:35.840 you request the descriptor off of them. So at this point if you are the attacker 00:42:35.840 --> 00:42:39.440 and you control one of the hidden service directory nodes for a particular site 00:42:39.440 --> 00:42:43.100 you can send back down the circuit a particular pattern of traffic. 00:42:43.100 --> 00:42:47.740 And if you control that user’s Guard node – which is a big if – 00:42:47.740 --> 00:42:52.110 then you can spot that pattern of traffic at the Guard node. The question is: 00:42:52.110 --> 00:42:56.940 “How do you control a particular user’s Guard node?” That’s very, very hard. 00:42:56.940 --> 00:43:01.480 But if e.g. I run a hidden service and all of you visit my hidden service, and 00:43:01.480 --> 00:43:05.670 I’m running a couple of dodgy Guard relays then the probability is that some of you, 00:43:05.670 --> 00:43:09.760 certainly not all of you by any stretch will select my dodgy Guard relay, and 00:43:09.760 --> 00:43:13.220 I could deanonymise you, but I couldn’t deanonymise the rest of them. 00:43:13.220 --> 00:43:18.260 So what we’re saying here is that you can deanonymise some of the users 00:43:18.260 --> 00:43:22.130 some of the time but you can’t pick which users those are which you’re going to 00:43:22.130 --> 00:43:26.609 deanonymise. You can’t deanonymise someone specific but you can deanonymise a fraction 00:43:26.609 --> 00:43:32.170 based on what fraction of the network you control in terms of Guard capacity. 00:43:32.170 --> 00:43:36.340 How about… so the attacker controls those two – here’s a picture from a research of 00:43:36.340 --> 00:43:40.200 the University of Luxemburg which did this. And these are plots of 00:43:40.200 --> 00:43:45.270 taking the user’s IP address visiting a C&C server, and then geolocating it 00:43:45.270 --> 00:43:48.480 and putting it on a map. So “where was the user located when they called one of 00:43:48.480 --> 00:43:51.620 the Tor hidden services?” So, again, this is a selection, a percentage 00:43:51.620 --> 00:43:58.060 of the users visiting C&C servers using this technique. 00:43:58.060 --> 00:44:03.770 How about deanonymising hidden services themselves? Well, again, you got a problem. 00:44:03.770 --> 00:44:08.340 You’re the user. You’re gonna connect through your Guard into the Tor network. 00:44:08.340 --> 00:44:12.160 And then, eventually, through the hidden service’s Guard node, and talk to 00:44:12.160 --> 00:44:16.740 the hidden service. As the attacker you need to control the hidden service’s 00:44:16.740 --> 00:44:20.859 Guard node to do these traffic correlation attacks. So again, it’s very difficult 00:44:20.859 --> 00:44:24.390 to deanonymise a specific Tor hidden service. But if you think about, okay, 00:44:24.390 --> 00:44:30.200 there is 1.000 Tor hidden services, if you can control a percentage of the Guard nodes 00:44:30.200 --> 00:44:34.230 then some hidden services will pick you and then you’ll be able to deanonymise those. 00:44:34.230 --> 00:44:37.330 So provided you don’t care which hidden services you gonna deanonymise 00:44:37.330 --> 00:44:41.400 then it becomes much more straight-forward to control the Guard nodes of some hidden 00:44:41.400 --> 00:44:44.910 services but you can’t pick exactly what those are. 00:44:44.910 --> 00:44:51.040 So what sort of data can you see traversing a relay? 00:44:51.040 --> 00:44:55.880 This is a modified Tor client which just dumps cells which are coming… 00:44:55.880 --> 00:44:58.750 essentially packets travelling down a circuit, and the information you can 00:44:58.750 --> 00:45:04.020 extract from them at a Guard node. And this is done off the main Tor network. 00:45:04.020 --> 00:45:08.590 So I’ve got a client connected to a “malicious” Guard relay 00:45:08.590 --> 00:45:14.040 and it logs every single packet – they’re called ‘cells’ in the Tor protocol – 00:45:14.040 --> 00:45:17.619 coming through the Guard relay. We can’t decrypt the packet because it’s encrypted 00:45:17.619 --> 00:45:21.780 three times. What we can record, though, is the IP address of the user, 00:45:21.780 --> 00:45:25.070 the IP address of the next hop, and we can count packets travelling 00:45:25.070 --> 00:45:29.240 in each direction down the circuit. And we can also record the time at which those 00:45:29.240 --> 00:45:32.210 packets were sent. So of course, if you’re doing the traffic correlation attacks 00:45:32.210 --> 00:45:37.970 you’re using that time in the information to try and work out whether you’re seeing 00:45:37.970 --> 00:45:42.370 traffic which you’ve sent and which identifies a particular user or not. 00:45:42.370 --> 00:45:44.810 Or indeed traffic which they’ve sent which you’ve seen at a different point 00:45:44.810 --> 00:45:49.100 in the network. 00:45:49.100 --> 00:45:51.980 Moving on to my… 00:45:51.980 --> 00:45:55.760 …interesting problems, research questions etc. 00:45:55.760 --> 00:45:59.250 Based on what I’ve said, I’ve said there’s these directory authorities which are 00:45:59.250 --> 00:46:05.070 controlled by the core Tor members. If e.g. they were malicious then they could 00:46:05.070 --> 00:46:08.990 manipulate the Tor… – if a big enough chunk of them are malicious then 00:46:08.990 --> 00:46:12.700 they can manipulate the consensus to direct you to particular nodes. 00:46:12.700 --> 00:46:15.920 I don’t think that’s the case, and that anyone thinks that’s the case. 00:46:15.920 --> 00:46:19.180 And Tor is designed in a way to tr… I mean that you’d have to control 00:46:19.180 --> 00:46:22.480 a certain number of the authorities to be able to do anything important. 00:46:22.480 --> 00:46:25.270 So the Tor people… I said this to them a couple of days ago. 00:46:25.270 --> 00:46:28.780 I find it quite funny that you’d design your system as if you don’t trust 00:46:28.780 --> 00:46:31.880 each other. To which their response was: “No, we design our system so that 00:46:31.880 --> 00:46:35.620 we don’t have to trust each other.” Which I think is a very good model to have, 00:46:35.620 --> 00:46:39.430 when you have this type of system. So could we eliminate these sort of 00:46:39.430 --> 00:46:43.240 centralized servers? I think that’s actually a very hard problem to do. 00:46:43.240 --> 00:46:46.340 There are lots of attacks which could potentially be deployed against 00:46:46.340 --> 00:46:51.250 a decentralized network. At the moment the Tor network is relatively well understood 00:46:51.250 --> 00:46:54.490 both in terms of what types of attack it is vulnerable to. So if we were to move 00:46:54.490 --> 00:46:58.880 to a new architecture then we may open it to a whole new class of attacks. 00:46:58.880 --> 00:47:02.000 The Tor network has been existing for quite some time and it’s been 00:47:02.000 --> 00:47:06.820 very well studied. What about global adversaries like the NSA, where you could 00:47:06.820 --> 00:47:10.980 monitor network links all across the world? It’s very difficult to defend 00:47:10.980 --> 00:47:15.530 against that. Where they can monitor… if they can identify which Guard relay 00:47:15.530 --> 00:47:18.760 you’re using, they can monitor traffic going into and out of the Guard relay, 00:47:18.760 --> 00:47:23.259 and they log each of the subsequent hops along. It’s very, very difficult to defend against 00:47:23.259 --> 00:47:26.470 these types of things. Do we know if they’re doing it? The documents that were 00:47:26.470 --> 00:47:29.850 released yesterday – I’ve only had a very brief look through them, but they suggest 00:47:29.850 --> 00:47:32.480 that they’re not presently doing it and they haven’t had much success. 00:47:32.480 --> 00:47:36.450 I don’t know why, there are very powerful attacks described in the academic literature 00:47:36.450 --> 00:47:40.830 which are very, very reliable and most academic literature you can access for free 00:47:40.830 --> 00:47:43.960 so it’s not even as if they have to figure out how to do it. They just have to read 00:47:43.960 --> 00:47:47.010 the academic literature and try and implement some of these attacks. 00:47:47.010 --> 00:47:52.000 I don’t know what – why they’re not. The next question is how to detect malicious 00:47:52.000 --> 00:47:57.760 relays. So in my case we’re running 40 relays. Our relays were on consecutive 00:47:57.760 --> 00:48:01.570 IP addresses, so we’re running 40 – well, most of them are on consecutive 00:48:01.570 --> 00:48:04.820 IP addresses in two blocks. So they’re running on IP addresses numbered 00:48:04.820 --> 00:48:09.280 e.g. 1,2,3,4,… We were running two relays per IP address, 00:48:09.280 --> 00:48:12.210 and every single relay had my name plastered across it. 00:48:12.210 --> 00:48:14.740 So after I set up these 40 relays in 00:48:14.740 --> 00:48:17.420 a relatively short period of time I expected someone from the Tor Project 00:48:17.420 --> 00:48:22.260 to come to me and say: “Hey Gareth, what are you doing?” – no one noticed, 00:48:22.260 --> 00:48:26.090 no one noticed. So this is presently an open question. On the Tor Project 00:48:26.090 --> 00:48:28.790 they’re quite open about this. They acknowledged that, in fact, last year 00:48:28.790 --> 00:48:33.210 we had the CERT researchers launch much more relays than that. The Tor Project 00:48:33.210 --> 00:48:36.510 spotted those large number of relays but chose not to do anything about it 00:48:36.510 --> 00:48:40.119 and, in fact, they were deploying an attack. But, as you know, it’s often very 00:48:40.119 --> 00:48:43.700 difficult to defend against unknown attacks. So at the moment how to detect 00:48:43.700 --> 00:48:47.780 malicious relays is a bit of an open question. Which as I think is being 00:48:47.780 --> 00:48:50.720 discussed on the mailing list. 00:48:50.720 --> 00:48:54.230 The other one is defending against unknown tampering at exits. If you took or take 00:48:54.230 --> 00:48:57.220 the exit relays – the exit relay can tamper with the traffic. 00:48:57.220 --> 00:49:01.040 So we know particular types of attacks doing SSL man-in-the-middles etc. 00:49:01.040 --> 00:49:05.350 We’ve seen recently binary patching. How do we detect unknown tampering 00:49:05.350 --> 00:49:08.970 with traffic, other types of traffic? So the binary tampering wasn’t spotted 00:49:08.970 --> 00:49:12.060 until it was spotted by someone who told the Tor Project. So it wasn’t 00:49:12.060 --> 00:49:15.609 detected e.g. by the Tor Project themselves, it was spotted by someone else 00:49:15.609 --> 00:49:20.500 and notified to them. And then the final one open on here is the Tor code review. 00:49:20.500 --> 00:49:25.400 So the Tor code is open source. We know from OpenSSL that, although everyone 00:49:25.400 --> 00:49:29.260 can read source code, people don’t always look at it. And OpenSSL has been 00:49:29.260 --> 00:49:32.230 a huge mess, and there’s been lots of stuff disclosed over that 00:49:32.230 --> 00:49:35.880 over the last coming days. There are lots of eyes on the Tor code but I think 00:49:35.880 --> 00:49:41.519 always, more eyes are better. I’d say, ideally if we can get people to look 00:49:41.519 --> 00:49:45.140 at the Tor code and look for vulnerabilities then… I encourage people 00:49:45.140 --> 00:49:49.860 to do that. It’s a very useful thing to do. There could be unknown vulnerabilities 00:49:49.860 --> 00:49:53.119 as we’ve seen with the “relay early” type quite recently in the Tor code which 00:49:53.119 --> 00:49:56.990 could be quite serious. The truth is we just don’t know until people do thorough 00:49:56.990 --> 00:50:02.500 code audits, and even then it’s very difficult to know for certain. 00:50:02.500 --> 00:50:08.170 So my last point, I think, yes, 00:50:08.170 --> 00:50:11.130 is advice to future researchers. So if you ever wanted, or are planning 00:50:11.130 --> 00:50:16.349 on doing a study in the future, e.g. on Tor, do not do what the CERT researchers 00:50:16.349 --> 00:50:20.550 do and start deanonymising people on the live Tor network and doing it in a way 00:50:20.550 --> 00:50:25.060 which is incredibly irresponsible. I don’t think…I mean, I tend, myself, to give you with 00:50:25.060 --> 00:50:28.510 the benefit of a doubt, I don’t think the CERT researchers set out to be malicious. 00:50:28.510 --> 00:50:33.320 I think they’re just very naive. That’s what it was they were doing. 00:50:33.320 --> 00:50:36.780 That was rapidly pointed out to them. In my case we are running 00:50:36.780 --> 00:50:43.090 40 relays. Our Tor relays they were forwarding traffic, they were acting as good relays. 00:50:43.090 --> 00:50:45.970 The only thing that we were doing was logging publication requests 00:50:45.970 --> 00:50:50.050 to the directories. Big question whether that’s malicious or not – I don’t know. 00:50:50.050 --> 00:50:53.330 One thing that has been pointed out to me is that the .onion addresses themselves 00:50:53.330 --> 00:50:58.270 could be considered sensitive information, so any data we will be retaining 00:50:58.270 --> 00:51:01.840 from the study is the aggregated data. So we won't be retaining information 00:51:01.840 --> 00:51:05.400 on individual .onion addresses because that could potentially be considered 00:51:05.400 --> 00:51:08.900 sensitive information. If you think about someone running an .onion address which 00:51:08.900 --> 00:51:11.240 contains something which they don’t want other people knowing about. So we won’t 00:51:11.240 --> 00:51:15.060 be retaining that data, and we’ll be destroying them. 00:51:15.060 --> 00:51:19.920 So I think that brings me now to starting the questions. 00:51:19.920 --> 00:51:22.770 I want to say “Thanks” to a couple of people. The student who donated 00:51:22.770 --> 00:51:26.820 the server to us. Nick Savage who is one of my colleagues who was a sounding board 00:51:26.820 --> 00:51:30.510 during the entire study. Ivan Pustogarov who is the researcher at the University 00:51:30.510 --> 00:51:34.700 of Luxembourg who sent us the large data set of .onion addresses from last year. 00:51:34.700 --> 00:51:37.670 He’s also the chap who has demonstrated those deanonymisation attacks 00:51:37.670 --> 00:51:41.500 that I talked about. A big "Thank you" to Roger Dingledine who has frankly been… 00:51:41.500 --> 00:51:45.230 presented loads of questions to me over the last couple of days and allowed me 00:51:45.230 --> 00:51:49.410 to bounce ideas back and forth. That has been a very useful process. 00:51:49.410 --> 00:51:53.640 If you are doing future research I strongly encourage you to contact the Tor Project 00:51:53.640 --> 00:51:57.040 at the earliest opportunity. You’ll find them… certainly I found them to be 00:51:57.040 --> 00:51:59.460 extremely helpful. 00:51:59.460 --> 00:52:04.640 Donncha also did something similar, so both Ivan and Donncha have done 00:52:04.640 --> 00:52:09.520 a similar study in trying to classify the types of hidden services or work out 00:52:09.520 --> 00:52:13.520 how many hits there are to particular types of hidden service. Ivan Pustogarov 00:52:13.520 --> 00:52:17.430 did it on a bigger scale and found similar results to us. 00:52:17.430 --> 00:52:21.910 That is that these abuse sites featured frequently 00:52:21.910 --> 00:52:26.740 in the top requested sites. That was done over a year ago, and again, he was seeing 00:52:26.740 --> 00:52:31.109 similar sorts of pattern. There were these abuse sites being requested frequently. 00:52:31.109 --> 00:52:35.450 So that also sort of probates what we’re saying. 00:52:35.450 --> 00:52:38.540 The data I put online is at this address, there will probably be the slides, 00:52:38.540 --> 00:52:41.609 something called ‘The Tor Research Framework’ which is an implementation 00:52:41.609 --> 00:52:47.510 of a Java client, so an implementation of a Tor client in Java specifically aimed 00:52:47.510 --> 00:52:52.080 at researchers. So if e.g. you wanna pull out data from a consensus you can do. 00:52:52.080 --> 00:52:55.290 If you want to build custom routes through the network you can do. 00:52:55.290 --> 00:52:58.230 If you want to build routes through the network and start sending padding traffic 00:52:58.230 --> 00:53:01.720 down them you can do etc. The code is designed in a way which is 00:53:01.720 --> 00:53:06.000 designed to be easily modifiable for testing lots of these things. 00:53:06.000 --> 00:53:10.580 There is also a link to the Tor FBI exploit which they deployed against 00:53:10.580 --> 00:53:16.230 visitors to some Tor hidden services last year. They exploited a Mozilla Firefox bug 00:53:16.230 --> 00:53:20.540 and then ran code on users who were visiting these hidden service, and ran 00:53:20.540 --> 00:53:24.619 code on their computer to identify them. At this address there is a link to that 00:53:24.619 --> 00:53:29.250 including a copy of the shell code and an analysis of exactly what it was doing. 00:53:29.250 --> 00:53:31.670 And then of course a list of references, with papers and things. 00:53:31.670 --> 00:53:34.260 So I’m quite happy to take questions now. 00:53:34.260 --> 00:53:46.960 applause 00:53:46.960 --> 00:53:50.880 Herald: Thanks for the nice talk! Do we have any questions 00:53:50.880 --> 00:53:57.000 from the internet? 00:53:57.000 --> 00:53:59.740 Signal Angel: One question. It’s very hard to block addresses since creating them 00:53:59.740 --> 00:54:03.620 is cheap, and they can be generated for each user, and rotated often. So 00:54:03.620 --> 00:54:07.510 can you think of any other way for doing the blocking? 00:54:07.510 --> 00:54:09.799 Gareth: That is absolutely true, so, yes. If you were to block a particular .onion 00:54:09.799 --> 00:54:13.060 address they can wail: “I want another .onion address.” So I don’t know of 00:54:13.060 --> 00:54:16.760 any way to counter that now. 00:54:16.760 --> 00:54:18.510 Herald: Another one from the internet? inaudible answer from Signal Angel 00:54:18.510 --> 00:54:22.030 Okay, then, Microphone 1, please! 00:54:22.030 --> 00:54:26.359 Question: Thank you, that’s fascinating research. You mentioned that it is 00:54:26.359 --> 00:54:32.200 possible to influence the hash of your relay node in a sense that you could 00:54:32.200 --> 00:54:35.970 to be choosing which service you are advertising, or which hidden service 00:54:35.970 --> 00:54:38.050 you are responsible for. Is that right? Gareth: Yeah, correct! 00:54:38.050 --> 00:54:40.390 Question: So could you elaborate on how this is possible? 00:54:40.390 --> 00:54:44.740 Gareth: So e.g. you just keep regenerating a public key for your relay, 00:54:44.740 --> 00:54:48.140 you’ll get closer and closer to the point where you’ll be the responsible relay 00:54:48.140 --> 00:54:51.160 for that particular hidden service. That’s just – you keep regenerating your identity 00:54:51.160 --> 00:54:54.720 hash until you’re at that particular point in the relay. That’s not particularly 00:54:54.720 --> 00:55:00.490 computationally intensive to do. That was it? 00:55:00.490 --> 00:55:04.740 Herald: Okay, next question from Microphone 5, please. 00:55:04.740 --> 00:55:09.490 Question: Hi, I was wondering for the attacks where you identify a certain number 00:55:09.490 --> 00:55:15.170 of users using a hidden service. Have those attacks been used, or is there 00:55:15.170 --> 00:55:18.880 any evidence there, and is there any way of protecting against that? 00:55:18.880 --> 00:55:22.260 Gareth: That’s a very interesting question, is there any way to detect these types 00:55:22.260 --> 00:55:24.970 of attacks? So some of the attacks, if you’re going to generate particular 00:55:24.970 --> 00:55:29.030 traffic patterns, one way to do that is to use the padding cells. The padding cells 00:55:29.030 --> 00:55:32.070 aren’t used at the moment by the official Tor client. So the detection of those 00:55:32.070 --> 00:55:36.510 could be indicative but it doesn't... it`s not conclusive evidence in our tool. 00:55:36.510 --> 00:55:40.050 Question: And is there any way of protecting against a government 00:55:40.050 --> 00:55:46.510 or something trying to denial-of-service hidden services? 00:55:46.510 --> 00:55:48.180 Gareth: So I… trying to… did not… 00:55:48.180 --> 00:55:52.500 Question: Is it possible to protect against this kind of attack? 00:55:52.500 --> 00:55:56.180 Gareth: Not that I’m aware of. The Tor Project are currently revising how they 00:55:56.180 --> 00:55:59.500 do the hidden service protocol which will make e.g. what I did, enumerating 00:55:59.500 --> 00:56:03.230 the hidden services, much more difficult. And to also be in a position on the 00:56:03.230 --> 00:56:07.470 distributed hash table in advance for a particular hidden service. 00:56:07.470 --> 00:56:10.510 So they are at the moment trying to change the way it’s done, and make some of 00:56:10.510 --> 00:56:15.270 these things more difficult. 00:56:15.270 --> 00:56:20.290 Herald: Good. Next question from Microphone 2, please. 00:56:20.290 --> 00:56:27.220 Mic2: Hi. I’m running the Tor2Web abuse, and so I used to see a lot of abuse of requests 00:56:27.220 --> 00:56:31.130 concerning the Tor hidden service being exposed on the internet through 00:56:31.130 --> 00:56:37.270 the Tor2Web.org domain name. And I just wanted to comment on, like you said, 00:56:37.270 --> 00:56:45.410 the abuse number of the requests. I used to spoke with some of the child protection 00:56:45.410 --> 00:56:50.070 agencies that reported abuse at Tor2Web.org, and they are effectively 00:56:50.070 --> 00:56:55.570 using crawlers that periodically look for changes in order to get new images to be 00:56:55.570 --> 00:57:00.190 put in the database. And what I was able to understand is that the German agency 00:57:00.190 --> 00:57:07.440 doing that is crawling the same sites that the Italian agencies are crawling, too. 00:57:07.440 --> 00:57:11.890 So it’s likely that in most of the countries there are the child protection 00:57:11.890 --> 00:57:16.790 agencies that are crawling those few numbers of Tor hidden services that 00:57:16.790 --> 00:57:22.760 contain child porn. And I saw it also a bit from the statistics of Tor2Web 00:57:22.760 --> 00:57:28.500 where the amount of abuse relating to that kind of content, it’s relatively low. 00:57:28.500 --> 00:57:30.000 Just as contribution! 00:57:30.000 --> 00:57:33.500 Gareth: Yes, that’s very interesting, thank you for that! 00:57:33.500 --> 00:57:37.260 applause 00:57:37.260 --> 00:57:39.560 Herald: Next, Microphone 4, please. 00:57:39.560 --> 00:57:45.260 Mic4: You then attacked or deanonymised users with an infected or a modified Guard 00:57:45.260 --> 00:57:51.810 relay? Is it required to modify the Guard relay if I control the entry point 00:57:51.810 --> 00:57:57.360 of the user to the internet? If I’m his ISP? 00:57:57.360 --> 00:58:01.900 Gareth: Yes, if you observe traffic travelling into a Guard relay without 00:58:01.900 --> 00:58:04.570 controlling the Guard relay itself. Mic4: Yeah. 00:58:04.570 --> 00:58:07.500 Gareth: In theory, yes. I wouldn’t be able to tell you how reliable that is 00:58:07.500 --> 00:58:10.500 off the top of my head. Mic4: Thanks! 00:58:10.500 --> 00:58:13.630 Herald: So another question from the internet! 00:58:13.630 --> 00:58:16.339 Signal Angel: Wouldn’t the ability to choose the key hash prefix give 00:58:16.339 --> 00:58:19.980 the ability to target specific .onions? 00:58:19.980 --> 00:58:23.680 Gareth: So you can only target one .onion address at a time. Because of the way 00:58:23.680 --> 00:58:28.080 they are generated. So you wouldn’t be able to say e.g. “Pick a key which targeted 00:58:28.080 --> 00:58:32.339 two or more .onion addresses.” You can only target one .onion address at a time 00:58:32.339 --> 00:58:37.720 by positioning yourself at a particular point on the distributed hash table. 00:58:37.720 --> 00:58:40.260 Herald: Another one from the internet? … Okay. 00:58:40.260 --> 00:58:43.369 Then Microphone 3, please. 00:58:43.369 --> 00:58:47.780 Mic3: Hey. Thanks for this research. I think it strengthens the network. 00:58:47.780 --> 00:58:54.300 So in the deem (?) I was wondering whether you can donate this relays to be a part of 00:58:54.300 --> 00:58:59.500 non-malicious relays pool, basically use them as regular relays afterwards? 00:58:59.500 --> 00:59:02.750 Gareth: Okay, so can I donate the relays a rerun and at the Tor capacity (?) ? 00:59:02.750 --> 00:59:05.490 Unfortunately, I said they were run by a student and they were donated for 00:59:05.490 --> 00:59:09.510 a fixed period of time. So we’ve given those back to him. We are very grateful 00:59:09.510 --> 00:59:14.790 to him, he was very generous. In fact, without his contribution donating these 00:59:14.790 --> 00:59:18.700 it would have been much more difficult to collect as much data as we did. 00:59:18.700 --> 00:59:21.490 Herald: Good, next, Microphone 5, please! 00:59:21.490 --> 00:59:25.839 Mic5: Yeah hi, first of all thanks for your talk. I think you’ve raised 00:59:25.839 --> 00:59:29.310 some real issues that need to be considered very carerfully by everyone 00:59:29.310 --> 00:59:33.950 on the Tor Project. My question: I’d like to go back to the issue with so many 00:59:33.950 --> 00:59:38.470 abuse related web sites running over the Tor Project. I think it’s an important 00:59:38.470 --> 00:59:41.900 issue that really needs to be considered because we don’t wanna be associated 00:59:41.900 --> 00:59:44.840 with that at the end of the day. Anyone who uses Tor, who runs a relay 00:59:44.840 --> 00:59:51.250 or an exit node. And I understand it’s a bit of a censored issue, and you don’t 00:59:51.250 --> 00:59:55.300 really have any say over whether it’s implemented or not. But I’d like to get 00:59:55.300 --> 01:00:02.410 your opinion on the implementation of a distributed block-deny system 01:00:02.410 --> 01:00:06.980 that would run in very much a similar way to those of the directory authorities. 01:00:06.980 --> 01:00:08.950 I’d just like to see what you think of that. 01:00:08.950 --> 01:00:13.200 Gareth: So you’re asking me whether I want to support a particular blocking mechanism 01:00:13.200 --> 01:00:14.200 then? 01:00:14.200 --> 01:00:16.470 Mic5: I’d like to get your opinion on it. Gareth laughs 01:00:16.470 --> 01:00:20.540 I know it’s a sensitive issue but I think, like I said, I think something… 01:00:20.540 --> 01:00:25.700 I think it needs to be considered because everyone running exit nodes and relays 01:00:25.700 --> 01:00:30.270 and people of the Tor Project don’t want to be known or associated with 01:00:30.270 --> 01:00:34.790 these massive amount of abuse web sites that currently exist within the Tor network. 01:00:34.790 --> 01:00:40.210 Gareth: I absolutely agree, and I think the Tor Project are horrified as well that 01:00:40.210 --> 01:00:43.960 this problem exists, and they, in fact, talked on it in previous years that 01:00:43.960 --> 01:00:48.690 they have a problem with this type of content. I asked to what if anything is 01:00:48.690 --> 01:00:52.340 done about it, it’s very much up to them. Could it be done in a distributed fashion? 01:00:52.340 --> 01:00:56.240 So the example I gave was a way which it could be done by relay operators. 01:00:56.240 --> 01:00:59.770 So e.g. that would need the consensus of a large number of relay operators to be 01:00:59.770 --> 01:01:02.890 effective. So that is done in a distributed fashion. The question is: 01:01:02.890 --> 01:01:06.810 who gives the list of .onion addresses to block to each of the relay operators? 01:01:06.810 --> 01:01:09.640 Clearly, the relay operators aren’t going to collect themselves. It needs to be 01:01:09.640 --> 01:01:15.780 supplied by someone like the Tor Project, e.g., or someone trustworthy. Yes, it can 01:01:15.780 --> 01:01:20.480 be done in a distributed fashion. It can be done in an open fashion. 01:01:20.480 --> 01:01:21.710 Mic5: Who knows? Gareth: Okay. 01:01:21.710 --> 01:01:23.750 Mic5: Thank you. 01:01:23.750 --> 01:01:27.260 Herald: Good. And another question from the internet. 01:01:27.260 --> 01:01:31.210 Signal Angel: Apparently there’s an option in the Tor client to collect statistics 01:01:31.210 --> 01:01:35.169 on hidden services. Do you know about this, and how it relates to your research? 01:01:35.169 --> 01:01:38.551 Gareth: Yes, I believe they’re going to be… the extent to which I know about it 01:01:38.551 --> 01:01:41.930 is they’re gonna be trying this next month, to try and estimate how many 01:01:41.930 --> 01:01:46.490 hidden services there are. So keep your eye on the Tor Project web site, 01:01:46.490 --> 01:01:50.340 I’m sure they’ll be publishing their data in the coming months. 01:01:50.340 --> 01:01:55.090 Herald: And, sadly, we are running out of time, so this will be the last question, 01:01:55.090 --> 01:01:56.980 so Microphone 4, please! 01:01:56.980 --> 01:02:01.250 Mic4: Hi, I’m just wondering if you could sort of outline what ethical clearances 01:02:01.250 --> 01:02:04.510 you had to get from your university to conduct this kind of research. 01:02:04.510 --> 01:02:07.260 Gareth: So we have to discuss these types of things before undertaking 01:02:07.260 --> 01:02:11.970 any research. And we go through the steps to make sure that we’re not e.g. storing 01:02:11.970 --> 01:02:16.370 sensitive information about particular people. So yes, we are very mindful 01:02:16.370 --> 01:02:19.240 of that. And that’s why I made a particular point of putting on the slides 01:02:19.240 --> 01:02:21.510 as to some of the things to consider. 01:02:21.510 --> 01:02:26.180 Mic4: So like… you outlined a potential implementation of the traffic correlation 01:02:26.180 --> 01:02:29.500 attack. Are you saying that you performed the attack? Or… 01:02:29.500 --> 01:02:33.180 Gareth: No, no no, absolutely not. So the link I’m giving… absolutely not. 01:02:33.180 --> 01:02:34.849 We have not engaged in any… 01:02:34.849 --> 01:02:36.350 Mic4: It just wasn’t clear from the slides. 01:02:36.350 --> 01:02:39.380 Gareth: I apologize. So it’s absolutely clear on that. No, we’re not engaging 01:02:39.380 --> 01:02:42.860 in any deanonymisation research on the Tor network. The research I showed 01:02:42.860 --> 01:02:46.079 is linked on the references, I think, which I put at the end of the slides. 01:02:46.079 --> 01:02:52.000 You can read about it. But it’s done in simulation. So e.g. there’s a way 01:02:52.000 --> 01:02:54.730 to do simulation of the Tor network on a single computer. I can’t remember 01:02:54.730 --> 01:02:58.880 the name of the project, though. Shadow! Yes, it’s a system 01:02:58.880 --> 01:03:02.170 called Shadow, we can run a large number of Tor relays on a single computer 01:03:02.170 --> 01:03:04.579 and simulate the traffic between them. If you’re going to do that type of research 01:03:04.579 --> 01:03:09.380 then you should use that. Okay, thank you very much, everyone. 01:03:09.380 --> 01:03:17.985 applause 01:03:17.985 --> 01:03:22.071 silent postroll titles 01:03:22.071 --> 01:03:27.000 subtitles created by c3subtitles.de Join, and help us!