silent 31C3 preroll Dr. Gareth Owen: Hello. Can you hear me? Yes. Okay. So my name is Gareth Owen. I’m from the University of Portsmouth. I’m an academic and I’m going to talk to you about an experiment that we did on the Tor hidden services, trying to categorize them, estimate how many they were etc. etc. Well, as we go through the talk I’m going to explain how Tor hidden services work internally, and how the data was collected. So what sort of conclusions you can draw from the data based on the way that we’ve collected it. Just so [that] I get an idea: how many of you use Tor on a regular basis, could you put your hand up for me? So quite a big number. Keep your hand up if… or put your hand up if you’re a relay operator. Wow, that’s quite a significant number, isn’t it? And then, put your hand up and/or keep it up if you run a hidden service. Okay, so, a fewer number, but still some people run hidden services. Okay, so, some of you may be very familiar with the way Tor works, sort of, in a low level. But I am gonna go through it for those which aren’t, so they understand just how they work. And as we go along, because I’m explaining how the hidden services work, I’m going to tag on information on how the Tor hidden services themselves can be deanonymised and also how the users of those hidden services can be deanonymised, if you put some strict criteria on what it is you want to do with respect to them. So the things that I’m going to go over: I wanna go over how Tor works, and then specifically how hidden services work. I’m gonna talk about something called the “Tor Distributed Hash Table” for hidden services. If you’ve heard that term and don’t know what it means, don’t worry, I’ll explain what a distributed hash table is and how it works. It’s not as complicated as it sounds. And then I wanna go over Darknet data, so, data that we collected from Tor hidden services. And as I say, as we go along I will sort of explain how you do deanonymisation of both the services themselves and of the visitors to the service. And just how complicated it is. So you may have seen this slide which I think was from GCHQ, released last year as part of the Snowden leaks where they said: “You can deanonymise some users some of the time but they’ve had no success in deanonymising someone in response to a specific request.” So, given all of you e.g., I may be able to deanonymise a small fraction of you but I can’t choose precisely one person I want to deanonymise. That’s what I’m gonna be explaining in relation to the deanonymisation attacks, how you can deanonymise a section but you can’t necessarily choose which section of the users that you will be deanonymising. Tor drives with just a couple of different problems. On one part it allows you to bypass censorship. So if you’re in a country like China, which blocks some types of traffic you can use Tor to bypass their censorship blocks. It tries to give you privacy, so, at some level in the network someone can’t see what you’re doing. And at another point in the network people who don’t know who you are but may necessarily be able to see what you’re doing. Now the traditional case for this is to look at VPNs. With a VPN you have sort of a single provider. You have lots of users connecting to the VPN. The VPN has sort of a mixing effect from an outside or a server’s point of view. And then out of the VPN you see requests to Twitter, Wikipedia etc. etc. And if that traffic doesn’t encrypt it then the VPN can also read the contents of the traffic. Now of course there is a fundamental weakness with this. If you trust the VPN provider the VPN provider knows both who you are and what you’re doing and can link those two together with absolute certainty. So you don’t… whilst you do get some of these properties, assuming you’ve got a trustworthy VPN provider you don’t get them in the face of an untrustworthy VPN provider. And of course: how do you trust the VPN provider? What sort of measure do you use? That’s sort of an open question. So Tor tries to solve this problem by distributing the trust. Tor is an open source project, so you can go on to their Git repository, you can download the source code, and change it, improve it, submit patches etc. As you heard earlier, during Jacob and Roger’s talk they’re currently partly sponsored by the US Government which seems a bit paradoxical, but they explained in that talk many of the… that doesn’t affect like judgment. And indeed, they do have some funding from other sources, and they design that system – which I’ll talk about a little bit later – in a way where they don’t have to trust each other. So there’s sort of some redundancy, and they’re trying to minimize these sort of trust issues related to this. Now, Tor is a partially de-centralized network, which means that it has some centralized components which are under the control of the Tor Project and some de-centralized components which are normally the Tor relays. If you run a relay you’re one of those de-centralized components. There is, however, no single authority on the Tor network. So no single server which is responsible, which you’re required to trust. So the trust is somewhat distributed, but not entirely. When you establish a circuit through Tor you, the user, download a list of all of the relays inside the Tor network. And you get to pick – and I’ll tell you how you do that – which relays you’re going to use to route your traffic through. So here is a typical example: You’re here on the left hand side as the user. You download a list of the relays inside the Tor network and you select from that list three nodes, a guard node which is your entry into the Tor network, a relay node which is a middle node. Essentially, it’s going to route your traffic to a third hop. And then the third hop is the exit node where your traffic essentially exits out on the internet. Now, looking at the circuit. So this is a circuit through the Tor network through which you’re going to route your traffic. There are three layers of encryption at the beginning, so between you and the guard node. Your traffic is encrypted three times. In the first instance encrypted to the guard, and the it’s encrypted again, through the relay, and then encrypted again to the exit, and as the traffic moves through the Tor network each of those layers of encryption are unpeeled from the data. The Guard here in this case knows who you are, and the exit relay knows what you’re doing but neither know both. And the middle relay doesn’t really know a lot, except for which relay is her guard and which relay is her exit. Who runs an exit relay? So if you run an exit relay all of the traffic which users are sending out on the internet they appear to come from your IP address. So running an exit relay is potentially risky because someone may do something through your relay which attracts attention. And then, when law enforcement traced that back to an IP address it’s going to come back to your address. So some relay operators have had trouble with this, with law enforcement coming to them, and saying: “Hey we got this traffic coming through your IP address and you have to go and explain it.” So if you want to run an exit relay it’s a little bit risky, but we’re thankful for those people that do run exit relays because ultimately if people didn’t run an exit relay you wouldn’t be able to get out of the Tor network, and it wouldn’t be terribly useful from this point of view. So, yes. applause So every Tor relay, when you set up a Tor relay you publish something called a descriptor which describes your Tor relay and how to use it to a set of servers called the authorities. And the trust in the Tor network is essentially split across these authorities. They’re run by the core Tor Project members. And they maintain a list of all of the relays in the network. And they observe them over a period of time. If the relays exhibit certain properties they give the relays flags. If e.g. a relay allows traffic to exit from the Tor network it will get the ‘Exit’ flag. If they’d been switched on for a certain period of time, or for a certain amount of traffic they’ll be allowed to become the guard relay which is the first node in your circuit. So when you build your circuit you download a list of these descriptors from one of the Directory Authorities. You look at the flags which have been assigned to each of the relays, and then you pick your route based on that. So you’ll pick the guard node from a set of relays which have the ‘Guard’ flag, your exits from the set of relays which have the ‘Exit’ flag etc. etc. Now, as of a quick count this morning there are about 1500 guard relays, around 1000 exit relays, and six relays flagged as ‘bad’ exits. What does a ‘bad exit’ mean? waits for audience to respond That’s not good! That’s exactly what it means! Yes! laughs applause So relays which have been flagged as ‘bad exits’ your client will never chose to exit traffic through. And examples of things which may get a relay flagged as an [bad] exit relay – if they’re fiddling with the traffic which is coming out of the Tor relay. Or doing things like man-in-the-middle attacks against SSL traffic. We’ve seen various things, there have been relays man-in-the-middling SSL traffic, there have very, very recently been an exit relay which was patching binaries that you downloaded from the internet, inserting malware into the binaries. So you can do these things but the Tor Project tries to scan for them. And if these things are detected then they’ll be flagged as ‘Bad Exits’. It’s true to say that the scanning mechanism is not 100% fool-proof by any stretch of the imagination. It tries to pick up common types of attacks, so as a result it won’t pick up unknown attacks or attacks which haven’t been seen or have not been known about beforehand. So looking at this, how do you deanonymise the traffic travelling through the Tor networks? Given some traffic coming out of the exit relay, how do you know which user that corresponds to? What is their IP address? You can’t actually modify the traffic because if any of the relays tried to modify the traffic which they’re sending through the network Tor will tear down the circuit through the relay. So there’s these integrity checks, each of the hops. And if you try to sort of – because you can’t decrypt the packet you can’t modify it in any meaningful way, and because there’s an integrity check at the next hop that means that you can’t modify the packet because otherwise it’s detected. So you can’t do this sort of marker, and try and follow the marker through the network. So instead what you can do if you control… so let me give you two cases. In the worst case if the attacker controls all three of your relays that you pick, which is an unlikely scenario that needs to control quite a big proportion of the network. Then it should be quite obvious that they can work out who you are and also see what you’re doing because in that case they can tag the traffic, and they can just discard these integrity checks at each of the following hops. Now in a different case, if you control the Guard relay and the exit relay but not the middle relay the Guard relay can’t tamper with the traffic because this middle relay will close down the circuit as soon as it happens. The exit relay can’t send stuff back down the circuit to try and identify the user, either. Because again, the circuit will be closed down. So what can you do? Well, you can count the number of packets going through the Guard node. And you can measure the timing differences between packets, and try and spot that pattern at the Exit relays. You’re looking at counts of packets and the timing between those packets which are being sent, and essentially trying to correlate them all. So if your user happens to pick you as your Guard node, and then happens to pick your exit relay, then you can deanonymise them with very high probability using this technique. You’re just correlating the timings of packets and counting the number of packets going through. And the attacks demonstrated in literature are very reliable for this. We heard earlier from the Tor talk about the “relay early” tag which was the attack discovered by the cert researches in the US. That attack didn’t rely on timing attacks. Instead, what they were able to do was send a special type of cell containing the data back down the circuit, essentially marking this data, and saying: “This is the data we’re seeing at the Exit relay, or at the hidden service", and encode into the messages travelling back down the circuit, what the data was. And then you could pick those up at the Guard relay and say, okay, whether it’s this person that’s doing that. In fact, although this technique works, and yeah it was a very nice attack, the traffic correlation attacks are actually just as powerful. So although this bug has been fixed traffic correlation attacks still work and are still fairly, fairly reliable. So the problem still does exist. This is very much an open question. How do we solve this problem? We don’t know, currently, how to solve this problem of trying to tackle the traffic correlation. There are a couple of solutions. But they’re not particularly… they’re not particularly reliable. Let me just go through these, and I’ll skip back on the few things I’ve missed. The first thing is, high-latency networks, so networks where packets are delayed in their transit through the network. That throws away a lot of the timing information. So they promise to potentially solve this problem. But of course, if you want to visit Google’s home page, and you have to wait five minutes for it, you’re simply just not going to use Tor. The whole point is trying to make this technology usable. And if you got something which is very, very slow then it doesn’t make it attractive to use. But of course, this case does work slightly better for e-mail. If you think about it with e-mail, you don’t mind if you’re e-mail – well, you may not mind, you may mind – you don’t mind if your e-mail is delayed by some period of time. Which makes this somewhat difficult. And as Roger said earlier, you can also introduce padding into the circuit, so these are dummy cells. But, but… with a big caveat: some of the research suggests that actually you’d need to introduce quite a lot of padding to defeat these attacks, and that would overload the Tor network in its current state. So, again, not a particular practical solution. How does Tor try to solve this problem? Well, Tor makes it very difficult to become a users Guard relay. If you can’t become a users Guard relay then you don’t know who the user is, quite simply. And so by making it very hard to become the Guard relay therefore you can’t do this traffic correlation attack. So at the moment the Tor client chooses one Guard relay and keeps it for a period of time. So if I want to sort of target just one of you I would need to control the Guard relay that you were using at that particular point in time. And in fact I’d also need to know what that Guard relay is. So by making it very unlikely that you would select a particular malicious Guard relay, where the number of malicious Guard relays is very small, that’s how Tor tries to solve this problem. And at the moment your Guard relay is your barrier of security. If the attacker can’t control the Guard relay then they won’t know who you are. That doesn’t mean they can’t try other sort of side channel attacks by messing with the traffic at the Exit relay etc. You know that you may sort of e.g. download dodgy documents and open one on your computer, and those sort of things. Now the alternative of course to having a Guard relay and keeping it for a very long time will be to have a Guard relay and to change it on a regular basis. Because you might think, well, just choosing one Guard relay and sticking with it is probably a bad idea. But actually, that’s not the case. If you pick the Guard relay, and assuming that the chance of picking a Guard relay that is malicious is very low, then, when you first use your Guard relay, if you got a good choice, then your traffic is safe. If you haven’t got a good choice then your traffic isn’t safe. Whereas if your Tor client chooses a Guard relay every few minutes, or every hour, or something on those lines at some point you’re gonna pick a malicious Guard relay. So they’re gonna have some of your traffic but not all of it. And so currently the trade-off is that we make it very difficult for an attacker to control a Guard relay and the user picks a Guard relay and keeps it for a long period of time. And so it’s very difficult for the attackers to pick that Guard relay when they control a very small proportion of the network. So this, currently, provides those properties I described earlier, the privacy and the anonymity when you’re browsing the web, when you’re accessing websites etc. But still you know who the website is. So although you’re anonymous and the website doesn’t know who you are you know who the website is. And there may be some cases where e.g. the website would also wish to remain anonymous. You want the person accessing the website and the website itself to be anonymous to each other. And you could think about people e.g. being in countries where running a political blog e.g. might be a dangerous activity. If you run that on a regular webserver you’re easily identified whereas, if you got some way where you as the webserver can be anonymous then that allows you to do that activity without being targeted by your government. So this is what hidden services try to solve. Now when you first think about a problem you kind of think: “Hang on a second, the user doesn’t know who the website is and the website doesn’t know who the user is. So how on earth do they talk to each other?” Well, that’s essentially what the Tor hidden service protocol tries to sort of set up. How do you identify and connect to each other. So at the moment this is what happens: We’ve got Bob on the [right] hand side who is the hidden service. And we got Alice on the left hand side here who is the user who wishes to visit the hidden service. Now when Bob sets up his hidden service he picks three nodes in the Tor network as introduction points and builds several hop circuits to them. So the introduction points don’t know who Bob is. Bob has circuits to them. And Bob says to each of these introduction points “Will you relay traffic to me if someone connects to you asking for me?” And then those introduction points do that. So then, once Bob has picked his introduction points he publishes a descriptor describing the list of his introduction points for someone who wishes to come onto his websites. And then Alice on the left hand side wishing to visit Bob will pick a rendezvous point in the network and build a circuit to it. So this “RP” here is the rendezvous point. And she will relay a message via one of the introduction points saying to Bob: “Meet me at the rendezvous point”. And then Bob will build a 3-hop-circuit to the rendezvous point. So now at this stage we got Alice with a multi-hop circuit to the rendezvous point, and Bob with a multi-hop circuit to the rendezvous point. Alice and Bob haven’t connected to one another directly. The rendezvous point doesn’t know who Bob is, the rendezvous point doesn’t know who Alice is. All they’re doing is forwarding the traffic. And they can’t inspect the traffic, either, because the traffic itself is encrypted. So that’s currently how you solve this problem with trying to communicate with someone who you don’t know who they are and vice versa. drinks from the bottle The principle thing I’m going to talk about today is this database. So I said, Bob, when he picks his introduction points he builds this thing called a descriptor, describing who his introduction points are, and he publishes them to a database. This database itself is distributed throughout the Tor network. It’s not a single server. So both, Bob and Alice need to be able to publish information to this database, and also retrieve information from this database. And Tor currently uses something called a distributed hash table, which I’m gonna give an example of what this means and how it works. And then I’ll talk to you specifically how the Tor Distributed Hash Table works itself. So let’s say e.g. you've got a set of servers. So here we've got 26 servers and you’d like to store your files across these different servers without having a single server responsible for deciding, “okay, that file is stored on that server, and this file is stored on that server” etc. etc. Now here is my list of files. You could take a very naive approach. And you could say: “Okay, I’ve got 26 servers, I got all of these file names and start with the letter of the alphabet.” And I could say: “All of the files that begin with A are gonna go under server A; or the files that begin with B are gonna go on server B etc.” And then when you want to retrieve a file you say: “Okay, what does my file name begin with?” And then you know which server it’s stored on. Now of course you could have a lot of servers – sorry – a lot of files which begin with a Z, an X or a Y etc. in which case you’re gonna overload that server. You’re gonna have more files stored on one server than on another server in your set. And if you have a lot of big files, say e.g. beginning with B then rather than distributing your files across all the servers you’re gonna just be overloading one or two of them. So to solve this problem what we tend to do is: we take the file name, and we run it through a cryptographic hash function. A hash function produces output which looks like random, very small changes in the input so a cryptographic hash function produces a very large change in the output. And this change looks random. So if I take all of my file names here, and assuming I have a lot more, I take a hash of them, and then I use that hash to determine which server to store the file on. Then, with high probability my files will be distributed evenly across all of the servers. And then when I want to go and retrieve one of the files I take my file name, I run it through the cryptographic hash function, that gives me the hash, and then I use that hash to identify which server that particular file is stored on. And then I go and retrieve it. So that’s the sort of a loose idea of how a distributed hash table works. There are a couple of problems with this. What if you got a changing size, what if the number of servers you got changes in size as it does in the Tor network. It’s a very brief overview of the theory. So how does it apply for the Tor network? Well, the Tor network has a set of relays and it has a set of hidden services. Now we take all of the relays, and they have a hash identity which identifies them. And we map them onto a circle using that hash value as an identifier. So you can imagine the hash value ranging from Zero to a very large number. We got a Zero point at the very top there. And that runs all the way round to the very large number. So given the identity hash for a relay we can map that to a particular point on the server. And then all we have to do is also do this for hidden services. So there’s a hidden service address, something.onion, so this is one of the hidden websites that you might visit. You take the – I’m not gonna describe in too much detail how this is done but – the value is done in such a way such that it’s evenly distributed about the circle. So your hidden service will have a particular point on the circle. And the relays will also be mapped onto this circle. So there’s the relays. And the hidden service. And in the case of Tor the hidden service actually maps to two positions on the circle, and it publishes its descriptor to the three relays to the right at one position, and the three relays to the right at another position. So there are actually in total six places where this descriptor is published on the circle. And then if I want to go and fetch and connect to a hidden service I go on to go and pull this hidden descriptor down to identify what its introduction points are. I take the hidden service address, I find out where it is on the circle, I map all of the relays onto the circle, and then I identify which relays on the circle are responsible for that particular hidden service. And I just connect, then I say: “Do you have a copy of the descriptor for that particular hidden service?” And if so then we’ve got our list of introduction points. And we can go to the next steps to connect to our hidden service. So I’m gonna explain how we sort of set up our experiments. What we thought, or what we were interested to do, was collect publications of hidden services. So for everytime a hidden service gets set up it publishes to this distributed hash table. What we wanted to do was collect those publications so that we get a complete list of all of the hidden services. And what we also wanted to do is to find out how many times a particular hidden service is requested. Just one more point that will become important later. The position which the hidden service appears on the circle changes every 24 hours. So there’s not a fixed position every single day. If we run 40 nodes over a long period of time we will occupy positions within that distributed hash table. And we will be able to collect publications and requests for hidden services that are located at that position inside the distributed hash table. So in that case we ran 40 Tor nodes, we had a student at university who said: “Hey, I run a hosting company, I got loads of server capacity”, and we told him what we were doing, and he said: “Well, you really helped us out, these last couple of years…” and just gave us loads of server capacity to allow us to do this. So we spun up 40 Tor nodes. Each Tor node was required to advertise a certain amount of bandwidth to become a part of that distributed hash table. It’s actually a very small amount, so this didn’t matter too much. And then, after – this has changed recently in the last few days, it used to be 25 hours, it’s just been increased as a result of one of the attacks last week. But here… certainly during our study it was 25 hours. You then appear at a particular point inside that distributed hash table. And you’re then in a position to record publications of hidden services and requests for hidden services. So not only can you get a full list of the onion addresses you can also find out how many times each of the onion addresses are requested. And so this is what we recorded. And then, once we had a full list of… or once we had run for a long period of time to collect a long list of .onion addresses we then built a custom crawler that would visit each of the Tor hidden services in turn, and pull down the HTML contents, the text content from the web page, so that we could go ahead and classify the content. Now it’s really important to know here, and it will become obvious why a little bit later, we only pulled down HTML content. We didn’t pull out images. And there’s a very, very important reason for that which will become clear shortly. We had a lot of questions when we first started this. Noone really knew how many hidden services there were. It had been suggested to us there was a very high turn-over of hidden services. We wanted to confirm that whether that was true or not. And we also wanted to do this so, what are the hidden services, how popular are they, etc. etc. etc. So our estimate for how many hidden services there are, over the period which we ran our study, this is a graph plotting our estimate for each of the individual days as to how many hidden services there were on that particular day. Now the data is naturally noisy because we’re only a very small proportion of that circle. So we’re only observing a very small proportion of the total publications and requests every single day, for each of those hidden services. And if you take a long term average for this there’s about 45.000 hidden services that we think were present, on average, each day, during our entire study. Which is a large number of hidden services. But over the entire length we collected about 80.000, in total. Some came and went etc. So the next question after how many hidden services there are is how long the hidden service exists for. Does it exist for a very long period of time, does it exist for a very short period of time etc. etc. So what we did was, for every single .onion address we plotted how many times we saw a publication for that particular hidden service during the six months. How many times did we see it. If we saw it a lot of times that suggested in general the hidden service existed for a very long period of time. If we saw a very short number of publications for each hidden service then that suggests that they were only present for a very short period of time. This is our graph. By far the most number of hidden services we only saw once during the entire study. And we never saw them again. We suggest that there’s a very high turnover of the hidden services, they don’t tend to exist on average i.e. for a very long period of time. And then you can see the sort of a tail here. If we plot just those hidden services which existed for a long time, so e.g. we could take hidden services which have a high number of hit requests and say: “Okay, those that have a high number of hits probably existed for a long time.” That’s not absolutely certain, but probably. Then you see this sort of -normal- plot about 4..5, so we saw on average most hidden services four or five times during the entire six months if they were popular and we’re using that as a proxy measure for whether they existed for the entire time. Now, this stage was over 160 days, so almost six months. What we also wanted to do was trying to confirm this over a longer period. So last year, in 2013, about February time some researchers of the University of Luxemburg also ran a similar study but it ran over a very short period of time over the day. But they did it in such a way it could collect descriptors across much of the circle during a single day. That was because of a bug in the way Tor did some of the things which has now been fixed so we can’t repeat that as a particular way. So we got a list of .onion addresses from February 2013 from these researchers at the University of Luxemburg. And then we got our list of .onion addresses from this six months which was March to September of this year. And we wanted to say, okay, we’re given these two sets of .onion addresses. Which .onion addresses existed in his set but not ours and vice versa, and which .onion addresses existed in both sets? So as you can see a very small minority of hidden service addresses existed in both sets. This is over an 18 month period between these two collection points. A very small number of services existed in both his data set and in our data set. Which again suggested there’s a very high turnover of hidden services that don’t tend to exist for a very long period of time. So the question is why is that? Which we’ll come on to a little bit later. It’s a very valid question, can’t answer it 100%, we have some inclines as to why that may be the case. So in terms of popularity which hidden services did we see, or which .onion addresses did we see requested the most? Which got the most number of hits? Or the most number of directory requests. So botnet Command & Control servers – if you’re not familiar with what a botnet is, the idea is to infect lots of people with a piece of malware. And this malware phones home to a Command & Control server where the botnet master can give instructions to each of the bots on to do things. So it might be e.g. to collect passwords, key strokes, banking details. Or it might be to do things like Distributed Denial of Service attacks, or to send spam, those sorts of things. And a couple of years ago someone gave a talk and said: “Well, the problem with running a botnet is your C&C servers are vulnerable.” Once a C&C server is taken down you no longer have control over your botnet. So it’s been a sort of arms race against anti-virus companies and against malware authors to try and come up with techniques to run C&C servers in a way which they can’t be taken down. And a couple of years ago someone gave a talk at a conference that said: “You know what? It would be a really good idea if botnet C&C servers were run as Tor hidden services because then no one knows where they are, and in theory they can’t be taken down.” So in the fact we have this there are loads and loads and loads of these addresses associated with several different botnets, ‘Sefnit’ and ‘Skynet’. Now Skynet is the one I wanted to talk to you about because the guy that runs Skynet had a twitter account, and he also did a Reddit AMA. If you not heard of a Reddit AMA before, that’s a Reddit ask-me-anything. You can go on the website and ask the guy anything. So this guy wasn’t hiding in the shadows. He’d say: “Hey, I’m running this massive botnet, here’s my Twitter account which I update regularly, here is my Reddit AMA where you can ask me questions!” etc. He was arrested last year, which is not, perhaps, a huge surprise. laughter and applause But… so he was arrested, his C&C servers disappeared but there were still infected hosts trying to connect with the C&C servers and request access to the C&C server. This is why we’re saying: “A large number of hits.” So all of these requests are failed requests, i.e. we didn’t have a descriptor for them because the hidden service had gone away but there were still clients requesting each of the hidden services. And the next thing we wanted to do was to try and categorize sites. So, as I said earlier, we crawled all of the hidden services that we could, and we classified them into different categories based on what the type of content was on the hidden service side. The first graph I have is the number of sites in each of the categories. So you can see down the bottom here we got lots of different categories. We got drugs, market places, etc. on the bottom. And the graph shows the percentage of the hidden services that we crawled that fit in to each of these categories. So e.g. looking at this, drugs, the most number of sites that we crawled were made up of drugs-focused websites, followed by market places etc. There’s a couple of questions you might have here, so which ones are gonna stick out, what does ‘porn’ mean, well, you know what ‘porn’ means. There are some very notorious porn sites on the Tor Darknet. There was one in particular which was focused on revenge porn. It turns out that youngsters wish to take pictures of themselves, and send it to their boyfriends or their girlfriends. And when they get dumped they publish them on these websites. So there were several of these sites on the main internet which have mostly been shut down. And some of these sites were archived on the Darknet. The second one is that we should probably wonder what is, is ‘abuse’. Abuse was… every single site we classified in this category were child abuse sites. So they were in some way facilitating child abuse. And how do we know that? Well, the data that came back from the crawler made it completely unambiguous as to what the content was in these sites. That was completely obvious, from then content, from the crawler as to what was on these sites. And this is the principal reason why we didn’t pull down images from sites. There are many countries that would be a criminal offense to do so. So our crawler only pulled down text content from all of these sites, and that enabled us to classify them, based on that. We didn’t pull down any images. So of course the next thing we liked to do is to say: “Okay, well, given each of these categories, what proportion of directory requests went to each of the categories?” Now the next graph is going to need some explaining as to precisely what it means, and I’m gonna give that. This is the proportion of directory requests which we saw that went to each of the categories of hidden service that we classified. As you can see, in fact, we saw a very large number going to these abuse sites. And the rest sort of distributed right there, at the bottom. And the question is: “What is it we’re collecting here?” We’re collecting successful hidden service directory requests. What does a hidden service directory request mean? It probably loosely correlates with either a visit or a visitor. So somewhere in between those two. Because when you want to visit a hidden service you make a request for the hidden service descriptor and that allows you to connect to it and browse through the web site. But there are cases where, e.g. if you restart Tor, you’ll go back and you re-fetch the descriptor. So in that case we’ll count twice, for example. What proportion of these are people, and which proportion of them are something else? The answer to that is we just simply don’t know. We've got directory requests but that doesn’t tell us about what they’re doing on these sites, what they’re fetching, or who indeed they are, or what it is they are. So these could be automated requests, they could be human beings. We can’t distinguish between those two things. What are the limitations? A hidden service directory request neither exactly correlates to a visit -or- a visitor. It’s probably somewhere in between. So you can’t say whether it’s exactly one or the other. We cannot say whether a hidden service directory request is a person or something automated. We can’t distinguish between those two. Any type of site could be targeted by e.g. DoS attacks, by web crawlers which would greatly inflate the figures. If you were to do a DoS attack it’s likely you’d only request a small number of descriptors. You’d actually be flooding the site itself rather than the directories. But, in theory, you could flood the directories. But we didn’t see any sort of shutdown of our directories based on flooding, e.g. Whilst we can’t rule that out, it doesn’t seem to fit too well with what we’ve got. The other question is ‘crawlers’. I obviously talked with the Tor Project about these results and they’ve suggested that there are groups, so the child protection agencies e.g. that will crawl these sites on a regular basis. And, again, that doesn’t necessarily correlate with a human being. And that could inflate the figures. How many hidden directory requests would there be if a crawler was pointed at it. Typically, if I crawl them on a single day, one request. But if they got a large number of servers doing the crawling then it could be a request per day for every single server. So, again, I can’t give you, definitive, “yes, this is human beings” or “yes, this is automated requests”. The other important point is, these two content graphs are only hidden services offering web content. There are hidden services that do things, e.g. IRC, the instant messaging etc. Those aren’t included in these figures. We’re only concentrating on hidden services offering web sites. They’re HTTP services, or HTTPS services. Because that allows to easily classify them. And, in fact, some of the other types are IRC and Jabber the result was probably not directly comparable with web sites. That’s sort of the use case for using them, it’s probably slightly different. So I appreciate the last graph is somewhat alarming. If you have any questions please ask either me or the Tor developers as to how to interpret these results. It’s not quite as straight-forward as it may look when you look at the graph. You might look at the graph and say: “Hey, that looks like there’s lots of people visiting these sites”. It’s difficult to conclude that from the results. The next slide is gonna be very contentious. I will prefix it with: “I’m not advocating -any- kind of action whatsoever. I’m just trying to describe technically as to what could be done. It’s not up to me to make decisions on these types of things.” So, of course, when we found this out, frankly, I think we were stunned. I mean, it took us several days, frankly, it just stunned us, “what the hell, this is not what we expected at all.” So a natural step is, well, we think, most of us think that Tor is a great thing, it seems. Could this problem be sorted out while still keeping Tor as it is? And probably the next step to say: “Well, okay, could we just block this class of content and not other types of content?” So could we block just hidden services that are associated with these sites and not other types of hidden services? We thought there’s three ways in which we could block hidden services. And I’ll talk about whether these were impossible in the coming months, after explaining them. But during our study these would have been impossible and presently they are possible. A single individual could shut down a single hidden service by controlling all of the relays which are responsible for receiving a publication request on that distributed hash table. It’s possible to place one of your relays at a particular position on that circle and so therefore make yourself be the responsible relay for a particular hidden service. And if you control all of the six relays which are responsible for a hidden service, when someone comes to you and says: “Can I have a descriptor for that site” you can just say: “No, I haven’t got it”. And provided you control those relays users won’t be able to fetch those sites. The second option is you could say: “Okay, the Tor Project are blocking these” – which I’ll talk about in a second – “as a relay operator”. Could I as a relay operator say: “Okay, as a relay operator I don’t want to carry this type of content, and I don’t want to be responsible for serving up this type of content.” A relay operator could patch his relay and say: “You know what, if anyone comes to this relay requesting anyone of these sites then, again, just refuse to do it”. The problem is a lot of relay operators need to do it. So a very, very large number of the potential relay operators would need to do that to effectively block these sites. The final option is the Tor Project could modify the Tor program and actually embed these ingresses in the Tor program itself so as that all relays by default both block hidden service directory requests to these sites, and also clients themselves would say: “Okay, if anyone’s requesting these block them at the client level.” Now I hasten to add: I’m not advocating any kind of action that is entirely up to other people because, frankly, I think if I advocated blocking hidden services I probably wouldn’t make it out alive, so I’m just saying: this is a description of what technical measures could be used to block some classes of sites. And of course there’s lots of questions here. If e.g. the Tor Project themselves decided: “Okay, we’re gonna block these sites” that means they are essentially in control of the block list. The block list would be somewhat public so everyone would be up to inspect what the sites are that are being blocked and they would be in control of some kind of block list. Which, you know, arguably is against what the Tor Projects are after. takes a sip, coughs So how about deanonymising visitors to hidden service web sites? So in this case we got a user on the left-hand side who is connected to a Guard node. We’ve got a hidden service on the right-hand side who is connected to a Guard node and on the top we got one of those directory servers which is responsible for serving up those hidden service directory requests. Now, when you first want to connect to a hidden service you connect through your Guard node and through a couple of hops up to the hidden service directory and you request the descriptor off of them. So at this point if you are the attacker and you control one of the hidden service directory nodes for a particular site you can send back down the circuit a particular pattern of traffic. And if you control that user’s Guard node – which is a big if – then you can spot that pattern of traffic at the Guard node. The question is: “How do you control a particular user’s Guard node?” That’s very, very hard. But if e.g. I run a hidden service and all of you visit my hidden service, and I’m running a couple of dodgy Guard relays then the probability is that some of you, certainly not all of you by any stretch will select my dodgy Guard relay, and I could deanonymise you, but I couldn’t deanonymise the rest of them. So what we’re saying here is that you can deanonymise some of the users some of the time but you can’t pick which users those are which you’re going to deanonymise. You can’t deanonymise someone specific but you can deanonymise a fraction based on what fraction of the network you control in terms of Guard capacity. How about… so the attacker controls those two – here’s a picture from a research of the University of Luxemburg which did this. And these are plots of taking the user’s IP address visiting a C&C server, and then geolocating it and putting it on a map. So “where was the user located when they called one of the Tor hidden services?” So, again, this is a selection, a percentage of the users visiting C&C servers using this technique. How about deanonymising hidden services themselves? Well, again, you got a problem. You’re the user. You’re gonna connect through your Guard into the Tor network. And then, eventually, through the hidden service’s Guard node, and talk to the hidden service. As the attacker you need to control the hidden service’s Guard node to do these traffic correlation attacks. So again, it’s very difficult to deanonymise a specific Tor hidden service. But if you think about, okay, there is 1.000 Tor hidden services, if you can control a percentage of the Guard nodes then some hidden services will pick you and then you’ll be able to deanonymise those. So provided you don’t care which hidden services you gonna deanonymise then it becomes much more straight-forward to control the Guard nodes of some hidden services but you can’t pick exactly what those are. So what sort of data can you see traversing a relay? This is a modified Tor client which just dumps cells which are coming… essentially packets travelling down a circuit, and the information you can extract from them at a Guard node. And this is done off the main Tor network. So I’ve got a client connected to a “malicious” Guard relay and it logs every single packet – they’re called ‘cells’ in the Tor protocol – coming through the Guard relay. We can’t decrypt the packet because it’s encrypted three times. What we can record, though, is the IP address of the user, the IP address of the next hop, and we can count packets travelling in each direction down the circuit. And we can also record the time at which those packets were sent. So of course, if you’re doing the traffic correlation attacks you’re using that time in the information to try and work out whether you’re seeing traffic which you’ve sent and which identifies a particular user or not. Or indeed traffic which they’ve sent which you’ve seen at a different point in the network. Moving on to my… …interesting problems, research questions etc. Based on what I’ve said, I’ve said there’s these directory authorities which are controlled by the core Tor members. If e.g. they were malicious then they could manipulate the Tor… – if a big enough chunk of them are malicious then they can manipulate the consensus to direct you to particular nodes. I don’t think that’s the case, and that anyone thinks that’s the case. And Tor is designed in a way to tr… I mean that you’d have to control a certain number of the authorities to be able to do anything important. So the Tor people… I said this to them a couple of days ago. I find it quite funny that you’d design your system as if you don’t trust each other. To which their response was: “No, we design our system so that we don’t have to trust each other.” Which I think is a very good model to have, when you have this type of system. So could we eliminate these sort of centralized servers? I think that’s actually a very hard problem to do. There are lots of attacks which could potentially be deployed against a decentralized network. At the moment the Tor network is relatively well understood both in terms of what types of attack it is vulnerable to. So if we were to move to a new architecture then we may open it to a whole new class of attacks. The Tor network has been existing for quite some time and it’s been very well studied. What about global adversaries like the NSA, where you could monitor network links all across the world? It’s very difficult to defend against that. Where they can monitor… if they can identify which Guard relay you’re using, they can monitor traffic going into and out of the Guard relay, and they log each of the subsequent hops along. It’s very, very difficult to defend against these types of things. Do we know if they’re doing it? The documents that were released yesterday – I’ve only had a very brief look through them, but they suggest that they’re not presently doing it and they haven’t had much success. I don’t know why, there are very powerful attacks described in the academic literature which are very, very reliable and most academic literature you can access for free so it’s not even as if they have to figure out how to do it. They just have to read the academic literature and try and implement some of these attacks. I don’t know what – why they’re not. The next question is how to detect malicious relays. So in my case we’re running 40 relays. Our relays were on consecutive IP addresses, so we’re running 40 – well, most of them are on consecutive IP addresses in two blocks. So they’re running on IP addresses numbered e.g. 1,2,3,4,… We were running two relays per IP address, and every single relay had my name plastered across it. So after I set up these 40 relays in a relatively short period of time I expected someone from the Tor Project to come to me and say: “Hey Gareth, what are you doing?” – no one noticed, no one noticed. So this is presently an open question. On the Tor Project they’re quite open about this. They acknowledged that, in fact, last year we had the CERT researchers launch much more relays than that. The Tor Project spotted those large number of relays but chose not to do anything about it and, in fact, they were deploying an attack. But, as you know, it’s often very difficult to defend against unknown attacks. So at the moment how to detect malicious relays is a bit of an open question. Which as I think is being discussed on the mailing list. The other one is defending against unknown tampering at exits. If you took or take the exit relays – the exit relay can tamper with the traffic. So we know particular types of attacks doing SSL man-in-the-middles etc. We’ve seen recently binary patching. How do we detect unknown tampering with traffic, other types of traffic? So the binary tampering wasn’t spotted until it was spotted by someone who told the Tor Project. So it wasn’t detected e.g. by the Tor Project themselves, it was spotted by someone else and notified to them. And then the final one open on here is the Tor code review. So the Tor code is open source. We know from OpenSSL that, although everyone can read source code, people don’t always look at it. And OpenSSL has been a huge mess, and there’s been lots of stuff disclosed over that over the last coming days. There are lots of eyes on the Tor code but I think always, more eyes are better. I’d say, ideally if we can get people to look at the Tor code and look for vulnerabilities then… I encourage people to do that. It’s a very useful thing to do. There could be unknown vulnerabilities as we’ve seen with the “relay early” type quite recently in the Tor code which could be quite serious. The truth is we just don’t know until people do thorough code audits, and even then it’s very difficult to know for certain. So my last point, I think, yes, is advice to future researchers. So if you ever wanted, or are planning on doing a study in the future, e.g. on Tor, do not do what the CERT researchers do and start deanonymising people on the live Tor network and doing it in a way which is incredibly irresponsible. I don’t think…I mean, I tend, myself, to give you with the benefit of a doubt, I don’t think the CERT researchers set out to be malicious. I think they’re just very naive. That’s what it was they were doing. That was rapidly pointed out to them. In my case we are running 40 relays. Our Tor relays they were forwarding traffic, they were acting as good relays. The only thing that we were doing was logging publication requests to the directories. Big question whether that’s malicious or not – I don’t know. One thing that has been pointed out to me is that the .onion addresses themselves could be considered sensitive information, so any data we will be retaining from the study is the aggregated data. So we won't be retaining information on individual .onion addresses because that could potentially be considered sensitive information. If you think about someone running an .onion address which contains something which they don’t want other people knowing about. So we won’t be retaining that data, and we’ll be destroying them. So I think that brings me now to starting the questions. I want to say “Thanks” to a couple of people. The student who donated the server to us. Nick Savage who is one of my colleagues who was a sounding board during the entire study. Ivan Pustogarov who is the researcher at the University of Luxembourg who sent us the large data set of .onion addresses from last year. He’s also the chap who has demonstrated those deanonymisation attacks that I talked about. A big "Thank you" to Roger Dingledine who has frankly been… presented loads of questions to me over the last couple of days and allowed me to bounce ideas back and forth. That has been a very useful process. If you are doing future research I strongly encourage you to contact the Tor Project at the earliest opportunity. You’ll find them… certainly I found them to be extremely helpful. Donncha also did something similar, so both Ivan and Donncha have done a similar study in trying to classify the types of hidden services or work out how many hits there are to particular types of hidden service. Ivan Pustogarov did it on a bigger scale and found similar results to us. That is that these abuse sites featured frequently in the top requested sites. That was done over a year ago, and again, he was seeing similar sorts of pattern. There were these abuse sites being requested frequently. So that also sort of probates what we’re saying. The data I put online is at this address, there will probably be the slides, something called ‘The Tor Research Framework’ which is an implementation of a Java client, so an implementation of a Tor client in Java specifically aimed at researchers. So if e.g. you wanna pull out data from a consensus you can do. If you want to build custom routes through the network you can do. If you want to build routes through the network and start sending padding traffic down them you can do etc. The code is designed in a way which is designed to be easily modifiable for testing lots of these things. There is also a link to the Tor FBI exploit which they deployed against visitors to some Tor hidden services last year. They exploited a Mozilla Firefox bug and then ran code on users who were visiting these hidden service, and ran code on their computer to identify them. At this address there is a link to that including a copy of the shell code and an analysis of exactly what it was doing. And then of course a list of references, with papers and things. So I’m quite happy to take questions now. applause Herald: Thanks for the nice talk! Do we have any questions from the internet? Signal Angel: One question. It’s very hard to block addresses since creating them is cheap, and they can be generated for each user, and rotated often. So can you think of any other way for doing the blocking? Gareth: That is absolutely true, so, yes. If you were to block a particular .onion address they can wail: “I want another .onion address.” So I don’t know of any way to counter that now. Herald: Another one from the internet? inaudible answer from Signal Angel Okay, then, Microphone 1, please! Question: Thank you, that’s fascinating research. You mentioned that it is possible to influence the hash of your relay node in a sense that you could to be choosing which service you are advertising, or which hidden service you are responsible for. Is that right? Gareth: Yeah, correct! Question: So could you elaborate on how this is possible? Gareth: So e.g. you just keep regenerating a public key for your relay, you’ll get closer and closer to the point where you’ll be the responsible relay for that particular hidden service. That’s just – you keep regenerating your identity hash until you’re at that particular point in the relay. That’s not particularly computationally intensive to do. That was it? Herald: Okay, next question from Microphone 5, please. Question: Hi, I was wondering for the attacks where you identify a certain number of users using a hidden service. Have those attacks been used, or is there any evidence there, and is there any way of protecting against that? Gareth: That’s a very interesting question, is there any way to detect these types of attacks? So some of the attacks, if you’re going to generate particular traffic patterns, one way to do that is to use the padding cells. The padding cells aren’t used at the moment by the official Tor client. So the detection of those could be indicative but it doesn't... it`s not conclusive evidence in our tool. Question: And is there any way of protecting against a government or something trying to denial-of-service hidden services? Gareth: So I… trying to… did not… Question: Is it possible to protect against this kind of attack? Gareth: Not that I’m aware of. The Tor Project are currently revising how they do the hidden service protocol which will make e.g. what I did, enumerating the hidden services, much more difficult. And to also be in a position on the distributed hash table in advance for a particular hidden service. So they are at the moment trying to change the way it’s done, and make some of these things more difficult. Herald: Good. Next question from Microphone 2, please. Mic2: Hi. I’m running the Tor2Web abuse, and so I used to see a lot of abuse of requests concerning the Tor hidden service being exposed on the internet through the Tor2Web.org domain name. And I just wanted to comment on, like you said, the abuse number of the requests. I used to spoke with some of the child protection agencies that reported abuse at Tor2Web.org, and they are effectively using crawlers that periodically look for changes in order to get new images to be put in the database. And what I was able to understand is that the German agency doing that is crawling the same sites that the Italian agencies are crawling, too. So it’s likely that in most of the countries there are the child protection agencies that are crawling those few numbers of Tor hidden services that contain child porn. And I saw it also a bit from the statistics of Tor2Web where the amount of abuse relating to that kind of content, it’s relatively low. Just as contribution! Gareth: Yes, that’s very interesting, thank you for that! applause Herald: Next, Microphone 4, please. Mic4: You then attacked or deanonymised users with an infected or a modified Guard relay? Is it required to modify the Guard relay if I control the entry point of the user to the internet? If I’m his ISP? Gareth: Yes, if you observe traffic travelling into a Guard relay without controlling the Guard relay itself. Mic4: Yeah. Gareth: In theory, yes. I wouldn’t be able to tell you how reliable that is off the top of my head. Mic4: Thanks! Herald: So another question from the internet! Signal Angel: Wouldn’t the ability to choose the key hash prefix give the ability to target specific .onions? Gareth: So you can only target one .onion address at a time. Because of the way they are generated. So you wouldn’t be able to say e.g. “Pick a key which targeted two or more .onion addresses.” You can only target one .onion address at a time by positioning yourself at a particular point on the distributed hash table. Herald: Another one from the internet? … Okay. Then Microphone 3, please. Mic3: Hey. Thanks for this research. I think it strengthens the network. So in the deem (?) I was wondering whether you can donate this relays to be a part of non-malicious relays pool, basically use them as regular relays afterwards? Gareth: Okay, so can I donate the relays a rerun and at the Tor capacity (?) ? Unfortunately, I said they were run by a student and they were donated for a fixed period of time. So we’ve given those back to him. We are very grateful to him, he was very generous. In fact, without his contribution donating these it would have been much more difficult to collect as much data as we did. Herald: Good, next, Microphone 5, please! Mic5: Yeah hi, first of all thanks for your talk. I think you’ve raised some real issues that need to be considered very carerfully by everyone on the Tor Project. My question: I’d like to go back to the issue with so many abuse related web sites running over the Tor Project. I think it’s an important issue that really needs to be considered because we don’t wanna be associated with that at the end of the day. Anyone who uses Tor, who runs a relay or an exit node. And I understand it’s a bit of a censored issue, and you don’t really have any say over whether it’s implemented or not. But I’d like to get your opinion on the implementation of a distributed block-deny system that would run in very much a similar way to those of the directory authorities. I’d just like to see what you think of that. Gareth: So you’re asking me whether I want to support a particular blocking mechanism then? Mic5: I’d like to get your opinion on it. Gareth laughs I know it’s a sensitive issue but I think, like I said, I think something… I think it needs to be considered because everyone running exit nodes and relays and people of the Tor Project don’t want to be known or associated with these massive amount of abuse web sites that currently exist within the Tor network. Gareth: I absolutely agree, and I think the Tor Project are horrified as well that this problem exists, and they, in fact, talked on it in previous years that they have a problem with this type of content. I asked to what if anything is done about it, it’s very much up to them. Could it be done in a distributed fashion? So the example I gave was a way which it could be done by relay operators. So e.g. that would need the consensus of a large number of relay operators to be effective. So that is done in a distributed fashion. The question is: who gives the list of .onion addresses to block to each of the relay operators? Clearly, the relay operators aren’t going to collect themselves. It needs to be supplied by someone like the Tor Project, e.g., or someone trustworthy. Yes, it can be done in a distributed fashion. It can be done in an open fashion. Mic5: Who knows? Gareth: Okay. Mic5: Thank you. Herald: Good. And another question from the internet. Signal Angel: Apparently there’s an option in the Tor client to collect statistics on hidden services. Do you know about this, and how it relates to your research? Gareth: Yes, I believe they’re going to be… the extent to which I know about it is they’re gonna be trying this next month, to try and estimate how many hidden services there are. So keep your eye on the Tor Project web site, I’m sure they’ll be publishing their data in the coming months. Herald: And, sadly, we are running out of time, so this will be the last question, so Microphone 4, please! Mic4: Hi, I’m just wondering if you could sort of outline what ethical clearances you had to get from your university to conduct this kind of research. Gareth: So we have to discuss these types of things before undertaking any research. And we go through the steps to make sure that we’re not e.g. storing sensitive information about particular people. So yes, we are very mindful of that. And that’s why I made a particular point of putting on the slides as to some of the things to consider. Mic4: So like… you outlined a potential implementation of the traffic correlation attack. Are you saying that you performed the attack? Or… Gareth: No, no no, absolutely not. So the link I’m giving… absolutely not. We have not engaged in any… Mic4: It just wasn’t clear from the slides. Gareth: I apologize. So it’s absolutely clear on that. No, we’re not engaging in any deanonymisation research on the Tor network. The research I showed is linked on the references, I think, which I put at the end of the slides. You can read about it. But it’s done in simulation. So e.g. there’s a way to do simulation of the Tor network on a single computer. I can’t remember the name of the project, though. Shadow! Yes, it’s a system called Shadow, we can run a large number of Tor relays on a single computer and simulate the traffic between them. If you’re going to do that type of research then you should use that. Okay, thank you very much, everyone. applause silent postroll titles subtitles created by c3subtitles.de Join, and help us!