WEBVTT 00:00:03.200 --> 00:00:10.040 The world we live in is awash with data that comes pouring in from everywhere around us. 00:00:10.040 --> 00:00:14.520 On its own this data is just noise and confusion. 00:00:14.520 --> 00:00:22.520 To make sense of data, to find the meaning in it, we need the powerful branch of science - statistics. 00:00:22.520 --> 00:00:26.040 Believe me there's nothing boring about statistics. 00:00:26.040 --> 00:00:29.400 Especially not today when we can make the data sing. 00:00:29.400 --> 00:00:33.400 With statistics we can really make sense of the world. 00:00:33.400 --> 00:00:35.040 And there's more. 00:00:35.040 --> 00:00:40.440 With statistics, the data deluge, as it's being called, is leading us 00:00:40.440 --> 00:00:46.240 to an ever greater understanding of life on Earth and the universe beyond. 00:00:46.240 --> 00:00:50.760 And thanks to the incredible power of today's computers, 00:00:50.760 --> 00:00:57.040 it may fundamentally transform the process of scientific discovery. 00:00:57.040 --> 00:01:02.560 I kid you not, statistics is now the sexiest subject around. 00:01:23.000 --> 00:01:25.600 Did you know that there is one million boats in Sweden? 00:01:25.600 --> 00:01:27.960 That's one boat per nine people! 00:01:27.960 --> 00:01:31.080 It's the highest number of boats per person in Europe! 00:01:41.080 --> 00:01:45.760 Being a statistician, you don't like telling your profession at dinner parties. 00:01:45.760 --> 00:01:48.440 But really, statisticians shouldn't be shy 00:01:48.440 --> 00:01:51.320 because everyone wants to understand what's going on. 00:01:51.320 --> 00:01:56.480 And statistics gives us a perspective on the world we live in 00:01:56.480 --> 00:01:59.320 that we can't get in any other way. 00:02:03.520 --> 00:02:09.000 Statistics tells us whether the things we think and believe are actually true. 00:02:19.960 --> 00:02:25.440 And statistics are far more useful than we usually like to admit. 00:02:25.440 --> 00:02:29.600 In the last recession there was this famous call-in to a talk radio station. 00:02:29.600 --> 00:02:37.280 The man complained, "In times like this when unemployment rates are up to 13%, income has fallen by 5%, 00:02:37.280 --> 00:02:41.360 "and suicide rates are climbing, and I get so angry that the government 00:02:41.360 --> 00:02:45.520 "is wasting money on things like collection of statistics." 00:02:48.240 --> 00:02:50.360 I'm not officially a statistician. 00:02:50.360 --> 00:02:55.280 Strictly speaking, my field is global health. 00:02:58.120 --> 00:03:03.280 But I got really obsessed with stats when I realised how much people 00:03:03.280 --> 00:03:06.240 in Sweden just don't know about the rest of the world. 00:03:06.240 --> 00:03:10.800 I started in our medical university, Karolinska Institutet, 00:03:10.800 --> 00:03:13.960 an undergraduate course called Global Health. 00:03:13.960 --> 00:03:17.360 These students coming to us actually have the highest grade you can get 00:03:17.360 --> 00:03:18.840 in the Swedish college system, 00:03:18.840 --> 00:03:22.040 so I thought, "Maybe they know everything I'm going to teach them." 00:03:22.040 --> 00:03:25.680 So I did a pre-test when they came, and one of the questions 00:03:25.680 --> 00:03:28.160 from which I learned a lot was this one - 00:03:28.160 --> 00:03:32.360 which country has the highest child mortality of these five pairs? 00:03:32.360 --> 00:03:34.920 I won't put you at test here, but it is Turkey 00:03:34.920 --> 00:03:37.000 which is highest there, Poland, 00:03:37.000 --> 00:03:40.760 Russia, Pakistan, and South Africa. 00:03:40.760 --> 00:03:43.080 And these were the result of the Swedish students. 00:03:43.080 --> 00:03:44.760 A 1.8 right answer out of five possible. 00:03:44.760 --> 00:03:49.920 And that means there was a place for a professor of International Health and for my course. 00:03:49.920 --> 00:03:56.360 But one late night when I was compiling the report, I really realised my discovery. 00:03:56.360 --> 00:04:01.160 I had shown that Swedish top students know statistically 00:04:01.160 --> 00:04:04.480 significantly less about the world than the chimpanzees. 00:04:06.000 --> 00:04:09.840 Because the chimpanzees would score half right. 00:04:09.840 --> 00:04:12.320 If I gave them two bananas with Sri Lanka and Turkey, 00:04:12.320 --> 00:04:15.600 they would be right half of the cases, but the students are not there. 00:04:15.600 --> 00:04:20.200 I did also an unethical study of the professors of the Karolinska Institutet, 00:04:20.200 --> 00:04:25.520 that hands out the Nobel Prize for medicine, and they are on par with the chimpanzees there. 00:04:28.160 --> 00:04:32.680 Today there's more information accessible than ever before. 00:04:32.680 --> 00:04:35.760 'And I work with my team at the Gapminder Foundation 00:04:35.760 --> 00:04:41.600 'using new tools that help everyone make sense of the changing world. 00:04:41.600 --> 00:04:45.320 'We draw on the masses of data that's now freely available 00:04:45.320 --> 00:04:49.720 'from international institutions like the UN and the World Bank. 00:04:49.720 --> 00:04:53.640 'And it's become my mission to share the insights 00:04:53.640 --> 00:05:00.200 'from this data with anyone who'll listen, and to reveal how statistics is nothing to be frightened of.' 00:05:02.440 --> 00:05:05.040 I'm going to provide you a view of 00:05:05.040 --> 00:05:09.000 the global health situation across mankind. 00:05:09.000 --> 00:05:14.160 And I'm going to do that in hopefully an enjoyable way, so relax. 00:05:14.160 --> 00:05:17.120 So we did this software which displays it like this. 00:05:17.120 --> 00:05:19.320 Every bubble here is a country - 00:05:19.320 --> 00:05:21.320 this is China, this is India. 00:05:21.320 --> 00:05:23.560 The size of the bubble is the population. 00:05:23.560 --> 00:05:27.600 I'm going to stage a race between this sort of yellowish Ford here 00:05:27.600 --> 00:05:32.760 and the red Toyota down there and the brownish Volvo. 00:05:32.760 --> 00:05:36.440 The Toyota has a very bad start down here, and United States, 00:05:36.440 --> 00:05:38.280 Ford is going off-road there, 00:05:38.280 --> 00:05:40.480 and the Volvo is doing quite fine, this is the war. 00:05:40.480 --> 00:05:43.680 The Toyota got off track, now Toyota is on the healthier side of Sweden. 00:05:43.680 --> 00:05:46.800 That's about where I sold the Volvo and bought the Toyota. 00:05:46.800 --> 00:05:47.960 AUDIENCE LAUGH 00:05:47.960 --> 00:05:50.840 This is the great leap forward, when China fell down. 00:05:50.840 --> 00:05:53.080 It was the central planning by Mao Zedong. 00:05:53.080 --> 00:05:56.680 China recovered and said, "Never more stupid central planning," 00:05:56.680 --> 00:05:57.800 but they went up here. 00:05:57.800 --> 00:06:02.560 No, there is one more inequity, look there - United States 00:06:02.560 --> 00:06:07.480 They broke my frame. Washington DC is so rich over there, 00:06:07.480 --> 00:06:13.040 but they are not as healthy as Kerala in India. It's quite interesting, isn't it? 00:06:13.040 --> 00:06:14.600 LAUGHTER AND APPLAUSE 00:06:20.360 --> 00:06:25.520 Welcome to the USA, world leaders in big cars 00:06:25.520 --> 00:06:28.480 and free data. 00:06:28.480 --> 00:06:35.880 There are many here who share my vision of making public data accessible and useful for everyone. 00:06:35.880 --> 00:06:43.440 The city of San Francisco is in the lead, opening up its data on everything. 00:06:43.440 --> 00:06:47.480 Even the police department is releasing all its crime reports. 00:06:47.480 --> 00:06:50.840 This official crime data has been turned 00:06:50.840 --> 00:06:55.960 into a wonderful interactive map by two of the city's computer whizzes. 00:06:55.960 --> 00:06:58.920 It's community statistics in action. 00:07:09.400 --> 00:07:13.320 Crimespotting is a map of crime reports from the San Francisco Police Department 00:07:13.320 --> 00:07:16.120 showing dots on maps for citizens to be able to see 00:07:16.120 --> 00:07:19.320 patterns of crime around their neighbourhoods in San Francisco. 00:07:19.320 --> 00:07:25.080 The map is not just about individual crimes but about broader patterns that show you where crime is 00:07:25.080 --> 00:07:27.760 clustered around the city, which areas have high crime, 00:07:27.760 --> 00:07:30.320 and which areas have relatively low crime. 00:07:36.840 --> 00:07:41.440 We're here at the top of Jones Street on Nob Hill... 00:07:42.960 --> 00:07:45.280 ..quite a nice neighbourhood. 00:07:45.280 --> 00:07:49.600 What the crime maps show us is the relationship between 00:07:49.600 --> 00:07:51.360 topography and crime. 00:07:51.360 --> 00:07:54.520 Basically the higher up the hill, the less crime there is. 00:07:56.200 --> 00:07:58.640 You cross over the border 00:07:58.640 --> 00:08:00.240 into the flats... 00:08:02.800 --> 00:08:09.240 Essentially as soon as you get into the lower lying areas of Jones Street the crime just skyrockets. 00:08:20.240 --> 00:08:24.160 We're here in the uptown Tenderloin district. 00:08:26.040 --> 00:08:30.320 It's one of the oldest and densest neighbourhoods in San Francisco. 00:08:30.320 --> 00:08:32.400 This is where you go to buy drugs. 00:08:32.400 --> 00:08:33.919 Right around here. 00:08:37.200 --> 00:08:41.640 We see lots of aggravated assaults, lots of auto thefts. 00:08:41.640 --> 00:08:48.520 Basically a huge part of the crime that happens in the city happens in this five or six block radius. 00:08:55.640 --> 00:08:58.920 If you've been hearing police sirens in your neighbourhood, 00:08:58.920 --> 00:09:02.000 you can use the map to find out why. 00:09:02.000 --> 00:09:05.680 If you're out at night in an unfamiliar part of town, 00:09:05.680 --> 00:09:09.240 you can check the map for streets to avoid. 00:09:09.240 --> 00:09:12.400 If a neighbour gets burgled, you can see - 00:09:12.400 --> 00:09:16.520 is it a one-off or has there been a spike in local crime? 00:09:16.520 --> 00:09:19.480 If you commute through a neighbourhood and you're worried 00:09:19.480 --> 00:09:23.080 about its safety, the fact that we have the ability to turn off all 00:09:23.080 --> 00:09:25.360 the night-time and middle-of-the-day crimes 00:09:25.360 --> 00:09:28.280 and show you just the things that are happening during the commute, 00:09:28.280 --> 00:09:32.880 it is a statistical operation. But I think to people that are interacting with the thing 00:09:32.880 --> 00:09:38.000 it feels very much more like they're just sort of browsing a website or shopping on Amazon. 00:09:38.000 --> 00:09:43.520 They're looking at data and they don't realise they're doing statistics. 00:09:43.520 --> 00:09:47.840 What's most exciting for me is that public statistics 00:09:47.840 --> 00:09:52.640 is making citizens more powerful and the authorities more accountable. 00:10:02.360 --> 00:10:04.760 We have community meetings that the police attend 00:10:04.760 --> 00:10:08.880 and what citizens are now doing are bringing printouts 00:10:08.880 --> 00:10:12.240 of the maps that show where crimes are taking place, 00:10:12.240 --> 00:10:16.120 and they're demanding services from the police department 00:10:16.120 --> 00:10:20.520 and the police department is now having to change how they police, 00:10:20.520 --> 00:10:22.960 how they provide policing services, 00:10:22.960 --> 00:10:27.040 because the data is showing what is working and what is not. 00:10:28.560 --> 00:10:31.960 People in San Francisco are also using public data 00:10:31.960 --> 00:10:35.800 to map social inequalities and see how to improve society. 00:10:35.800 --> 00:10:39.720 And the possibilities are endless. 00:10:39.720 --> 00:10:43.160 I think our dream government data analysis project 00:10:43.160 --> 00:10:46.240 would really be focused on live information, 00:10:46.240 --> 00:10:51.240 on stuff that was being reported and pushed out to the world over the internet as it was happening. 00:10:51.240 --> 00:10:55.040 You know, trash pickups, traffic accidents, buses, 00:10:55.040 --> 00:10:57.680 and I think through the kind of stats-gathering power 00:10:57.680 --> 00:11:02.520 of the internet it's possible to really begin to see the workings of the city 00:11:02.520 --> 00:11:04.760 displayed as a unified interface. 00:11:07.320 --> 00:11:09.960 So that's where we are heading. 00:11:09.960 --> 00:11:14.760 Towards a world of free data with all the statistical insights that come from it, 00:11:14.760 --> 00:11:21.800 accessible to everyone, empowering us as citizens and letting us hold our rulers to account. 00:11:21.800 --> 00:11:26.920 It's a long way from where statistics began. 00:11:26.920 --> 00:11:32.880 Statistics are essential to us to monitor our governments and our societies. 00:11:32.880 --> 00:11:36.760 But it was our rulers up there who started 00:11:36.760 --> 00:11:40.840 the collection of statistics in the first place in order to monitor us! 00:11:46.880 --> 00:11:51.440 In fact the word 'statistics' comes from 'the state'. 00:11:51.440 --> 00:11:55.600 Modern statistics began two centuries ago. 00:11:55.600 --> 00:11:59.080 Once it got going, it spread and never stopped. 00:11:59.080 --> 00:12:01.640 And guess who was first! 00:12:03.280 --> 00:12:07.560 The Chinese have Confucius, the Italians have da Vinci, 00:12:07.560 --> 00:12:10.240 and the British have Shakespeare. 00:12:10.240 --> 00:12:12.440 And we have the Tabellverket - 00:12:12.440 --> 00:12:16.400 the first ever systematic collection of statistics! 00:12:16.400 --> 00:12:21.640 Since the year 1749 we have collected data 00:12:21.640 --> 00:12:26.920 on every birth, marriage and death, and we are proud of it! 00:12:29.120 --> 00:12:32.000 The Tabellverket recorded information 00:12:32.000 --> 00:12:34.040 from every parish in Sweden. 00:12:34.040 --> 00:12:39.080 It was a huge quantity of data and it was the first time any government 00:12:39.080 --> 00:12:41.800 could get an accurate picture of its people. 00:12:49.360 --> 00:12:53.360 Sweden had been the greatest military power in Northern Europe, 00:12:53.360 --> 00:12:58.200 but by 1749 our star was really fading 00:12:58.200 --> 00:13:00.920 and other countries were growing stronger. 00:13:00.920 --> 00:13:03.600 At least we were a large power, 00:13:03.600 --> 00:13:09.960 thought to have 20 million people, enough to rival Britain and France. 00:13:13.400 --> 00:13:18.160 But we were in for a nasty surprise. 00:13:18.160 --> 00:13:20.680 The first analysis of the Tabellverket 00:13:20.680 --> 00:13:24.080 revealed that Sweden only had two million inhabitants. 00:13:24.080 --> 00:13:30.680 Sweden was not just a power in decline, it also had a very small population. 00:13:30.680 --> 00:13:36.080 The government was horrified by this finding - what if the enemy found out? 00:13:37.840 --> 00:13:44.560 But the Tabellverket also showed that many women died in childbirth and many children died young. 00:13:44.560 --> 00:13:48.640 So government took action to improve the health of the people. 00:13:48.640 --> 00:13:52.440 This was the beginning of modern Sweden. 00:13:53.960 --> 00:13:59.000 It took more than 50 years before the Austrians, Belgians, Danes, 00:13:59.000 --> 00:14:02.320 Dutch, French, Germans, Italians 00:14:02.320 --> 00:14:08.600 and, finally, the British, caught up with Sweden in collecting and using statistics. 00:14:24.640 --> 00:14:29.640 It was called political arithmetic. It was a lovely phrase that was used for statistics. 00:14:29.640 --> 00:14:33.160 Governments could have much more control and understanding of 00:14:33.160 --> 00:14:36.840 the society - how it was working, how it was developing 00:14:36.840 --> 00:14:40.240 and essentially so they could control it better. 00:14:43.360 --> 00:14:47.960 It wasn't just governments who woke up to the power of statistics. 00:14:47.960 --> 00:14:54.600 Right across Europe, 19th century society went mad for facts. 00:14:54.600 --> 00:14:57.600 And, despite its late start, Britain, 00:14:57.600 --> 00:15:01.400 with its Royal Statistical Society in London, 00:15:01.400 --> 00:15:04.000 was soon a statisticians' nirvana. 00:15:05.920 --> 00:15:09.960 I love looking at old copies of the Royal Statistical Society journal 00:15:09.960 --> 00:15:11.760 because it's full of such odd stuff. 00:15:11.760 --> 00:15:14.840 There's a wonderful paper from the 1840s 00:15:14.840 --> 00:15:19.200 which shows a map of England and the rates of bastardy in each county. 00:15:19.200 --> 00:15:23.560 So you can identify very quickly the areas with high rates of bastardy. 00:15:23.560 --> 00:15:27.240 Being in East Anglia it always makes me slightly laugh that Norfolk 00:15:27.240 --> 00:15:30.720 seems to top the "bastardy league" in the 1840s. 00:15:30.720 --> 00:15:36.800 One of the founders of the Royal Statistical Society 00:15:36.800 --> 00:15:42.120 was the great Victorian mathematician and inventor Charles Babbage. 00:15:42.120 --> 00:15:50.120 In 1842 he read the latest poem by an equally great Victorian, Alfred Tennyson. 00:15:50.120 --> 00:15:53.120 Vision of Sin contained the lines: 00:15:53.120 --> 00:15:55.800 "Fill the cup, and fill the can 00:15:55.800 --> 00:15:58.160 "Have a rouse before the morn 00:15:58.160 --> 00:16:03.720 "Every moment dies a man Every moment one is born." 00:16:03.720 --> 00:16:07.360 So keen a statistician was Babbage that he could not contain himself. 00:16:07.360 --> 00:16:09.360 He dashed off a letter to Tennyson 00:16:09.360 --> 00:16:12.200 explaining that because of population growth, 00:16:12.200 --> 00:16:13.640 the line should read, 00:16:13.640 --> 00:16:18.640 "Every moment dies a man and one and a 16th is born." 00:16:18.640 --> 00:16:22.480 I may add that the exact figure is 1.067, 00:16:22.480 --> 00:16:27.200 but something must be conceded to the laws of metre. 00:16:31.840 --> 00:16:36.640 In the 19th century, scholars all over Europe did amazing work 00:16:36.640 --> 00:16:39.000 in measuring their societies. 00:16:39.000 --> 00:16:42.600 They were hoovering up data on almost everything. 00:16:42.600 --> 00:16:46.040 But numbers alone don't tell you anything. 00:16:46.040 --> 00:16:51.320 You have to analyse them, and that's what makes statistics. 00:16:55.760 --> 00:16:59.200 When the first statisticians began to get to grips with 00:16:59.200 --> 00:17:00.400 analysing their data 00:17:00.400 --> 00:17:05.760 they seized upon the average, and they took the average of everything. 00:17:09.720 --> 00:17:13.760 What's so great about an average is that 00:17:13.760 --> 00:17:18.640 you can take a whole mass of data and reduce it to a single number. 00:17:21.880 --> 00:17:26.119 And though each of us is unique, our collective lives produce 00:17:26.119 --> 00:17:29.880 averages that can characterise whole populations. 00:17:41.280 --> 00:17:45.360 I looked in my local newspaper one week and saw a pensioner 00:17:45.360 --> 00:17:49.440 had accidentally put her foot on the accelerator 00:17:49.440 --> 00:17:52.560 and crushed her friend against a wall. 00:17:52.560 --> 00:17:56.360 Devastating, hideous, horrible thing to happen. 00:17:56.360 --> 00:18:01.400 And then there was a second one about a young man who didn't have 00:18:01.400 --> 00:18:07.040 a driving licence, was driving a car under the influence of drugs and alcohol 00:18:07.040 --> 00:18:10.320 and he bashed into a pedestrian and killed him. 00:18:10.320 --> 00:18:15.560 What's remarkable, absolutely remarkable, if you look at the number 00:18:15.560 --> 00:18:22.880 of people who die each year in traffic crashes, it's nearly a constant. 00:18:22.880 --> 00:18:24.480 What? 00:18:24.480 --> 00:18:31.680 All these individual events, somehow when you sum them all up there's the same number every year. 00:18:31.680 --> 00:18:35.080 And every year, two and a half times as many men 00:18:35.080 --> 00:18:38.880 die in traffic crashes as women, and it's a constant. 00:18:38.880 --> 00:18:44.320 And every year the rate in Belgium is double the rate in England. 00:18:44.320 --> 00:18:47.160 There are these remarkable regularities. 00:18:47.160 --> 00:18:54.800 So that these individual particular events sum up into a social phenomenon. 00:18:56.560 --> 00:18:58.120 Let's see what Sweden have done. 00:18:58.120 --> 00:19:01.560 We used to boast about fast social progress, that's where we were.... 00:19:01.560 --> 00:19:05.240 'In my lectures, to tell stories about the changing world, 00:19:05.240 --> 00:19:08.120 'I use the averages from entire countries, 00:19:08.120 --> 00:19:12.160 'whether the average of income, child mortality, family size 00:19:12.160 --> 00:19:13.360 'or carbon output.' 00:19:13.360 --> 00:19:16.200 OK, I give you Singapore. The year I was born, 00:19:16.200 --> 00:19:20.560 Singapore had twice the child mortality of Sweden, the most tropical country in the world, 00:19:20.560 --> 00:19:22.920 a marshland on the Equator, and here we go. 00:19:22.920 --> 00:19:25.160 It took a little time for them to get independent, 00:19:25.160 --> 00:19:27.160 but then they started to grow their economy, 00:19:27.160 --> 00:19:29.840 and they made the social investment, they got away malaria, 00:19:29.840 --> 00:19:33.360 they got a magnificent health system that beat both US and Sweden. 00:19:33.360 --> 00:19:37.600 We never thought it would happen that they would win over Sweden! 00:19:37.600 --> 00:19:40.520 LAUGHTER AND APPLAUSE 00:19:40.520 --> 00:19:46.400 But useful as averages are, they don't tell you the whole story. 00:19:48.800 --> 00:19:53.040 On average, Swedish people have slightly less than two legs. 00:19:53.040 --> 00:19:57.560 This is because few people only have one leg or no legs, 00:19:57.560 --> 00:19:59.760 and no-one has three legs. 00:19:59.760 --> 00:20:06.240 So almost everybody in Sweden has more than the average number of legs. 00:20:06.240 --> 00:20:10.840 The variation in data is just as important as the average. 00:20:16.800 --> 00:20:19.400 But how do you get a handle on variation? 00:20:19.400 --> 00:20:23.000 For this, you transform numbers into shapes. 00:20:23.000 --> 00:20:26.320 Let's look again at the number of adult women in Sweden 00:20:26.320 --> 00:20:27.800 for different heights. 00:20:27.800 --> 00:20:31.800 Plotting the data as a shape shows how much their heights 00:20:31.800 --> 00:20:36.400 vary from the average and how wide that variation is. 00:20:36.400 --> 00:20:41.520 The shape a set of data makes is called its distribution. 00:20:41.520 --> 00:20:46.080 This is the income distribution of China, 1970. 00:20:46.080 --> 00:20:51.000 This is the income distribution of the United States, 1970. 00:20:51.000 --> 00:20:54.080 Almost no overlap, and what has happened? 00:20:54.080 --> 00:20:56.880 China is growing, it's not so equal any longer, 00:20:56.880 --> 00:21:01.120 and it's appearing here overlooking the United States. 00:21:01.120 --> 00:21:03.480 Almost like a ghost, isn't it? 00:21:03.480 --> 00:21:05.160 It's pretty scary. 00:21:05.160 --> 00:21:06.680 Rrrr! 00:21:06.680 --> 00:21:08.200 LAUGHTER 00:21:17.160 --> 00:21:21.280 The statisticians who first explored distribution 00:21:21.280 --> 00:21:25.760 discovered one shape that turned up again and again. 00:21:25.760 --> 00:21:28.120 The Victorian scholar Francis Galton 00:21:28.120 --> 00:21:32.400 was so fascinated he built a machine that could reproduce it, 00:21:32.400 --> 00:21:36.080 and he found it fitted so many different sets of measurements 00:21:36.080 --> 00:21:38.640 that he named it the normal distribution. 00:21:38.640 --> 00:21:45.600 Whether it was people's arm spans, lung capacities, 00:21:45.600 --> 00:21:47.400 or even their exam results, 00:21:47.400 --> 00:21:51.360 the normal distribution shape recurred time and time again. 00:21:51.360 --> 00:21:56.360 Other statisticians soon found many other regular shapes, 00:21:56.360 --> 00:22:01.360 each produced by particular kinds of natural or social processes. 00:22:01.360 --> 00:22:05.400 And every statistician has their favourite. 00:22:05.400 --> 00:22:09.280 The Poisson distribution, the Poisson shape is my favourite distribution. 00:22:09.280 --> 00:22:11.120 I think it's an absolute cracker. 00:22:15.760 --> 00:22:18.720 The Poisson shape describes how likely it is 00:22:18.720 --> 00:22:21.680 that out-of-the-ordinary things will happen. 00:22:21.680 --> 00:22:24.520 Imagine a London bus stop where we know that on average 00:22:24.520 --> 00:22:26.280 we'll get three buses in an hour. 00:22:26.280 --> 00:22:29.280 We won't always get three buses, of course. 00:22:29.280 --> 00:22:33.480 Amazingly, the Poisson shape will show us the probability 00:22:33.480 --> 00:22:37.200 that in any given hour we will get four, five, or six buses, 00:22:37.200 --> 00:22:39.440 or no buses at all. 00:22:40.720 --> 00:22:43.480 The exact shape changes with the average. 00:22:43.480 --> 00:22:46.920 But whether it's how many people will win the lottery jackpot 00:22:46.920 --> 00:22:48.000 each week, 00:22:48.000 --> 00:22:51.200 or how many people will phone a call centre each minute, 00:22:51.200 --> 00:22:54.120 the Poisson shape will give the probabilities. 00:22:57.240 --> 00:23:01.240 The wonderful example where this was applied to in the late 19th century 00:23:01.240 --> 00:23:04.400 was to count each year the number of Prussian officers, 00:23:04.400 --> 00:23:07.520 cavalry officers, who were kicked to death by their horses. 00:23:07.520 --> 00:23:10.240 Now, some years there were none, some years there were one, 00:23:10.240 --> 00:23:13.880 some years there were two, up to seven, I think, one particularly bad year. 00:23:13.880 --> 00:23:16.680 But with this distribution, however many years there were 00:23:16.680 --> 00:23:19.640 with nought, one, two, three, four Prussian cavalry officers 00:23:19.640 --> 00:23:23.880 kicked to death by their horses, beautifully obeyed the Poisson distribution. 00:23:42.800 --> 00:23:48.520 So statisticians use shapes to reveal the patterns in the data. 00:23:48.520 --> 00:23:51.000 But we also use images of all kinds 00:23:51.000 --> 00:23:54.480 to communicate statistics to a wider public. 00:23:54.480 --> 00:23:57.320 Because if the story in the numbers 00:23:57.320 --> 00:24:02.920 is told by a beautiful and clever image, then everyone understands. 00:24:02.920 --> 00:24:09.640 Of the pioneers of statistical graphics, my favourite is Florence Nightingale. 00:24:24.280 --> 00:24:27.120 There are not many people who realise that she was known 00:24:27.120 --> 00:24:30.520 as a passionate statistician and not just the Lady of the Lamp. 00:24:30.520 --> 00:24:34.720 She said that "to understand God's thoughts, we must study statistics, 00:24:34.720 --> 00:24:37.080 "for these are the measure of His purpose." 00:24:37.080 --> 00:24:40.520 Statistics was for her a religious duty and moral imperative. 00:24:42.080 --> 00:24:45.360 When Florence was nine years old she started collecting data. 00:24:45.360 --> 00:24:48.320 Her data was different fruits and vegetables she found. 00:24:48.320 --> 00:24:50.080 Put them into different tables. 00:24:50.080 --> 00:24:52.640 Trying to organise them in some standard form. 00:24:52.640 --> 00:24:55.640 And so we have one of Nightingale's first statistical tables 00:24:55.640 --> 00:24:57.440 at the age of nine. 00:25:04.360 --> 00:25:11.440 In the mid 1850s Florence Nightingale went to the Crimea to care for British casualties of war. 00:25:11.440 --> 00:25:14.400 She was horrified by what she discovered. 00:25:14.400 --> 00:25:19.920 For all the soldiers being blown to bits on the battlefield, there were many, many more soldiers 00:25:19.920 --> 00:25:25.200 dying from diseases they caught in the army's filthy hospitals. 00:25:25.200 --> 00:25:29.120 So Florence Nightingale began counting the dead. 00:25:29.120 --> 00:25:34.920 For two years she recorded mortality data in meticulous detail. 00:25:34.920 --> 00:25:39.120 When the war was over she persuaded the government to set up 00:25:39.120 --> 00:25:41.360 a Royal Commission of Inquiry, 00:25:41.360 --> 00:25:44.680 and gathered her data in a devastating report. 00:25:44.680 --> 00:25:48.480 What has cemented her place in the statistical history books 00:25:48.480 --> 00:25:50.120 are the graphics she used. 00:25:50.120 --> 00:25:53.960 And one in particular, the polar area graph. 00:25:53.960 --> 00:25:58.680 For each month of the war, a huge blue wedge represented 00:25:58.680 --> 00:26:02.200 the soldiers who had died from preventable diseases. 00:26:02.200 --> 00:26:05.560 The much smaller red wedges were deaths from wounds, 00:26:05.560 --> 00:26:10.600 and the black wedges were deaths from accidents and other causes. 00:26:10.600 --> 00:26:17.040 Nightingale's graphics were so clear they were impossible to ignore. 00:26:17.040 --> 00:26:19.360 The usual thing around Florence Nightingale's time 00:26:19.360 --> 00:26:23.920 was just to produce tables and tables of figures - absolutely really tedious stuff that, 00:26:23.920 --> 00:26:26.320 unless you're an absolutely dedicated statistician, 00:26:26.320 --> 00:26:29.240 it's really quite difficult to spot the patterns quite naturally. 00:26:29.240 --> 00:26:33.480 But visualisations, they tell a story, they tell a story immediately. 00:26:33.480 --> 00:26:38.480 And the use of colour and the use of shape can really tell a powerful story. 00:26:38.480 --> 00:26:41.280 And nowadays of course we can make things move as well. 00:26:41.280 --> 00:26:44.320 Florence Nightingale would have loved to have played with... 00:26:44.320 --> 00:26:48.800 She would have produced wonderful animations, I'm absolutely certain of it. 00:26:50.880 --> 00:26:54.800 Today, 150 years on, Nightingale's graphics 00:26:54.800 --> 00:26:57.800 are rightly regarded as a classic. 00:26:57.800 --> 00:27:00.600 They led to a revolution in nursing, health care 00:27:00.600 --> 00:27:05.880 and hygiene in hospitals worldwide, which saved innumerable lives. 00:27:07.400 --> 00:27:11.040 And statistical graphics has become an art form of its very own, 00:27:11.040 --> 00:27:16.280 led by designers who are passionate about visualising data. 00:27:24.640 --> 00:27:27.120 This is the Billion Pound-O-Gram. 00:27:27.120 --> 00:27:29.120 This image arose out of frustration 00:27:29.120 --> 00:27:32.280 with the reporting of billion pound amounts in the media. 00:27:32.280 --> 00:27:34.400 £500 billion pounds for this war. 00:27:34.400 --> 00:27:36.000 £50 billion for this oil spill. 00:27:36.000 --> 00:27:39.440 It doesn't make sense - the numbers are too enormous to get your mind round. 00:27:39.440 --> 00:27:43.520 So I scraped all this data from various news sources and created this diagram. 00:27:43.520 --> 00:27:48.680 So the squares here are scaled according to the billion pound amounts. 00:27:48.680 --> 00:27:51.840 When you see numbers visualised like this 00:27:51.840 --> 00:27:54.240 you start to have a different relationship with them. 00:27:54.240 --> 00:27:56.840 You can start to see the patterns, and the scale of them. 00:27:56.840 --> 00:27:59.600 Here in the corner, this little square - £37 billion. 00:27:59.600 --> 00:28:02.800 This was the predicted cost of the Iraq war in 2003. 00:28:02.800 --> 00:28:06.480 As you can see it's grown exponentially over the last few years 00:28:06.480 --> 00:28:10.560 and the total cost now is around about £2,500 billion. 00:28:10.560 --> 00:28:13.000 It's funny because when you visualise statistics 00:28:13.000 --> 00:28:15.360 you understand them, and when you understand them 00:28:15.360 --> 00:28:18.400 you can really start to put things in perspective. 00:28:23.960 --> 00:28:27.880 Visualisation is right at the heart of my own work too. 00:28:27.880 --> 00:28:30.160 I teach global health. 00:28:30.160 --> 00:28:33.840 And I know having the data is not enough - 00:28:33.840 --> 00:28:39.160 I have to show it in ways people both enjoy and understand. 00:28:39.160 --> 00:28:42.960 Now I'm going to try something I've never done before. 00:28:42.960 --> 00:28:45.960 Animating the data in real space, 00:28:45.960 --> 00:28:50.480 with a bit of technical assistance from the crew. 00:28:50.480 --> 00:28:52.240 So here we go. 00:28:52.240 --> 00:28:54.200 First, an axis for health. 00:28:54.200 --> 00:28:58.920 Life expectancy from 25 years to 75 years. 00:28:58.920 --> 00:29:01.440 And down here an axis for wealth. 00:29:01.440 --> 00:29:06.720 Income per person - 400, 4,000, 40,000. 00:29:06.720 --> 00:29:10.480 So down here is poor and sick. 00:29:10.480 --> 00:29:14.280 And up here is rich and healthy. 00:29:14.280 --> 00:29:18.320 Now I'm going to show you the world 00:29:18.320 --> 00:29:21.080 200 years ago, in 1810. 00:29:21.080 --> 00:29:22.880 Here come all the countries. 00:29:22.880 --> 00:29:26.200 Europe, brown; Asia, red; Middle East, green; 00:29:26.200 --> 00:29:29.440 Africa south of the Sahara, blue; and the Americas, yellow. 00:29:29.440 --> 00:29:33.760 And the size of the country bubble shows the size of the population. 00:29:33.760 --> 00:29:37.560 In 1810, it was pretty crowded down there, wasn't it? 00:29:37.560 --> 00:29:39.760 All countries were sick and poor. 00:29:39.760 --> 00:29:43.360 Life expectancy was below 40 in all countries. 00:29:43.360 --> 00:29:48.680 And only UK and the Netherlands were slightly better off. But not much. 00:29:48.680 --> 00:29:52.520 And now I start the world. 00:29:52.520 --> 00:29:56.840 The industrial revolution makes countries in Europe and elsewhere 00:29:56.840 --> 00:29:59.040 move away from the rest. 00:29:59.040 --> 00:30:02.280 But the colonized countries in Asia and Africa, 00:30:02.280 --> 00:30:04.040 they are stuck down there. 00:30:04.040 --> 00:30:08.200 And eventually the Western countries get healthier and healthier. 00:30:08.200 --> 00:30:13.320 And now we slow down to show the impact of the First World War 00:30:13.320 --> 00:30:15.880 and the Spanish flu epidemic. 00:30:15.880 --> 00:30:18.320 What a catastrophe! 00:30:18.320 --> 00:30:22.640 And now I speed up through the 1920s and the 1930s and, 00:30:22.640 --> 00:30:24.400 in spite of the Great Depression, 00:30:24.400 --> 00:30:27.800 Western countries forge on towards greater wealth and health. 00:30:27.800 --> 00:30:29.880 Japan and some others try to follow. 00:30:29.880 --> 00:30:32.560 But most countries stay down here. 00:30:32.560 --> 00:30:35.640 And after the tragedies of the Second World War, 00:30:35.640 --> 00:30:39.400 we stop a bit to look at the world in 1948. 00:30:39.400 --> 00:30:42.080 1948 was a great year. 00:30:42.080 --> 00:30:43.280 The war was over, 00:30:43.280 --> 00:30:48.000 Sweden topped the medal table at the Winter Olympics and I was born. 00:30:48.000 --> 00:30:51.280 But the differences between the countries of the world 00:30:51.280 --> 00:30:52.680 was wider than ever. 00:30:52.680 --> 00:30:54.960 United States was in the front. 00:30:54.960 --> 00:30:56.840 Japan was catching up. 00:30:56.840 --> 00:30:58.400 Brazil was way behind, 00:30:58.400 --> 00:31:03.040 Iran was getting a little richer from oil but still had short lives. 00:31:03.040 --> 00:31:05.160 And the Asian giants... 00:31:05.160 --> 00:31:08.720 China, India, Pakistan, Bangladesh, and Indonesia, 00:31:08.720 --> 00:31:11.360 they were still poor and sick down here. 00:31:11.360 --> 00:31:14.360 But look what was about to happen! Here we go again. 00:31:14.360 --> 00:31:18.640 In my lifetime, former colonies gained independence and then finally 00:31:18.640 --> 00:31:22.640 they started to get healthier and healthier and healthier. 00:31:22.640 --> 00:31:26.080 And in the 1970s, then countries in Asia and Latin America 00:31:26.080 --> 00:31:28.960 started to catch up with the Western countries. 00:31:28.960 --> 00:31:31.240 They became the emerging economies. 00:31:31.240 --> 00:31:32.640 Some in Africa follows, 00:31:32.640 --> 00:31:36.440 some Africans were stuck in civil war, and others were hit by HIV. 00:31:36.440 --> 00:31:41.840 And now we can see the world in the most up-to-date statistics. 00:31:42.840 --> 00:31:45.480 Most people today live in the middle. 00:31:45.480 --> 00:31:48.080 But there is huge difference at the same time 00:31:48.080 --> 00:31:51.520 between the best-off countries and the worst-off countries. 00:31:51.520 --> 00:31:54.520 And there are also huge inequalities within countries. 00:31:54.520 --> 00:31:59.000 These bubbles show country averages but I can split them. 00:31:59.000 --> 00:32:02.120 Take China. I can split it into provinces. 00:32:02.120 --> 00:32:05.120 There goes Shanghai... 00:32:05.120 --> 00:32:08.000 It has the same health and wealth as Italy today. 00:32:08.000 --> 00:32:11.240 And there is the poor inland province Guizhou, 00:32:11.240 --> 00:32:12.680 it is like Pakistan. 00:32:12.680 --> 00:32:18.800 And if I split it further, the rural parts are like Ghana in Africa. 00:32:19.800 --> 00:32:23.160 And yet, despite the enormous disparities today, 00:32:23.160 --> 00:32:27.240 we have seen 200 years of remarkable progress! 00:32:27.240 --> 00:32:31.720 That huge historical gap between the west and the rest is now closing. 00:32:31.720 --> 00:32:35.640 We have become an entirely new, converging world. 00:32:35.640 --> 00:32:37.960 And I see a clear trend into the future. 00:32:37.960 --> 00:32:40.840 With aid, trade, green technology and peace, 00:32:40.840 --> 00:32:43.720 it's fully possible that everyone can make it 00:32:43.720 --> 00:32:45.640 to the healthy, wealthy corner. 00:32:48.000 --> 00:32:51.360 Well, what you've just seen in the last few minutes 00:32:51.360 --> 00:32:56.520 is a story of 200 countries shown over 200 years and beyond. 00:32:56.520 --> 00:33:00.960 It involved plotting 120,000 numbers. 00:33:00.960 --> 00:33:02.560 Pretty neat, huh? 00:33:07.960 --> 00:33:13.120 So, with statistics, we can begin to see things as they really are. 00:33:13.120 --> 00:33:18.200 From tables of data to averages, distributions and visualisations, 00:33:18.200 --> 00:33:22.640 statistics gives us a clear description of the world. 00:33:22.640 --> 00:33:28.200 But, with statistics, we can not only discover WHAT is happening 00:33:28.200 --> 00:33:30.520 but also explore WHY, 00:33:30.520 --> 00:33:34.480 by using the powerful analytical method - correlation. 00:33:35.480 --> 00:33:38.400 Just looking at one thing at a time doesn't tell you very much. 00:33:38.400 --> 00:33:41.280 You've got to look at the relationships between things, 00:33:41.280 --> 00:33:43.360 how they change, how they vary together. 00:33:43.360 --> 00:33:45.360 That's what correlation is about. 00:33:45.360 --> 00:33:48.320 That's how you start trying to understand the processes 00:33:48.320 --> 00:33:50.960 that are really going on in the world and society. 00:33:52.480 --> 00:33:57.000 Most of us today would recognise that crime correlates to poverty, 00:33:57.000 --> 00:34:00.200 that infection correlates to poor sanitation, 00:34:00.200 --> 00:34:02.600 and that knowledge of statistics correlates 00:34:02.600 --> 00:34:05.040 to being great at dancing! 00:34:06.560 --> 00:34:10.199 Correlations can be very tricky. 00:34:10.199 --> 00:34:12.960 I got a joke about silly correlations. 00:34:12.960 --> 00:34:15.840 There was this American who was afraid of heart attack. 00:34:15.840 --> 00:34:19.920 He found out that the Japanese ate very little fat 00:34:19.920 --> 00:34:22.320 and almost didn't drink wine, 00:34:22.320 --> 00:34:25.520 but they had much less heart attacks than the Americans. 00:34:25.520 --> 00:34:28.639 But, on the other hand, he also found out that the French 00:34:28.639 --> 00:34:35.080 eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. 00:34:35.080 --> 00:34:40.840 So he concluded that what kills you is speaking English. 00:34:40.840 --> 00:34:43.920 # Smoke, smoke, smoke that cigarette 00:34:43.920 --> 00:34:48.000 # Puff, puff, puff and if you smoke yourself to death... # 00:34:48.000 --> 00:34:51.920 The time, the pace, the cigarette. Weights Tilt. 00:34:51.920 --> 00:34:56.199 The best example of a really ground-breaking correlation 00:34:56.199 --> 00:35:01.640 is the link that was established in the 1950s between smoking and lung cancer. 00:35:01.640 --> 00:35:07.040 Not long after the Second World War, a British doctor, Richard Doll, 00:35:07.040 --> 00:35:11.040 investigated lung cancer patients in 20 London hospitals. 00:35:11.040 --> 00:35:15.400 And he became certain that the only thing they had in common was smoking. 00:35:15.400 --> 00:35:18.280 So certain, that he stopped smoking himself. 00:35:18.280 --> 00:35:22.160 But other people weren't so sure. 00:35:22.160 --> 00:35:25.400 A lot of the discussion of the early data, 00:35:25.400 --> 00:35:29.120 linking smoking to lung cancer, said, "It's not the smoking, surely, 00:35:29.120 --> 00:35:32.600 "that thing we've done all our lives, that can't be bad for you. 00:35:32.600 --> 00:35:35.000 "Maybe it's genes. 00:35:35.000 --> 00:35:39.080 "Maybe people who are genetically predisposed to get lung cancer 00:35:39.080 --> 00:35:43.840 "are also genetically predisposed to smoke." 00:35:43.840 --> 00:35:47.360 "Maybe it's not the smoking, maybe it's air pollution - 00:35:47.360 --> 00:35:52.520 "that smokers are somehow more exposed to air pollution than non-smokers. 00:35:52.520 --> 00:35:56.280 "Maybe it's not smoking, maybe it's poverty." 00:35:56.280 --> 00:36:00.720 So now we've got three alternative explanations, apart from chance. 00:36:02.240 --> 00:36:06.760 To verify his correlation did imply cause and effect. 00:36:06.760 --> 00:36:10.680 Richard Doll created the biggest statistical study of smoking yet. 00:36:10.680 --> 00:36:14.680 He began tracking the lives of 40,000 British doctors, 00:36:14.680 --> 00:36:17.000 some of whom smoked and some of whom didn't, 00:36:17.000 --> 00:36:19.440 and gathered enough data 00:36:19.440 --> 00:36:22.000 to correlate the amount the doctors smoked 00:36:22.000 --> 00:36:24.920 with their likelihood of getting cancer. 00:36:24.920 --> 00:36:30.120 Eventually, he not only showed a correlation between smoking and lung cancer, 00:36:30.120 --> 00:36:35.800 but also a correlation between stopping smoking and reducing the risk. 00:36:35.800 --> 00:36:37.760 This was science at its best. 00:36:39.760 --> 00:36:44.000 What correlations do not replace is human thought. 00:36:44.000 --> 00:36:46.760 You've got to think about what it means. 00:36:46.760 --> 00:36:50.480 What a good scientist does, if he comes with a correlation, 00:36:50.480 --> 00:36:55.960 is try as hard as she or he possibly can to disprove it, 00:36:55.960 --> 00:37:00.200 to break it down, to get rid of it, to try and refute it. 00:37:00.200 --> 00:37:05.440 And if it withstands all those efforts at demolishing it 00:37:05.440 --> 00:37:10.760 and it is still standing up then, cautiously, you say, "We really might have something here." 00:37:26.720 --> 00:37:32.840 However brilliant the scientist, data is still the oxygen of science. 00:37:32.840 --> 00:37:39.320 The good news is that the more we have, the more correlations we'll find, the more theories we'll test, 00:37:39.320 --> 00:37:42.240 and the more discoveries we're likely to make. 00:37:46.160 --> 00:37:53.440 And history shows how our total sum of information grows in huge leaps as we develop new technologies. 00:37:53.440 --> 00:38:00.000 The invention of the printing press kicked off the first data and information explosion. 00:38:00.000 --> 00:38:06.000 If you piled up all the books that had been printed by the year 1700, 00:38:06.000 --> 00:38:11.200 they would make 60 stacks each as high as Mount Everest. 00:38:12.880 --> 00:38:15.360 Then, starting in the 19th century, 00:38:15.360 --> 00:38:19.880 there came a second information revolution with the telegraph, 00:38:19.880 --> 00:38:23.960 gramophone and camera. And later radio and TV. 00:38:23.960 --> 00:38:28.200 The total amount of information exploded. 00:38:28.200 --> 00:38:35.200 And by the 1950s the information available to us all had multiplied 6,000 times. 00:38:35.200 --> 00:38:41.440 Then, thanks to the computer and later the internet, we went digital. 00:38:41.440 --> 00:38:47.200 And the amount of data we have now is unimaginably vast. 00:38:49.920 --> 00:38:55.080 A single letter printed in a book is equivalent to a byte of data. 00:38:55.080 --> 00:38:58.720 A printed page equals a kilobyte or two. 00:39:01.960 --> 00:39:06.240 Five megabytes is enough for the complete works of Shakespeare. 00:39:08.000 --> 00:39:11.680 10 gigabytes - that's a DVD movie. 00:39:16.840 --> 00:39:23.360 Two terabytes is the tens of millions of photos added to Facebook every day. 00:39:24.880 --> 00:39:32.200 Ten petabytes is the data recorded every second by the world's largest particle accelerator. 00:39:32.200 --> 00:39:35.800 So much only a tiny fraction is kept. 00:39:35.800 --> 00:39:43.440 Six exabytes is what you'd have if you sequenced the genomes of every single person on Earth. 00:39:48.680 --> 00:39:50.520 But really, that's nothing. 00:39:50.520 --> 00:39:55.080 In 2009, the internet added up to 500 exabytes. 00:39:55.080 --> 00:40:02.120 In 2010, in just one year, that will double to more than one zettabyte! 00:40:06.360 --> 00:40:14.000 Back in the real world, if we turned all this data into print it would make 90 stacks of books, 00:40:14.000 --> 00:40:18.560 each reaching from here all the way to the sun! 00:40:18.560 --> 00:40:23.600 The data deluge is staggering, but, with today's computers 00:40:23.600 --> 00:40:28.200 and statistics, I'm confident we can handle it. 00:40:28.200 --> 00:40:31.400 When it comes to all the data on the internet, 00:40:31.400 --> 00:40:33.760 the powerhouse of statistical analysis 00:40:33.760 --> 00:40:37.560 is the Silicon Valley giant Google. 00:40:44.000 --> 00:40:50.600 The average person over their lifetime is exposed to about 100 million words of conversation. 00:40:50.600 --> 00:40:54.840 And so if you multiple that by the six billion people on the planet, 00:40:54.840 --> 00:40:58.040 that amount of words is about equal to the number of words 00:40:58.040 --> 00:41:01.080 that Google has available at any one instant in time. 00:41:03.480 --> 00:41:08.680 Google's computers hoover up and file away every document, web page, and image they can find. 00:41:08.680 --> 00:41:14.640 They then hunt for patterns and correlations in all this data, 00:41:14.640 --> 00:41:17.760 doing statistics on a massive scale. 00:41:17.760 --> 00:41:25.560 And, for me, Google has one project that's particularly exciting - statistical language translation. 00:41:25.560 --> 00:41:30.880 We wanted to provide access to all the web's information, no matter what language you spoke. 00:41:30.880 --> 00:41:33.520 There's just so much information on the internet, 00:41:33.520 --> 00:41:37.880 you couldn't hope to translate it all by hand into every possible language. 00:41:37.880 --> 00:41:41.560 We figured we'd have to be able to do machine translation. 00:41:44.280 --> 00:41:47.360 In the past, programmers tried to teach their computers 00:41:47.360 --> 00:41:53.320 to see each language as a set of grammatical rules - much like the way languages are taught at school. 00:41:53.320 --> 00:41:58.760 But this didn't work because no set of rules could capture a language 00:41:58.760 --> 00:42:01.480 in all its subtlety and ambiguity. 00:42:01.480 --> 00:42:05.840 "Having eaten our lunch the coach departed." 00:42:05.840 --> 00:42:07.920 Well, that's obviously incorrect. 00:42:07.920 --> 00:42:12.000 Written like that it would imply that the coach has eaten the lunch. 00:42:12.000 --> 00:42:15.160 It would be far better to say... 00:42:15.160 --> 00:42:19.920 "having eaten our lunch we departed in the coach." 00:42:19.920 --> 00:42:26.320 Those rules are helpful and they are useful most of time, but they don't turn out to be true all the time. 00:42:26.320 --> 00:42:30.320 And the insight of using statistical machine translation is saying, 00:42:30.320 --> 00:42:35.280 "If you've got to have all these exceptions anyways, maybe you can get by without having any of the rules. 00:42:35.280 --> 00:42:39.480 "Maybe you can treat everything as an exception." And that's essentially what we've done. 00:42:48.840 --> 00:42:52.640 What the computer is doing when he's learning how to translate 00:42:52.640 --> 00:42:55.160 is to learn correlations between words 00:42:55.160 --> 00:42:57.240 and correlations between phrases. 00:42:57.240 --> 00:43:00.840 So we feed the system very large amounts of data 00:43:00.840 --> 00:43:04.720 and then the system is seeing that a certain word or a certain phrase 00:43:04.720 --> 00:43:07.600 correlates very often to the other language. 00:43:09.800 --> 00:43:15.800 Google's website currently offers translation between any of 57 different languages. 00:43:15.800 --> 00:43:22.680 It does this purely statistically, having correlated a huge collection of multilingual texts. 00:43:22.680 --> 00:43:25.600 The people that built the system don't need to know Chinese 00:43:25.600 --> 00:43:29.800 in order to build the Chinese-to-English system, or they don't need to know Arabic. 00:43:29.800 --> 00:43:33.040 But the expertise that's needed is basically knowledge of statistics, 00:43:33.040 --> 00:43:35.840 knowledge of computer science, knowledge of infrastructure 00:43:35.840 --> 00:43:40.880 to build those very large computational systems that we are building for doing that. 00:43:42.880 --> 00:43:48.360 I hooked up with Google from my office in Stockholm to try the translator for myself. 00:43:48.360 --> 00:43:51.760 'I will type... some Swedish sentences.' 00:43:51.760 --> 00:43:53.080 OK. 00:43:53.080 --> 00:43:55.240 Sveriges... 00:43:55.240 --> 00:43:59.280 ..guldring i orat. 00:44:00.920 --> 00:44:07.400 OK. So it says, "Sweden's finance minister has a ponytail and a gold ring in your ear." 00:44:07.400 --> 00:44:11.520 I guess it probably means in his ear. 'That's exactly correct, it's amazing! 00:44:11.520 --> 00:44:15.400 'He comes from the Conservative party, that's the kind of Sweden we have today. 00:44:15.400 --> 00:44:18.520 'I will type one more sentence.' 00:44:18.520 --> 00:44:22.080 'I sitt samkonade...' 00:44:22.080 --> 00:44:25.600 partnerskap... 00:44:25.600 --> 00:44:28.280 nya biskop. 00:44:28.280 --> 00:44:35.200 "In his same-sex partnership has Stockholm's new bishop and his partners a three-year son." 00:44:35.200 --> 00:44:38.120 It's almost perfect, there's one important thing - 00:44:38.120 --> 00:44:41.800 it's HER, it's a lesbian partnership. 00:44:41.800 --> 00:44:46.760 OK, so those kinds of words his and her are one of the challenges 00:44:46.760 --> 00:44:49.080 in translation to get really those right. 00:44:49.080 --> 00:44:51.920 Especially when it comes to bishops one can excuse it! 00:44:51.920 --> 00:44:53.640 'Right, right.' 00:44:53.640 --> 00:44:58.520 I guess more often than not it would probably be a "his". 'I will write one more sentence.' 00:44:58.520 --> 00:45:01.720 Nar Sverige deltar I olympiader ar malet 00:45:01.720 --> 00:45:03.720 'inte att vinna utan att sla Norge.' 00:45:06.400 --> 00:45:11.960 OK. "When Sweden is taking part in Olympic goal is not to win but to beat Norway." 00:45:11.960 --> 00:45:13.640 'Yes! This is what it is! 00:45:13.640 --> 00:45:17.920 'But they are very good in Winter Olympics, so we can't make it, but we are trying.' 00:45:17.920 --> 00:45:19.960 Ah, very good, very good. 00:45:19.960 --> 00:45:24.960 'This is absolutely amazing, you know, and I was especially impressed 00:45:24.960 --> 00:45:30.520 'that it picks up words like "same-sex partnership" which are very new to the language." 00:45:30.520 --> 00:45:36.920 'The translator is good, but if they succeed with what's next, that'll be remarkable.' 00:45:36.920 --> 00:45:38.440 One of the exciting possibilities 00:45:38.440 --> 00:45:42.720 is combining the machine translation technology with the speech recognition technology. 00:45:42.720 --> 00:45:45.480 Now, both of these are statistical in nature. 00:45:45.480 --> 00:45:51.360 The machine translation relies on the statistics of mapping from one language to another, 00:45:51.360 --> 00:45:57.840 and similarly speech recognition relies on the statistics of mapping from a sound form to the words. 00:45:57.840 --> 00:45:59.520 When we put them together, 00:45:59.520 --> 00:46:03.200 now we have the capability of having instant conversation 00:46:03.200 --> 00:46:06.760 between two people that don't speak a common language. 00:46:06.760 --> 00:46:08.680 I can talk to you in my language, 00:46:08.680 --> 00:46:11.880 you hear me in your language and you can answer back. 00:46:11.880 --> 00:46:15.000 And in real time we can make that translation, 00:46:15.000 --> 00:46:18.800 we can bring two people together and allow them to speak. 00:46:31.400 --> 00:46:39.040 The internet is just one of many technologies created to gather massive amounts of data. 00:46:39.040 --> 00:46:43.640 Scientists studying our earth and our environment 00:46:43.640 --> 00:46:47.440 now use an incredible range of instruments 00:46:47.440 --> 00:46:50.920 to measure the processes of our planet. 00:46:52.760 --> 00:47:00.360 All around us are sensors continuously measuring temperature, water flow, and ocean currents. 00:47:00.360 --> 00:47:06.800 And high in orbit are satellites busy imaging cloud formations, forest growth and snow cover. 00:47:06.800 --> 00:47:11.360 Scientists speak of "instrumenting the earth". 00:47:13.320 --> 00:47:20.160 And pointing up to the skies above are powerful new telescopes mapping the universe. 00:47:30.280 --> 00:47:34.760 What's happening in astronomy is typical of how profoundly 00:47:34.760 --> 00:47:39.760 this new torrent of data is transforming science. 00:47:39.760 --> 00:47:45.280 Astronomers are now addressing many enduring mysteries of the cosmos 00:47:45.280 --> 00:47:49.600 by applying statistical methods to all this new data. 00:47:59.800 --> 00:48:03.360 The galaxy is a very big place and it's got billions of stars in it, 00:48:03.360 --> 00:48:09.400 and so to put together a coherent picture of the whole galaxy requires having an enormous amount of data. 00:48:09.400 --> 00:48:13.720 And before you could do a large sky survey with sensitive, digital detectors 00:48:13.720 --> 00:48:16.880 that meant that you could map many, many stars all at once, 00:48:16.880 --> 00:48:20.680 it was very difficult to build up enough data on enough of the galaxy. 00:48:24.600 --> 00:48:28.560 In the past, large surveys of the night sky had to be done 00:48:28.560 --> 00:48:32.400 by exposing thousands of large photographic plates. 00:48:32.400 --> 00:48:37.200 But these surveys could take 25 years or more to complete. 00:48:39.040 --> 00:48:44.680 Then, in the 1990s, came digital astronomy and a huge increase 00:48:44.680 --> 00:48:49.600 in both the amount and the accessibility of data. 00:48:49.600 --> 00:48:55.960 The Sloan Sky Survey is the world's biggest yet, using a massive digital sensor 00:48:55.960 --> 00:49:00.840 mounted on the back of a custom-built telescope in New Mexico. 00:49:00.840 --> 00:49:05.240 It's scanned the sky night after night for eight years, 00:49:05.240 --> 00:49:09.800 building up a composite picture in unprecedented resolution. 00:49:09.800 --> 00:49:14.840 The Sloan is some of the best, deepest survey data that we have in astronomy. 00:49:14.840 --> 00:49:18.760 Both on our own galaxy and on galaxies further away from ours. 00:49:24.080 --> 00:49:27.320 All the Sloan data is on the internet, 00:49:27.320 --> 00:49:34.120 and with it astronomers have identified millions of hitherto unknown stars and galaxies. 00:49:34.120 --> 00:49:37.480 They also comb the database for statistical patterns 00:49:37.480 --> 00:49:42.800 which will prove, disprove, or even suggest new theories. 00:49:42.800 --> 00:49:49.160 So we have this idea that galaxies grow, they become large galaxies like the one we live in, the milky way, 00:49:49.160 --> 00:49:55.880 not all at once, or not smoothly, but by continuously incorporating, 00:49:55.880 --> 00:49:59.160 basically cannibalising, smaller galaxies. 00:49:59.160 --> 00:50:04.000 They dissolve them and they become part of the bigger galaxy as it grows. 00:50:06.040 --> 00:50:12.520 It's a startling idea, and, in the Sloan data, is the evidence to support it. 00:50:12.520 --> 00:50:16.280 Groups of stars that came from cannibalised galaxies 00:50:16.280 --> 00:50:21.240 stand out in the Sloan data as statistically different from other stars 00:50:21.240 --> 00:50:24.280 because they move at a different velocity. 00:50:24.280 --> 00:50:28.680 Each big spike on one of these distribution graphs 00:50:28.680 --> 00:50:35.120 means Professor Rockosi has found a group of stars all travelling in a different way to the rest. 00:50:35.120 --> 00:50:38.360 They are the telltale patterns she's looking for. 00:50:40.240 --> 00:50:44.960 The evidence is accumulating that, in fact, this really is how galaxies grow, 00:50:44.960 --> 00:50:47.440 or an important way in which how galaxies grow. 00:50:47.440 --> 00:50:53.000 And so this is an important part of understanding how galaxies form, not only ours but every galaxy. 00:50:56.360 --> 00:51:00.400 The more data there is, the more discoveries can be made. 00:51:00.400 --> 00:51:03.320 And the technology is getting better all the time. 00:51:03.320 --> 00:51:07.560 The next big survey telescope starts its work in 2015. 00:51:07.560 --> 00:51:10.760 It will leave Sloan in the dust! 00:51:10.760 --> 00:51:16.160 Sloan has taken eight years to cover one quarter of the night sky. 00:51:17.680 --> 00:51:25.680 The new telescope will scan the entire sky, in even greater resolution, every three days! 00:51:34.120 --> 00:51:41.000 The vast amounts of data we have today allows researchers in all sorts of fields 00:51:41.000 --> 00:51:46.280 to test their theories on a previously unimaginable scale. 00:51:46.280 --> 00:51:53.600 But more than this, it may even change the fundamental way science is done. 00:51:53.600 --> 00:51:58.560 With the power of today's computers applied to all this data, 00:51:58.560 --> 00:52:03.880 the machines might even be able to guide the researchers. 00:52:14.600 --> 00:52:17.920 We're at a potentially profoundly important 00:52:17.920 --> 00:52:22.560 and potentially one of the most significant points in science, 00:52:22.560 --> 00:52:24.680 and certainly one of the most exciting, 00:52:24.680 --> 00:52:32.080 where the potential to transform not just how scientists do science but even what science is possible. 00:52:32.080 --> 00:52:34.680 And what will power that transformation 00:52:34.680 --> 00:52:38.400 of both how science is done and even what science is possible 00:52:38.400 --> 00:52:40.120 is going to be computation. 00:52:41.800 --> 00:52:49.440 Many of the dynamics of the natural world, like the interplay between the rainforests and the atmosphere, 00:52:49.440 --> 00:52:53.560 are so complex that we don't as yet really understand them. 00:52:53.560 --> 00:52:59.280 But now computers are generating literally tens of thousands of different simulations 00:52:59.280 --> 00:53:03.480 of how these biological systems might work. 00:53:03.480 --> 00:53:07.840 It's like creating thousands of hypothetical parallel worlds. 00:53:07.840 --> 00:53:10.640 Each and every one of these simulations 00:53:10.640 --> 00:53:18.360 is analysed with statistics to see if any are a good match for what is observed in nature. 00:53:18.360 --> 00:53:21.840 The computers can now automatically generate, 00:53:21.840 --> 00:53:26.240 test and discard hypotheses with scarcely a human in sight. 00:53:28.240 --> 00:53:35.120 This new application of statistics will become absolutely vital for the future of science. 00:53:35.120 --> 00:53:39.400 It's creating a new paradigm, if you like, 00:53:39.400 --> 00:53:42.640 in science, in the way in which we can do science, 00:53:42.640 --> 00:53:45.280 which is increasingly... 00:53:45.280 --> 00:53:51.160 Which one might characterise as... data-centric or data driven 00:53:51.160 --> 00:53:55.000 rather than being hypothesis-driven or experimentally-driven. 00:53:55.000 --> 00:53:58.240 So, it's exciting times in terms of the science, 00:53:58.240 --> 00:54:02.200 in terms of the computation and in terms of the statistics. 00:54:08.800 --> 00:54:15.480 Now, if all that sounds a bit abstract and theoretical to you, how about one final frontier? 00:54:15.480 --> 00:54:19.040 Could statistics even make sense of your feelings? 00:54:21.200 --> 00:54:25.800 In California - where else? - one computer scientist 00:54:25.800 --> 00:54:32.680 is harvesting the internet to try to divine the patterns of our innermost thoughts and emotions. 00:54:44.800 --> 00:54:46.360 This is the madness movement. 00:54:46.360 --> 00:54:50.960 The madness movement represents a skyscraper view of the world. 00:54:50.960 --> 00:54:54.880 Each of these brightly coloured dots is an individual feeling 00:54:54.880 --> 00:54:58.720 expressed by someone out there in a blog or a tweet. 00:54:58.720 --> 00:55:04.480 And when you click on the dot it explodes to reveal the underlying feeling of that person. 00:55:04.480 --> 00:55:07.080 This is what people say they're feeling today. 00:55:07.720 --> 00:55:10.160 Better...safe... 00:55:10.160 --> 00:55:12.040 crappy... 00:55:12.040 --> 00:55:14.560 well... 00:55:14.560 --> 00:55:18.440 pretty...special... 00:55:18.440 --> 00:55:20.800 sorry...alone... 00:55:25.560 --> 00:55:29.040 So, every minute, We Feel Fine crawls the world's blogs, 00:55:29.040 --> 00:55:34.120 takes all the sentences that start with the words "I feel" or "I am feeling", 00:55:34.120 --> 00:55:35.920 and puts them in a database. 00:55:35.920 --> 00:55:40.080 We collect all the feelings and we count the most common. 00:55:40.080 --> 00:55:43.320 They are better...bad... 00:55:43.320 --> 00:55:45.640 good...right... 00:55:45.640 --> 00:55:48.520 guilty...sick... 00:55:48.520 --> 00:55:51.680 the same...like shit... 00:55:51.680 --> 00:55:54.720 sorry...well... 00:55:54.720 --> 00:55:56.240 and so on. 00:55:58.320 --> 00:56:01.760 And we can take a look at any one feeling and analyse it. 00:56:01.760 --> 00:56:04.800 Right now a lot of people are feeling happy. 00:56:04.800 --> 00:56:11.320 We can take a look at all the people who are happy and break it down by age, gender or location. 00:56:11.320 --> 00:56:16.840 Since bloggers have public profiles we have that information and so we can ask questions like, 00:56:16.840 --> 00:56:21.400 "Are women happier than men?" or, "Is England happier than the United States?" 00:56:30.240 --> 00:56:33.120 We find that, as people get older, they get happier. 00:56:33.120 --> 00:56:40.560 And, moreover, we find that for younger people they associate happiness more with excitement, 00:56:40.560 --> 00:56:47.000 and, as people get older, they associate happiness more with peacefulness. 00:56:51.240 --> 00:56:57.760 And we also find that women feel loved more often than men, but also more guilty. 00:56:57.760 --> 00:57:02.480 While men feel good more often than women, but also more alone. 00:57:06.640 --> 00:57:12.480 As people lead more and more of their lives online, they leave behind digital traces, 00:57:12.480 --> 00:57:19.840 and with these digital traces we can begin to statistically analyse what it means to be human. 00:57:51.280 --> 00:57:54.480 So where does all of this leave us? 00:57:54.480 --> 00:58:00.160 We generate unimaginable quantities of data about everything you can think of. 00:58:00.160 --> 00:58:02.800 We analyse it to reveal the patterns. 00:58:02.800 --> 00:58:10.480 And now not only experts but all of us can understand the stories in the numbers. 00:58:18.160 --> 00:58:21.080 Instead of being led astray by prejudice, 00:58:21.080 --> 00:58:28.160 with statistics at our fingertips, our eyes can be open for a fact-based view of the world. 00:58:28.160 --> 00:58:33.760 So, more than ever before, we can become authors of our own destiny. 00:58:33.760 --> 00:58:36.800 And that's pretty exciting isn't it?! 00:58:37.680 --> 00:58:44.200 # 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 00:58:44.200 --> 00:58:50.800 # 1, 22, 3, 24, 25, 26, 27, 28, 9, 30, 31, 32, 3, 34, 35, 36, 7 00:58:50.800 --> 00:58:54.440 # 38, 39, 40, 41, 42, 3, 44, 45, 46, 47 00:58:54.440 --> 00:58:58.680 LYRICS DEGENERATE INTO GIBBERISH 00:59:08.680 --> 00:59:13.400 GIBBERISH DEGENERATES INTO NOISE 00:59:13.400 --> 00:59:14.440 # 100. #