What we learned from 5 million books
-
0:00 - 0:02Erez Lieberman Aiden: Everyone knows
-
0:02 - 0:05that a picture is worth a thousand words.
-
0:07 - 0:09But we at Harvard
-
0:09 - 0:12were wondering if this was really true.
-
0:12 - 0:14(Laughter)
-
0:14 - 0:18So we assembled a team of experts,
-
0:18 - 0:20spanning Harvard, MIT,
-
0:20 - 0:23The American Heritage Dictionary, The Encyclopedia Britannica
-
0:23 - 0:25and even our proud sponsors,
-
0:25 - 0:28the Google.
-
0:28 - 0:30And we cogitated about this
-
0:30 - 0:32for about four years.
-
0:32 - 0:37And we came to a startling conclusion.
-
0:37 - 0:40Ladies and gentlemen, a picture is not worth a thousand words.
-
0:40 - 0:42In fact, we found some pictures
-
0:42 - 0:47that are worth 500 billion words.
-
0:47 - 0:49Jean-Baptiste Michel: So how did we get to this conclusion?
-
0:49 - 0:51So Erez and I were thinking about ways
-
0:51 - 0:53to get a big picture of human culture
-
0:53 - 0:56and human history: change over time.
-
0:56 - 0:58So many books actually have been written over the years.
-
0:58 - 1:00So we were thinking, well the best way to learn from them
-
1:00 - 1:02is to read all of these millions of books.
-
1:02 - 1:05Now of course, if there's a scale for how awesome that is,
-
1:05 - 1:08that has to rank extremely, extremely high.
-
1:08 - 1:10Now the problem is there's an X-axis for that,
-
1:10 - 1:12which is the practical axis.
-
1:12 - 1:14This is very, very low.
-
1:14 - 1:17(Applause)
-
1:17 - 1:20Now people tend to use an alternative approach,
-
1:20 - 1:22which is to take a few sources and read them very carefully.
-
1:22 - 1:24This is extremely practical, but not so awesome.
-
1:24 - 1:27What you really want to do
-
1:27 - 1:30is to get to the awesome yet practical part of this space.
-
1:30 - 1:33So it turns out there was a company across the river called Google
-
1:33 - 1:35who had started a digitization project a few years back
-
1:35 - 1:37that might just enable this approach.
-
1:37 - 1:39They have digitized millions of books.
-
1:39 - 1:42So what that means is, one could use computational methods
-
1:42 - 1:44to read all of the books in a click of a button.
-
1:44 - 1:47That's very practical and extremely awesome.
-
1:48 - 1:50ELA: Let me tell you a little bit about where books come from.
-
1:50 - 1:53Since time immemorial, there have been authors.
-
1:53 - 1:56These authors have been striving to write books.
-
1:56 - 1:58And this became considerably easier
-
1:58 - 2:00with the development of the printing press some centuries ago.
-
2:00 - 2:03Since then, the authors have won
-
2:03 - 2:05on 129 million distinct occasions,
-
2:05 - 2:07publishing books.
-
2:07 - 2:09Now if those books are not lost to history,
-
2:09 - 2:11then they are somewhere in a library,
-
2:11 - 2:14and many of those books have been getting retrieved from the libraries
-
2:14 - 2:16and digitized by Google,
-
2:16 - 2:18which has scanned 15 million books to date.
-
2:18 - 2:21Now when Google digitizes a book, they put it into a really nice format.
-
2:21 - 2:23Now we've got the data, plus we have metadata.
-
2:23 - 2:26We have information about things like where was it published,
-
2:26 - 2:28who was the author, when was it published.
-
2:28 - 2:31And what we do is go through all of those records
-
2:31 - 2:35and exclude everything that's not the highest quality data.
-
2:35 - 2:37What we're left with
-
2:37 - 2:40is a collection of five million books,
-
2:40 - 2:43500 billion words,
-
2:43 - 2:45a string of characters a thousand times longer
-
2:45 - 2:48than the human genome --
-
2:48 - 2:50a text which, when written out,
-
2:50 - 2:52would stretch from here to the Moon and back
-
2:52 - 2:5410 times over --
-
2:54 - 2:58a veritable shard of our cultural genome.
-
2:58 - 3:00Of course what we did
-
3:00 - 3:03when faced with such outrageous hyperbole ...
-
3:03 - 3:05(Laughter)
-
3:05 - 3:08was what any self-respecting researchers
-
3:08 - 3:11would have done.
-
3:11 - 3:13We took a page out of XKCD,
-
3:13 - 3:15and we said, "Stand back.
-
3:15 - 3:17We're going to try science."
-
3:17 - 3:19(Laughter)
-
3:19 - 3:21JM: Now of course, we were thinking,
-
3:21 - 3:23well let's just first put the data out there
-
3:23 - 3:25for people to do science to it.
-
3:25 - 3:27Now we're thinking, what data can we release?
-
3:27 - 3:29Well of course, you want to take the books
-
3:29 - 3:31and release the full text of these five million books.
-
3:31 - 3:33Now Google, and Jon Orwant in particular,
-
3:33 - 3:35told us a little equation that we should learn.
-
3:35 - 3:38So you have five million, that is, five million authors
-
3:38 - 3:41and five million plaintiffs is a massive lawsuit.
-
3:41 - 3:43So, although that would be really, really awesome,
-
3:43 - 3:46again, that's extremely, extremely impractical.
-
3:46 - 3:48(Laughter)
-
3:48 - 3:50Now again, we kind of caved in,
-
3:50 - 3:53and we did the very practical approach, which was a bit less awesome.
-
3:53 - 3:55We said, well instead of releasing the full text,
-
3:55 - 3:57we're going to release statistics about the books.
-
3:57 - 3:59So take for instance "A gleam of happiness."
-
3:59 - 4:01It's four words; we call that a four-gram.
-
4:01 - 4:03We're going to tell you how many times a particular four-gram
-
4:03 - 4:05appeared in books in 1801, 1802, 1803,
-
4:05 - 4:07all the way up to 2008.
-
4:07 - 4:09That gives us a time series
-
4:09 - 4:11of how frequently this particular sentence was used over time.
-
4:11 - 4:14We do that for all the words and phrases that appear in those books,
-
4:14 - 4:17and that gives us a big table of two billion lines
-
4:17 - 4:19that tell us about the way culture has been changing.
-
4:19 - 4:21ELA: So those two billion lines,
-
4:21 - 4:23we call them two billion n-grams.
-
4:23 - 4:25What do they tell us?
-
4:25 - 4:27Well the individual n-grams measure cultural trends.
-
4:27 - 4:29Let me give you an example.
-
4:29 - 4:31Let's suppose that I am thriving,
-
4:31 - 4:33then tomorrow I want to tell you about how well I did.
-
4:33 - 4:36And so I might say, "Yesterday, I throve."
-
4:36 - 4:39Alternatively, I could say, "Yesterday, I thrived."
-
4:39 - 4:42Well which one should I use?
-
4:42 - 4:44How to know?
-
4:44 - 4:46As of about six months ago,
-
4:46 - 4:48the state of the art in this field
-
4:48 - 4:50is that you would, for instance,
-
4:50 - 4:52go up to the following psychologist with fabulous hair,
-
4:52 - 4:54and you'd say,
-
4:54 - 4:57"Steve, you're an expert on the irregular verbs.
-
4:57 - 4:59What should I do?"
-
4:59 - 5:01And he'd tell you, "Well most people say thrived,
-
5:01 - 5:04but some people say throve."
-
5:04 - 5:06And you also knew, more or less,
-
5:06 - 5:09that if you were to go back in time 200 years
-
5:09 - 5:12and ask the following statesman with equally fabulous hair,
-
5:12 - 5:15(Laughter)
-
5:15 - 5:17"Tom, what should I say?"
-
5:17 - 5:19He'd say, "Well, in my day, most people throve,
-
5:19 - 5:22but some thrived."
-
5:22 - 5:24So now what I'm just going to show you is raw data.
-
5:24 - 5:28Two rows from this table of two billion entries.
-
5:28 - 5:30What you're seeing is year by year frequency
-
5:30 - 5:33of "thrived" and "throve" over time.
-
5:34 - 5:36Now this is just two
-
5:36 - 5:39out of two billion rows.
-
5:39 - 5:41So the entire data set
-
5:41 - 5:44is a billion times more awesome than this slide.
-
5:44 - 5:46(Laughter)
-
5:46 - 5:50(Applause)
-
5:50 - 5:52JM: Now there are many other pictures that are worth 500 billion words.
-
5:52 - 5:54For instance, this one.
-
5:54 - 5:56If you just take influenza,
-
5:56 - 5:58you will see peaks at the time where you knew
-
5:58 - 6:01big flu epidemics were killing people around the globe.
-
6:01 - 6:04ELA: If you were not yet convinced,
-
6:04 - 6:06sea levels are rising,
-
6:06 - 6:09so is atmospheric CO2 and global temperature.
-
6:09 - 6:12JM: You might also want to have a look at this particular n-gram,
-
6:12 - 6:15and that's to tell Nietzsche that God is not dead,
-
6:15 - 6:18although you might agree that he might need a better publicist.
-
6:18 - 6:20(Laughter)
-
6:20 - 6:23ELA: You can get at some pretty abstract concepts with this sort of thing.
-
6:23 - 6:25For instance, let me tell you the history
-
6:25 - 6:27of the year 1950.
-
6:27 - 6:29Pretty much for the vast majority of history,
-
6:29 - 6:31no one gave a damn about 1950.
-
6:31 - 6:33In 1700, in 1800, in 1900,
-
6:33 - 6:36no one cared.
-
6:37 - 6:39Through the 30s and 40s,
-
6:39 - 6:41no one cared.
-
6:41 - 6:43Suddenly, in the mid-40s,
-
6:43 - 6:45there started to be a buzz.
-
6:45 - 6:47People realized that 1950 was going to happen,
-
6:47 - 6:49and it could be big.
-
6:49 - 6:52(Laughter)
-
6:52 - 6:55But nothing got people interested in 1950
-
6:55 - 6:58like the year 1950.
-
6:58 - 7:01(Laughter)
-
7:01 - 7:03People were walking around obsessed.
-
7:03 - 7:05They couldn't stop talking
-
7:05 - 7:08about all the things they did in 1950,
-
7:08 - 7:11all the things they were planning to do in 1950,
-
7:11 - 7:16all the dreams of what they wanted to accomplish in 1950.
-
7:16 - 7:18In fact, 1950 was so fascinating
-
7:18 - 7:20that for years thereafter,
-
7:20 - 7:23people just kept talking about all the amazing things that happened,
-
7:23 - 7:25in '51, '52, '53.
-
7:25 - 7:27Finally in 1954,
-
7:27 - 7:29someone woke up and realized
-
7:29 - 7:33that 1950 had gotten somewhat passé.
-
7:33 - 7:35(Laughter)
-
7:35 - 7:37And just like that, the bubble burst.
-
7:37 - 7:39(Laughter)
-
7:39 - 7:41And the story of 1950
-
7:41 - 7:43is the story of every year that we have on record,
-
7:43 - 7:46with a little twist, because now we've got these nice charts.
-
7:46 - 7:49And because we have these nice charts, we can measure things.
-
7:49 - 7:51We can say, "Well how fast does the bubble burst?"
-
7:51 - 7:54And it turns out that we can measure that very precisely.
-
7:54 - 7:57Equations were derived, graphs were produced,
-
7:57 - 7:59and the net result
-
7:59 - 8:02is that we find that the bubble bursts faster and faster
-
8:02 - 8:04with each passing year.
-
8:04 - 8:09We are losing interest in the past more rapidly.
-
8:09 - 8:11JM: Now a little piece of career advice.
-
8:11 - 8:13So for those of you who seek to be famous,
-
8:13 - 8:15we can learn from the 25 most famous political figures,
-
8:15 - 8:17authors, actors and so on.
-
8:17 - 8:20So if you want to become famous early on, you should be an actor,
-
8:20 - 8:22because then fame starts rising by the end of your 20s --
-
8:22 - 8:24you're still young, it's really great.
-
8:24 - 8:26Now if you can wait a little bit, you should be an author,
-
8:26 - 8:28because then you rise to very great heights,
-
8:28 - 8:30like Mark Twain, for instance: extremely famous.
-
8:30 - 8:32But if you want to reach the very top,
-
8:32 - 8:34you should delay gratification
-
8:34 - 8:36and, of course, become a politician.
-
8:36 - 8:38So here you will become famous by the end of your 50s,
-
8:38 - 8:40and become very, very famous afterward.
-
8:40 - 8:43So scientists also tend to get famous when they're much older.
-
8:43 - 8:45Like for instance, biologists and physics
-
8:45 - 8:47tend to be almost as famous as actors.
-
8:47 - 8:50One mistake you should not do is become a mathematician.
-
8:50 - 8:52(Laughter)
-
8:52 - 8:54If you do that,
-
8:54 - 8:57you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
-
8:57 - 8:59But guess what, nobody will really care.
-
8:59 - 9:02(Laughter)
-
9:02 - 9:04ELA: There are more sobering notes
-
9:04 - 9:06among the n-grams.
-
9:06 - 9:08For instance, here's the trajectory of Marc Chagall,
-
9:08 - 9:10an artist born in 1887.
-
9:10 - 9:13And this looks like the normal trajectory of a famous person.
-
9:13 - 9:17He gets more and more and more famous,
-
9:17 - 9:19except if you look in German.
-
9:19 - 9:21If you look in German, you see something completely bizarre,
-
9:21 - 9:23something you pretty much never see,
-
9:23 - 9:25which is he becomes extremely famous
-
9:25 - 9:27and then all of a sudden plummets,
-
9:27 - 9:30going through a nadir between 1933 and 1945,
-
9:30 - 9:33before rebounding afterward.
-
9:33 - 9:35And of course, what we're seeing
-
9:35 - 9:38is the fact Marc Chagall was a Jewish artist
-
9:38 - 9:40in Nazi Germany.
-
9:40 - 9:42Now these signals
-
9:42 - 9:44are actually so strong
-
9:44 - 9:47that we don't need to know that someone was censored.
-
9:47 - 9:49We can actually figure it out
-
9:49 - 9:51using really basic signal processing.
-
9:51 - 9:53Here's a simple way to do it.
-
9:53 - 9:55Well, a reasonable expectation
-
9:55 - 9:57is that somebody's fame in a given period of time
-
9:57 - 9:59should be roughly the average of their fame before
-
9:59 - 10:01and their fame after.
-
10:01 - 10:03So that's sort of what we expect.
-
10:03 - 10:06And we compare that to the fame that we observe.
-
10:06 - 10:08And we just divide one by the other
-
10:08 - 10:10to produce something we call a suppression index.
-
10:10 - 10:13If the suppression index is very, very, very small,
-
10:13 - 10:15then you very well might be being suppressed.
-
10:15 - 10:18If it's very large, maybe you're benefiting from propaganda.
-
10:19 - 10:21JM: Now you can actually look at
-
10:21 - 10:24the distribution of suppression indexes over whole populations.
-
10:24 - 10:26So for instance, here --
-
10:26 - 10:28this suppression index is for 5,000 people
-
10:28 - 10:30picked in English books where there's no known suppression --
-
10:30 - 10:32it would be like this, basically tightly centered on one.
-
10:32 - 10:34What you expect is basically what you observe.
-
10:34 - 10:36This is distribution as seen in Germany --
-
10:36 - 10:38very different, it's shifted to the left.
-
10:38 - 10:41People talked about it twice less as it should have been.
-
10:41 - 10:43But much more importantly, the distribution is much wider.
-
10:43 - 10:46There are many people who end up on the far left on this distribution
-
10:46 - 10:49who are talked about 10 times fewer than they should have been.
-
10:49 - 10:51But then also many people on the far right
-
10:51 - 10:53who seem to benefit from propaganda.
-
10:53 - 10:56This picture is the hallmark of censorship in the book record.
-
10:56 - 10:58ELA: So culturomics
-
10:58 - 11:00is what we call this method.
-
11:00 - 11:02It's kind of like genomics.
-
11:02 - 11:04Except genomics is a lens on biology
-
11:04 - 11:07through the window of the sequence of bases in the human genome.
-
11:07 - 11:09Culturomics is similar.
-
11:09 - 11:12It's the application of massive-scale data collection analysis
-
11:12 - 11:14to the study of human culture.
-
11:14 - 11:16Here, instead of through the lens of a genome,
-
11:16 - 11:19through the lens of digitized pieces of the historical record.
-
11:19 - 11:21The great thing about culturomics
-
11:21 - 11:23is that everyone can do it.
-
11:23 - 11:25Why can everyone do it?
-
11:25 - 11:27Everyone can do it because three guys,
-
11:27 - 11:30Jon Orwant, Matt Gray and Will Brockman over at Google,
-
11:30 - 11:32saw the prototype of the Ngram Viewer,
-
11:32 - 11:34and they said, "This is so fun.
-
11:34 - 11:37We have to make this available for people."
-
11:37 - 11:39So in two weeks flat -- the two weeks before our paper came out --
-
11:39 - 11:42they coded up a version of the Ngram Viewer for the general public.
-
11:42 - 11:45And so you too can type in any word or phrase that you're interested in
-
11:45 - 11:47and see its n-gram immediately --
-
11:47 - 11:49also browse examples of all the various books
-
11:49 - 11:51in which your n-gram appears.
-
11:51 - 11:53JM: Now this was used over a million times on the first day,
-
11:53 - 11:55and this is really the best of all the queries.
-
11:55 - 11:58So people want to be their best, put their best foot forward.
-
11:58 - 12:01But it turns out in the 18th century, people didn't really care about that at all.
-
12:01 - 12:04They didn't want to be their best, they wanted to be their beft.
-
12:04 - 12:07So what happened is, of course, this is just a mistake.
-
12:07 - 12:09It's not that strove for mediocrity,
-
12:09 - 12:12it's just that the S used to be written differently, kind of like an F.
-
12:12 - 12:15Now of course, Google didn't pick this up at the time,
-
12:15 - 12:18so we reported this in the science article that we wrote.
-
12:18 - 12:20But it turns out this is just a reminder
-
12:20 - 12:22that, although this is a lot of fun,
-
12:22 - 12:24when you interpret these graphs, you have to be very careful,
-
12:24 - 12:27and you have to adopt the base standards in the sciences.
-
12:27 - 12:30ELA: People have been using this for all kinds of fun purposes.
-
12:30 - 12:37(Laughter)
-
12:37 - 12:39Actually, we're not going to have to talk,
-
12:39 - 12:42we're just going to show you all the slides and remain silent.
-
12:42 - 12:45This person was interested in the history of frustration.
-
12:45 - 12:48There's various types of frustration.
-
12:48 - 12:51If you stub your toe, that's a one A "argh."
-
12:51 - 12:53If the planet Earth is annihilated by the Vogons
-
12:53 - 12:55to make room for an interstellar bypass,
-
12:55 - 12:57that's an eight A "aaaaaaaargh."
-
12:57 - 12:59This person studies all the "arghs,"
-
12:59 - 13:01from one through eight A's.
-
13:01 - 13:03And it turns out
-
13:03 - 13:05that the less-frequent "arghs"
-
13:05 - 13:08are, of course, the ones that correspond to things that are more frustrating --
-
13:08 - 13:11except, oddly, in the early 80s.
-
13:11 - 13:13We think that might have something to do with Reagan.
-
13:13 - 13:15(Laughter)
-
13:15 - 13:18JM: There are many usages of this data,
-
13:18 - 13:21but the bottom line is that the historical record is being digitized.
-
13:21 - 13:23Google has started to digitize 15 million books.
-
13:23 - 13:25That's 12 percent of all the books that have ever been published.
-
13:25 - 13:28It's a sizable chunk of human culture.
-
13:28 - 13:31There's much more in culture: there's manuscripts, there newspapers,
-
13:31 - 13:33there's things that are not text, like art and paintings.
-
13:33 - 13:35These all happen to be on our computers,
-
13:35 - 13:37on computers across the world.
-
13:37 - 13:40And when that happens, that will transform the way we have
-
13:40 - 13:42to understand our past, our present and human culture.
-
13:42 - 13:44Thank you very much.
-
13:44 - 13:47(Applause)
- Title:
- What we learned from 5 million books
- Speaker:
- Jean-Baptiste Michel + Erez Lieberman Aiden
- Description:
-
Have you played with Google Labs' Ngram Viewer? It's an addicting tool that lets you search for words and ideas in a database of 5 million books from across centuries. Erez Lieberman Aiden and Jean-Baptiste Michel show us how it works, and a few of the surprising things we can learn from 500 billion words.
- Video Language:
- English
- Team:
closed TED
- Project:
- TEDTalks
- Duration:
- 13:48
![]() |
TED edited English subtitles for What we learned from 5 million books | |
![]() |
TED added a translation |