< Return to Video

What we learned from 5 million books

  • 0:00 - 0:02
    Erez Lieberman Aiden: Everyone knows
  • 0:02 - 0:05
    that a picture is worth a thousand words.
  • 0:07 - 0:09
    But we at Harvard
  • 0:09 - 0:12
    were wondering if this was really true.
  • 0:12 - 0:14
    (Laughter)
  • 0:14 - 0:18
    So we assembled a team of experts,
  • 0:18 - 0:20
    spanning Harvard, MIT,
  • 0:20 - 0:23
    The American Heritage Dictionary, The Encyclopedia Britannica
  • 0:23 - 0:25
    and even our proud sponsors,
  • 0:25 - 0:28
    the Google.
  • 0:28 - 0:30
    And we cogitated about this
  • 0:30 - 0:32
    for about four years.
  • 0:32 - 0:37
    And we came to a startling conclusion.
  • 0:37 - 0:40
    Ladies and gentlemen, a picture is not worth a thousand words.
  • 0:40 - 0:42
    In fact, we found some pictures
  • 0:42 - 0:47
    that are worth 500 billion words.
  • 0:47 - 0:49
    Jean-Baptiste Michel: So how did we get to this conclusion?
  • 0:49 - 0:51
    So Erez and I were thinking about ways
  • 0:51 - 0:53
    to get a big picture of human culture
  • 0:53 - 0:56
    and human history: change over time.
  • 0:56 - 0:58
    So many books actually have been written over the years.
  • 0:58 - 1:00
    So we were thinking, well the best way to learn from them
  • 1:00 - 1:02
    is to read all of these millions of books.
  • 1:02 - 1:05
    Now of course, if there's a scale for how awesome that is,
  • 1:05 - 1:08
    that has to rank extremely, extremely high.
  • 1:08 - 1:10
    Now the problem is there's an X-axis for that,
  • 1:10 - 1:12
    which is the practical axis.
  • 1:12 - 1:14
    This is very, very low.
  • 1:14 - 1:17
    (Applause)
  • 1:17 - 1:20
    Now people tend to use an alternative approach,
  • 1:20 - 1:22
    which is to take a few sources and read them very carefully.
  • 1:22 - 1:24
    This is extremely practical, but not so awesome.
  • 1:24 - 1:27
    What you really want to do
  • 1:27 - 1:30
    is to get to the awesome yet practical part of this space.
  • 1:30 - 1:33
    So it turns out there was a company across the river called Google
  • 1:33 - 1:35
    who had started a digitization project a few years back
  • 1:35 - 1:37
    that might just enable this approach.
  • 1:37 - 1:39
    They have digitized millions of books.
  • 1:39 - 1:42
    So what that means is, one could use computational methods
  • 1:42 - 1:44
    to read all of the books in a click of a button.
  • 1:44 - 1:47
    That's very practical and extremely awesome.
  • 1:48 - 1:50
    ELA: Let me tell you a little bit about where books come from.
  • 1:50 - 1:53
    Since time immemorial, there have been authors.
  • 1:53 - 1:56
    These authors have been striving to write books.
  • 1:56 - 1:58
    And this became considerably easier
  • 1:58 - 2:00
    with the development of the printing press some centuries ago.
  • 2:00 - 2:03
    Since then, the authors have won
  • 2:03 - 2:05
    on 129 million distinct occasions,
  • 2:05 - 2:07
    publishing books.
  • 2:07 - 2:09
    Now if those books are not lost to history,
  • 2:09 - 2:11
    then they are somewhere in a library,
  • 2:11 - 2:14
    and many of those books have been getting retrieved from the libraries
  • 2:14 - 2:16
    and digitized by Google,
  • 2:16 - 2:18
    which has scanned 15 million books to date.
  • 2:18 - 2:21
    Now when Google digitizes a book, they put it into a really nice format.
  • 2:21 - 2:23
    Now we've got the data, plus we have metadata.
  • 2:23 - 2:26
    We have information about things like where was it published,
  • 2:26 - 2:28
    who was the author, when was it published.
  • 2:28 - 2:31
    And what we do is go through all of those records
  • 2:31 - 2:35
    and exclude everything that's not the highest quality data.
  • 2:35 - 2:37
    What we're left with
  • 2:37 - 2:40
    is a collection of five million books,
  • 2:40 - 2:43
    500 billion words,
  • 2:43 - 2:45
    a string of characters a thousand times longer
  • 2:45 - 2:48
    than the human genome --
  • 2:48 - 2:50
    a text which, when written out,
  • 2:50 - 2:52
    would stretch from here to the Moon and back
  • 2:52 - 2:54
    10 times over --
  • 2:54 - 2:58
    a veritable shard of our cultural genome.
  • 2:58 - 3:00
    Of course what we did
  • 3:00 - 3:03
    when faced with such outrageous hyperbole ...
  • 3:03 - 3:05
    (Laughter)
  • 3:05 - 3:08
    was what any self-respecting researchers
  • 3:08 - 3:11
    would have done.
  • 3:11 - 3:13
    We took a page out of XKCD,
  • 3:13 - 3:15
    and we said, "Stand back.
  • 3:15 - 3:17
    We're going to try science."
  • 3:17 - 3:19
    (Laughter)
  • 3:19 - 3:21
    JM: Now of course, we were thinking,
  • 3:21 - 3:23
    well let's just first put the data out there
  • 3:23 - 3:25
    for people to do science to it.
  • 3:25 - 3:27
    Now we're thinking, what data can we release?
  • 3:27 - 3:29
    Well of course, you want to take the books
  • 3:29 - 3:31
    and release the full text of these five million books.
  • 3:31 - 3:33
    Now Google, and Jon Orwant in particular,
  • 3:33 - 3:35
    told us a little equation that we should learn.
  • 3:35 - 3:38
    So you have five million, that is, five million authors
  • 3:38 - 3:41
    and five million plaintiffs is a massive lawsuit.
  • 3:41 - 3:43
    So, although that would be really, really awesome,
  • 3:43 - 3:46
    again, that's extremely, extremely impractical.
  • 3:46 - 3:48
    (Laughter)
  • 3:48 - 3:50
    Now again, we kind of caved in,
  • 3:50 - 3:53
    and we did the very practical approach, which was a bit less awesome.
  • 3:53 - 3:55
    We said, well instead of releasing the full text,
  • 3:55 - 3:57
    we're going to release statistics about the books.
  • 3:57 - 3:59
    So take for instance "A gleam of happiness."
  • 3:59 - 4:01
    It's four words; we call that a four-gram.
  • 4:01 - 4:03
    We're going to tell you how many times a particular four-gram
  • 4:03 - 4:05
    appeared in books in 1801, 1802, 1803,
  • 4:05 - 4:07
    all the way up to 2008.
  • 4:07 - 4:09
    That gives us a time series
  • 4:09 - 4:11
    of how frequently this particular sentence was used over time.
  • 4:11 - 4:14
    We do that for all the words and phrases that appear in those books,
  • 4:14 - 4:17
    and that gives us a big table of two billion lines
  • 4:17 - 4:19
    that tell us about the way culture has been changing.
  • 4:19 - 4:21
    ELA: So those two billion lines,
  • 4:21 - 4:23
    we call them two billion n-grams.
  • 4:23 - 4:25
    What do they tell us?
  • 4:25 - 4:27
    Well the individual n-grams measure cultural trends.
  • 4:27 - 4:29
    Let me give you an example.
  • 4:29 - 4:31
    Let's suppose that I am thriving,
  • 4:31 - 4:33
    then tomorrow I want to tell you about how well I did.
  • 4:33 - 4:36
    And so I might say, "Yesterday, I throve."
  • 4:36 - 4:39
    Alternatively, I could say, "Yesterday, I thrived."
  • 4:39 - 4:42
    Well which one should I use?
  • 4:42 - 4:44
    How to know?
  • 4:44 - 4:46
    As of about six months ago,
  • 4:46 - 4:48
    the state of the art in this field
  • 4:48 - 4:50
    is that you would, for instance,
  • 4:50 - 4:52
    go up to the following psychologist with fabulous hair,
  • 4:52 - 4:54
    and you'd say,
  • 4:54 - 4:57
    "Steve, you're an expert on the irregular verbs.
  • 4:57 - 4:59
    What should I do?"
  • 4:59 - 5:01
    And he'd tell you, "Well most people say thrived,
  • 5:01 - 5:04
    but some people say throve."
  • 5:04 - 5:06
    And you also knew, more or less,
  • 5:06 - 5:09
    that if you were to go back in time 200 years
  • 5:09 - 5:12
    and ask the following statesman with equally fabulous hair,
  • 5:12 - 5:15
    (Laughter)
  • 5:15 - 5:17
    "Tom, what should I say?"
  • 5:17 - 5:19
    He'd say, "Well, in my day, most people throve,
  • 5:19 - 5:22
    but some thrived."
  • 5:22 - 5:24
    So now what I'm just going to show you is raw data.
  • 5:24 - 5:28
    Two rows from this table of two billion entries.
  • 5:28 - 5:30
    What you're seeing is year by year frequency
  • 5:30 - 5:33
    of "thrived" and "throve" over time.
  • 5:34 - 5:36
    Now this is just two
  • 5:36 - 5:39
    out of two billion rows.
  • 5:39 - 5:41
    So the entire data set
  • 5:41 - 5:44
    is a billion times more awesome than this slide.
  • 5:44 - 5:46
    (Laughter)
  • 5:46 - 5:50
    (Applause)
  • 5:50 - 5:52
    JM: Now there are many other pictures that are worth 500 billion words.
  • 5:52 - 5:54
    For instance, this one.
  • 5:54 - 5:56
    If you just take influenza,
  • 5:56 - 5:58
    you will see peaks at the time where you knew
  • 5:58 - 6:01
    big flu epidemics were killing people around the globe.
  • 6:01 - 6:04
    ELA: If you were not yet convinced,
  • 6:04 - 6:06
    sea levels are rising,
  • 6:06 - 6:09
    so is atmospheric CO2 and global temperature.
  • 6:09 - 6:12
    JM: You might also want to have a look at this particular n-gram,
  • 6:12 - 6:15
    and that's to tell Nietzsche that God is not dead,
  • 6:15 - 6:18
    although you might agree that he might need a better publicist.
  • 6:18 - 6:20
    (Laughter)
  • 6:20 - 6:23
    ELA: You can get at some pretty abstract concepts with this sort of thing.
  • 6:23 - 6:25
    For instance, let me tell you the history
  • 6:25 - 6:27
    of the year 1950.
  • 6:27 - 6:29
    Pretty much for the vast majority of history,
  • 6:29 - 6:31
    no one gave a damn about 1950.
  • 6:31 - 6:33
    In 1700, in 1800, in 1900,
  • 6:33 - 6:36
    no one cared.
  • 6:37 - 6:39
    Through the 30s and 40s,
  • 6:39 - 6:41
    no one cared.
  • 6:41 - 6:43
    Suddenly, in the mid-40s,
  • 6:43 - 6:45
    there started to be a buzz.
  • 6:45 - 6:47
    People realized that 1950 was going to happen,
  • 6:47 - 6:49
    and it could be big.
  • 6:49 - 6:52
    (Laughter)
  • 6:52 - 6:55
    But nothing got people interested in 1950
  • 6:55 - 6:58
    like the year 1950.
  • 6:58 - 7:01
    (Laughter)
  • 7:01 - 7:03
    People were walking around obsessed.
  • 7:03 - 7:05
    They couldn't stop talking
  • 7:05 - 7:08
    about all the things they did in 1950,
  • 7:08 - 7:11
    all the things they were planning to do in 1950,
  • 7:11 - 7:16
    all the dreams of what they wanted to accomplish in 1950.
  • 7:16 - 7:18
    In fact, 1950 was so fascinating
  • 7:18 - 7:20
    that for years thereafter,
  • 7:20 - 7:23
    people just kept talking about all the amazing things that happened,
  • 7:23 - 7:25
    in '51, '52, '53.
  • 7:25 - 7:27
    Finally in 1954,
  • 7:27 - 7:29
    someone woke up and realized
  • 7:29 - 7:33
    that 1950 had gotten somewhat passé.
  • 7:33 - 7:35
    (Laughter)
  • 7:35 - 7:37
    And just like that, the bubble burst.
  • 7:37 - 7:39
    (Laughter)
  • 7:39 - 7:41
    And the story of 1950
  • 7:41 - 7:43
    is the story of every year that we have on record,
  • 7:43 - 7:46
    with a little twist, because now we've got these nice charts.
  • 7:46 - 7:49
    And because we have these nice charts, we can measure things.
  • 7:49 - 7:51
    We can say, "Well how fast does the bubble burst?"
  • 7:51 - 7:54
    And it turns out that we can measure that very precisely.
  • 7:54 - 7:57
    Equations were derived, graphs were produced,
  • 7:57 - 7:59
    and the net result
  • 7:59 - 8:02
    is that we find that the bubble bursts faster and faster
  • 8:02 - 8:04
    with each passing year.
  • 8:04 - 8:09
    We are losing interest in the past more rapidly.
  • 8:09 - 8:11
    JM: Now a little piece of career advice.
  • 8:11 - 8:13
    So for those of you who seek to be famous,
  • 8:13 - 8:15
    we can learn from the 25 most famous political figures,
  • 8:15 - 8:17
    authors, actors and so on.
  • 8:17 - 8:20
    So if you want to become famous early on, you should be an actor,
  • 8:20 - 8:22
    because then fame starts rising by the end of your 20s --
  • 8:22 - 8:24
    you're still young, it's really great.
  • 8:24 - 8:26
    Now if you can wait a little bit, you should be an author,
  • 8:26 - 8:28
    because then you rise to very great heights,
  • 8:28 - 8:30
    like Mark Twain, for instance: extremely famous.
  • 8:30 - 8:32
    But if you want to reach the very top,
  • 8:32 - 8:34
    you should delay gratification
  • 8:34 - 8:36
    and, of course, become a politician.
  • 8:36 - 8:38
    So here you will become famous by the end of your 50s,
  • 8:38 - 8:40
    and become very, very famous afterward.
  • 8:40 - 8:43
    So scientists also tend to get famous when they're much older.
  • 8:43 - 8:45
    Like for instance, biologists and physics
  • 8:45 - 8:47
    tend to be almost as famous as actors.
  • 8:47 - 8:50
    One mistake you should not do is become a mathematician.
  • 8:50 - 8:52
    (Laughter)
  • 8:52 - 8:54
    If you do that,
  • 8:54 - 8:57
    you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
  • 8:57 - 8:59
    But guess what, nobody will really care.
  • 8:59 - 9:02
    (Laughter)
  • 9:02 - 9:04
    ELA: There are more sobering notes
  • 9:04 - 9:06
    among the n-grams.
  • 9:06 - 9:08
    For instance, here's the trajectory of Marc Chagall,
  • 9:08 - 9:10
    an artist born in 1887.
  • 9:10 - 9:13
    And this looks like the normal trajectory of a famous person.
  • 9:13 - 9:17
    He gets more and more and more famous,
  • 9:17 - 9:19
    except if you look in German.
  • 9:19 - 9:21
    If you look in German, you see something completely bizarre,
  • 9:21 - 9:23
    something you pretty much never see,
  • 9:23 - 9:25
    which is he becomes extremely famous
  • 9:25 - 9:27
    and then all of a sudden plummets,
  • 9:27 - 9:30
    going through a nadir between 1933 and 1945,
  • 9:30 - 9:33
    before rebounding afterward.
  • 9:33 - 9:35
    And of course, what we're seeing
  • 9:35 - 9:38
    is the fact Marc Chagall was a Jewish artist
  • 9:38 - 9:40
    in Nazi Germany.
  • 9:40 - 9:42
    Now these signals
  • 9:42 - 9:44
    are actually so strong
  • 9:44 - 9:47
    that we don't need to know that someone was censored.
  • 9:47 - 9:49
    We can actually figure it out
  • 9:49 - 9:51
    using really basic signal processing.
  • 9:51 - 9:53
    Here's a simple way to do it.
  • 9:53 - 9:55
    Well, a reasonable expectation
  • 9:55 - 9:57
    is that somebody's fame in a given period of time
  • 9:57 - 9:59
    should be roughly the average of their fame before
  • 9:59 - 10:01
    and their fame after.
  • 10:01 - 10:03
    So that's sort of what we expect.
  • 10:03 - 10:06
    And we compare that to the fame that we observe.
  • 10:06 - 10:08
    And we just divide one by the other
  • 10:08 - 10:10
    to produce something we call a suppression index.
  • 10:10 - 10:13
    If the suppression index is very, very, very small,
  • 10:13 - 10:15
    then you very well might be being suppressed.
  • 10:15 - 10:18
    If it's very large, maybe you're benefiting from propaganda.
  • 10:19 - 10:21
    JM: Now you can actually look at
  • 10:21 - 10:24
    the distribution of suppression indexes over whole populations.
  • 10:24 - 10:26
    So for instance, here --
  • 10:26 - 10:28
    this suppression index is for 5,000 people
  • 10:28 - 10:30
    picked in English books where there's no known suppression --
  • 10:30 - 10:32
    it would be like this, basically tightly centered on one.
  • 10:32 - 10:34
    What you expect is basically what you observe.
  • 10:34 - 10:36
    This is distribution as seen in Germany --
  • 10:36 - 10:38
    very different, it's shifted to the left.
  • 10:38 - 10:41
    People talked about it twice less as it should have been.
  • 10:41 - 10:43
    But much more importantly, the distribution is much wider.
  • 10:43 - 10:46
    There are many people who end up on the far left on this distribution
  • 10:46 - 10:49
    who are talked about 10 times fewer than they should have been.
  • 10:49 - 10:51
    But then also many people on the far right
  • 10:51 - 10:53
    who seem to benefit from propaganda.
  • 10:53 - 10:56
    This picture is the hallmark of censorship in the book record.
  • 10:56 - 10:58
    ELA: So culturomics
  • 10:58 - 11:00
    is what we call this method.
  • 11:00 - 11:02
    It's kind of like genomics.
  • 11:02 - 11:04
    Except genomics is a lens on biology
  • 11:04 - 11:07
    through the window of the sequence of bases in the human genome.
  • 11:07 - 11:09
    Culturomics is similar.
  • 11:09 - 11:12
    It's the application of massive-scale data collection analysis
  • 11:12 - 11:14
    to the study of human culture.
  • 11:14 - 11:16
    Here, instead of through the lens of a genome,
  • 11:16 - 11:19
    through the lens of digitized pieces of the historical record.
  • 11:19 - 11:21
    The great thing about culturomics
  • 11:21 - 11:23
    is that everyone can do it.
  • 11:23 - 11:25
    Why can everyone do it?
  • 11:25 - 11:27
    Everyone can do it because three guys,
  • 11:27 - 11:30
    Jon Orwant, Matt Gray and Will Brockman over at Google,
  • 11:30 - 11:32
    saw the prototype of the Ngram Viewer,
  • 11:32 - 11:34
    and they said, "This is so fun.
  • 11:34 - 11:37
    We have to make this available for people."
  • 11:37 - 11:39
    So in two weeks flat -- the two weeks before our paper came out --
  • 11:39 - 11:42
    they coded up a version of the Ngram Viewer for the general public.
  • 11:42 - 11:45
    And so you too can type in any word or phrase that you're interested in
  • 11:45 - 11:47
    and see its n-gram immediately --
  • 11:47 - 11:49
    also browse examples of all the various books
  • 11:49 - 11:51
    in which your n-gram appears.
  • 11:51 - 11:53
    JM: Now this was used over a million times on the first day,
  • 11:53 - 11:55
    and this is really the best of all the queries.
  • 11:55 - 11:58
    So people want to be their best, put their best foot forward.
  • 11:58 - 12:01
    But it turns out in the 18th century, people didn't really care about that at all.
  • 12:01 - 12:04
    They didn't want to be their best, they wanted to be their beft.
  • 12:04 - 12:07
    So what happened is, of course, this is just a mistake.
  • 12:07 - 12:09
    It's not that strove for mediocrity,
  • 12:09 - 12:12
    it's just that the S used to be written differently, kind of like an F.
  • 12:12 - 12:15
    Now of course, Google didn't pick this up at the time,
  • 12:15 - 12:18
    so we reported this in the science article that we wrote.
  • 12:18 - 12:20
    But it turns out this is just a reminder
  • 12:20 - 12:22
    that, although this is a lot of fun,
  • 12:22 - 12:24
    when you interpret these graphs, you have to be very careful,
  • 12:24 - 12:27
    and you have to adopt the base standards in the sciences.
  • 12:27 - 12:30
    ELA: People have been using this for all kinds of fun purposes.
  • 12:30 - 12:37
    (Laughter)
  • 12:37 - 12:39
    Actually, we're not going to have to talk,
  • 12:39 - 12:42
    we're just going to show you all the slides and remain silent.
  • 12:42 - 12:45
    This person was interested in the history of frustration.
  • 12:45 - 12:48
    There's various types of frustration.
  • 12:48 - 12:51
    If you stub your toe, that's a one A "argh."
  • 12:51 - 12:53
    If the planet Earth is annihilated by the Vogons
  • 12:53 - 12:55
    to make room for an interstellar bypass,
  • 12:55 - 12:57
    that's an eight A "aaaaaaaargh."
  • 12:57 - 12:59
    This person studies all the "arghs,"
  • 12:59 - 13:01
    from one through eight A's.
  • 13:01 - 13:03
    And it turns out
  • 13:03 - 13:05
    that the less-frequent "arghs"
  • 13:05 - 13:08
    are, of course, the ones that correspond to things that are more frustrating --
  • 13:08 - 13:11
    except, oddly, in the early 80s.
  • 13:11 - 13:13
    We think that might have something to do with Reagan.
  • 13:13 - 13:15
    (Laughter)
  • 13:15 - 13:18
    JM: There are many usages of this data,
  • 13:18 - 13:21
    but the bottom line is that the historical record is being digitized.
  • 13:21 - 13:23
    Google has started to digitize 15 million books.
  • 13:23 - 13:25
    That's 12 percent of all the books that have ever been published.
  • 13:25 - 13:28
    It's a sizable chunk of human culture.
  • 13:28 - 13:31
    There's much more in culture: there's manuscripts, there newspapers,
  • 13:31 - 13:33
    there's things that are not text, like art and paintings.
  • 13:33 - 13:35
    These all happen to be on our computers,
  • 13:35 - 13:37
    on computers across the world.
  • 13:37 - 13:40
    And when that happens, that will transform the way we have
  • 13:40 - 13:42
    to understand our past, our present and human culture.
  • 13:42 - 13:44
    Thank you very much.
  • 13:44 - 13:47
    (Applause)
Title:
What we learned from 5 million books
Speaker:
Jean-Baptiste Michel + Erez Lieberman Aiden
Description:

Have you played with Google Labs' Ngram Viewer? It's an addicting tool that lets you search for words and ideas in a database of 5 million books from across centuries. Erez Lieberman Aiden and Jean-Baptiste Michel show us how it works, and a few of the surprising things we can learn from 500 billion words.

more » « less
Video Language:
English
Team:
closed TED
Project:
TEDTalks
Duration:
13:48
TED edited English subtitles for What we learned from 5 million books
TED added a translation

English subtitles

Revisions Compare revisions