Return to Video

Irene Eleta: Multilingual Users of Twitter: Social Ties Across Language Borders or How a Story Could Travel the World

  • 0:00 - 0:02
    Good morning, everyone.
  • 0:03 - 0:06
    Thank you for coming here
    [unclear] of the semester.
  • 0:08 - 0:10
    So, I'm going to start.
  • 0:11 - 0:14
    Access to the internet
    is greater than ever before
  • 0:14 - 0:17
    and as a consequence,
    it's becoming more multilingual.
  • 0:19 - 0:23
    However, there's evidence of segmentation
    of cyberspace
  • 0:23 - 0:25
    due to language and national borders.
  • 0:28 - 0:31
    This image serves to illustrate that.
  • 0:32 - 0:36
    This is the language communities
    of Twitter in Europe.
  • 0:37 - 0:41
    So, what you can see are tweets
    geolocated over a map of Europe
  • 0:41 - 0:44
    and the different colors
    represent the different languages.
  • 0:45 - 0:51
    You can even see regional languages
    like Catalan in the Catalan region of Spain
  • 0:52 - 0:56
    And this is going to be useful
    for an example I'm going to use later.
  • 1:02 - 1:04
    I'm interested in Twitter in particular,
  • 1:04 - 1:07
    because of the speed
    of information dissemination
  • 1:07 - 1:11
    and that most of this information
    is publicly accessible.
  • 1:14 - 1:19
    I'm going to illustrate this
    with a capture
  • 1:19 - 1:22
    of a dynamic visualization
    you can find on the Twitter blog
  • 1:22 - 1:25
    by Miguel Rios.
  • 1:25 - 1:29
    And what you can see here
    is the global flow of tweets
  • 1:29 - 1:31
    after the earthquake in Japan.
  • 1:32 - 1:35
    In pink, there are the tweets
    coming out of Japan
  • 1:35 - 1:37
    and, in green, the retweets
    all over the world.
  • 1:39 - 1:45
    This illustrates that in Twitter
    information is spreading across countries.
  • 1:46 - 1:48
    But how can this happen?
  • 1:49 - 1:55
    Expatriates, migrants, minorities.
    diaspora communities, language learners
  • 1:55 - 1:59
    all play an important role
    in building transnational networks
  • 1:59 - 2:03
    and cultural bridges
    between nations and communities.
  • 2:04 - 2:06
    They are the multilingual users
    on the internet.
  • 2:08 - 2:11
    The overarching research question is:
  • 2:11 - 2:17
    how are multilingual users of Twitter
    connecting different language groups?
  • 2:22 - 2:27
    In 2009, the Berkman Center of Internet
    and Society at Harvard University
  • 2:27 - 2:30
    mapped the Arabic blogosphere
  • 2:30 - 2:33
    and they described a key concept
    for my research.
  • 2:35 - 2:41
    They discovered an English bridge
    and a French bridge of bloggers
  • 2:41 - 2:46
    that were writing in their native
    Arabic language and in English or French.
  • 2:47 - 2:52
    And they were connecting the different
    national blogospheres
  • 2:52 - 2:53
    with the international one.
  • 2:55 - 3:00
    This might have played a role in the Arab
    popular uprisings in 2011
  • 3:00 - 3:03
    for reaching out to the world.
  • 3:05 - 3:09
    And this is connected with a concept
    that first appeared in 2008
  • 3:09 - 3:12
    of the bridge bloggers.
  • 3:14 - 3:16
    So, bridge bloggers are bloggers
  • 3:16 - 3:20
    that are trying to connect
    their local communities
  • 3:20 - 3:23
    to a wider global audience.
  • 3:25 - 3:29
    The image you can see here
    is actually the visualization they created
  • 3:29 - 3:33
    of mapping the Arabic blogosphere.
  • 3:35 - 3:37
    Each dot is a blogger, or a blog.
  • 3:39 - 3:43
    The size represents their popularity,
    so how many incoming links they have
  • 3:43 - 3:45
    and they grouped them--
  • 3:45 - 3:48
    the neighborhoods they created
  • 3:48 - 3:52
    in relation to the linking
    between them.
  • 3:53 - 3:56
    So, the ones that are grouped together
    are linking among each other.
  • 3:57 - 3:59
    The colors are a different question.
  • 3:59 - 4:06
    The colors represent "attentive clusters",
    that's how they call it.
  • 4:06 - 4:12
    And they look at their online resources
    and media outlets
  • 4:12 - 4:14
    these blogs were linking to.
  • 4:15 - 4:19
    So, blogs of the same colors
    are following the same media outlets
  • 4:19 - 4:21
    and online resources.
  • 4:21 - 4:25
    And they did human coding
    to label those groups.
  • 4:26 - 4:30
    And here is where we see
    the label English grids
  • 4:30 - 4:32
    the responses from Cuba
    in English
  • 4:32 - 4:34
    and up there, there's [unclear] France.
  • 4:36 - 4:41
    And so I think it's important to retain
    the concept of attentive clusters.
  • 4:44 - 4:49
    Now, let's go back to 2011
    during the Arab popular uprisings.
  • 4:50 - 4:55
    And I'll show you a visualization
    of the influence network
  • 4:55 - 4:57
    of Twitter users in Egypt.
  • 4:58 - 5:00
    So, what you're seeing here
  • 5:00 - 5:04
    just imagine people down the street
    at Tahrir Square
  • 5:04 - 5:08
    tweeting in Arabic about what's going on
    on the ground.
  • 5:08 - 5:12
    And those are the people in red.
  • 5:13 - 5:18
    So, these red dots represent users
    that are tweeting in Arabic.
  • 5:18 - 5:20
    Then we have the international community
  • 5:20 - 5:26
    or even Americans, British and so on
    tweeting in English.
  • 5:26 - 5:28
    And they are in blue,
    those blue dots.
  • 5:29 - 5:33
    And then, interestingly, we have
    people in between them.
  • 5:33 - 5:38
    which are illustrated in different
    degrees of violet, or violet shades.
  • 5:39 - 5:43
    This represents the fact that they
    are tweeting in both Arabic and English.
  • 5:45 - 5:47
    So, what we're seeing
    is the bridge Twitters
  • 5:48 - 5:53
    because, like Ethan Zuckermann called them
    "bridge bloggers".
  • 5:56 - 5:59
    So, another context.
  • 5:59 - 6:05
    The same year, 2011, a lot
    of big protests were going on in Europe.
  • 6:05 - 6:07
    And in particular, in Spain.
  • 6:07 - 6:12
    They started on May 15th 2011
    there were massive protests.
  • 6:13 - 6:17
    And of because of this context,
    this situation
  • 6:17 - 6:23
    new attentive clusters were emerging
    in the social media landscape of Spain.
  • 6:28 - 6:34
    Now, this is a visualization you can find
    in the Socialflow blog, research blog
  • 6:34 - 6:35
    on social networks.
  • 6:35 - 6:41
    And what it is, is it tracks the origin
    and the initial spread
  • 6:41 - 6:45
    of the hashtag #occupywallstreet
    in Twitter.
  • 6:47 - 6:52
    They detected that one of the first users
    of the hashtag #occupywallstreet
  • 6:52 - 6:57
    was on July 13th 2011, linking to a blog
    post of Adbusters.
  • 6:58 - 7:02
    So you have the Twitter account
    of Adbusters there, very big
  • 7:02 - 7:05
    because it's being retweeted a lot.
  • 7:06 - 7:07
    And mentioned a lot.
  • 7:08 - 7:14
    And they collected these mentions
    and the tweets that had these mentions
  • 7:14 - 7:17
    and these retweets with the hashtag
    during July 13th.
  • 7:18 - 7:21
    From July 13th to July 23rd.
  • 7:21 - 7:24
    So, from the first 10 days
    of the use of this hashtag
  • 7:24 - 7:27
    it was from the very beginning of the use
    of this hashtag on Twitter.
  • 7:31 - 7:32
    They just mapped the accounts
  • 7:33 - 7:39
    and the series of posts with the hashtag
    and mentions with the hashtag
  • 7:41 - 7:43
    and the users that were connecting
  • 7:43 - 7:45
    because of these mentions
    and retweets.
  • 7:46 - 7:49
    Now the interesting thing
    in this visualization
  • 7:49 - 7:51
    is that they
  • 7:51 - 7:54
    the Socialflow people
    particularly in [inaudible]
  • 7:55 - 8:00
    detected this Spanish brand
    of users
  • 8:00 - 8:03
    were forming an attentive cluster.
  • 8:06 - 8:09
    Mentioning and retweeting about it
    in Spanish
  • 8:09 - 8:13
    using the hashtag in their messages
    in Spanish.
  • 8:14 - 8:17
    And they point out in the blog
  • 8:18 - 8:20
    that this Spanish contingent
  • 8:20 - 8:24
    helped post and spread the word
    about Occupy Wall Street
  • 8:24 - 8:29
    even before most of the United States
    was aware of it.
  • 8:32 - 8:34
    So, I found that very interesting.
  • 8:34 - 8:37
    And it was due to the context
    in Spain at that moment
  • 8:38 - 8:45
    with big protests and new clusters
    forming in the social media landscape.
  • 8:57 - 9:01
    Now I have shown you the importance
    of these multilingual users
  • 9:01 - 9:06
    in connecting language communities
    and spreading information
  • 9:06 - 9:09
    across countries, acting as mediators.
  • 9:11 - 9:16
    But let's focus on another aspect
    of connecting language groups
  • 9:16 - 9:17
    which is language choice.
  • 9:18 - 9:23
    So I'm going to devote a moment
    to speak about languages
  • 9:23 - 9:24
    and language choice.
  • 9:28 - 9:31
    To understand languages in the world
  • 9:31 - 9:33
    I'm going to use a telescope.
  • 9:37 - 9:40
    So de Swaan...
  • 9:41 - 9:45
    ...proposed a theory called
    the world language system
  • 9:45 - 9:46
    back in the 1990s.
  • 9:47 - 9:51
    to explain the languages in the world.
  • 9:52 - 9:55
    And he used a very beautiful metaphor,
    the constellation.
  • 9:57 - 10:02
    So, in his theory there's about a dozen
    languages in the world
  • 10:02 - 10:05
    that are the hearts of the system,
    or the suns.
  • 10:06 - 10:07
    The suns of the system.
  • 10:08 - 10:11
    For instance, English, French, Spanish,
    Arabic and more.
  • 10:12 - 10:17
    And then there are hundreds,
    maybe more than 100, 200...
  • 10:17 - 10:23
    national languages that are orbiting
    around these suns like planets.
  • 10:24 - 10:28
    And finally we have regional
    and minority languages
  • 10:28 - 10:32
    that are orbiting these planets
    like satellites.
  • 10:33 - 10:38
    And he used this metaphor
    to explain the power relationships
  • 10:38 - 10:40
    between languages.
  • 10:40 - 10:43
    This is a theory of what he called
  • 10:43 - 10:47
    "communication potential
    and language competition"
  • 10:48 - 10:51
    A key point he made
  • 10:52 - 10:55
    is that the system holds together
  • 10:55 - 10:59
    thanks to multilingual people
    and interpreters.
  • 11:00 - 11:03
    This is what's providing cohesion
    to the system.
  • 11:04 - 11:07
    He also made a controversial proposal
  • 11:07 - 11:11
    about the communication potential
    of a language.
  • 11:12 - 11:15
    So, he proposed a formula,
    a mathematical formula
  • 11:15 - 11:20
    where he could estimate the communication
    potential of a language
  • 11:20 - 11:25
    and supposedly a person with tools
    through learning and usage
  • 11:25 - 11:28
    based on the communications of that.
  • 11:28 - 11:34
    For example, a person might decide
    to learn English and use English
  • 11:35 - 11:41
    because not only does it provide
    communication with English native speakers
  • 11:42 - 11:46
    but also, adding to that, it provides
    the possibility to communicate
  • 11:46 - 11:50
    with all the second-language learners
    of English
  • 11:50 - 11:53
    from many different languages,
    many different countries.
  • 11:53 - 11:56
    So, supposedly, in history
  • 11:56 - 12:00
    English provides
    the greatest communication.
  • 12:01 - 12:05
    And he received some criticism,
    because of the central role of English
  • 12:05 - 12:07
    in his theory
  • 12:07 - 12:10
    He said it was the central hub
    of all the system.
  • 12:13 - 12:20
    There's also the language ecology paradigm
    first proposed by Haugen in 1972
  • 12:22 - 12:25
    and there's this idea of an ecosystem
    of languages
  • 12:25 - 12:30
    and, again, it's using another metaphor
  • 12:30 - 12:32
    and because of this metaphor
  • 12:32 - 12:34
    also appeared the idea
    of endangered languages.
  • 12:36 - 12:40
    I'm going to briefly just read
    the definition.
  • 12:40 - 12:43
    He defined the language ecology as:
  • 12:43 - 12:47
    "the study of interactions between
    any given language and its environment"
  • 12:48 - 12:49
    and what I think is very important:
  • 12:49 - 12:54
    "language exists only in the minds
    of its users"
  • 12:57 - 13:00
    which leads me to point at my research.
  • 13:02 - 13:06
    In my research, I'm using a microscope
    to see the cells
  • 13:06 - 13:09
    and my cells in my study
    are the Twitter users.
  • 13:12 - 13:13
    Why is that?
  • 13:16 - 13:20
    Because as Haugen explains,
    there's a psychological dimension
  • 13:20 - 13:22
    to language ecology
  • 13:22 - 13:25
    where language interacts
    with other languages
  • 13:25 - 13:28
    in the minds of multilingual people.
  • 13:29 - 13:32
    And there's a sociological dimension
    to language ecology
  • 13:32 - 13:38
    where we use language to communicate
    and interact with other people.
  • 13:38 - 13:44
    And this language ecology generates
    because of the people
  • 13:44 - 13:46
    that decide to use that language
  • 13:46 - 13:50
    learning and interacting
    with people using it.
  • 13:51 - 13:55
    And this is the point
    of language choice in languages.
  • 13:56 - 14:00
    So, I focus on the connections of people
    and the language choice.
  • 14:04 - 14:08
    So, these are the four points
    I'm going to be speaking about.
  • 14:08 - 14:13
    But actually the main focus
    is going to be the first point
  • 14:13 - 14:19
    Social network analysis and the taxonomy
    of intersections between language groups
  • 14:20 - 14:23
    This is where I'm going to be spending
    most of the time.
  • 14:23 - 14:27
    And then very briefly,
    just for compilation purposes
  • 14:27 - 14:30
    I'm going to speak about another
    small study that I did
  • 14:30 - 14:31
    the factor analysis
  • 14:31 - 14:34
    looking at the influence
    of the social network
  • 14:34 - 14:39
    in the language choices of the users.
  • 14:39 - 14:42
    So, how the social network
    influences language choice
  • 14:42 - 14:43
    of our multilingual users.
  • 14:45 - 14:50
    And then I'm going to briefly also talk
    about the last study of my dissertation
  • 14:50 - 14:52
    that is still ongoing.
  • 14:53 - 14:55
    So, I still have new research
    to talk about.
  • 14:55 - 14:58
    And it's content analysis
  • 14:58 - 15:00
    and in this case I'm focusing on
    intrinsic factors
  • 15:00 - 15:03
    intrinsic to the messages
  • 15:03 - 15:06
    about the topic,
    and the type of exchange.
  • 15:06 - 15:08
    If it's a reply,
    if it's a public post
  • 15:08 - 15:10
    and how that influences
    the language choice as well.
  • 15:11 - 15:12
    And finally I will...
  • 15:14 - 15:17
    I'm going to give you my reflections
  • 15:18 - 15:23
    so I can invite your thoughts
    and suggestions and discussions about it.
  • 15:27 - 15:29
    Briefly, I'm going to start
    with the sampling
  • 15:29 - 15:32
    so I can talk about the rest
    of the research.
  • 15:35 - 15:37
    So my focus is on multilingual users,
  • 15:37 - 15:40
    how did I identify multilingual users
    on Twitter?
  • 15:42 - 15:44
    It was giving me a headache.
  • 15:44 - 15:47
    Finally what we decided...
  • 15:49 - 15:50
    this research has been--
  • 15:51 - 15:54
    I have always had the help
    of Jennifer Golbeck,
  • 15:54 - 15:55
    she was my adviser.
  • 15:55 - 15:57
    And I did this with her help.
  • 15:58 - 16:02
    So what we did, was gather a list
    of what is called stopwords.
  • 16:03 - 16:05
    From different languages
    and you have a list over there.
  • 16:06 - 16:10
    And then the stopword lists
    you can find them on the internet.
  • 16:10 - 16:13
    They are created
    for computational linguistics
  • 16:13 - 16:16
    so they use it for filtering purposes.
  • 16:17 - 16:19
    And they are common words
    in a language.
  • 16:20 - 16:21
    Very common words in a language.
  • 16:21 - 16:26
    So, sometimes they're used precisely
    for eliminating them from texts
  • 16:26 - 16:29
    when they're in, for example,
    searches in Google
  • 16:29 - 16:33
    the eliminate the stopwords,
    the stopwords that you type
  • 16:33 - 16:35
    in the search.
  • 16:35 - 16:38
    But in this case I wanted
    to find the stopwords
  • 16:38 - 16:41
    that are very common in the language
    to represent the language.
  • 16:41 - 16:44
    And so we had to select words
    that were not written the same
  • 16:44 - 16:46
    as in another language.
  • 16:46 - 16:48
    Sometimes, could be confusing
    and ambiguous.
  • 16:51 - 16:54
    Then I typed in Google...
  • 16:56 - 16:59
    one word in one language
    and one word in another language.
  • 16:59 - 17:01
    Usually I was always using
    one English word
  • 17:01 - 17:04
    and one word in a different language.
  • 17:05 - 17:08
    And I looked in the Twitter domain.
  • 17:09 - 17:12
    So the search results from Google
    will give me the profiles
  • 17:13 - 17:20
    of people on Twitter that in theory
    wrote messages in both languages.
  • 17:20 - 17:24
    We had to do a lot of hand-combing
    to actually see if it was in two languages
  • 17:25 - 17:28
    or it was just that they were mentioning
    an English song
  • 17:28 - 17:32
    the title of an English song
    but they had no English in the rest.
  • 17:32 - 17:36
    So we had to ensure
    that they were authoring tweets
  • 17:36 - 17:38
    in two languages.
  • 17:38 - 17:40
    So writing them, not just retweeting them
  • 17:40 - 17:43
    they were not just automatic postings
    from Facebook.
  • 17:43 - 17:48
    So we had a long set of criteria
    a lot of manual combing
  • 17:48 - 17:53
    and then finally we selected
    92 multilingual users
  • 17:53 - 17:58
    and in total they used 19 languages,
    2 or 3 languages per person.
  • 18:01 - 18:05
    Now, I don't know if you want to ask
    some questions about the sampling
  • 18:05 - 18:08
    because there's a lot of details about it.
  • 18:13 - 18:15
    No doubts?
  • 18:15 - 18:17
    Or maybe they'll come later!
  • 18:19 - 18:22
    Now, how do I do
    the social networks analysis?
  • 18:23 - 18:28
    Well, now I have my 92 multilingual users
    technically they are called the ego
  • 18:28 - 18:30
    of an egocentric network.
  • 18:31 - 18:33
    This is the cell of my study.
  • 18:34 - 18:36
    It started with the nucleus of the cell
  • 18:36 - 18:38
    which is my multilingual user
  • 18:38 - 18:40
    and then I go to Twitter
  • 18:40 - 18:43
    and first of all I have instructed--
  • 18:45 - 18:47
    so in this case my ego
    is called the Painter
  • 18:48 - 18:54
    and I have extracted the last 50 messages
    that he posted on Twitter
  • 18:54 - 18:57
    to see the languages
    this person used-- is using.
  • 18:57 - 19:02
    And I see that he is using English,
    Spanish and Catalan.
  • 19:03 - 19:05
    Catalan is a regional language in Spain
  • 19:06 - 19:08
    and I have shown you on the map
    the region before
  • 19:08 - 19:09
    where the region was.
  • 19:09 - 19:12
    And they speak both Catalan
    and Spanish.
  • 19:14 - 19:17
    So, this person is tweeting
    in a minority language
  • 19:17 - 19:18
    a national language
  • 19:18 - 19:21
    and also international.
  • 19:27 - 19:32
    So, I already found the Painter
    and I know what languages this person speaks
  • 19:32 - 19:34
    well, uses on Twitter,
  • 19:34 - 19:36
    and then I extract
    all the social networks.
  • 19:36 - 19:38
    So, the followers on Twitter
  • 19:38 - 19:40
    you know that on Twitter
    you have followers
  • 19:40 - 19:41
    and you follow people.
  • 19:41 - 19:43
    I extracted both.
  • 19:43 - 19:48
    The followers of the Painter
    the people that are following him on Twitter
  • 19:48 - 19:52
    and also how the friends
    are connecting to each other.
  • 19:52 - 19:57
    So, all of them, all of these dots
    are the followers
  • 19:57 - 19:59
    the people following the Painter
    on Twitter
  • 19:59 - 20:03
    and also I see how they connect
    among each other, ok?
  • 20:05 - 20:09
    So the Painter follows Eduard
    in the center
  • 20:10 - 20:12
    and it seems he's very popular.
  • 20:14 - 20:17
    And then I extract the last 30 posts
    of Eduard--
  • 20:17 - 20:19
    there's a reason for that
  • 20:19 - 20:22
    but vernacular
    is mostly economy questions!
  • 20:25 - 20:26
    I will tell you why!
  • 20:26 - 20:29
    So I extracted the last 30 posts of Eduard
  • 20:29 - 20:32
    and then I do
    automatic language identification
  • 20:32 - 20:37
    with the Google API
    for language identification
  • 20:39 - 20:40
    which costs money.
  • 20:41 - 20:43
    So you have to really think
    about how many posts you want to send
  • 20:43 - 20:46
    to Google and how much money
    you have available
  • 20:46 - 20:48
    and what is the accuracy
    you're going to have
  • 20:48 - 20:51
    according to how many posts you send.
  • 20:51 - 20:53
    There's a lot of testing going on there.
  • 20:54 - 20:58
    I do the same with everybody
    in the social network.
  • 20:59 - 21:00
    I extract the last 30 posts
  • 21:00 - 21:02
    use the Google identification
  • 21:02 - 21:08
    build that algorithm that decides
    based on the languages of these 30 posts
  • 21:08 - 21:12
    is this person monolingual?
    Is this person multilingual?
  • 21:12 - 21:13
    Which languages?
  • 21:13 - 21:15
    And then I laddered them, ok.
  • 21:17 - 21:19
    This is just a visualization behind the--
  • 21:20 - 21:27
    Perhaps person 1 is monolingual,
    or bilingual of two languages.
  • 21:32 - 21:36
    Now that I have all the friends
    of the Painter
  • 21:36 - 21:37
    how they connect,
  • 21:37 - 21:41
    I color code them
    depending on the languages they are using.
  • 21:42 - 21:45
    And here, what you can see
    is very interesting.
  • 21:46 - 21:49
    I don't know if you can distinguish
    the colors well
  • 21:49 - 21:54
    because up here, this area,
    that is like a triangle
  • 21:54 - 21:58
    there's a group of users
    writing in English.
  • 21:59 - 22:01
    And it's pink.
    Sort of pinkish.
  • 22:01 - 22:05
    And then, down here
    there's this Spanish group
  • 22:05 - 22:07
    in light green.
  • 22:08 - 22:12
    And, in the middle, the one
    that perhaps doesn't distinguish as well
  • 22:12 - 22:15
    from the English,
    is the Catalan group.
  • 22:16 - 22:19
    So the users writing in Catalan
    in dark blue.
  • 22:20 - 22:22
    And then there's a set of violets
    in between
  • 22:22 - 22:26
    and these violets represent
    the bilingual users
  • 22:26 - 22:29
    either English and Catalan
    or English and Spanish.
  • 22:30 - 22:33
    And then there's darker green
    around here,
  • 22:33 - 22:36
    they are using both Catalan and Spanish.
  • 22:36 - 22:38
    So there's a lot of bilinguals
    going on.
  • 22:38 - 22:40
    And there's an interesting dynamics
  • 22:40 - 22:43
    in that you have this English group
    up there
  • 22:43 - 22:44
    and the Spanish group up here
  • 22:44 - 22:46
    and the Catalan group in the middle.
  • 22:46 - 22:49
    And this Catalan group is very mixed up
    with the Spanish group
  • 22:50 - 22:52
    which makes sense,
    because it's a bilingual community.
  • 23:01 - 23:07
    So, this is how I built the egocentric
    network of my 92 multilingual users.
  • 23:09 - 23:11
    The Painter is just one of them.
    I have 92.
  • 23:11 - 23:17
    I have 92 cells or egocentric networks
    that I studied with my microscope.
  • 23:18 - 23:22
    Do you want to ask some questions
    about this process
  • 23:22 - 23:23
    or this visualization?
  • 23:25 - 23:30
    (person 1) Of the bilingual units,
    are they users or tweets?
  • 23:31 - 23:32
    They are users, yeah.
  • 23:32 - 23:36
    So, the dots represent people.
  • 23:36 - 23:40
    So, like Eduard here.
    They represent people.
  • 23:42 - 23:45
    Now each dot to determine the language
    and the color
  • 23:45 - 23:48
    I extracted 30 posts
  • 23:48 - 23:53
    So, it's an interesting question
    because the 30 posts
  • 23:53 - 23:56
    have different language levels
    assigned to them
  • 23:56 - 23:57
    especially if they were bilingual
  • 23:57 - 24:02
    and I had to decide which language level
    I was going to assign to the user.
  • 24:02 - 24:05
    So, I had to build an algorithm
    with a set of rules
  • 24:10 - 24:11
    basically saying--
  • 24:11 - 24:17
    the Google identification system
    would give me a language
  • 24:17 - 24:18
    and a confidence level
  • 24:18 - 24:19
    So if the confidence level was very low
  • 24:19 - 24:24
    I would say "discard that"
    because I had a series of pluristics
  • 24:24 - 24:30
    based on both the number of tweets
    using a particular language
  • 24:30 - 24:33
    and also on the confidence level.
  • 24:34 - 24:38
    And there are a lot
    of technical challenges there as well.
  • 24:40 - 24:42
    (woman) So, it's possible
    that some of these posts
  • 24:42 - 24:46
    many of these posts would be multilingual,
    I'm sorry monolingual in one language or the other?
  • 24:46 - 24:52
    So it's also possible that some
    of these individual posts
  • 24:52 - 24:54
    would mix languages?
  • 24:55 - 24:57
    Yes, it is possible.
    It's very possible!
  • 24:57 - 25:00
    It's very challenging
    for the automatic system!
  • 25:02 - 25:04
    (woman) Right, ok.
    I just wanted to be clear--
  • 25:04 - 25:05
    Yes, exactly.
  • 25:05 - 25:11
    So it's not as frequent as I expected,
    having bilingual posts
  • 25:11 - 25:13
    that I would call.
  • 25:13 - 25:14
    But it's happening.
  • 25:15 - 25:21
    And so, for a series of tests,
    I had to do manual combing
  • 25:21 - 25:23
    and I saw that sometimes
    it was the case
  • 25:23 - 25:27
    that they were doing some sort
    of translation in the same tweet
  • 25:27 - 25:32
    and sometimes it was just the case
    that they were mentioning titles of things
  • 25:32 - 25:34
    or places in a different language.
  • 25:35 - 25:39
    So, there's a lot of issues
    surrounding the automatic handling of this
  • 25:39 - 25:44
    but you are dealing with 92 networks
  • 25:44 - 25:51
    and they have between 30
    and 5,000 nodes in them.
  • 25:53 - 25:56
    So, I don't remember the numbers exactly,
  • 25:56 - 25:59
    but I'm talking about
    around 80,000 people.
  • 26:01 - 26:05
    So detecting the language of 80,000 people
    and this is small-scale.
  • 26:05 - 26:08
    If you go to millions,
    you need an automatic system.
  • 26:08 - 26:11
    And one of the things I'm having
    to write up in my dissertation
  • 26:11 - 26:14
    is what are the challenges.
  • 26:14 - 26:18
    You have to be prepared for them,
    to solve those problems.
  • 26:19 - 26:22
    And one of them is what do you do
    with bilingual posts
  • 26:22 - 26:24
    which language do you assign to that post?
  • 26:24 - 26:28
    Automatic posts, spam...
    there's a lot of problems.
  • 26:30 - 26:31
    Challenges, I mean.
  • 26:31 - 26:35
    That's what makes it interesting
    because you cannot do manual combing
  • 26:35 - 26:36
    on these scales.
  • 26:39 - 26:41
    Do you have another question?
  • 26:45 - 26:48
    So, now, what am I doing with this?
  • 26:51 - 26:56
    I'm going to classify my social networks,
    looking at the patterns
  • 26:56 - 26:59
    of overlaps between the languages groups.
  • 27:00 - 27:02
    And overlaps or intersections.
  • 27:03 - 27:08
    I'm looking specifically at the networks
    that have only two language groups
  • 27:08 - 27:12
    I had five of these networks
    that were trilingual
  • 27:12 - 27:16
    so I put them aside to go simple
    first with just two language groups
  • 27:16 - 27:18
    to see how they interconnect.
  • 27:19 - 27:21
    And then I classified them
  • 27:22 - 27:24
    first following a qualitative analysis
  • 27:24 - 27:29
    and then I used network statistics
    that I developed with my adviser
  • 27:29 - 27:30
    for this purpose.
  • 27:31 - 27:34
    And I will talk later a little more
    about it.
  • 27:34 - 27:38
    So, tried to provide
    more robust measures for that.
  • 27:39 - 27:44
    I classified them and I came up
    with some types.
  • 27:46 - 27:50
    This is what I call the gatekeeper
    language bridge type.
  • 27:51 - 27:53
    And there's some variants of it,
    obviously.
  • 27:54 - 27:56
    What you can see here
    is the network of a person
  • 27:56 - 28:00
    and I'm going to assume this person
    is in the United States
  • 28:00 - 28:02
    and speaks both Spanish and English.
  • 28:04 - 28:06
    Let's call her Maria.
  • 28:06 - 28:12
    So she's Maria and she has two groups
    of friends using Spanish on Twitter
  • 28:13 - 28:16
    and then that big group of friends
    using English.
  • 28:17 - 28:20
    And, as you can see,
    there's just a few nodes
  • 28:20 - 28:22
    connecting the two language groups.
  • 28:22 - 28:28
    You can see that the social structure
    can be different from the language groups
  • 28:29 - 28:32
    so you can have maybe a group of friends
    and a group of coworkers
  • 28:32 - 28:36
    inside the same language group,
    so it can be more complex
  • 28:36 - 28:41
    than just dividing the social network
    by language groups.
  • 28:41 - 28:46
    There can be more grouping
    because of other social resources.
  • 28:47 - 28:51
    But the interesting thing is that
    there are only a few nodes
  • 28:51 - 28:53
    where people are connecting
    holding together these Twitters.
  • 28:55 - 29:01
    I think this was friends
    with English here.
  • 29:01 - 29:05
    You can see, in this case, it seems
    like the two groups
  • 29:05 - 29:08
    are holding closely together
  • 29:09 - 29:14
    because there are much more links
    holding the two groups together.
  • 29:15 - 29:18
    Of course, this is going to depend
    on the size of the networks
  • 29:18 - 29:23
    so I had to account for the size
    when coming up with measures
  • 29:23 - 29:26
    with network connections
  • 29:26 - 29:28
    I had to provide ratios.
  • 29:28 - 29:32
    Now, the ratio of [close] language linking
    here and here
  • 29:32 - 29:34
    and you have these types--
  • 29:36 - 29:40
    These types are not just clear-cut.
  • 29:40 - 29:42
    There's an evolution.
  • 29:42 - 29:43
    There's people that have
    very few connections
  • 29:43 - 29:45
    with the language groups
  • 29:45 - 29:47
    and then progressively there's people
    with more and more.
  • 29:48 - 29:49
    And this increases.
  • 29:49 - 29:52
    Which points to the fact,
    that my cells are there.
  • 29:53 - 29:57
    Which means I don't see the evolution
    over time, ok?
  • 29:58 - 30:00
    This is a limitation of my research.
  • 30:00 - 30:05
    I just see the social network
    of this person looked
  • 30:05 - 30:07
    at a particular point in time.
  • 30:08 - 30:10
    I don't know how it evolves over time.
  • 30:10 - 30:13
    So, for myself, it's just there.
  • 30:14 - 30:19
    It would be interesting
    to see these different patterns
  • 30:19 - 30:21
    that I have been observing.
  • 30:21 - 30:27
    Maybe over time these connections
    between languages maybe increasing.
  • 30:29 - 30:32
    Now we have the integration
    and union
    type
  • 30:33 - 30:37
    where in this case you have a person
    from an Arab country
  • 30:37 - 30:41
    and green represents the friends
    that are using Arabic
  • 30:41 - 30:45
    and the friends using English are in pink,
    but there's also violet
  • 30:45 - 30:47
    there are bilinguals.
  • 30:47 - 30:52
    That means there's a group
    of English users
  • 30:52 - 30:57
    and bilingual English - Arabic users
    inserted in the group of Arabic, inside.
  • 31:00 - 31:01
    That's the integration,
    so they're integrated.
  • 31:02 - 31:08
    And then I have a Greek guy,
    who uses Greek and English
  • 31:08 - 31:09
    and his Arabic friends.
  • 31:09 - 31:12
    And in this case, you can see
    it's sort of light blue
  • 31:12 - 31:17
    representing Greek, so the friends
    that tweet in Greek
  • 31:17 - 31:21
    Pink again represents people tweeting
    in English
  • 31:21 - 31:23
    and there's a lot of bilinguals.
  • 31:23 - 31:27
    So these kind of dark blues
    represent the bilinguals.
  • 31:27 - 31:29
    And these are two groups
  • 31:29 - 31:33
    that if you've seen before,
    the gatekeeper and the language bridge
  • 31:33 - 31:35
    progressively getting closer and closer
  • 31:35 - 31:41
    with more and more links
    across languages.
  • 31:41 - 31:43
    In this case, this is like the extreme.
  • 31:43 - 31:46
    The links between the two languages
    are so dense
  • 31:46 - 31:51
    that you cannot almost distinguish
    where the border is
  • 31:51 - 31:53
    between the two language groups.
  • 31:53 - 31:59
    And, interestingly, the border might be
    even only noticeable
  • 31:59 - 32:01
    because there's a lot of bilinguals
    around it.
  • 32:02 - 32:05
    And this is the union type
    where they unite.
  • 32:07 - 32:10
    And finally, the peripheral language type.
  • 32:10 - 32:14
    This is a Brazilian guy,
    the network of a Brazilian guy
  • 32:15 - 32:17
    where you have--
  • 32:17 - 32:19
    probably he lives in the United States
    or something like that--
  • 32:19 - 32:23
    because this guy has mostly
    all this big group of friends
  • 32:23 - 32:25
    tweeting in English.
  • 32:27 - 32:32
    And then there's the side tentacle
    running outside, using Portuguese.
  • 32:35 - 32:36
    And this is like a periphery landscape.
  • 32:36 - 32:39
    So, in the periphery there's a small group
    of Portuguese language.
  • 32:40 - 32:45
    Now, I forgot to mention that there's dots
    that are light yellow or white.
  • 32:45 - 32:48
    Those are the ones that have no data.
  • 32:49 - 32:51
    So, I don't know
    the language they're using
  • 32:51 - 32:53
    because either their accounts are closed
  • 32:53 - 32:58
    or for some reason, in between the collection
    of data they closed the account.
  • 32:59 - 33:03
    Mostly, the reason
    is that they're private accounts
  • 33:04 - 33:06
    where you cannot get the data from.
  • 33:06 - 33:09
    I think somewhere I read
    it was about 5 percent.
  • 33:09 - 33:10
    I'm not sure.
  • 33:10 - 33:14
    But for one reason or another,
    I don't have that information.
  • 33:17 - 33:21
    Now, why am I classifying them?
    These networks?
  • 33:23 - 33:26
    Well, the reason is that--
  • 33:26 - 33:29
    well, there are some studies
    that demonstrate that the social structure
  • 33:29 - 33:34
    the structure of the social networks
    influences the spread of information.
  • 33:34 - 33:36
    How information disseminates
    in the network.
  • 33:39 - 33:43
    So, I'm just assuming
    that these different structures
  • 33:43 - 33:46
    are going to influence the spread
    of information.
  • 33:47 - 33:50
    But this is a study that has to be done.
  • 33:50 - 33:53
    I cannot demonstrate that one
    of these types
  • 33:53 - 33:56
    facilitates the spread of information.
  • 33:56 - 34:02
    I can only say that I am assuming,
    so that potential study
  • 34:04 - 34:09
    could just look at, for example,
    if gatekeeper and language bridges
  • 34:11 - 34:16
    are not as good for spreading information
    as union and integration types.
  • 34:20 - 34:25
    Right, we can just assume
    because of the cross-language links
  • 34:28 - 34:33
    so, how many links there are
    or the ratio of discourse language
  • 34:33 - 34:38
    may potentially facilitate information
    diffusion in these cases.
  • 34:40 - 34:43
    So, that study needs to be done.
  • 34:43 - 34:45
    I cannot say what's going to happen!
  • 34:45 - 34:47
    I just assume it's going to be like that.
  • 34:49 - 34:52
    So that is the reason why I classify them.
  • 34:52 - 34:55
    I have some network statistics.
  • 34:56 - 35:01
    We've made about an 80 percent accuracy
    guess, which is quite good,
  • 35:01 - 35:02
    but the sample is small.
  • 35:08 - 35:11
    So now, do you have any more questions
    before I move past to the next study?
  • 35:14 - 35:15
    man) I was curious as to how many--
  • 35:15 - 35:19
    what was the selection process like
    to find the 92 users?
  • 35:20 - 35:23
    Well, this is what I've been spending
    the beginning
  • 35:23 - 35:27
    about just using two stopwords
    from two different languages
  • 35:27 - 35:31
    typing that in the search box in Google
    and searching Twitter
  • 35:31 - 35:33
    and then once--
  • 35:33 - 35:36
    Basically you just go through
    the list of results
  • 35:36 - 35:42
    and start opening the profile,
    counting the tweets.
  • 35:42 - 35:45
    How many in this language,
    how many in the other.
  • 35:45 - 35:47
    And we put a threshold of 10 percent
  • 35:47 - 35:53
    they had to have written 10 percent
    of the tweets in a second language
  • 35:53 - 35:57
    and you couldn't count retweets
    or automatic posting.
  • 35:58 - 36:00
    We also had to manually discard
    these spammers.
  • 36:02 - 36:04
    So, that was the process.
  • 36:06 - 36:10
    (woman) And that's a paid search
    through Google?
  • 36:10 - 36:13
    No, that we did manually
  • 36:13 - 36:14
    and then once--
  • 36:14 - 36:20
    So the other thing you can say is you can
    use these core multilingual users
  • 36:21 - 36:24
    and then do what I did for behavior
    in these social networks
  • 36:24 - 36:29
    which is once you extract the friends
    and extract the messages of the friends
  • 36:31 - 36:34
    and automatically find the language
  • 36:34 - 36:37
    then you can say "Oh, this person
    is multilingual" automatically.
  • 36:37 - 36:41
    You just process it and you can detect
    a lot more multilingual people
  • 36:41 - 36:43
    through that process.
  • 36:43 - 36:46
    The paid process was sending these posts
  • 36:46 - 36:49
    to the Google language
    identification tool.
  • 36:50 - 36:55
    So, what I did was clean each message
    automatically.
  • 36:56 - 37:00
    Basically, eliminating the hashtags
  • 37:01 - 37:05
    and the mentions
    that had an @ in front,
  • 37:05 - 37:10
    symbols, URLs, all those things
    I would automatically eliminate them
  • 37:10 - 37:14
    and then with the rest of the message,
    I'd send that to the Google API
  • 37:14 - 37:16
    for language identification
  • 37:16 - 37:22
    and the Google API would give me
    a language level and a confidence binary.
  • 37:22 - 37:23
    And that for each message.
  • 37:23 - 37:26
    And then I built the algorithm
    with the help of Jen Golbeck
  • 37:26 - 37:31
    to decide, well I have 30 messages,
    500 English
  • 37:31 - 37:35
    10 million Spanish and then one in Swahili
    which is unlikely
  • 37:37 - 37:40
    and you had to decide
    the confidence value--
  • 37:40 - 37:43
    So I used rules, defined rules
  • 37:43 - 37:46
    but it could be done
    statistically I think.
  • 37:46 - 37:48
    And write some statistical method
    to decide
  • 37:48 - 37:52
    "well this person actually is bilingual"
    or whatever.
  • 37:53 - 37:54
    That's the process.
  • 37:54 - 37:56
    It's long!
  • 37:56 - 37:57
    Yes.
  • 37:58 - 38:00
    (woman) Hi, I understand
    that you did it manually
  • 38:00 - 38:05
    but currently in existing research field
    is there any software
  • 38:05 - 38:08
    that we can use to capture,
  • 38:08 - 38:12
    to have access to all
    these different tweets?
  • 38:12 - 38:15
    And to capture the different categories?
    [inaudible]
  • 38:15 - 38:18
    Ok, so you mean the extraction?
  • 38:19 - 38:20
    (woman) Yeah.
  • 38:20 - 38:21
    No, I didn't do it manually.
  • 38:21 - 38:23
    (woman) And the other,
    I think the other part
  • 38:23 - 38:26
    of your data presentation
    is visualizations coming out
  • 38:26 - 38:27
    like this graph.
  • 38:27 - 38:33
    Can you show us what kind of research
    do we have for social scientists
  • 38:33 - 38:35
    to present the data in a visual form?
  • 38:35 - 38:37
    This is a tool I would recommend.
  • 38:37 - 38:39
    [inaudible]
  • 38:39 - 38:41
    So, the first question.
  • 38:43 - 38:46
    All the extraction from Twitter,
    it was automatic.
  • 38:46 - 38:49
    I didn't copy the tweets,
    it was automatic.
  • 38:49 - 38:51
    I used the Twitter API.
  • 38:51 - 38:55
    They have a process
    for registered developers
  • 38:55 - 38:57
    and I extracted it automatically.
  • 39:02 - 39:06
    Now, the tools, and I forgot
    to put that in this slide
  • 39:06 - 39:09
    but in the beginning,
    when I showed you the first visualization
  • 39:09 - 39:12
    I put the name of the tool in--
  • 39:13 - 39:18
    I don't know if I translate well,
    but I think it's G-E--
  • 39:18 - 39:24
    You can see here, G-E-P-H-I,
    I don't know how to pronounce it!
  • 39:24 - 39:27
    ["Jefy" I think...]
  • 39:28 - 39:32
    So, this is the one I've used
    for the visualizations
  • 39:34 - 39:37
    and it's good because you can use it
    on any platform.
  • 39:37 - 39:42
    So both on a Mac or a PC or Linux.
  • 39:45 - 39:47
    Now, it has limitations for...
  • 39:47 - 39:51
    mostly for network statistics
    in my opinion.
  • 39:54 - 39:57
    The other one, that is very popular
    is Node XL.
  • 39:57 - 40:01
    And in fact it was developed
    here in the ATI lab.
  • 40:02 - 40:04
    In the lab where I work.
  • 40:05 - 40:07
    So, they collaborated with Microsoft.
  • 40:07 - 40:10
    It's a template for Excel
  • 40:11 - 40:13
    and it allows--
  • 40:13 - 40:18
    In fact they are still adding new features
    and there's two people working on it
  • 40:18 - 40:20
    in the lab.
  • 40:20 - 40:24
    But the reason I haven't used it here,
    is because I have a Mac
  • 40:24 - 40:29
    and also there's another reason
    I like this positioning algorithm
  • 40:31 - 40:33
    and this is...
  • 40:33 - 40:37
    this is another issue
    I haven't talked about
  • 40:37 - 40:40
    is how you actually place the dots.
  • 40:40 - 40:47
    And actually these algorithms for layout
    use force-directed schemes
  • 40:49 - 40:51
    like in physics science.
  • 40:51 - 40:54
    So if a node has a lot of links
    with another node
  • 40:54 - 40:57
    they put it closer,
    so it's like there's forces
  • 40:57 - 41:00
    or strings attaching the nodes.
  • 41:01 - 41:04
    And depending on how many strings
    there are, they're closer or farther.
  • 41:05 - 41:08
    There's physics science rules
    for placing them.
  • 41:08 - 41:10
    But there's different algorithms
  • 41:10 - 41:15
    but the other reason I chose Gephi
    is that it has an algorithm
  • 41:15 - 41:21
    specifically in this tool
    that places my language groups separately
  • 41:21 - 41:24
    more than any other algorithm
    that I could use in Node XL.
  • 41:24 - 41:29
    And it was more useful
    to see the groups separated.
  • 41:30 - 41:33
    But you can use both
    depending on what you want to do.
  • 41:33 - 41:36
    They both have weaknesses and strengths,
  • 41:36 - 41:39
    different depending
    on what you have to do.
  • 41:41 - 41:47
    Node XL has more features
    for processing many networks
  • 41:48 - 41:51
    and extracting network statistics
    for many networks at the same time.
  • 41:52 - 41:57
    And it has a lot of interesting features,
    maybe this is more manual.
  • 41:59 - 42:00
    I don't know.
  • 42:00 - 42:05
    Somebody called it
    "the Photoshop of visualization".
  • 42:09 - 42:14
    So I'm going to briefly comment
    on the factor analysis.
  • 42:14 - 42:19
    The point here, what I want to see
    is multilingual users of Twitter
  • 42:21 - 42:24
    are aware of their audience in a way.
  • 42:25 - 42:29
    And they somehow perceive
    how many followers
  • 42:29 - 42:32
    of this language or the other they have.
  • 42:33 - 42:36
    Maybe not very consciously,
  • 42:38 - 42:40
    but they perceive something.
  • 42:40 - 42:42
    So, I went to see how this social network
  • 42:42 - 42:47
    the fact that there's many languages
    or just one in the social network
  • 42:48 - 42:53
    can affect the choice of language in this person,
    the ego person.
  • 42:55 - 42:58
    So, I actually did a lot of testing,
    different variables,
  • 42:58 - 43:01
    but I'm just going to focus
    on the essence,
  • 43:01 - 43:06
    which is I have my dependent variable
    which is the proportion of English
  • 43:06 - 43:11
    used by the ego has 50 posts,
    maybe 60 percent of them are in English
  • 43:12 - 43:14
    and 40 percent in Spanish,
    I don't know.
  • 43:15 - 43:19
    And then they have the factor
    of how many users in the network
  • 43:19 - 43:21
    are in English
    and how many are using other languages.
  • 43:22 - 43:24
    And then the multilingual index
    of the network
  • 43:24 - 43:26
    - and this is my favorite part -
  • 43:26 - 43:30
    because it's basically saying
  • 43:30 - 43:36
    "is multilingualism encouraging English
    as a lingua franca?"
  • 43:37 - 43:42
    especially on Twitter, where we have these
    public posts that anybody can read.
  • 43:43 - 43:47
    So anyway... I'm not going to go
    into the technical details
  • 43:48 - 43:51
    of bi-nodal statistical interpretation.
  • 43:51 - 43:55
    What I wanted to do is
    that in these combined effects
  • 43:56 - 44:00
    of the factors,
    which one was more important?
  • 44:01 - 44:03
    Was heavier than the others?
  • 44:03 - 44:07
    Had more weight in defining these
    proportional [inaudible] used by the ego.
  • 44:09 - 44:11
    I tried other factors,
  • 44:11 - 44:14
    I also looked at the use
    of non-English language
  • 44:15 - 44:18
    In the end... there are certain,
  • 44:20 - 44:21
    I mean, they're obvious somehow.
  • 44:21 - 44:24
    I think it's more interesting the process
    of what I've learned
  • 44:24 - 44:26
    than the results themselves.
  • 44:27 - 44:30
    Because basically what I've learned
    is that, yeah,
  • 44:31 - 44:33
    the English use of the network
  • 44:33 - 44:36
    is encouraged by the use
    of English by the ego
  • 44:36 - 44:41
    and in a certain way it's so important
    that any other factor
  • 44:41 - 44:44
    is really not that important.
  • 44:45 - 44:49
    And even the second most important,
    the multilingual index
  • 44:50 - 44:55
    was so light compared with
    the heavy impact of English
  • 44:56 - 44:57
    used in the network.
  • 44:58 - 45:00
    But what I thought was really interesting
  • 45:00 - 45:03
    was how do you define
    the multlinguality of a network?
  • 45:04 - 45:07
    And with this I got help
    from Jordan Boyd-Graber
  • 45:07 - 45:09
    who is also in the iSchool
  • 45:09 - 45:14
    and in the lab for computational lab,
    the information processing lab
  • 45:14 - 45:15
    here in Maryland.
  • 45:15 - 45:18
    He helped me
    with all these technical aspects.
  • 45:18 - 45:21
    And he was the one suggesting
    "Well, why don't you look--"
  • 45:21 - 45:25
    "instead of just looking at the number
    of languages in the network...
  • 45:25 - 45:29
    "because sometimes you get
    wrongly detected languages...
  • 45:29 - 45:30
    like Swahili. Well, no one was really
    speaking Swahihi in this network.
  • 45:33 - 45:37
    There were technical challenges,
    like I explained to you.
  • 45:38 - 45:42
    So maybe there's a high number
    of languages in the network
  • 45:42 - 45:44
    but the network is mostly monolingual.
  • 45:44 - 45:49
    Mostly everybody uses English
    and just a few people maybe use others
  • 45:50 - 45:52
    or maybe just it got wrongly detected.
  • 45:52 - 45:55
    And maybe you're just saying
  • 45:55 - 45:57
    "Oh yeah, there's ten languages
    in the network!"
  • 45:57 - 46:00
    and actually it's not
    a very multilingual network at all.
  • 46:00 - 46:03
    So, we came up with this, the entropy.
  • 46:03 - 46:06
    And this is a physics concept
    that measures the disorder
  • 46:06 - 46:08
    in a system.
  • 46:08 - 46:11
    And in this case, the entropy
    would be my multilingual index
  • 46:11 - 46:17
    and what it's doing is providing a value
    between 0 and 1
  • 46:17 - 46:23
    So, with 0 it's a very homogeneous system
    everyone speaks the same language
  • 46:24 - 46:27
    and if it's closer to 1,
    it's really a heterogeneous
  • 46:27 - 46:29
    and it places an importance
  • 46:29 - 46:32
    in how many people
    are using its language.
  • 46:32 - 46:36
    So, this is the equation,
    just to show you it.
  • 46:38 - 46:41
    And it takes into account the number
    of languages in the network
  • 46:41 - 46:45
    and then one of the variables
    is how many nodes in that language
  • 46:45 - 46:48
    that there are divided by the total number
  • 46:48 - 46:51
    and this is what gives the proportion
    for example.
  • 46:53 - 46:57
    So just to let you know
    that there's interesting lessons
  • 46:57 - 46:58
    from this study.
  • 46:58 - 47:00
    Despite the research not being exciting!
  • 47:01 - 47:03
    And this is what I'm doing right now.
  • 47:05 - 47:08
    So, the intrinsic characteristic
    of the message
  • 47:08 - 47:11
    how that influences the language choice.
  • 47:11 - 47:16
    First, I'm wondering,
    because I just saw it in the content
  • 47:19 - 47:22
    are replies encouraging people
    to use their native language?
  • 47:23 - 47:27
    And public posts encouraging people
    to use English as a lingua franca?
  • 47:28 - 47:30
    This is one that showed up the same.
  • 47:30 - 47:34
    And I changed the handle,
    for privacy reasons...
  • 47:35 - 47:38
    So this is the reply to somebody
    and it's in Arabic.
  • 47:38 - 47:41
    And this is a public posting
    and it's in English.
  • 47:42 - 47:46
    Now, the thing I'm looking at
    is public analysis
  • 47:46 - 47:50
    and I'm considering with Jordan
    to do some automatic topic analysis
  • 47:51 - 47:54
    because there's many languages,
    so I cannot decode it all
  • 47:55 - 47:57
    in many of them.
  • 47:57 - 47:58
    Only in three, maybe four...
  • 48:00 - 48:01
    So, I'm wondering,
  • 48:01 - 48:04
    are technology topics favoring
    the use of English?
  • 48:05 - 48:10
    And other topics,
    international news maybe?
  • 48:11 - 48:16
    Whereas other topics
    like national news or songs
  • 48:16 - 48:19
    they might be encouraging the use
    of native languages.
  • 48:21 - 48:23
    And then I'm looking
    if there's translations
  • 48:23 - 48:27
    or if there's cross-cultural words
    that you can detect.
  • 48:27 - 48:29
    For instance, this person
    is writing in English
  • 48:29 - 48:33
    but it recommending a visit to a museum
    in the city of Lille in France.
  • 48:34 - 48:39
    So this person knows the city in France,
    knows that to visit the museum
  • 48:39 - 48:41
    you go there.
  • 48:41 - 48:43
    And this is what I call
    cross-cultural words.
  • 48:44 - 48:49
    [What I kind of found] is that surprisingly
    there's not many translation behaviors
  • 48:49 - 48:53
    going on, despite these people
    being multilingual.
  • 48:53 - 48:56
    And this is what is going to trigger
    some reflections.
  • 49:00 - 49:02
    How am I doing on time?
  • 49:04 - 49:06
    (woman) 1:22.
  • 49:06 - 49:10
    (man) Umm, it's usually an hour long...
  • 49:10 - 49:14
    So, I will go on with my reflections.
  • 49:14 - 49:18
    to encourage some thoughts.
  • 49:18 - 49:22
    So the greatest connecting power
    is the will of users who want
  • 49:22 - 49:23
    to be connected.
  • 49:23 - 49:28
    This is a really nice quality,
    because the communities of interest
  • 49:28 - 49:32
    in social media, in Twitter
    is what is bringing people
  • 49:32 - 49:34
    from different countries, together.
  • 49:35 - 49:41
    And also experiences,
    like the Voluntweeters,
  • 49:42 - 49:46
    so after the earthquake in Haiti,
    there were these spontaneous
  • 49:46 - 49:49
    self-organizations of Twitter users
    for translating tweets
  • 49:50 - 49:54
    and they called themselves Voluntweeters,
    there's a paper about that--
  • 49:54 - 49:59
    So this is the triggering
    of social connections
  • 50:01 - 50:04
    across countries, across borders
    and across languages.
  • 50:07 - 50:10
    But even when the social structure
    could potentially facilitate
  • 50:10 - 50:13
    information diffusion
    and cross-language linking
  • 50:15 - 50:17
    this condition is not sufficient.
  • 50:17 - 50:20
    There are other factors
    like the design of the interfaces
  • 50:20 - 50:22
    and the design of systems
    that can influence...
  • 50:23 - 50:27
    can promote, or not translation behaviors
    and cross-cultural awareness.
  • 50:28 - 50:32
    And the Wikipedia
    of cross-language linking
  • 50:32 - 50:35
    you have links for many languages
    for every article.
  • 50:37 - 50:41
    We also still acknowledge the dynamic
    language preferences of multilingual users
  • 50:42 - 50:44
    so they could address their messages
    to the appropriate audience.
  • 50:44 - 50:47
    I like the solution of Google+
    with their circles
  • 50:48 - 50:52
    where I can put my friends and family
    in Spain in a circle
  • 50:52 - 50:55
    and write them in Spanish.
  • 50:55 - 51:01
    And then the recommendation of people
    based on language profile
  • 51:01 - 51:04
    would be useful for this spontaneous
    self-organization.
  • 51:06 - 51:08
    So, these are some of the things.
  • 51:08 - 51:10
    The impact of mediation.
  • 51:11 - 51:13
    Global Voices is
    an international community of bloggers
  • 51:13 - 51:18
    that connect bloggers and citizens
    from around the world
  • 51:19 - 51:21
    in different languages.
  • 51:21 - 51:23
    And Scott Hale
  • 51:23 - 51:27
    a student from Oxford University
    led a very interesting study
  • 51:27 - 51:34
    after the earthquake in Haiti about blogs
    in Spanish, Japanese and English
  • 51:36 - 51:39
    and he looked
    at the cross-language linking
  • 51:39 - 51:41
    and focusing on this topic
    over time.
  • 51:41 - 51:45
    And he discovered that 50 percent
    of the cross-language linking
  • 51:45 - 51:48
    was happening through this platform,
    Global Voices.
  • 51:49 - 51:52
    So, it had a very big impact
    in the language links.
  • 51:54 - 51:58
    And finally, social media,
    big media outlets,
  • 51:58 - 52:02
    people are interconnected
    in these complex networks
  • 52:05 - 52:09
    and underlying is this language ecosystem.
  • 52:09 - 52:13
    So we have the language ecosystem,
    and on top of that
  • 52:13 - 52:15
    we have the social media ecosystem.
  • 52:15 - 52:20
    People would share a video from YouTube
    on Twitter, or news on Facebook.
  • 52:21 - 52:26
    What happened if we integrate
    in this ecosystem
  • 52:27 - 52:31
    these platforms, like Global Voices,
    like Universal Subtitles
  • 52:31 - 52:34
    which is a platform
    for crowdsourcing subtitling of videos
  • 52:34 - 52:37
    and translation of subtitles
    for videos.
  • 52:38 - 52:42
    If you integrate that and this
    starts connecting, starts building paths
  • 52:42 - 52:46
    between languages,
    that didn't exist before.
  • 52:46 - 52:51
    So I think we should make it easy
    for multilingual people to translate
  • 52:51 - 52:55
    and subtitle all the content they like,
    their favorite content
  • 52:56 - 53:00
    and share it with the appropriate audience
    so they can start connecting
  • 53:00 - 53:03
    the language islands of the internet.
  • 53:03 - 53:06
    And that way stories will travel
    all over the world.
  • 53:09 - 53:12
    Particularly I would like to thank
    Jen Golbeck, my adviser
  • 53:12 - 53:14
    and Fulbright for supporting
    this research.
  • 53:14 - 53:19
    And then I open the space
    for questions and your ideas
  • 53:19 - 53:22
    if this has triggered some thoughts.
  • 53:24 - 53:26
    (woman) I have a question
    about how this relates
  • 53:26 - 53:28
    to your Yahoo award.
  • 53:29 - 53:35
    Well, they have the Internet Experiences
    lab in California.
  • 53:35 - 53:36
    And they--
  • 53:36 - 53:40
    So, we tend to think
    maybe it's a super tiny place
  • 53:40 - 53:43
    but actually there are fields
  • 53:43 - 53:45
    and I applied for the social systems.
  • 53:45 - 53:49
    The social systems are a category.
  • 53:49 - 53:55
    And I think that was embedded
    in the Internet Experience lab
  • 53:57 - 53:58
    and yeah, they liked it.
  • 53:59 - 54:02
    (man) But is it this
    work that they are interested in?
  • 54:02 - 54:03
    Yes.
  • 54:03 - 54:04
    - The languages?
    - Yes.
  • 54:04 - 54:08
    Well, now I have results,
    because I wrote up reports
  • 54:09 - 54:12
    about what my work was about.
  • 54:17 - 54:18
    Great.
  • 54:22 - 54:23
    Yes?
  • 54:23 - 54:26
    (woman) I was thinking about
    if you analyzed the place...
  • 54:26 - 54:31
    like if there's any relationship
    between tweeters and tweets
  • 54:31 - 54:34
    and the place that the people are.
  • 54:36 - 54:40
    I mean, because it's not the same
    being a Brazilian in Brazil
  • 54:40 - 54:43
    and tweeting in Portuguese
    or being Brazilian in the US
  • 54:43 - 54:45
    and tweeting in Portuguese--
  • 54:46 - 54:49
    There's many, many factors
    that I haven't looked at.
  • 54:50 - 54:52
    It's not part of your study?
  • 54:52 - 54:54
    But because I had to scope it somehow.
  • 54:54 - 54:56
    There's so many factors.
  • 54:57 - 55:00
    Geography was one that I was originally
    intending to look at
  • 55:00 - 55:04
    but I found there were so many problems
    to actually get the right geography
  • 55:04 - 55:07
    the right geolocation.
  • 55:08 - 55:12
    The problem is that I didn't originally
    collect the geolocation.
  • 55:12 - 55:16
    I think only a small percentage
    of messages have...
  • 55:16 - 55:18
    geolocated information.
  • 55:19 - 55:21
    I'm not sure about the percentage there.
  • 55:21 - 55:25
    So there's only a small percentage
    of messages that have geolocation.
  • 55:25 - 55:28
    There's issues with the accuracy...
  • 55:28 - 55:31
    What I have collected is the information
    in their profile
  • 55:32 - 55:35
    they can put the information
    about the place,
  • 55:35 - 55:40
    but sometimes it's more
    or less trustworthy,
  • 55:40 - 55:43
    sometimes there's nothing,
    and sometimes there's just crazy stuff.
  • 55:43 - 55:45
    (audience laughs)
  • 55:47 - 55:50
    So, something absolutely has to be there.
  • 55:50 - 55:55
    If I wanted to expand this,
    geography would be a nice place to go!
  • 55:55 - 55:57
    (woman) Ok.
  • 56:00 - 56:01
    Yes?
  • 56:01 - 56:02
    (man) Could you say a little bit more
  • 56:02 - 56:05
    I think you said about the visualization
    choices you made?
  • 56:05 - 56:06
    Oh yes, well...
  • 56:08 - 56:11
    I tried this tool, the Node XL,
  • 56:11 - 56:13
    I used both Node XL and Gephi.
  • 56:14 - 56:15
    There's more...
  • 56:16 - 56:20
    I think there's, I don't remember the name
    there's one that was developed
  • 56:20 - 56:22
    here in Maryland
  • 56:22 - 56:24
    but it's not as user-friendly.
  • 56:26 - 56:30
    But I've forgotten the name,
    I will have to look it up.
  • 56:30 - 56:34
    And there's a lot of tools
    that are for really technical people
  • 56:35 - 56:37
    that are handling millions of nodes.
  • 56:38 - 56:41
    Because with these tools,
    for social scientists or humanists
  • 56:41 - 56:42
    maybe they are not.
  • 56:42 - 56:49
    Some tools can have maybe 300-400 nodes
    and still be understandable.
  • 56:51 - 56:56
    But if you go beyond that,
    actually visualizations get crazy
  • 56:56 - 57:02
    and even for more technical tools
    for more technical people
  • 57:03 - 57:07
    there are hundreds or millions,
    they cannot do visualizations
  • 57:08 - 57:12
    at some point they just give you
    statistical measures.
  • 57:14 - 57:15
    I have to leave it out.
  • 57:15 - 57:17
    I have a list of tools and that
  • 57:17 - 57:21
    but if I need the names,
    I need to go through everything.
  • 57:23 - 57:25
    (woman) But yours was Mac-accessible?
  • 57:25 - 57:32
    Yes, this Gephi tool is Mac-accessible,
    you can use it with Microsoft
  • 57:32 - 57:34
    with Mac and with Linux.
  • 57:36 - 57:38
    And I forgot to say,
    it's open source.
  • 57:43 - 57:49
    (woman) Did you find
    studying languages and internet
  • 57:49 - 57:53
    was like a place, unexplored?
  • 57:53 - 57:55
    Like here in the United States?
  • 57:55 - 58:00
    Like when you began studying
    or analyzing this
  • 58:00 - 58:04
    you felt that a lot of people
    are doing this
  • 58:04 - 58:06
    or nobody is doing this
  • 58:06 - 58:08
    and I'm the first one trying to--
  • 58:08 - 58:13
    I'm not the first one,
    but it's a very new area
  • 58:13 - 58:15
    to be exploring.
  • 58:15 - 58:17
    So, it's very exciting
    because of that.
  • 58:17 - 58:19
    Because there's so many
    unanswered questions
  • 58:19 - 58:24
    and I find that surprisingly enough
    the United States is not paying so much attention
  • 58:24 - 58:26
    about multilinguality issues
  • 58:26 - 58:31
    And I think that language policies
    are very monolingual-oriented
  • 58:31 - 58:33
    but it's terrible
  • 58:33 - 58:37
    because there's a whole lot
    of multilinguality in this country.
  • 58:37 - 58:41
    There's so many people
    speaking different languages
  • 58:43 - 58:45
    that I'm so amazed
    about that contradiction.
  • 58:46 - 58:49
    Because in Europe,
    it's an obvious challenge for us
  • 58:49 - 58:52
    because we need to understand each other
    between all these countries
  • 58:52 - 58:54
    of the European Union.
  • 58:54 - 58:58
    And there's a lot of money invested
    in research that relates to multilinguality
  • 58:59 - 59:01
    and communication in languages
  • 59:01 - 59:05
    and technology in particular,
    cross-language systems
  • 59:05 - 59:09
    and in libraries there's a lot of work
    going on.
  • 59:09 - 59:14
    There's investment in the research.
  • 59:15 - 59:18
    So yeah, maybe in terms of investment
  • 59:18 - 59:22
    the European Union is
    not a bad place to be.
  • 59:22 - 59:24
    Better than the United States!
  • 59:24 - 59:27
    But at the same time,
    what I find interesting
  • 59:27 - 59:33
    is that here when I talk about it
    people are really interested
  • 59:35 - 59:38
    and interested in the subject
    and excited about it.
  • 59:38 - 59:41
    Maybe in Europe it looks more
    like old news.
  • 59:41 - 59:44
    Like "yeah, we already know that."
  • 59:44 - 59:46
    (audience laughs)
  • 59:46 - 59:50
    So I find that it's exciting
    to be seeing the audience
  • 59:50 - 59:52
    like "Oh yeah!"
    It's so new.
  • 59:53 - 59:54
    *(woman) Yes.
  • 59:59 - 60:03
    (woman) As the emerging view
    of research in the United States
  • 60:03 - 60:10
    can you show me which institutions
    or which area of academic institutions
  • 60:12 - 60:15
    actually have more invested
    in this topic in the US?
  • 60:16 - 60:19
    I'm not sure about the institutions.
  • 60:21 - 60:26
    What I know, particularly,
    in Indiana there's work
  • 60:27 - 60:29
    because Susan Herring
    is a researcher there.
  • 60:31 - 60:33
    She has inspired my work.
  • 60:33 - 60:36
    She published a book
    The Multilingual Internet
  • 60:36 - 60:41
    and she has done research on blogs,
    also communities
  • 60:42 - 60:45
    of different languages connecting blogs
    in the blogosphere.
  • 60:45 - 60:51
    So she has been one of the ones,
    one of the first tackling these issues
  • 60:51 - 60:55
    and she's still going
    and she's doing something.
  • 60:55 - 60:59
    So, it's the University of Indiana,
    I think.
  • 61:01 - 61:03
    Yeah, Susan Herring.
    Look for her!
  • 61:06 - 61:09
    And also at the same university
    there's Paolillo.
  • 61:10 - 61:13
    He's also doing research
    in this area
  • 61:13 - 61:19
    and he actually published for UNESCO
    for research on language diversity
  • 61:19 - 61:20
    on the internet.
  • 61:22 - 61:23
    So Susan Herring and Paolillo,
  • 61:23 - 61:25
    they are at the same university.
  • 61:27 - 61:30
    Those are my inspiring ones.
  • 61:34 - 61:37
    Well, at Harvard at the Berkman Center
    of Internet and Society also did
  • 61:37 - 61:39
    this mapping of the blogs.
  • 61:39 - 61:41
    But they don't focus on languages.
  • 61:42 - 61:45
    But there's tangential thing
    around there.
  • 61:49 - 61:51
    (man) One more question?
  • 61:54 - 61:55
    Well, thank you very much!
  • 61:55 - 61:56
    Thanks!
  • 61:56 - 61:58
    (audience applauds)
Title:
Irene Eleta: Multilingual Users of Twitter: Social Ties Across Language Borders or How a Story Could Travel the World
Description:

more » « less
Video Language:
English
Team:
MITH Captions (Amara)
Project:
BATCH 1

English subtitles

Revisions Compare revisions