WEBVTT

00:00:00.384 --> 00:00:01.891
Good morning, everyone.

00:00:02.858 --> 00:00:05.650
Thank you for coming here
[unclear] of the semester.

00:00:08.370 --> 00:00:10.217
So, I'm going to start.

00:00:11.001 --> 00:00:13.901
Access to the internet
is greater than ever before

00:00:14.183 --> 00:00:17.166
and as a consequence,
it's becoming more multilingual.

00:00:18.706 --> 00:00:22.613
However, there's evidence of segmentation
of cyberspace

00:00:22.614 --> 00:00:25.170
due to language and national borders.

00:00:28.011 --> 00:00:30.812
This image serves to illustrate that.

00:00:31.684 --> 00:00:35.656
This is the language communities
of Twitter in Europe.

00:00:36.562 --> 00:00:40.656
So, what you can see are tweets
geolocated over a map of Europe

00:00:40.657 --> 00:00:44.047
and the different colors
represent the different languages.

00:00:45.272 --> 00:00:50.905
You can even see regional languages
like Catalan in the Catalan region of Spain

00:00:52.286 --> 00:00:56.037
And this is going to be useful
for an example I'm going to use later.

00:01:01.958 --> 00:01:04.209
I'm interested in Twitter in particular,

00:01:04.209 --> 00:01:07.312
because of the speed
of information dissemination

00:01:07.313 --> 00:01:10.744
and that most of this information
is publicly accessible.

00:01:13.743 --> 00:01:18.780
I'm going to illustrate this
with a capture

00:01:18.780 --> 00:01:22.280
of a dynamic visualization
you can find on the Twitter blog

00:01:22.281 --> 00:01:24.783
by Miguel Rios.

00:01:25.345 --> 00:01:28.981
And what you can see here
is the global flow of tweets

00:01:28.982 --> 00:01:31.205
after the earthquake in Japan.

00:01:32.147 --> 00:01:34.793
In pink, there are the tweets
coming out of Japan

00:01:34.794 --> 00:01:37.400
and, in green, the retweets
all over the world.

00:01:39.258 --> 00:01:44.966
This illustrates that in Twitter
information is spreading across countries.

00:01:46.180 --> 00:01:47.987
But how can this happen?

00:01:49.380 --> 00:01:55.018
Expatriates, migrants, minorities.
diaspora communities, language learners

00:01:55.028 --> 00:01:59.484
all play an important role
in building transnational networks

00:01:59.484 --> 00:02:02.549
and cultural bridges
between nations and communities.

00:02:03.847 --> 00:02:06.169
They are the multilingual users
on the internet.

00:02:07.512 --> 00:02:11.045
The overarching research question is:

00:02:11.047 --> 00:02:16.833
how are multilingual users of Twitter
connecting different language groups?

00:02:21.961 --> 00:02:27.157
In 2009, the Berkman Center of Internet
and Society at Harvard University

00:02:27.158 --> 00:02:29.773
mapped the Arabic blogosphere

00:02:29.774 --> 00:02:33.134
and they described a key concept
for my research.

00:02:35.020 --> 00:02:40.855
They discovered an English bridge
and a French bridge of bloggers

00:02:40.856 --> 00:02:45.780
that were writing in their native
Arabic language and in English or French.

00:02:47.430 --> 00:02:51.551
And they were connecting the different
national blogospheres

00:02:51.552 --> 00:02:53.034
with the international one.

00:02:55.387 --> 00:03:00.351
This might have played a role in the Arab
popular uprisings in 2011

00:03:00.352 --> 00:03:02.773
for reaching out to the world.

00:03:04.582 --> 00:03:09.223
And this is connected with a concept
that first appeared in 2008

00:03:09.224 --> 00:03:11.697
of the bridge bloggers.

00:03:13.601 --> 00:03:16.010
So, bridge bloggers are bloggers

00:03:16.011 --> 00:03:19.991
that are trying to connect
their local communities

00:03:19.992 --> 00:03:23.255
to a wider global audience.

00:03:25.190 --> 00:03:29.398
The image you can see here
is actually the visualization they created

00:03:29.399 --> 00:03:32.617
of mapping the Arabic blogosphere.

00:03:35.341 --> 00:03:37.370
Each dot is a blogger, or a blog.

00:03:38.763 --> 00:03:42.930
The size represents their popularity,
so how many incoming links they have

00:03:42.931 --> 00:03:45.457
and they grouped them--

00:03:45.457 --> 00:03:47.666
the neighborhoods they created

00:03:47.667 --> 00:03:51.582
in relation to the linking
between them.

00:03:52.587 --> 00:03:56.420
So, the ones that are grouped together
are linking among each other.

00:03:57.235 --> 00:03:59.454
The colors are a different question.

00:03:59.489 --> 00:04:06.067
The colors represent "attentive clusters",
that's how they call it.

00:04:06.067 --> 00:04:11.793
And they look at their online resources
and media outlets

00:04:12.065 --> 00:04:13.911
these blogs were linking to.

00:04:15.055 --> 00:04:19.346
So, blogs of the same colors
are following the same media outlets

00:04:19.346 --> 00:04:21.100
and online resources.

00:04:21.165 --> 00:04:25.057
And they did human coding
to label those groups.

00:04:25.640 --> 00:04:30.282
And here is where we see
the label English grids

00:04:30.283 --> 00:04:32.035
the responses from Cuba
in English

00:04:32.036 --> 00:04:33.950
and up there, there's [unclear] France.

00:04:35.774 --> 00:04:40.537
And so I think it's important to retain
the concept of attentive clusters.

00:04:43.797 --> 00:04:49.005
Now, let's go back to 2011
during the Arab popular uprisings.

00:04:50.444 --> 00:04:55.207
And I'll show you a visualization
of the influence network

00:04:55.208 --> 00:04:57.069
of Twitter users in Egypt.

00:04:58.210 --> 00:04:59.868
So, what you're seeing here

00:04:59.869 --> 00:05:04.348
just imagine people down the street
at Tahrir Square

00:05:04.349 --> 00:05:08.426
tweeting in Arabic about what's going on
on the ground.

00:05:08.427 --> 00:05:12.188
And those are the people in red.

00:05:12.652 --> 00:05:17.525
So, these red dots represent users
that are tweeting in Arabic.

00:05:18.309 --> 00:05:20.372
Then we have the international community

00:05:20.373 --> 00:05:25.515
or even Americans, British and so on
tweeting in English.

00:05:26.075 --> 00:05:28.287
And they are in blue,
those blue dots.

00:05:28.808 --> 00:05:33.052
And then, interestingly, we have
people in between them.

00:05:33.052 --> 00:05:38.152
which are illustrated in different
degrees of violet, or violet shades.

00:05:39.393 --> 00:05:43.417
This represents the fact that they
are tweeting in both Arabic and English.

00:05:45.006 --> 00:05:47.204
So, what we're seeing
is the bridge Twitters

00:05:47.996 --> 00:05:52.754
because, like Ethan Zuckermann called them
"bridge bloggers".

00:05:55.642 --> 00:05:58.520
So, another context.

00:05:59.333 --> 00:06:04.649
The same year, 2011, a lot
of big protests were going on in Europe.

00:06:05.172 --> 00:06:06.780
And in particular, in Spain.

00:06:06.781 --> 00:06:11.868
They started on May 15th 2011
there were massive protests.

00:06:13.002 --> 00:06:16.640
And of because of this context,
this situation

00:06:16.640 --> 00:06:22.974
new attentive clusters were emerging
in the social media landscape of Spain.

00:06:28.240 --> 00:06:33.711
Now, this is a visualization you can find
in the <i>Socialflow</i> blog, research blog

00:06:33.806 --> 00:06:35.216
on social networks.

00:06:35.216 --> 00:06:41.235
And what it is, is it tracks the origin
and the initial spread

00:06:41.236 --> 00:06:45.347
of the hashtag <i>#occupywallstreet</i>
in Twitter.

00:06:46.987 --> 00:06:51.803
They detected that one of the first users
of the hashtag <i>#occupywallstreet</i>

00:06:51.804 --> 00:06:57.093
was on July 13th 2011, linking to a blog
post of Adbusters.

00:06:58.344 --> 00:07:02.250
So you have the Twitter account
of Adbusters there, very big

00:07:02.251 --> 00:07:05.023
because it's being retweeted a lot.

00:07:05.966 --> 00:07:06.966
And mentioned a lot.

00:07:07.884 --> 00:07:13.514
And they collected these mentions
and the tweets that had these mentions

00:07:13.515 --> 00:07:17.298
and these retweets with the hashtag
during July 13th.

00:07:18.188 --> 00:07:20.501
From July 13th to July 23rd.

00:07:21.163 --> 00:07:23.523
So, from the first 10 days
of the use of this hashtag

00:07:23.524 --> 00:07:26.810
it was from the very beginning of the use
of this hashtag on Twitter.

00:07:31.012 --> 00:07:32.320
They just mapped the accounts

00:07:33.462 --> 00:07:39.459
and the series of posts with the hashtag
and mentions with the hashtag

00:07:40.652 --> 00:07:42.997
and the users that were connecting

00:07:42.998 --> 00:07:45.329
because of these mentions
and retweets.

00:07:46.136 --> 00:07:49.237
Now the interesting thing
in this visualization

00:07:49.238 --> 00:07:50.517
is that they

00:07:50.518 --> 00:07:54.348
the <i>Socialflow</i> people
particularly in [inaudible]

00:07:54.873 --> 00:07:59.885
detected this Spanish brand
of users

00:07:59.886 --> 00:08:03.146
were forming an attentive cluster.

00:08:05.948 --> 00:08:09.282
Mentioning and retweeting about it
in Spanish

00:08:09.283 --> 00:08:13.042
using the hashtag in their messages
in Spanish.

00:08:14.104 --> 00:08:17.164
And they point out in the blog

00:08:17.865 --> 00:08:19.524
that this Spanish contingent

00:08:19.525 --> 00:08:24.189
helped post and spread the word
about Occupy Wall Street

00:08:24.190 --> 00:08:29.358
even before most of the United States
was aware of it.

00:08:32.239 --> 00:08:34.043
So, I found that very interesting.

00:08:34.043 --> 00:08:37.438
And it was due to the context
in Spain at that moment

00:08:38.287 --> 00:08:45.449
with big protests and new clusters
forming in the social media landscape.

00:08:56.673 --> 00:09:01.497
Now I have shown you the importance
of these multilingual users

00:09:01.498 --> 00:09:06.470
in connecting language communities
and spreading information

00:09:06.471 --> 00:09:09.228
across countries, acting as mediators.

00:09:11.172 --> 00:09:15.831
But let's focus on another aspect
of connecting language groups

00:09:15.832 --> 00:09:17.170
which is language choice.

00:09:17.881 --> 00:09:23.261
So I'm going to devote a moment
to speak about languages

00:09:23.262 --> 00:09:24.262
and language choice.

00:09:27.743 --> 00:09:31.064
To understand languages in the world

00:09:31.065 --> 00:09:33.478
I'm going to use a telescope.

00:09:37.358 --> 00:09:39.954
So de Swaan...

00:09:41.244 --> 00:09:44.565
...proposed a theory called
the world language system

00:09:44.566 --> 00:09:45.815
back in the 1990s.

00:09:47.268 --> 00:09:50.867
to explain the languages in the world.

00:09:52.127 --> 00:09:55.428
And he used a very beautiful metaphor,
the constellation.

00:09:56.906 --> 00:10:01.510
So, in his theory there's about a dozen
languages in the world

00:10:01.511 --> 00:10:05.469
that are the hearts of the system,
or the suns.

00:10:06.493 --> 00:10:07.493
The suns of the system.

00:10:07.604 --> 00:10:11.015
For instance, English, French, Spanish,
Arabic and more.

00:10:12.324 --> 00:10:16.765
And then there are hundreds,
maybe more than 100, 200...

00:10:16.766 --> 00:10:22.707
national languages that are orbiting
around these suns like planets.

00:10:24.497 --> 00:10:28.393
And finally we have regional
and minority languages

00:10:28.394 --> 00:10:31.664
that are orbiting these planets
like satellites.

00:10:32.826 --> 00:10:37.606
And he used this metaphor
to explain the power relationships

00:10:37.607 --> 00:10:39.729
between languages.

00:10:40.172 --> 00:10:43.096
This is a theory of what he called

00:10:43.097 --> 00:10:46.659
"communication potential
and language competition"

00:10:48.408 --> 00:10:50.539
A key point he made

00:10:51.979 --> 00:10:55.379
is that the system holds together

00:10:55.440 --> 00:10:58.793
thanks to multilingual people
and interpreters.

00:11:00.291 --> 00:11:03.173
This is what's providing cohesion
to the system.

00:11:03.956 --> 00:11:06.959
He also made a controversial proposal

00:11:06.960 --> 00:11:11.340
about the communication potential
of a language.

00:11:12.290 --> 00:11:14.868
So, he proposed a formula,
a mathematical formula

00:11:14.869 --> 00:11:19.910
where he could estimate the communication
potential of a language

00:11:19.911 --> 00:11:24.889
and supposedly a person with tools
through learning and usage

00:11:24.889 --> 00:11:27.554
based on the communications of that.

00:11:28.108 --> 00:11:34.206
For example, a person might decide
to learn English and use English

00:11:35.137 --> 00:11:41.414
because not only does it provide
communication with English native speakers

00:11:42.395 --> 00:11:46.377
but also, adding to that, it provides
the possibility to communicate

00:11:46.378 --> 00:11:50.073
with all the second-language learners
of English

00:11:50.074 --> 00:11:53.101
from many different languages,
many different countries.

00:11:53.102 --> 00:11:55.818
So, supposedly, in history

00:11:55.819 --> 00:11:59.686
English provides
the greatest communication.

00:12:01.464 --> 00:12:05.106
And he received some criticism,
because of the central role of English

00:12:05.107 --> 00:12:06.785
in his theory

00:12:06.806 --> 00:12:10.139
He said it was the central hub
of all the system.

00:12:12.959 --> 00:12:20.043
There's also the language ecology paradigm
first proposed by Haugen in 1972

00:12:21.825 --> 00:12:25.243
and there's this idea of an ecosystem
of languages

00:12:25.244 --> 00:12:29.695
and, again, it's using another metaphor

00:12:29.696 --> 00:12:31.608
and because of this metaphor

00:12:31.609 --> 00:12:34.466
also appeared the idea
of endangered languages.

00:12:36.353 --> 00:12:39.880
I'm going to briefly just read
the definition.

00:12:39.881 --> 00:12:42.787
He defined the language ecology as:

00:12:42.788 --> 00:12:47.235
"the study of interactions between
any given language and its environment"

00:12:47.985 --> 00:12:49.207
and what I think is very important:

00:12:49.208 --> 00:12:53.690
"language exists only in the minds
of its users"

00:12:56.694 --> 00:12:59.956
which leads me to point at my research.

00:13:01.528 --> 00:13:05.808
In my research, I'm using a microscope
to see the cells

00:13:05.809 --> 00:13:09.337
and my cells in my study
are the Twitter users.

00:13:12.192 --> 00:13:13.452
Why is that?

00:13:15.616 --> 00:13:19.906
Because as Haugen explains,
there's a psychological dimension

00:13:19.907 --> 00:13:21.811
to language ecology

00:13:21.812 --> 00:13:25.070
where language interacts
with other languages

00:13:25.071 --> 00:13:27.697
in the minds of multilingual people.

00:13:28.860 --> 00:13:31.574
And there's a sociological dimension
to language ecology

00:13:32.425 --> 00:13:38.379
where we use language to communicate
and interact with other people.

00:13:38.380 --> 00:13:43.860
And this language ecology generates
because of the people

00:13:43.861 --> 00:13:46.493
that decide to use that language

00:13:46.494 --> 00:13:50.337
learning and interacting 
with people using it.

00:13:51.139 --> 00:13:54.758
And this is the point
of language choice in languages.

00:13:55.643 --> 00:13:59.827
So, I focus on the connections of people
and the language choice.

00:14:04.231 --> 00:14:07.595
So, these are the four points
I'm going to be speaking about.

00:14:07.862 --> 00:14:13.015
But actually the main focus
is going to be the first point

00:14:13.483 --> 00:14:19.196
Social network analysis and the taxonomy
of intersections between language groups

00:14:20.267 --> 00:14:23.376
This is where I'm going to be spending
most of the time.

00:14:23.377 --> 00:14:26.820
And then very briefly,
just for compilation purposes

00:14:26.821 --> 00:14:29.551
I'm going to speak about another
small study that I did

00:14:29.552 --> 00:14:30.861
the factor analysis

00:14:30.862 --> 00:14:34.485
looking at the influence
of the social network

00:14:34.486 --> 00:14:39.007
in the language choices of the users.

00:14:39.236 --> 00:14:41.737
So, how the social network
influences language choice

00:14:41.738 --> 00:14:43.400
of our multilingual users.

00:14:45.128 --> 00:14:50.399
And then I'm going to briefly also talk
about the last study of my dissertation

00:14:50.400 --> 00:14:52.492
that is still ongoing.

00:14:52.992 --> 00:14:55.215
So, I still have new research
to talk about.

00:14:55.216 --> 00:14:57.622
And it's content analysis

00:14:57.623 --> 00:15:00.221
and in this case I'm focusing on
intrinsic factors

00:15:00.222 --> 00:15:02.666
intrinsic to the messages

00:15:02.667 --> 00:15:05.506
about the topic,
and the type of exchange.

00:15:05.973 --> 00:15:07.513
If it's a reply,
if it's a public post

00:15:07.514 --> 00:15:09.923
and how that influences
the language choice as well.

00:15:11.331 --> 00:15:12.331
And finally I will...

00:15:14.194 --> 00:15:17.193
I'm going to give you my reflections

00:15:17.996 --> 00:15:23.204
so I can invite your thoughts
and suggestions and discussions about it.

00:15:27.097 --> 00:15:29.098
Briefly, I'm going to start
with the sampling

00:15:29.099 --> 00:15:32.077
so I can talk about the rest
of the research.

00:15:34.959 --> 00:15:36.654
So my focus is on multilingual users,

00:15:36.655 --> 00:15:39.782
how did I identify multilingual users
on Twitter?

00:15:42.044 --> 00:15:43.784
It was giving me a headache.

00:15:44.058 --> 00:15:46.691
Finally what we decided...

00:15:48.542 --> 00:15:50.240
this research has been--

00:15:51.458 --> 00:15:54.070
I have always had the help
of Jennifer Golbeck,

00:15:54.071 --> 00:15:55.077
she was my adviser.

00:15:55.078 --> 00:15:57.296
And I did this with her help.

00:15:57.978 --> 00:16:01.715
So what we did, was gather a list
of what is called <i>stopwords</i>.

00:16:02.813 --> 00:16:05.419
From different languages
and you have a list over there.

00:16:06.176 --> 00:16:10.451
And then the <i>stopword</i> lists
you can find them on the internet.

00:16:10.452 --> 00:16:13.055
They are created
for computational linguistics

00:16:13.056 --> 00:16:15.560
so they use it for filtering purposes.

00:16:16.783 --> 00:16:19.103
And they are common words
in a language.

00:16:19.791 --> 00:16:21.356
Very common words in a language.

00:16:21.357 --> 00:16:25.729
So, sometimes they're used precisely
for eliminating them from texts

00:16:25.730 --> 00:16:28.689
when they're in, for example,
searches in Google

00:16:28.690 --> 00:16:32.952
the eliminate the stopwords,
the stopwords that you type

00:16:32.952 --> 00:16:34.612
in the search.

00:16:35.314 --> 00:16:37.797
But in this case I wanted
to find the stopwords

00:16:37.798 --> 00:16:41.003
that are very common in the language
to represent the language.

00:16:41.004 --> 00:16:44.103
And so we had to select words
that were not written the same

00:16:44.104 --> 00:16:45.734
as in another language.

00:16:46.121 --> 00:16:48.425
Sometimes, could be confusing
and ambiguous.

00:16:51.491 --> 00:16:54.029
Then I typed in Google...

00:16:55.954 --> 00:16:59.029
one word in one language
and one word in another language.

00:16:59.030 --> 00:17:01.372
Usually I was always using
one English word

00:17:01.373 --> 00:17:04.006
and one word in a different language.

00:17:04.729 --> 00:17:08.310
And I looked in the Twitter domain.

00:17:08.901 --> 00:17:12.185
So the search results from Google
will give me the profiles

00:17:13.105 --> 00:17:19.866
of people on Twitter that in theory
wrote messages in both languages.

00:17:20.123 --> 00:17:24.242
We had to do a lot of hand-combing
to actually see if it was in two languages

00:17:24.542 --> 00:17:27.563
or it was just that they were mentioning
an English song

00:17:28.352 --> 00:17:31.654
the title of an English song
but they had no English in the rest.

00:17:31.655 --> 00:17:35.662
So we had to ensure
that they were authoring tweets

00:17:35.663 --> 00:17:37.575
in two languages.

00:17:37.911 --> 00:17:39.996
So writing them, not just retweeting them

00:17:39.997 --> 00:17:43.202
they were not just automatic postings
from Facebook.

00:17:43.203 --> 00:17:48.275
So we had a long set of criteria
a lot of manual combing

00:17:48.276 --> 00:17:52.949
and then finally we selected
92 multilingual users

00:17:52.950 --> 00:17:57.694
and in total they used 19 languages,
2 or 3 languages per person.

00:18:00.989 --> 00:18:04.956
Now, I don't know if you want to ask
some questions about the sampling

00:18:04.957 --> 00:18:07.607
because there's a lot of details about it.

00:18:13.392 --> 00:18:14.602
No doubts?

00:18:15.119 --> 00:18:16.585
Or maybe they'll come later!

00:18:19.137 --> 00:18:21.929
Now, how do I do
the social networks analysis?

00:18:22.743 --> 00:18:27.869
Well, now I have my 92 multilingual users
technically they are called <i>the ego</i>

00:18:28.277 --> 00:18:30.190
of an <i>egocentric</i> network.

00:18:31.084 --> 00:18:33.005
This is the cell of my study.

00:18:33.836 --> 00:18:35.984
It started with the nucleus of the cell

00:18:35.985 --> 00:18:37.550
which is my multilingual user

00:18:37.551 --> 00:18:40.478
and then I go to Twitter

00:18:40.478 --> 00:18:43.439
and first of all I have instructed--

00:18:44.878 --> 00:18:47.416
so in this case my <i>ego</i>
is called <i>the Painter</i>

00:18:48.111 --> 00:18:53.500
and I have extracted the last 50 messages
that he posted on Twitter

00:18:53.501 --> 00:18:56.560
to see the languages
this person used-- is using.

00:18:57.156 --> 00:19:01.943
And I see that he is using English,
Spanish and Catalan.

00:19:02.945 --> 00:19:05.479
Catalan is a regional language in Spain

00:19:05.605 --> 00:19:07.736
and I have shown you on the map
the region before

00:19:07.737 --> 00:19:09.247
where the region was.

00:19:09.275 --> 00:19:12.474
And they speak both Catalan
and Spanish.

00:19:13.727 --> 00:19:16.825
So, this person is tweeting
in a minority language

00:19:16.826 --> 00:19:18.488
a national language

00:19:18.489 --> 00:19:20.779
and also international.

00:19:26.808 --> 00:19:31.754
So, I already found <i>the Painter</i>
and I know what languages this person speaks

00:19:31.754 --> 00:19:33.542
well, uses on Twitter,

00:19:33.543 --> 00:19:36.178
and then I extract
all the social networks.

00:19:36.179 --> 00:19:37.932
So, the followers on Twitter

00:19:37.933 --> 00:19:39.716
you know that on Twitter
you have followers

00:19:39.717 --> 00:19:41.160
and you follow people.

00:19:41.163 --> 00:19:42.513
I extracted both.

00:19:42.546 --> 00:19:48.323
The followers of <i>the Painter</i>
the people that are following him on Twitter

00:19:48.323 --> 00:19:52.118
and also how the friends
are connecting to each other.

00:19:52.289 --> 00:19:56.671
So, all of them, all of these dots
are the followers

00:19:56.672 --> 00:19:58.707
the people following <i>the Painter</i>
on Twitter

00:19:58.708 --> 00:20:02.788
and also I see how they connect
among each other, ok?

00:20:04.542 --> 00:20:08.837
So <i>the Painter</i> follows <i>Eduard</i>
in the center

00:20:10.153 --> 00:20:12.245
and it seems he's very popular.

00:20:13.567 --> 00:20:17.042
And then I extract the last 30 posts
of <i>Eduard--</i>

00:20:17.048 --> 00:20:18.509
there's a reason for that

00:20:18.510 --> 00:20:21.961
but vernacular 
is mostly economy questions!

00:20:24.717 --> 00:20:25.717
I will tell you why!

00:20:25.718 --> 00:20:28.857
So I extracted the last 30 posts of <i>Eduard</i>

00:20:28.858 --> 00:20:31.966
and then I do
automatic language identification

00:20:31.967 --> 00:20:36.734
with the Google API
for language identification

00:20:38.548 --> 00:20:39.548
which costs money.

00:20:40.527 --> 00:20:43.282
So you have to really think
about how many posts you want to send

00:20:43.283 --> 00:20:45.580
to Google and how much money
you have available

00:20:45.581 --> 00:20:48.178
and what is the accuracy
you're going to have

00:20:48.179 --> 00:20:51.125
according to how many posts you send.

00:20:51.348 --> 00:20:53.268
There's a lot of testing going on there.

00:20:54.271 --> 00:20:58.482
I do the same with everybody
in the social network.

00:20:58.700 --> 00:20:59.893
I extract the last 30 posts

00:20:59.894 --> 00:21:02.340
use the Google identification

00:21:02.341 --> 00:21:08.086
build that algorithm that decides
based on the languages of these 30 posts

00:21:08.087 --> 00:21:11.929
is this person monolingual?
Is this person multilingual?

00:21:11.929 --> 00:21:13.221
Which languages?

00:21:13.222 --> 00:21:15.379
And then I laddered them, ok.

00:21:16.572 --> 00:21:18.746
This is just a visualization behind the--

00:21:20.280 --> 00:21:27.315
Perhaps person 1 is monolingual,
or bilingual of two languages.

00:21:31.985 --> 00:21:35.782
Now that I have all the friends
of <i>the Painter</i>

00:21:35.918 --> 00:21:37.392
how they connect,

00:21:37.392 --> 00:21:40.854
I color code them
depending on the languages they are using.

00:21:42.020 --> 00:21:44.669
And here, what you can see
is very interesting.

00:21:46.076 --> 00:21:48.735
I don't know if you can distinguish
the colors well

00:21:48.736 --> 00:21:53.949
because up here, this area,
that is like a triangle

00:21:53.950 --> 00:21:57.896
there's a group of users
writing in English.

00:21:58.743 --> 00:22:00.753
And it's pink.
Sort of pinkish.

00:22:00.753 --> 00:22:04.547
And then, down here
there's this Spanish group

00:22:04.548 --> 00:22:06.792
in light green.

00:22:07.544 --> 00:22:12.407
And, in the middle, the one
that perhaps doesn't distinguish as well

00:22:12.408 --> 00:22:15.464
from the English,
is the Catalan group.

00:22:15.935 --> 00:22:18.962
So the users writing in Catalan
in dark blue.

00:22:19.776 --> 00:22:21.870
And then there's a set of violets
in between

00:22:21.871 --> 00:22:26.319
and these violets represent
the bilingual users

00:22:26.319 --> 00:22:29.292
either English and Catalan
or English and Spanish.

00:22:29.963 --> 00:22:33.031
And then there's darker green
around here,

00:22:33.031 --> 00:22:36.498
they are using both Catalan and Spanish.

00:22:36.498 --> 00:22:38.252
So there's a lot of bilinguals
going on.

00:22:38.252 --> 00:22:39.736
And there's an interesting dynamics

00:22:39.737 --> 00:22:42.710
in that you have this English group
up there

00:22:42.711 --> 00:22:44.060
and the Spanish group up here

00:22:44.061 --> 00:22:46.200
and the Catalan group in the middle.

00:22:46.201 --> 00:22:49.147
And this Catalan group is very mixed up
with the Spanish group

00:22:49.744 --> 00:22:52.184
which makes sense,
because it's a bilingual community.

00:23:01.121 --> 00:23:06.529
So, this is how I built the egocentric
network of my 92 multilingual users.

00:23:08.601 --> 00:23:10.987
<i>The Painter</i> is just one of them.
I have 92.

00:23:10.988 --> 00:23:16.575
I have 92 cells or <i>egocentric</i> networks
that I studied with my microscope.

00:23:17.868 --> 00:23:21.817
Do you want to ask some questions
about this process

00:23:21.818 --> 00:23:23.419
or this visualization?

00:23:25.051 --> 00:23:29.982
<i>(person 1) Of the bilingual units,
are they users or tweets?</i>

00:23:30.894 --> 00:23:32.056
They are users, yeah.

00:23:32.400 --> 00:23:35.560
So, the dots represent people.

00:23:35.561 --> 00:23:40.014
So, like <i>Eduard</i> here.
They represent people.

00:23:42.250 --> 00:23:45.317
Now each dot to determine the language
and the color

00:23:45.318 --> 00:23:47.931
I extracted 30 posts

00:23:48.434 --> 00:23:52.797
So, it's an interesting question
because the 30 posts

00:23:52.798 --> 00:23:55.958
have different language levels
assigned to them

00:23:56.096 --> 00:23:57.130
especially if they were bilingual

00:23:57.131 --> 00:24:01.643
and I had to decide which language level
I was going to assign to the user.

00:24:01.644 --> 00:24:05.383
So, I had to build an algorithm
with a set of rules

00:24:10.279 --> 00:24:11.346
basically saying--

00:24:11.347 --> 00:24:16.651
the Google identification system
would give me a language

00:24:16.652 --> 00:24:17.882
and a confidence level

00:24:17.882 --> 00:24:19.496
So if the confidence level was very low

00:24:19.497 --> 00:24:23.838
I would say "discard that"
because I had a series of pluristics

00:24:23.858 --> 00:24:30.113
based on both the number of tweets
using a particular language

00:24:30.113 --> 00:24:32.685
and also on the confidence level.

00:24:33.655 --> 00:24:38.267
And there are a lot
of technical challenges there as well.

00:24:39.973 --> 00:24:41.948
<i>(woman) So, it's possible
that some of these posts</i>

00:24:41.949 --> 00:24:45.824
<i>many of these posts would be multilingual, 
I'm sorry monolingual in one language or the other?</i>

00:24:46.498 --> 00:24:51.988
<i>So it's also possible that some
of these individual posts</i>

00:24:51.989 --> 00:24:54.184
<i>would mix languages?</i>

00:24:54.623 --> 00:24:56.733
Yes, it is possible.
It's very possible!

00:24:57.063 --> 00:25:00.360
It's very challenging
for the automatic system!

00:25:01.915 --> 00:25:03.743
<i>(woman) Right, ok.
I just wanted to be clear--</i>

00:25:03.744 --> 00:25:05.185
Yes, exactly.

00:25:05.186 --> 00:25:11.303
So it's not as frequent as I expected,
having bilingual posts

00:25:11.304 --> 00:25:12.740
that I would call.

00:25:12.741 --> 00:25:14.431
But it's happening.

00:25:15.058 --> 00:25:20.539
And so, for a series of tests,
I had to do manual combing

00:25:20.540 --> 00:25:23.263
and I saw that sometimes
it was the case

00:25:23.264 --> 00:25:26.718
that they were doing some sort
of translation in the same tweet

00:25:26.719 --> 00:25:31.585
and sometimes it was just the case
that they were mentioning titles of things

00:25:31.586 --> 00:25:34.206
or places in a different language.

00:25:34.563 --> 00:25:39.470
So, there's a lot of issues
surrounding the automatic handling of this

00:25:39.471 --> 00:25:44.478
but you are dealing with 92 networks

00:25:44.479 --> 00:25:50.864
and they have between 30
and 5,000 nodes in them.

00:25:52.708 --> 00:25:55.841
So, I don't remember the numbers exactly,

00:25:55.867 --> 00:25:59.148
but I'm talking about
around 80,000 people.

00:26:01.132 --> 00:26:04.527
So detecting the language of 80,000 people
and this is small-scale.

00:26:04.913 --> 00:26:08.286
If you go to millions,
you need an automatic system.

00:26:08.287 --> 00:26:11.291
And one of the things I'm having
to write up in my dissertation

00:26:11.292 --> 00:26:13.832
is what are the challenges.

00:26:13.833 --> 00:26:17.984
You have to be prepared for them,
to solve those problems.

00:26:18.551 --> 00:26:21.851
And one of them is what do you do
with bilingual posts

00:26:21.852 --> 00:26:23.920
which language do you assign to that post?

00:26:23.921 --> 00:26:28.287
Automatic posts, spam...
there's a lot of problems.

00:26:29.862 --> 00:26:31.219
Challenges, I mean.

00:26:31.220 --> 00:26:34.766
That's what makes it interesting
because you cannot do manual combing

00:26:34.766 --> 00:26:36.046
on these scales.

00:26:39.073 --> 00:26:41.013
Do you have another question?

00:26:44.501 --> 00:26:48.025
So, now, what am I doing with this?

00:26:50.562 --> 00:26:56.178
I'm going to classify my social networks,
looking at the patterns

00:26:56.179 --> 00:26:59.094
of overlaps between the languages groups.

00:26:59.720 --> 00:27:01.953
And overlaps or intersections.

00:27:02.547 --> 00:27:07.878
I'm looking specifically at the networks
that have only two language groups

00:27:08.219 --> 00:27:11.860
I had five of these networks
that were trilingual

00:27:12.284 --> 00:27:16.020
so I put them aside to go simple
first with just two language groups

00:27:16.021 --> 00:27:18.361
to see how they interconnect.

00:27:19.369 --> 00:27:21.272
And then I classified them

00:27:21.936 --> 00:27:24.198
first following a qualitative analysis

00:27:24.198 --> 00:27:28.822
and then I used network statistics
that I developed with my adviser

00:27:28.823 --> 00:27:30.386
for this purpose.

00:27:31.338 --> 00:27:33.693
And I will talk later a little more
about it.

00:27:34.341 --> 00:27:37.980
So, tried to provide
more robust measures for that.

00:27:39.428 --> 00:27:44.074
I classified them and I came up
with some types.

00:27:45.922 --> 00:27:49.631
This is what I call <i>the gatekeeper</i>
language bridge type.

00:27:50.526 --> 00:27:52.995
And there's some variants of it,
obviously.

00:27:53.624 --> 00:27:55.990
What you can see here
is the network of a person

00:27:55.991 --> 00:28:00.092
and I'm going to assume this person
is in the United States

00:28:00.093 --> 00:28:02.350
and speaks both Spanish and English.

00:28:04.043 --> 00:28:05.684
Let's call her <i>Maria</i>.

00:28:05.927 --> 00:28:11.581
So she's Maria and she has two groups
of friends using Spanish on Twitter

00:28:12.531 --> 00:28:15.768
and then that big group of friends
using English.

00:28:17.320 --> 00:28:19.528
And, as you can see,
there's just a few nodes

00:28:19.529 --> 00:28:22.003
connecting the two language groups.

00:28:22.004 --> 00:28:27.869
You can see that the social structure
can be different from the language groups

00:28:29.391 --> 00:28:32.174
so you can have maybe a group of friends
and a group of coworkers

00:28:32.175 --> 00:28:36.424
inside the same language group,
so it can be more complex

00:28:36.425 --> 00:28:41.205
than just dividing the social network
by language groups.

00:28:41.206 --> 00:28:45.522
There can be more grouping
because of other social resources.

00:28:46.811 --> 00:28:50.572
But the interesting thing is that
there are only a few nodes

00:28:50.573 --> 00:28:53.455
where people are connecting
holding together these Twitters.

00:28:55.058 --> 00:29:00.675
I think this was friends
with English here.

00:29:00.676 --> 00:29:05.461
You can see, in this case, it seems
like the two groups

00:29:05.462 --> 00:29:08.089
are holding closely together

00:29:08.809 --> 00:29:13.833
because there are much more links
holding the two groups together.

00:29:14.663 --> 00:29:18.246
Of course, this is going to depend
on the size of the networks

00:29:18.247 --> 00:29:23.067
so I had to account for the size
when coming up with measures

00:29:23.068 --> 00:29:25.943
with network connections

00:29:25.944 --> 00:29:28.257
I had to provide ratios.

00:29:28.258 --> 00:29:32.340
Now, the ratio of [close] language linking
here and here

00:29:32.341 --> 00:29:34.312
and you have these types--

00:29:36.477 --> 00:29:40.266
These types are not just clear-cut.

00:29:40.346 --> 00:29:41.696
There's an evolution.

00:29:41.700 --> 00:29:43.337
There's people that have
very few connections

00:29:43.338 --> 00:29:44.653
with the language groups

00:29:44.654 --> 00:29:46.943
and then progressively there's people
with more and more.

00:29:47.704 --> 00:29:49.037
And this increases.

00:29:49.037 --> 00:29:52.048
Which points to the fact,
that my cells are there.

00:29:52.735 --> 00:29:57.001
Which means I don't see the evolution
over time, ok?

00:29:57.819 --> 00:29:59.724
This is a limitation of my research.

00:29:59.725 --> 00:30:04.594
I just see the social network
of this person looked

00:30:04.594 --> 00:30:07.491
at a particular point in time.

00:30:07.925 --> 00:30:10.057
I don't know how it evolves over time.

00:30:10.058 --> 00:30:13.130
So, for myself, it's just there.

00:30:13.508 --> 00:30:18.702
It would be interesting
to see these different patterns

00:30:18.702 --> 00:30:20.771
that I have been observing.

00:30:20.771 --> 00:30:26.632
Maybe over time these connections
between languages maybe increasing.

00:30:28.862 --> 00:30:32.131
Now we have the <i>integration
and union</i> type

00:30:32.693 --> 00:30:37.128
where in this case you have a person
from an Arab country

00:30:37.129 --> 00:30:40.778
and green represents the friends
that are using Arabic

00:30:40.779 --> 00:30:45.155
and the friends using English are in pink,
but there's also violet

00:30:45.156 --> 00:30:46.837
there are bilinguals.

00:30:47.196 --> 00:30:51.534
That means there's a group
of English users

00:30:51.535 --> 00:30:57.187
and bilingual English - Arabic users
inserted in the group of Arabic, inside.

00:30:59.530 --> 00:31:01.289
That's the integration,
so they're integrated.

00:31:02.419 --> 00:31:07.726
And then I have a Greek guy,
who uses Greek and English

00:31:07.726 --> 00:31:09.446
and his Arabic friends.

00:31:09.446 --> 00:31:11.935
And in this case, you can see
it's sort of light blue

00:31:11.936 --> 00:31:16.788
representing Greek, so the friends
that tweet in Greek

00:31:16.789 --> 00:31:20.729
Pink again represents people tweeting
in English

00:31:21.353 --> 00:31:23.426
and there's a lot of bilinguals.

00:31:23.449 --> 00:31:26.994
So these kind of dark blues
represent the bilinguals.

00:31:26.995 --> 00:31:28.604
And these are two groups

00:31:28.605 --> 00:31:32.741
that if you've seen before,
<i>the gatekeeper</i> and the language bridge

00:31:32.742 --> 00:31:35.281
progressively getting closer and closer

00:31:35.282 --> 00:31:40.990
with more and more links
across languages.

00:31:41.184 --> 00:31:42.815
In this case, this is like the extreme.

00:31:42.816 --> 00:31:46.016
The links between the two languages
are so dense

00:31:46.017 --> 00:31:51.021
that you cannot almost distinguish
where the border is

00:31:51.021 --> 00:31:53.128
between the two language groups.

00:31:53.164 --> 00:31:58.534
And, interestingly, the border might be
even only noticeable

00:31:58.534 --> 00:32:01.406
because there's a lot of bilinguals
around it.

00:32:02.091 --> 00:32:04.924
And this is the union type
where they unite.

00:32:07.201 --> 00:32:09.806
And finally, the <i>peripheral</i> language type.

00:32:09.807 --> 00:32:13.690
This is a Brazilian guy,
the network of a Brazilian guy

00:32:15.324 --> 00:32:16.892
where you have--

00:32:16.893 --> 00:32:18.885
probably he lives in the United States
or something like that--

00:32:18.886 --> 00:32:23.192
because this guy has mostly
all this big group of friends

00:32:23.226 --> 00:32:24.850
tweeting in English.

00:32:26.532 --> 00:32:31.978
And then there's the side tentacle
running outside, using Portuguese.

00:32:34.702 --> 00:32:36.399
And this is like a periphery landscape.

00:32:36.400 --> 00:32:39.137
So, in the periphery there's a small group
of Portuguese language.

00:32:39.893 --> 00:32:45.233
Now, I forgot to mention that there's dots
that are light yellow or white.

00:32:45.286 --> 00:32:48.100
Those are the ones that have no data.

00:32:49.074 --> 00:32:51.270
So, I don't know
the language they're using

00:32:51.271 --> 00:32:53.382
because either their accounts are closed

00:32:53.383 --> 00:32:57.803
or for some reason, in between the collection
of data they closed the account.

00:32:59.307 --> 00:33:03.059
Mostly, the reason
is that they're private accounts

00:33:03.570 --> 00:33:05.640
where you cannot get the data from.

00:33:06.442 --> 00:33:08.755
I think somewhere I read
it was about 5 percent.

00:33:08.756 --> 00:33:10.216
I'm not sure.

00:33:10.216 --> 00:33:14.010
But for one reason or another,
I don't have that information.

00:33:16.563 --> 00:33:20.976
Now, why am I classifying them?
These networks?

00:33:22.785 --> 00:33:26.088
Well, the reason is that--

00:33:26.089 --> 00:33:28.793
well, there are some studies
that demonstrate that the social structure

00:33:28.794 --> 00:33:33.539
the structure of the social networks
influences the spread of information.

00:33:34.096 --> 00:33:36.457
How information disseminates
in the network.

00:33:38.553 --> 00:33:42.909
So, I'm just assuming
that these different structures

00:33:42.910 --> 00:33:46.382
are going to influence the spread
of information.

00:33:47.292 --> 00:33:49.750
But this is a study that has to be done.

00:33:49.929 --> 00:33:52.944
I cannot demonstrate that one
of these types

00:33:52.945 --> 00:33:55.681
facilitates the spread of information.

00:33:55.682 --> 00:34:02.330
I can only say that I am assuming,
so that potential study

00:34:04.200 --> 00:34:09.400
could just look at, for example,
if <i>gatekeeper</i> and <i>language bridges</i>

00:34:10.551 --> 00:34:16.231
are not as good for spreading information
as <i>union and integration</i> types.

00:34:20.178 --> 00:34:25.022
Right, we can just assume
because of the cross-language links

00:34:28.295 --> 00:34:33.380
so, how many links there are
or the ratio of discourse language

00:34:33.380 --> 00:34:38.331
may potentially facilitate information
diffusion in these cases.

00:34:39.944 --> 00:34:42.557
So, that study needs to be done.

00:34:42.607 --> 00:34:44.732
I cannot say what's going to happen!

00:34:44.732 --> 00:34:47.123
I just assume it's going to be like that.

00:34:49.178 --> 00:34:52.009
So that is the reason why I classify them.

00:34:52.498 --> 00:34:54.599
I have some network statistics.

00:34:55.969 --> 00:35:00.753
We've made about an 80 percent accuracy
guess, which is quite good,

00:35:00.753 --> 00:35:02.453
but the sample is small.

00:35:08.014 --> 00:35:10.961
So now, do you have any more questions
before I move past to the next study?

00:35:13.726 --> 00:35:15.444
<i>man) I was curious as to how many--</i>

00:35:15.444 --> 00:35:19.144
<i>what was the selection process like 
to find the 92 users?</i>

00:35:20.324 --> 00:35:22.891
Well, this is what I've been spending
the beginning

00:35:22.892 --> 00:35:26.690
about just using two stopwords
from two different languages

00:35:26.691 --> 00:35:31.482
typing that in the search box in Google
and searching Twitter

00:35:31.482 --> 00:35:32.875
and then once--

00:35:32.876 --> 00:35:36.192
Basically you just go through 
the list of results

00:35:36.193 --> 00:35:41.540
and start opening the profile,
counting the tweets.

00:35:42.327 --> 00:35:44.536
How many in this language,
how many in the other.

00:35:44.601 --> 00:35:46.640
And we put a threshold of 10 percent

00:35:46.640 --> 00:35:53.026
they had to have written 10 percent
of the tweets in a second language

00:35:53.228 --> 00:35:56.742
and you couldn't count retweets
or automatic posting.

00:35:57.937 --> 00:36:00.296
We also had to manually discard
these spammers.

00:36:01.535 --> 00:36:03.733
So, that was the process.

00:36:06.151 --> 00:36:09.536
<i>(woman) And that's a paid search
 through Google?</i>

00:36:10.131 --> 00:36:12.601
No, that we did manually

00:36:12.717 --> 00:36:14.087
and then once--

00:36:14.088 --> 00:36:20.392
So the other thing you can say is you can
use these core multilingual users

00:36:20.938 --> 00:36:23.929
and then do what I did for behavior
in these social networks

00:36:23.929 --> 00:36:29.363
which is once you extract the friends
and extract the messages of the friends

00:36:30.669 --> 00:36:33.559
and automatically find the language

00:36:34.035 --> 00:36:36.522
then you can say "Oh, this person
is multilingual" automatically.

00:36:36.522 --> 00:36:41.099
You just process it and you can detect
a lot more multilingual people

00:36:41.183 --> 00:36:42.756
through that process.

00:36:42.757 --> 00:36:46.101
The paid process was sending these posts


00:36:46.101 --> 00:36:49.075
to the Google language 
identification tool.

00:36:49.885 --> 00:36:55.010
So, what I did was clean each message
automatically.

00:36:55.544 --> 00:37:00.387
Basically, eliminating the hashtags


00:37:01.437 --> 00:37:05.230
and the mentions
that had an <i>@</i> in front,

00:37:05.230 --> 00:37:10.074
symbols, URLs, all those things
I would automatically eliminate them

00:37:10.392 --> 00:37:13.777
and then with the rest of the message,
I'd send that to the Google API

00:37:14.125 --> 00:37:15.849
for language identification

00:37:16.009 --> 00:37:21.726
and the Google API would give me
a language level and a confidence binary.

00:37:21.726 --> 00:37:23.476
And that for each message.

00:37:23.485 --> 00:37:26.371
And then I built the algorithm
with the help of Jen Golbeck

00:37:26.372 --> 00:37:30.688
to decide, well I have 30 messages,
500 English

00:37:30.714 --> 00:37:35.420
10 million Spanish and then one in Swahili
which is unlikely

00:37:36.728 --> 00:37:39.954
and you had to decide
the confidence value--

00:37:39.955 --> 00:37:42.935
So I used rules, defined rules

00:37:42.936 --> 00:37:45.559
but it could be done
statistically I think.

00:37:46.097 --> 00:37:48.388
And write some statistical method
to decide

00:37:48.389 --> 00:37:51.869
"well this person actually is bilingual"
or whatever.

00:37:52.779 --> 00:37:54.429
That's the process.

00:37:54.477 --> 00:37:55.597
It's long!

00:37:55.788 --> 00:37:56.788
Yes.

00:37:58.026 --> 00:38:00.487
<i>(woman) Hi, I understand
that you did it manually</i>

00:38:00.488 --> 00:38:05.265
<i>but currently in existing research field
is there any software</i>

00:38:05.265 --> 00:38:08.489
<i>that we can use to capture,</i>

00:38:08.489 --> 00:38:11.935
<i>to have access to all
these different tweets?</i>

00:38:11.983 --> 00:38:15.400
<i>And to capture the different categories?
[inaudible]</i>

00:38:15.400 --> 00:38:18.472
Ok, so you mean the extraction?

00:38:18.912 --> 00:38:19.983
<i>(woman) Yeah.</i>

00:38:19.983 --> 00:38:21.226
No, I didn't do it manually.

00:38:21.227 --> 00:38:22.705
(woman) <i>And the other,
I think the other part</i>

00:38:22.706 --> 00:38:25.570
<i>of your data presentation
is visualizations coming out</i>

00:38:25.571 --> 00:38:27.132
<i>like this graph.</i>

00:38:27.132 --> 00:38:32.610
<i>Can you show us what kind of research
do we have for social scientists</i>

00:38:33.250 --> 00:38:35.478
<i>to present the data in a visual form?</i>

00:38:35.479 --> 00:38:37.461
This is a tool I would recommend.

00:38:37.461 --> 00:38:39.123
[inaudible]

00:38:39.123 --> 00:38:41.427
So, the first question.

00:38:42.572 --> 00:38:45.748
All the extraction from Twitter,
it was automatic.

00:38:46.265 --> 00:38:48.638
I didn't copy the tweets,
it was automatic.

00:38:48.855 --> 00:38:50.707
I used the Twitter API.

00:38:51.286 --> 00:38:54.849
They have a process
for registered developers

00:38:54.850 --> 00:38:57.205
and I extracted it automatically.

00:39:01.925 --> 00:39:05.777
Now, the tools, and I forgot
to put that in this slide

00:39:05.847 --> 00:39:09.444
but in the beginning,
when I showed you the first visualization

00:39:09.445 --> 00:39:11.605
I put the name of the tool in--

00:39:12.703 --> 00:39:17.644
I don't know if I translate well,
but I think it's G-E--

00:39:17.644 --> 00:39:23.785
You can see here, G-E-P-H-I,
I don't know how to pronounce it!

00:39:23.785 --> 00:39:26.997
["Jefy" I think...]

00:39:28.201 --> 00:39:32.216
So, this is the one I've used
for the visualizations

00:39:33.709 --> 00:39:36.871
and it's good because you can use it
on any platform.

00:39:36.872 --> 00:39:41.911
So both on a Mac or a PC or Linux.

00:39:44.829 --> 00:39:46.696
Now, it has limitations for...

00:39:47.209 --> 00:39:50.778
mostly for network statistics
in my opinion.

00:39:54.237 --> 00:39:57.061
The other one, that is very popular
is Node XL.

00:39:57.062 --> 00:40:00.548
And in fact it was developed
here in the ATI lab.

00:40:01.773 --> 00:40:04.092
In the lab where I work.

00:40:05.190 --> 00:40:06.937
So, they collaborated with Microsoft.

00:40:06.938 --> 00:40:09.867
It's a template for Excel

00:40:11.076 --> 00:40:12.552
and it allows--

00:40:12.553 --> 00:40:17.849
In fact they are still adding new features
and there's two people working on it

00:40:18.235 --> 00:40:19.665
in the lab.

00:40:19.739 --> 00:40:23.984
But the reason I haven't used it here,
is because I have a Mac

00:40:24.264 --> 00:40:29.166
and also there's another reason
I like this positioning algorithm

00:40:31.302 --> 00:40:32.807
and this is...

00:40:32.808 --> 00:40:37.014
this is another issue
I haven't talked about

00:40:37.124 --> 00:40:40.476
is how you actually place the dots.

00:40:40.476 --> 00:40:47.182
And actually these algorithms for layout
use force-directed schemes

00:40:48.820 --> 00:40:50.507
like in physics science.

00:40:50.584 --> 00:40:53.598
So if a node has a lot of links
with another node

00:40:53.599 --> 00:40:56.980
they put it closer,
so it's like there's forces

00:40:56.981 --> 00:41:00.276
or strings attaching the nodes.

00:41:00.858 --> 00:41:04.293
And depending on how many strings
there are, they're closer or farther.

00:41:04.605 --> 00:41:07.933
There's physics science rules
for placing them.

00:41:07.959 --> 00:41:09.508
But there's different algorithms

00:41:09.509 --> 00:41:14.981
but the other reason I chose Gephi
is that it has an algorithm

00:41:15.336 --> 00:41:20.899
specifically in this tool
that places my language groups separately

00:41:20.943 --> 00:41:24.338
more than any other algorithm
that I could use in Node XL.

00:41:24.339 --> 00:41:29.142
And it was more useful
to see the groups separated.

00:41:30.407 --> 00:41:33.186
But you can use both
depending on what you want to do.

00:41:33.187 --> 00:41:35.905
They both have weaknesses and strengths,

00:41:35.931 --> 00:41:38.847
different depending
on what you have to do.

00:41:40.592 --> 00:41:46.628
Node XL has more features
for processing many networks

00:41:48.068 --> 00:41:51.147
and extracting network statistics
for many networks at the same time.

00:41:52.217 --> 00:41:57.372
And it has a lot of interesting features,
maybe this is more manual.

00:41:58.528 --> 00:41:59.998
I don't know.

00:42:00.215 --> 00:42:04.670
Somebody called it
"the Photoshop of visualization".

00:42:09.125 --> 00:42:13.580
So I'm going to briefly comment
on the factor analysis.

00:42:13.892 --> 00:42:18.627
The point here, what I want to see
is multilingual users of Twitter

00:42:20.784 --> 00:42:23.663
are aware of their audience in a way.

00:42:24.848 --> 00:42:29.480
And they somehow perceive
how many followers

00:42:29.480 --> 00:42:32.205
of this language or the other they have.

00:42:32.761 --> 00:42:35.501
Maybe not very consciously,

00:42:37.641 --> 00:42:39.763
but they perceive something.

00:42:39.932 --> 00:42:42.468
So, I went to see how this social network

00:42:42.469 --> 00:42:46.691
the fact that there's many languages
or just one in the social network

00:42:47.628 --> 00:42:52.814
can affect the choice of language in this person,
the <i>ego</i> person.

00:42:54.638 --> 00:42:57.734
So, I actually did a lot of testing,
different variables,

00:42:57.735 --> 00:43:01.434
but I'm just going to focus
on the essence,


00:43:01.434 --> 00:43:05.729
which is I have my dependent variable
which is the proportion of English

00:43:05.730 --> 00:43:11.064
used by the ego has 50 posts,
maybe 60 percent of them are in English

00:43:11.883 --> 00:43:14.409
and 40 percent in Spanish,
I don't know.

00:43:14.693 --> 00:43:18.630
And then they have the factor
of how many users in the network

00:43:18.631 --> 00:43:21.381
are in English
and how many are using other languages.

00:43:21.597 --> 00:43:24.274
And then the multilingual index
of the network

00:43:24.275 --> 00:43:26.153
- and this is my favorite part -

00:43:26.153 --> 00:43:29.674
because it's basically saying

00:43:29.774 --> 00:43:35.900
"is multilingualism encouraging English
as a lingua franca?"

00:43:37.026 --> 00:43:41.693
especially on Twitter, where we have these
public posts that anybody can read.

00:43:43.339 --> 00:43:47.418
So anyway... I'm not going to go
into the technical details

00:43:47.940 --> 00:43:50.516
of bi-nodal statistical interpretation.

00:43:50.517 --> 00:43:55.415
What I wanted to do is
that in these combined effects

00:43:56.046 --> 00:44:00.500
of the factors,
which one was more important?

00:44:00.998 --> 00:44:03.208
Was heavier than the others?

00:44:03.289 --> 00:44:07.340
Had more weight in defining these
proportional [inaudible] used by the ego.

00:44:08.750 --> 00:44:11.242
I tried other factors,

00:44:11.243 --> 00:44:14.237
I also looked at the use
of non-English language

00:44:15.370 --> 00:44:18.137
In the end... there are certain,

00:44:19.620 --> 00:44:21.423
I mean, they're obvious somehow.

00:44:21.424 --> 00:44:23.602
I think it's more interesting the process
of what I've learned

00:44:23.603 --> 00:44:25.908
than the results themselves.

00:44:27.166 --> 00:44:30.031
Because basically what I've learned
is that, yeah,

00:44:31.040 --> 00:44:32.931
the English use of the network

00:44:32.931 --> 00:44:36.338
is encouraged by the use
of English by the ego

00:44:36.338 --> 00:44:40.756
and in a certain way it's so important
that any other factor

00:44:40.757 --> 00:44:44.029
is really not that important.

00:44:45.231 --> 00:44:48.980
And even the second most important,
the multilingual index

00:44:49.770 --> 00:44:54.830
was so light compared with
the heavy impact of English

00:44:55.575 --> 00:44:57.107
used in the network.

00:44:57.608 --> 00:45:00.294
But what I thought was really interesting

00:45:00.295 --> 00:45:03.329
was how do you define
the multlinguality of a network?

00:45:03.968 --> 00:45:07.295
And with this I got help
from Jordan Boyd-Graber

00:45:07.296 --> 00:45:09.336
who is also in the iSchool

00:45:09.337 --> 00:45:14.331
and in the lab for computational lab,
the information processing lab

00:45:14.332 --> 00:45:15.332
here in Maryland.

00:45:15.333 --> 00:45:17.556
He helped me
with all these technical aspects.

00:45:18.183 --> 00:45:20.590
And he was the one suggesting
"Well, why don't you look--"

00:45:20.590 --> 00:45:24.620
"instead of just looking at the number
of languages in the network...

00:45:24.620 --> 00:45:28.694
"because sometimes you get
wrongly detected languages...

00:45:28.695 --> 00:45:30.231
like Swahili. Well, no one was really
speaking Swahihi in this network.

00:45:33.201 --> 00:45:37.029
There were technical challenges,
like I explained to you.

00:45:38.122 --> 00:45:42.248
So maybe there's a high number
of languages in the network

00:45:42.249 --> 00:45:44.189
but the network is mostly monolingual.

00:45:44.190 --> 00:45:49.064
Mostly everybody uses English
and just a few people maybe use others

00:45:49.633 --> 00:45:52.337
or maybe just it got wrongly detected.

00:45:52.338 --> 00:45:54.810
And maybe you're just saying

00:45:54.811 --> 00:45:57.047
"Oh yeah, there's ten languages
in the network!"

00:45:57.048 --> 00:45:59.548
and actually it's not
a very multilingual network at all.

00:45:59.549 --> 00:46:02.650
So, we came up with this, the entropy.

00:46:03.390 --> 00:46:06.495
And this is a physics concept
that measures the disorder

00:46:06.496 --> 00:46:07.866
in a system.

00:46:07.866 --> 00:46:11.452
And in this case, the entropy
would be my multilingual index

00:46:11.453 --> 00:46:17.104
and what it's doing is providing a value
between 0 and 1

00:46:17.364 --> 00:46:23.105
So, with 0 it's a very homogeneous system
everyone speaks the same language

00:46:23.549 --> 00:46:26.900
and if it's closer to 1,
it's really a heterogeneous

00:46:26.972 --> 00:46:28.911
and it places an importance

00:46:28.912 --> 00:46:31.823
in how many people
are using its language.

00:46:32.235 --> 00:46:36.480
So, this is the equation,
just to show you it.

00:46:38.009 --> 00:46:40.641
And it takes into account the number
of languages in the network

00:46:40.642 --> 00:46:45.427
and then one of the variables
is how many nodes in that language

00:46:45.498 --> 00:46:48.337
that there are divided by the total number

00:46:48.338 --> 00:46:50.971
and this is what gives the proportion
for example.

00:46:52.889 --> 00:46:56.556
So just to let you know
that there's interesting lessons

00:46:56.557 --> 00:46:57.977
from this study.

00:46:57.982 --> 00:47:00.479
Despite the research not being exciting!

00:47:00.549 --> 00:47:02.881
And this is what I'm doing right now.

00:47:04.816 --> 00:47:08.002
So, the intrinsic characteristic
of the message

00:47:08.484 --> 00:47:11.038
how that influences the language choice.

00:47:11.062 --> 00:47:16.370
First, I'm wondering,
because I just saw it in the content

00:47:19.070 --> 00:47:22.495
are replies encouraging people
to use their native language?

00:47:22.992 --> 00:47:27.150
And public posts encouraging people
to use English as a lingua franca?

00:47:27.759 --> 00:47:30.251
This is one that showed up the same.

00:47:30.252 --> 00:47:34.151
And I changed the handle,
for privacy reasons...

00:47:34.549 --> 00:47:37.709
So this is the reply to somebody
and it's in Arabic.

00:47:38.443 --> 00:47:41.001
And this is a public posting
and it's in English.

00:47:42.414 --> 00:47:45.501
Now, the thing I'm looking at
is public analysis

00:47:45.502 --> 00:47:50.314
and I'm considering with Jordan
to do some automatic topic analysis

00:47:50.706 --> 00:47:54.215
because there's many languages,
so I cannot decode it all

00:47:54.782 --> 00:47:56.503
in many of them.

00:47:56.507 --> 00:47:58.459
Only in three, maybe four...

00:47:59.910 --> 00:48:01.406
So, I'm wondering,

00:48:01.407 --> 00:48:04.213
are technology topics favoring
the use of English?

00:48:04.600 --> 00:48:10.072
And other topics,
international news maybe?

00:48:11.308 --> 00:48:16.147
Whereas other topics
like national news or songs

00:48:16.148 --> 00:48:19.407
they might be encouraging the use
of native languages.

00:48:20.566 --> 00:48:22.904
And then I'm looking
if there's translations

00:48:22.904 --> 00:48:26.845
or if there's cross-cultural words
that you can detect.

00:48:27.324 --> 00:48:29.111
For instance, this person
is writing in English

00:48:29.112 --> 00:48:33.313
but it recommending a visit to a museum
in the city of Lille in France.

00:48:33.767 --> 00:48:38.830
So this person knows the city in France,
knows that to visit the museum

00:48:38.987 --> 00:48:40.556
you go there.

00:48:40.559 --> 00:48:43.089
And this is what I call
<i>cross-cultural</i> words.

00:48:44.239 --> 00:48:49.095
[What I kind of found] is that surprisingly
there's not many translation behaviors

00:48:49.096 --> 00:48:52.589
going on, despite these people
being multilingual.

00:48:53.001 --> 00:48:56.264
And this is what is going to trigger
some reflections.

00:49:00.289 --> 00:49:02.085
How am I doing on time?

00:49:04.172 --> 00:49:05.646
(woman) 1:22.

00:49:05.646 --> 00:49:10.050
<i>(man) Umm, it's usually an hour long...</i>

00:49:10.450 --> 00:49:14.358
So, I will go on with my reflections.

00:49:14.358 --> 00:49:18.266
to encourage some thoughts.

00:49:18.266 --> 00:49:22.027
So the greatest connecting power
is the will of users who want

00:49:22.027 --> 00:49:23.317
to be connected.

00:49:23.317 --> 00:49:28.201
This is a really nice quality,
because the communities of interest

00:49:28.290 --> 00:49:32.012
in social media, in Twitter
is what is bringing people

00:49:32.013 --> 00:49:33.701
from different countries, together.

00:49:34.794 --> 00:49:41.151
And also experiences,
like <i>the Voluntweeters</i>,

00:49:42.095 --> 00:49:45.815
so after the earthquake in Haiti,
there were these spontaneous

00:49:45.816 --> 00:49:48.972
self-organizations of Twitter users
for translating tweets

00:49:50.213 --> 00:49:53.755
and they called themselves <i>Voluntweeters</i>,
there's a paper about that--

00:49:53.826 --> 00:49:59.151
So this is the triggering
of social connections

00:50:00.820 --> 00:50:04.486
across countries, across borders
and across languages.

00:50:06.759 --> 00:50:10.300
But even when the social structure
could potentially facilitate

00:50:10.301 --> 00:50:13.375
information diffusion
and cross-language linking

00:50:14.558 --> 00:50:16.731
this condition is not sufficient.

00:50:16.732 --> 00:50:19.720
There are other factors
like the design of the interfaces

00:50:19.721 --> 00:50:22.479
and the design of systems
that can influence...

00:50:23.145 --> 00:50:27.438
can promote, or not translation behaviors
and cross-cultural awareness.

00:50:28.293 --> 00:50:31.503
And the Wikipedia
of cross-language linking

00:50:31.504 --> 00:50:35.113
you have links for many languages
for every article.

00:50:37.257 --> 00:50:41.061
We also still acknowledge the dynamic
language preferences of multilingual users

00:50:41.790 --> 00:50:44.145
so they could address their messages
to the appropriate audience.

00:50:44.146 --> 00:50:47.187
I like the solution of Google+
with their circles

00:50:47.880 --> 00:50:51.890
where I can put my friends and family
in Spain in a circle

00:50:51.891 --> 00:50:54.559
and write them in Spanish.

00:50:54.739 --> 00:51:00.633
And then the recommendation of people
based on language profile

00:51:01.437 --> 00:51:04.134
would be useful for this spontaneous
self-organization.

00:51:05.708 --> 00:51:08.057
So, these are some of the things.

00:51:08.143 --> 00:51:10.455
The impact of mediation.

00:51:10.782 --> 00:51:13.206
Global Voices is
an international community of bloggers

00:51:13.207 --> 00:51:18.303
that connect bloggers and citizens
from around the world

00:51:18.814 --> 00:51:20.504
in different languages.

00:51:21.171 --> 00:51:22.580
And Scott Hale

00:51:22.581 --> 00:51:27.353
a student from Oxford University
led a very interesting study

00:51:27.354 --> 00:51:33.960
after the earthquake in Haiti about blogs
in Spanish, Japanese and English

00:51:35.561 --> 00:51:38.542
and he looked
at the cross-language linking

00:51:38.543 --> 00:51:41.388
and focusing on this topic
over time.

00:51:41.488 --> 00:51:45.495
And he discovered that 50 percent
of the cross-language linking

00:51:45.496 --> 00:51:48.304
was happening through this platform,
Global Voices.

00:51:49.062 --> 00:51:51.941
So, it had a very big impact
in the language links.

00:51:54.170 --> 00:51:57.857
And finally, social media,
big media outlets,

00:51:57.858 --> 00:52:01.592
people are interconnected
in these complex networks

00:52:04.693 --> 00:52:08.945
and underlying is this language ecosystem.

00:52:09.058 --> 00:52:12.786
So we have the language ecosystem,
and on top of that

00:52:12.787 --> 00:52:15.296
we have the social media ecosystem.

00:52:15.305 --> 00:52:20.200
People would share a video from YouTube
on Twitter, or news on Facebook.

00:52:21.302 --> 00:52:26.011
What happened if we integrate
in this ecosystem

00:52:26.517 --> 00:52:30.518
these platforms, like Global Voices,
like Universal Subtitles

00:52:30.519 --> 00:52:34.327
which is a platform
for crowdsourcing subtitling of videos

00:52:34.328 --> 00:52:37.108
and translation of subtitles
for videos.

00:52:38.050 --> 00:52:42.222
If you integrate that and this
starts connecting, starts building paths

00:52:42.223 --> 00:52:45.743
between languages,
that didn't exist before.

00:52:45.744 --> 00:52:50.955
So I think we should make it easy
for multilingual people to translate

00:52:50.955 --> 00:52:55.187
and subtitle all the content they like,
their favorite content

00:52:56.003 --> 00:53:00.326
and share it with the appropriate audience
so they can start connecting

00:53:00.327 --> 00:53:03.114
the language islands of the internet.

00:53:03.145 --> 00:53:06.219
And that way stories will travel
all over the world.

00:53:09.204 --> 00:53:11.950
Particularly I would like to thank
Jen Golbeck, my adviser

00:53:11.951 --> 00:53:14.337
and Fulbright for supporting
this research.

00:53:14.477 --> 00:53:19.206
And then I open the space
for questions and your ideas

00:53:19.488 --> 00:53:21.780
if this has triggered some thoughts.

00:53:24.140 --> 00:53:25.972
<i>(woman) I have a question
about how this relates</i>

00:53:25.973 --> 00:53:28.112
<i>to your Yahoo award.</i>

00:53:29.468 --> 00:53:35.076
Well, they have the Internet Experiences
lab in California.

00:53:35.078 --> 00:53:36.428
And they--

00:53:36.460 --> 00:53:40.213
So, we tend to think 
maybe it's a super tiny place

00:53:40.213 --> 00:53:42.630
but actually there are fields

00:53:42.631 --> 00:53:44.818
and I applied for the social systems.

00:53:45.121 --> 00:53:48.967
The social systems are a category.

00:53:49.068 --> 00:53:54.686
And I think that was embedded
in the Internet Experience lab

00:53:56.739 --> 00:53:58.452
and yeah, they liked it.

00:53:58.516 --> 00:54:01.530
<i>(man) But is it this
work that they are interested in?</i>

00:54:01.813 --> 00:54:02.883
Yes.

00:54:02.884 --> 00:54:04.022
- <i>The languages?</i>
- Yes.

00:54:04.022 --> 00:54:07.726
Well, now I have results,
because I wrote up reports

00:54:09.496 --> 00:54:11.548
about what my work was about.

00:54:16.758 --> 00:54:17.968
<i>Great.</i>

00:54:22.055 --> 00:54:22.879
Yes?

00:54:22.879 --> 00:54:25.682
<i>(woman) I was thinking about
if you analyzed the place...</i>

00:54:25.682 --> 00:54:30.689
<i>like if there's any relationship
between tweeters and tweets</i>

00:54:31.056 --> 00:54:33.624
<i>and the place that the people are.</i>

00:54:35.883 --> 00:54:39.760
<i>I mean, because it's not the same
being a Brazilian in Brazil</i>

00:54:39.761 --> 00:54:43.197
<i>and tweeting in Portuguese
or being Brazilian in the US</i>

00:54:43.198 --> 00:54:45.330
<i>and tweeting in Portuguese--</i>

00:54:45.950 --> 00:54:49.249
There's many, many factors
that I haven't looked at.

00:54:50.126 --> 00:54:51.971
<i>It's not part of your study?</i>

00:54:52.300 --> 00:54:54.447
But because I had to scope it somehow.

00:54:54.448 --> 00:54:56.108
There's so many factors.

00:54:56.710 --> 00:54:59.993
Geography was one that I was originally
intending to look at

00:55:00.097 --> 00:55:04.458
but I found there were so many problems
to actually get the right geography

00:55:04.459 --> 00:55:06.652
the right geolocation.

00:55:08.154 --> 00:55:12.136
The problem is that I didn't originally
collect the geolocation.

00:55:12.137 --> 00:55:15.898
I think only a small percentage
of messages have...

00:55:16.457 --> 00:55:18.297
geolocated information.

00:55:18.902 --> 00:55:20.795
I'm not sure about the percentage there.

00:55:20.796 --> 00:55:24.690
So there's only a small percentage
of messages that have geolocation.

00:55:25.173 --> 00:55:27.604
There's issues with the accuracy...

00:55:28.041 --> 00:55:31.147
What I have collected is the information
in their profile

00:55:31.931 --> 00:55:35.462
they can put the information
about the place,

00:55:35.493 --> 00:55:39.572
but sometimes it's more
or less trustworthy,

00:55:39.573 --> 00:55:42.828
sometimes there's nothing,
and sometimes there's just crazy stuff.

00:55:43.210 --> 00:55:44.710
(audience laughs)

00:55:46.545 --> 00:55:49.735
So, something absolutely has to be there.

00:55:50.419 --> 00:55:55.249
If I wanted to expand this,
geography would be a nice place to go!

00:55:55.279 --> 00:55:56.609
<i>(woman) Ok.</i>

00:55:59.863 --> 00:56:00.631
Yes?

00:56:00.631 --> 00:56:01.710
<i>(man) Could you say a little bit more</i>

00:56:01.710 --> 00:56:04.946
<i>I think you said about the visualization
choices you made?</i>

00:56:04.964 --> 00:56:06.224
Oh yes, well...

00:56:08.033 --> 00:56:11.117
I tried this tool, the Node XL,

00:56:11.118 --> 00:56:13.284
I used both Node XL and Gephi.

00:56:13.522 --> 00:56:14.522
There's more...

00:56:16.109 --> 00:56:20.202
I think there's, I don't remember the name
there's one that was developed

00:56:20.202 --> 00:56:21.854
here in Maryland

00:56:21.854 --> 00:56:24.163
but it's not as user-friendly.

00:56:26.108 --> 00:56:29.563
But I've forgotten the name,
I will have to look it up.

00:56:29.895 --> 00:56:33.872
And there's a lot of tools
that are for really technical people

00:56:34.696 --> 00:56:37.156
that are handling millions of nodes.

00:56:37.528 --> 00:56:40.615
Because with these tools,
for social scientists or humanists

00:56:40.615 --> 00:56:42.295
maybe they are not.

00:56:42.316 --> 00:56:48.685
Some tools can have maybe 300-400 nodes
and still be understandable.

00:56:51.115 --> 00:56:55.622
But if you go beyond that,
actually visualizations get crazy

00:56:56.058 --> 00:57:02.088
and even for more technical tools
for more technical people

00:57:02.563 --> 00:57:07.061
there are hundreds or millions,
they cannot do visualizations

00:57:08.349 --> 00:57:11.870
at some point they just give you
statistical measures.

00:57:13.729 --> 00:57:15.156
I have to leave it out.

00:57:15.156 --> 00:57:17.051
I have a list of tools and that

00:57:17.051 --> 00:57:20.598
but if I need the names,
I need to go through everything.

00:57:22.596 --> 00:57:25.479
<i>(woman) But yours was Mac-accessible?</i>

00:57:25.479 --> 00:57:31.585
Yes, this Gephi tool is Mac-accessible,
you can use it with Microsoft

00:57:31.792 --> 00:57:34.446
with Mac and with Linux.

00:57:35.905 --> 00:57:37.979
And I forgot to say,
it's open source.

00:57:43.480 --> 00:57:48.839
<i>(woman) Did you find
studying languages and internet</i>

00:57:48.840 --> 00:57:52.681
<i>was like a place, unexplored?</i>

00:57:52.948 --> 00:57:55.208
<i>Like here in the United States?</i>

00:57:55.378 --> 00:58:00.001
<i>Like when you began studying
or analyzing this</i>

00:58:00.002 --> 00:58:04.303
<i>you felt that a lot of people
are doing this</i>

00:58:04.303 --> 00:58:06.200
<i>or nobody is doing this</i>

00:58:06.200 --> 00:58:08.352
<i>and I'm the first one trying to--</i>

00:58:08.435 --> 00:58:13.114
I'm not the first one,
but it's a very new area

00:58:13.114 --> 00:58:14.971
to be exploring.

00:58:15.033 --> 00:58:16.983
So, it's very exciting
because of that.

00:58:17.012 --> 00:58:18.797
Because there's so many
unanswered questions

00:58:18.798 --> 00:58:23.785
and I find that surprisingly enough
the United States is not paying so much attention

00:58:23.786 --> 00:58:26.053
about multilinguality issues


00:58:26.053 --> 00:58:31.002
And I think that language policies
are very monolingual-oriented

00:58:31.003 --> 00:58:32.948
but it's terrible

00:58:33.043 --> 00:58:37.182
because there's a whole lot
of multilinguality in this country.

00:58:37.183 --> 00:58:41.270
There's so many people
speaking different languages

00:58:42.548 --> 00:58:45.290
that I'm so amazed
about that contradiction.

00:58:45.780 --> 00:58:48.727
Because in Europe,
it's an obvious challenge for us

00:58:49.388 --> 00:58:51.907
because we need to understand each other
between all these countries

00:58:51.907 --> 00:58:53.567
of the European Union.

00:58:53.567 --> 00:58:58.499
And there's a lot of money invested
in research that relates to multilinguality

00:58:58.691 --> 00:59:00.738
and communication in languages

00:59:00.738 --> 00:59:04.557
and technology in particular,
cross-language systems

00:59:04.558 --> 00:59:09.030
and in libraries there's a lot of work
going on.

00:59:09.400 --> 00:59:13.942
There's investment in the research.

00:59:14.565 --> 00:59:18.405
So yeah, maybe in terms of investment

00:59:18.405 --> 00:59:22.115
the European Union is
not a bad place to be.

00:59:22.322 --> 00:59:24.109
Better than the United States!

00:59:24.110 --> 00:59:27.445
But at the same time,
what I find interesting

00:59:27.446 --> 00:59:33.323
is that here when I talk about it
people are really interested

00:59:35.313 --> 00:59:38.376
and interested in the subject
and excited about it.

00:59:38.458 --> 00:59:41.294
Maybe in Europe it looks more
like old news.

00:59:41.294 --> 00:59:43.796
Like "yeah, we already know that."

00:59:44.135 --> 00:59:45.665
(audience laughs)

00:59:45.674 --> 00:59:49.580
So I find that it's exciting
to be seeing the audience

00:59:49.629 --> 00:59:52.226
like "Oh yeah!"
It's so new.

00:59:52.666 --> 00:59:54.026
*(woman) Yes.

00:59:58.653 --> 01:00:03.146
<i>(woman) As the emerging view
of research in the United States</i>

01:00:03.146 --> 01:00:09.892
<i>can you show me which institutions
or which area of academic institutions</i>

01:00:11.798 --> 01:00:14.748
<i>actually have more invested
in this topic in the US?</i>

01:00:16.262 --> 01:00:18.916
I'm not sure about the institutions.

01:00:20.572 --> 01:00:25.978
What I know, particularly,
in Indiana there's work

01:00:26.510 --> 01:00:29.107
because Susan Herring
is a researcher there.

01:00:30.797 --> 01:00:32.891
She has inspired my work.

01:00:32.891 --> 01:00:35.607
She published a book
<i>The Multilingual Internet</i>

01:00:35.687 --> 01:00:40.953
and she has done research on blogs,
also communities

01:00:41.891 --> 01:00:45.251
of different languages connecting blogs
in the blogosphere.

01:00:45.251 --> 01:00:51.058
So she has been one of the ones,
one of the first tackling these issues

01:00:51.144 --> 01:00:54.720
and she's still going
and she's doing something.

01:00:54.896 --> 01:00:59.399
So, it's the University of Indiana,
I think.

01:01:00.914 --> 01:01:03.348
Yeah, Susan Herring.
Look for her!

01:01:06.095 --> 01:01:09.181
And also at the same university
there's Paolillo.

01:01:10.156 --> 01:01:12.793
He's also doing research
in this area

01:01:12.826 --> 01:01:18.869
and he actually published for UNESCO
for research on language diversity

01:01:18.945 --> 01:01:20.275
on the internet.

01:01:21.785 --> 01:01:23.479
So Susan Herring and Paolillo,

01:01:23.480 --> 01:01:25.444
they are at the same university.

01:01:26.736 --> 01:01:30.058
Those are my inspiring ones.

01:01:33.682 --> 01:01:37.270
Well, at Harvard at the Berkman Center 
of Internet and Society also did

01:01:37.270 --> 01:01:38.639
this mapping of the blogs.

01:01:38.640 --> 01:01:40.649
But they don't focus on languages.

01:01:41.700 --> 01:01:45.279
But there's tangential thing
around there.

01:01:49.387 --> 01:01:51.428
<i>(man) One more question?</i>

01:01:53.560 --> 01:01:54.748
<i>Well, thank you very much!</i>

01:01:54.749 --> 01:01:55.749
Thanks!

01:01:55.759 --> 01:01:57.661
(audience applauds)