1
00:00:00,384 --> 00:00:01,891
Good morning, everyone.

2
00:00:02,858 --> 00:00:05,650
Thank you for coming here
[unclear] of the semester.

3
00:00:08,370 --> 00:00:10,217
So, I'm going to start.

4
00:00:11,001 --> 00:00:13,901
Access to the internet
is greater than ever before

5
00:00:14,183 --> 00:00:17,166
and as a consequence,
it's becoming more multilingual.

6
00:00:18,706 --> 00:00:22,613
However, there's evidence of segmentation
of cyberspace

7
00:00:22,614 --> 00:00:25,170
due to language and national borders.

8
00:00:28,011 --> 00:00:30,812
This image serves to illustrate that.

9
00:00:31,684 --> 00:00:35,656
This is the language communities
of Twitter in Europe.

10
00:00:36,562 --> 00:00:40,656
So, what you can see are tweets
geolocated over a map of Europe

11
00:00:40,657 --> 00:00:44,047
and the different colors
represent the different languages.

12
00:00:45,272 --> 00:00:50,905
You can even see regional languages
like Catalan in the Catalan region of Spain

13
00:00:52,286 --> 00:00:56,037
And this is going to be useful
for an example I'm going to use later.

14
00:01:01,958 --> 00:01:04,209
I'm interested in Twitter in particular,

15
00:01:04,209 --> 00:01:07,312
because of the speed
of information dissemination

16
00:01:07,313 --> 00:01:10,744
and that most of this information
is publicly accessible.

17
00:01:13,743 --> 00:01:18,780
I'm going to illustrate this
with a capture

18
00:01:18,780 --> 00:01:22,280
of a dynamic visualization
you can find on the Twitter blog

19
00:01:22,281 --> 00:01:24,783
by Miguel Rios.

20
00:01:25,345 --> 00:01:28,981
And what you can see here
is the global flow of tweets

21
00:01:28,982 --> 00:01:31,205
after the earthquake in Japan.

22
00:01:32,147 --> 00:01:34,793
In pink, there are the tweets
coming out of Japan

23
00:01:34,794 --> 00:01:37,400
and, in green, the retweets
all over the world.

24
00:01:39,258 --> 00:01:44,966
This illustrates that in Twitter
information is spreading across countries.

25
00:01:46,180 --> 00:01:47,987
But how can this happen?

26
00:01:49,380 --> 00:01:55,018
Expatriates, migrants, minorities.
diaspora communities, language learners

27
00:01:55,028 --> 00:01:59,484
all play an important role
in building transnational networks

28
00:01:59,484 --> 00:02:02,549
and cultural bridges
between nations and communities.

29
00:02:03,847 --> 00:02:06,169
They are the multilingual users
on the internet.

30
00:02:07,512 --> 00:02:11,045
The overarching research question is:

31
00:02:11,047 --> 00:02:16,833
how are multilingual users of Twitter
connecting different language groups?

32
00:02:21,961 --> 00:02:27,157
In 2009, the Berkman Center of Internet
and Society at Harvard University

33
00:02:27,158 --> 00:02:29,773
mapped the Arabic blogosphere

34
00:02:29,774 --> 00:02:33,134
and they described a key concept
for my research.

35
00:02:35,020 --> 00:02:40,855
They discovered an English bridge
and a French bridge of bloggers

36
00:02:40,856 --> 00:02:45,780
that were writing in their native
Arabic language and in English or French.

37
00:02:47,430 --> 00:02:51,551
And they were connecting the different
national blogospheres

38
00:02:51,552 --> 00:02:53,034
with the international one.

39
00:02:55,387 --> 00:03:00,351
This might have played a role in the Arab
popular uprisings in 2011

40
00:03:00,352 --> 00:03:02,773
for reaching out to the world.

41
00:03:04,582 --> 00:03:09,223
And this is connected with a concept
that first appeared in 2008

42
00:03:09,224 --> 00:03:11,697
of the bridge bloggers.

43
00:03:13,601 --> 00:03:16,010
So, bridge bloggers are bloggers

44
00:03:16,011 --> 00:03:19,991
that are trying to connect
their local communities

45
00:03:19,992 --> 00:03:23,255
to a wider global audience.

46
00:03:25,190 --> 00:03:29,398
The image you can see here
is actually the visualization they created

47
00:03:29,399 --> 00:03:32,617
of mapping the Arabic blogosphere.

48
00:03:35,341 --> 00:03:37,370
Each dot is a blogger, or a blog.

49
00:03:38,763 --> 00:03:42,930
The size represents their popularity,
so how many incoming links they have

50
00:03:42,931 --> 00:03:45,457
and they grouped them--

51
00:03:45,457 --> 00:03:47,666
the neighborhoods they created

52
00:03:47,667 --> 00:03:51,582
in relation to the linking
between them.

53
00:03:52,587 --> 00:03:56,420
So, the ones that are grouped together
are linking among each other.

54
00:03:57,235 --> 00:03:59,454
The colors are a different question.

55
00:03:59,489 --> 00:04:06,067
The colors represent "attentive clusters",
that's how they call it.

56
00:04:06,067 --> 00:04:11,793
And they look at their online resources
and media outlets

57
00:04:12,065 --> 00:04:13,911
these blogs were linking to.

58
00:04:15,055 --> 00:04:19,346
So, blogs of the same colors
are following the same media outlets

59
00:04:19,346 --> 00:04:21,100
and online resources.

60
00:04:21,165 --> 00:04:25,057
And they did human coding
to label those groups.

61
00:04:25,640 --> 00:04:30,282
And here is where we see
the label English grids

62
00:04:30,283 --> 00:04:32,035
the responses from Cuba
in English

63
00:04:32,036 --> 00:04:33,950
and up there, there's [unclear] France.

64
00:04:35,774 --> 00:04:40,537
And so I think it's important to retain
the concept of attentive clusters.

65
00:04:43,797 --> 00:04:49,005
Now, let's go back to 2011
during the Arab popular uprisings.

66
00:04:50,444 --> 00:04:55,207
And I'll show you a visualization
of the influence network

67
00:04:55,208 --> 00:04:57,069
of Twitter users in Egypt.

68
00:04:58,210 --> 00:04:59,868
So, what you're seeing here

69
00:04:59,869 --> 00:05:04,348
just imagine people down the street
at Tahrir Square

70
00:05:04,349 --> 00:05:08,426
tweeting in Arabic about what's going on
on the ground.

71
00:05:08,427 --> 00:05:12,188
And those are the people in red.

72
00:05:12,652 --> 00:05:17,525
So, these red dots represent users
that are tweeting in Arabic.

73
00:05:18,309 --> 00:05:20,372
Then we have the international community

74
00:05:20,373 --> 00:05:25,515
or even Americans, British and so on
tweeting in English.

75
00:05:26,075 --> 00:05:28,287
And they are in blue,
those blue dots.

76
00:05:28,808 --> 00:05:33,052
And then, interestingly, we have
people in between them.

77
00:05:33,052 --> 00:05:38,152
which are illustrated in different
degrees of violet, or violet shades.

78
00:05:39,393 --> 00:05:43,417
This represents the fact that they
are tweeting in both Arabic and English.

79
00:05:45,006 --> 00:05:47,204
So, what we're seeing
is the bridge Twitters

80
00:05:47,996 --> 00:05:52,754
because, like Ethan Zuckermann called them
"bridge bloggers".

81
00:05:55,642 --> 00:05:58,520
So, another context.

82
00:05:59,333 --> 00:06:04,649
The same year, 2011, a lot
of big protests were going on in Europe.

83
00:06:05,172 --> 00:06:06,780
And in particular, in Spain.

84
00:06:06,781 --> 00:06:11,868
They started on May 15th 2011
there were massive protests.

85
00:06:13,002 --> 00:06:16,640
And of because of this context,
this situation

86
00:06:16,640 --> 00:06:22,974
new attentive clusters were emerging
in the social media landscape of Spain.

87
00:06:28,240 --> 00:06:33,711
Now, this is a visualization you can find
in the <i>Socialflow</i> blog, research blog

88
00:06:33,806 --> 00:06:35,216
on social networks.

89
00:06:35,216 --> 00:06:41,235
And what it is, is it tracks the origin
and the initial spread

90
00:06:41,236 --> 00:06:45,347
of the hashtag <i>#occupywallstreet</i>
in Twitter.

91
00:06:46,987 --> 00:06:51,803
They detected that one of the first users
of the hashtag <i>#occupywallstreet</i>

92
00:06:51,804 --> 00:06:57,093
was on July 13th 2011, linking to a blog
post of Adbusters.

93
00:06:58,344 --> 00:07:02,250
So you have the Twitter account
of Adbusters there, very big

94
00:07:02,251 --> 00:07:05,023
because it's being retweeted a lot.

95
00:07:05,966 --> 00:07:06,966
And mentioned a lot.

96
00:07:07,884 --> 00:07:13,514
And they collected these mentions
and the tweets that had these mentions

97
00:07:13,515 --> 00:07:17,298
and these retweets with the hashtag
during July 13th.

98
00:07:18,188 --> 00:07:20,501
From July 13th to July 23rd.

99
00:07:21,163 --> 00:07:23,523
So, from the first 10 days
of the use of this hashtag

100
00:07:23,524 --> 00:07:26,810
it was from the very beginning of the use
of this hashtag on Twitter.

101
00:07:31,012 --> 00:07:32,320
They just mapped the accounts

102
00:07:33,462 --> 00:07:39,459
and the series of posts with the hashtag
and mentions with the hashtag

103
00:07:40,652 --> 00:07:42,997
and the users that were connecting

104
00:07:42,998 --> 00:07:45,329
because of these mentions
and retweets.

105
00:07:46,136 --> 00:07:49,237
Now the interesting thing
in this visualization

106
00:07:49,238 --> 00:07:50,517
is that they

107
00:07:50,518 --> 00:07:54,348
the <i>Socialflow</i> people
particularly in [inaudible]

108
00:07:54,873 --> 00:07:59,885
detected this Spanish brand
of users

109
00:07:59,886 --> 00:08:03,146
were forming an attentive cluster.

110
00:08:05,948 --> 00:08:09,282
Mentioning and retweeting about it
in Spanish

111
00:08:09,283 --> 00:08:13,042
using the hashtag in their messages
in Spanish.

112
00:08:14,104 --> 00:08:17,164
And they point out in the blog

113
00:08:17,865 --> 00:08:19,524
that this Spanish contingent

114
00:08:19,525 --> 00:08:24,189
helped post and spread the word
about Occupy Wall Street

115
00:08:24,190 --> 00:08:29,358
even before most of the United States
was aware of it.

116
00:08:32,239 --> 00:08:34,043
So, I found that very interesting.

117
00:08:34,043 --> 00:08:37,438
And it was due to the context
in Spain at that moment

118
00:08:38,287 --> 00:08:45,449
with big protests and new clusters
forming in the social media landscape.

119
00:08:56,673 --> 00:09:01,497
Now I have shown you the importance
of these multilingual users

120
00:09:01,498 --> 00:09:06,470
in connecting language communities
and spreading information

121
00:09:06,471 --> 00:09:09,228
across countries, acting as mediators.

122
00:09:11,172 --> 00:09:15,831
But let's focus on another aspect
of connecting language groups

123
00:09:15,832 --> 00:09:17,170
which is language choice.

124
00:09:17,881 --> 00:09:23,261
So I'm going to devote a moment
to speak about languages

125
00:09:23,262 --> 00:09:24,262
and language choice.

126
00:09:27,743 --> 00:09:31,064
To understand languages in the world

127
00:09:31,065 --> 00:09:33,478
I'm going to use a telescope.

128
00:09:37,358 --> 00:09:39,954
So de Swaan...

129
00:09:41,244 --> 00:09:44,565
...proposed a theory called
the world language system

130
00:09:44,566 --> 00:09:45,815
back in the 1990s.

131
00:09:47,268 --> 00:09:50,867
to explain the languages in the world.

132
00:09:52,127 --> 00:09:55,428
And he used a very beautiful metaphor,
the constellation.

133
00:09:56,906 --> 00:10:01,510
So, in his theory there's about a dozen
languages in the world

134
00:10:01,511 --> 00:10:05,469
that are the hearts of the system,
or the suns.

135
00:10:06,493 --> 00:10:07,493
The suns of the system.

136
00:10:07,604 --> 00:10:11,015
For instance, English, French, Spanish,
Arabic and more.

137
00:10:12,324 --> 00:10:16,765
And then there are hundreds,
maybe more than 100, 200...

138
00:10:16,766 --> 00:10:22,707
national languages that are orbiting
around these suns like planets.

139
00:10:24,497 --> 00:10:28,393
And finally we have regional
and minority languages

140
00:10:28,394 --> 00:10:31,664
that are orbiting these planets
like satellites.

141
00:10:32,826 --> 00:10:37,606
And he used this metaphor
to explain the power relationships

142
00:10:37,607 --> 00:10:39,729
between languages.

143
00:10:40,172 --> 00:10:43,096
This is a theory of what he called

144
00:10:43,097 --> 00:10:46,659
"communication potential
and language competition"

145
00:10:48,408 --> 00:10:50,539
A key point he made

146
00:10:51,979 --> 00:10:55,379
is that the system holds together

147
00:10:55,440 --> 00:10:58,793
thanks to multilingual people
and interpreters.

148
00:11:00,291 --> 00:11:03,173
This is what's providing cohesion
to the system.

149
00:11:03,956 --> 00:11:06,959
He also made a controversial proposal

150
00:11:06,960 --> 00:11:11,340
about the communication potential
of a language.

151
00:11:12,290 --> 00:11:14,868
So, he proposed a formula,
a mathematical formula

152
00:11:14,869 --> 00:11:19,910
where he could estimate the communication
potential of a language

153
00:11:19,911 --> 00:11:24,889
and supposedly a person with tools
through learning and usage

154
00:11:24,889 --> 00:11:27,554
based on the communications of that.

155
00:11:28,108 --> 00:11:34,206
For example, a person might decide
to learn English and use English

156
00:11:35,137 --> 00:11:41,414
because not only does it provide
communication with English native speakers

157
00:11:42,395 --> 00:11:46,377
but also, adding to that, it provides
the possibility to communicate

158
00:11:46,378 --> 00:11:50,073
with all the second-language learners
of English

159
00:11:50,074 --> 00:11:53,101
from many different languages,
many different countries.

160
00:11:53,102 --> 00:11:55,818
So, supposedly, in history

161
00:11:55,819 --> 00:11:59,686
English provides
the greatest communication.

162
00:12:01,464 --> 00:12:05,106
And he received some criticism,
because of the central role of English

163
00:12:05,107 --> 00:12:06,785
in his theory

164
00:12:06,806 --> 00:12:10,139
He said it was the central hub
of all the system.

165
00:12:12,959 --> 00:12:20,043
There's also the language ecology paradigm
first proposed by Haugen in 1972

166
00:12:21,825 --> 00:12:25,243
and there's this idea of an ecosystem
of languages

167
00:12:25,244 --> 00:12:29,695
and, again, it's using another metaphor

168
00:12:29,696 --> 00:12:31,608
and because of this metaphor

169
00:12:31,609 --> 00:12:34,466
also appeared the idea
of endangered languages.

170
00:12:36,353 --> 00:12:39,880
I'm going to briefly just read
the definition.

171
00:12:39,881 --> 00:12:42,787
He defined the language ecology as:

172
00:12:42,788 --> 00:12:47,235
"the study of interactions between
any given language and its environment"

173
00:12:47,985 --> 00:12:49,207
and what I think is very important:

174
00:12:49,208 --> 00:12:53,690
"language exists only in the minds
of its users"

175
00:12:56,694 --> 00:12:59,956
which leads me to point at my research.

176
00:13:01,528 --> 00:13:05,808
In my research, I'm using a microscope
to see the cells

177
00:13:05,809 --> 00:13:09,337
and my cells in my study
are the Twitter users.

178
00:13:12,192 --> 00:13:13,452
Why is that?

179
00:13:15,616 --> 00:13:19,906
Because as Haugen explains,
there's a psychological dimension

180
00:13:19,907 --> 00:13:21,811
to language ecology

181
00:13:21,812 --> 00:13:25,070
where language interacts
with other languages

182
00:13:25,071 --> 00:13:27,697
in the minds of multilingual people.

183
00:13:28,860 --> 00:13:31,574
And there's a sociological dimension
to language ecology

184
00:13:32,425 --> 00:13:38,379
where we use language to communicate
and interact with other people.

185
00:13:38,380 --> 00:13:43,860
And this language ecology generates
because of the people

186
00:13:43,861 --> 00:13:46,493
that decide to use that language

187
00:13:46,494 --> 00:13:50,337
learning and interacting 
with people using it.

188
00:13:51,139 --> 00:13:54,758
And this is the point
of language choice in languages.

189
00:13:55,643 --> 00:13:59,827
So, I focus on the connections of people
and the language choice.

190
00:14:04,231 --> 00:14:07,595
So, these are the four points
I'm going to be speaking about.

191
00:14:07,862 --> 00:14:13,015
But actually the main focus
is going to be the first point

192
00:14:13,483 --> 00:14:19,196
Social network analysis and the taxonomy
of intersections between language groups

193
00:14:20,267 --> 00:14:23,376
This is where I'm going to be spending
most of the time.

194
00:14:23,377 --> 00:14:26,820
And then very briefly,
just for compilation purposes

195
00:14:26,821 --> 00:14:29,551
I'm going to speak about another
small study that I did

196
00:14:29,552 --> 00:14:30,861
the factor analysis

197
00:14:30,862 --> 00:14:34,485
looking at the influence
of the social network

198
00:14:34,486 --> 00:14:39,007
in the language choices of the users.

199
00:14:39,236 --> 00:14:41,737
So, how the social network
influences language choice

200
00:14:41,738 --> 00:14:43,400
of our multilingual users.

201
00:14:45,128 --> 00:14:50,399
And then I'm going to briefly also talk
about the last study of my dissertation

202
00:14:50,400 --> 00:14:52,492
that is still ongoing.

203
00:14:52,992 --> 00:14:55,215
So, I still have new research
to talk about.

204
00:14:55,216 --> 00:14:57,622
And it's content analysis

205
00:14:57,623 --> 00:15:00,221
and in this case I'm focusing on
intrinsic factors

206
00:15:00,222 --> 00:15:02,666
intrinsic to the messages

207
00:15:02,667 --> 00:15:05,506
about the topic,
and the type of exchange.

208
00:15:05,973 --> 00:15:07,513
If it's a reply,
if it's a public post

209
00:15:07,514 --> 00:15:09,923
and how that influences
the language choice as well.

210
00:15:11,331 --> 00:15:12,331
And finally I will...

211
00:15:14,194 --> 00:15:17,193
I'm going to give you my reflections

212
00:15:17,996 --> 00:15:23,204
so I can invite your thoughts
and suggestions and discussions about it.

213
00:15:27,097 --> 00:15:29,098
Briefly, I'm going to start
with the sampling

214
00:15:29,099 --> 00:15:32,077
so I can talk about the rest
of the research.

215
00:15:34,959 --> 00:15:36,654
So my focus is on multilingual users,

216
00:15:36,655 --> 00:15:39,782
how did I identify multilingual users
on Twitter?

217
00:15:42,044 --> 00:15:43,784
It was giving me a headache.

218
00:15:44,058 --> 00:15:46,691
Finally what we decided...

219
00:15:48,542 --> 00:15:50,240
this research has been--

220
00:15:51,458 --> 00:15:54,070
I have always had the help
of Jennifer Golbeck,

221
00:15:54,071 --> 00:15:55,077
she was my adviser.

222
00:15:55,078 --> 00:15:57,296
And I did this with her help.

223
00:15:57,978 --> 00:16:01,715
So what we did, was gather a list
of what is called <i>stopwords</i>.

224
00:16:02,813 --> 00:16:05,419
From different languages
and you have a list over there.

225
00:16:06,176 --> 00:16:10,451
And then the <i>stopword</i> lists
you can find them on the internet.

226
00:16:10,452 --> 00:16:13,055
They are created
for computational linguistics

227
00:16:13,056 --> 00:16:15,560
so they use it for filtering purposes.

228
00:16:16,783 --> 00:16:19,103
And they are common words
in a language.

229
00:16:19,791 --> 00:16:21,356
Very common words in a language.

230
00:16:21,357 --> 00:16:25,729
So, sometimes they're used precisely
for eliminating them from texts

231
00:16:25,730 --> 00:16:28,689
when they're in, for example,
searches in Google

232
00:16:28,690 --> 00:16:32,952
the eliminate the stopwords,
the stopwords that you type

233
00:16:32,952 --> 00:16:34,612
in the search.

234
00:16:35,314 --> 00:16:37,797
But in this case I wanted
to find the stopwords

235
00:16:37,798 --> 00:16:41,003
that are very common in the language
to represent the language.

236
00:16:41,004 --> 00:16:44,103
And so we had to select words
that were not written the same

237
00:16:44,104 --> 00:16:45,734
as in another language.

238
00:16:46,121 --> 00:16:48,425
Sometimes, could be confusing
and ambiguous.

239
00:16:51,491 --> 00:16:54,029
Then I typed in Google...

240
00:16:55,954 --> 00:16:59,029
one word in one language
and one word in another language.

241
00:16:59,030 --> 00:17:01,372
Usually I was always using
one English word

242
00:17:01,373 --> 00:17:04,006
and one word in a different language.

243
00:17:04,729 --> 00:17:08,310
And I looked in the Twitter domain.

244
00:17:08,901 --> 00:17:12,185
So the search results from Google
will give me the profiles

245
00:17:13,105 --> 00:17:19,866
of people on Twitter that in theory
wrote messages in both languages.

246
00:17:20,123 --> 00:17:24,242
We had to do a lot of hand-combing
to actually see if it was in two languages

247
00:17:24,542 --> 00:17:27,563
or it was just that they were mentioning
an English song

248
00:17:28,352 --> 00:17:31,654
the title of an English song
but they had no English in the rest.

249
00:17:31,655 --> 00:17:35,662
So we had to ensure
that they were authoring tweets

250
00:17:35,663 --> 00:17:37,575
in two languages.

251
00:17:37,911 --> 00:17:39,996
So writing them, not just retweeting them

252
00:17:39,997 --> 00:17:43,202
they were not just automatic postings
from Facebook.

253
00:17:43,203 --> 00:17:48,275
So we had a long set of criteria
a lot of manual combing

254
00:17:48,276 --> 00:17:52,949
and then finally we selected
92 multilingual users

255
00:17:52,950 --> 00:17:57,694
and in total they used 19 languages,
2 or 3 languages per person.

256
00:18:00,989 --> 00:18:04,956
Now, I don't know if you want to ask
some questions about the sampling

257
00:18:04,957 --> 00:18:07,607
because there's a lot of details about it.

258
00:18:13,392 --> 00:18:14,602
No doubts?

259
00:18:15,119 --> 00:18:16,585
Or maybe they'll come later!

260
00:18:19,137 --> 00:18:21,929
Now, how do I do
the social networks analysis?

261
00:18:22,743 --> 00:18:27,869
Well, now I have my 92 multilingual users
technically they are called <i>the ego</i>

262
00:18:28,277 --> 00:18:30,190
of an <i>egocentric</i> network.

263
00:18:31,084 --> 00:18:33,005
This is the cell of my study.

264
00:18:33,836 --> 00:18:35,984
It started with the nucleus of the cell

265
00:18:35,985 --> 00:18:37,550
which is my multilingual user

266
00:18:37,551 --> 00:18:40,478
and then I go to Twitter

267
00:18:40,478 --> 00:18:43,439
and first of all I have instructed--

268
00:18:44,878 --> 00:18:47,416
so in this case my <i>ego</i>
is called <i>the Painter</i>

269
00:18:48,111 --> 00:18:53,500
and I have extracted the last 50 messages
that he posted on Twitter

270
00:18:53,501 --> 00:18:56,560
to see the languages
this person used-- is using.

271
00:18:57,156 --> 00:19:01,943
And I see that he is using English,
Spanish and Catalan.

272
00:19:02,945 --> 00:19:05,479
Catalan is a regional language in Spain

273
00:19:05,605 --> 00:19:07,736
and I have shown you on the map
the region before

274
00:19:07,737 --> 00:19:09,247
where the region was.

275
00:19:09,275 --> 00:19:12,474
And they speak both Catalan
and Spanish.

276
00:19:13,727 --> 00:19:16,825
So, this person is tweeting
in a minority language

277
00:19:16,826 --> 00:19:18,488
a national language

278
00:19:18,489 --> 00:19:20,779
and also international.

279
00:19:26,808 --> 00:19:31,754
So, I already found <i>the Painter</i>
and I know what languages this person speaks

280
00:19:31,754 --> 00:19:33,542
well, uses on Twitter,

281
00:19:33,543 --> 00:19:36,178
and then I extract
all the social networks.

282
00:19:36,179 --> 00:19:37,932
So, the followers on Twitter

283
00:19:37,933 --> 00:19:39,716
you know that on Twitter
you have followers

284
00:19:39,717 --> 00:19:41,160
and you follow people.

285
00:19:41,163 --> 00:19:42,513
I extracted both.

286
00:19:42,546 --> 00:19:48,323
The followers of <i>the Painter</i>
the people that are following him on Twitter

287
00:19:48,323 --> 00:19:52,118
and also how the friends
are connecting to each other.

288
00:19:52,289 --> 00:19:56,671
So, all of them, all of these dots
are the followers

289
00:19:56,672 --> 00:19:58,707
the people following <i>the Painter</i>
on Twitter

290
00:19:58,708 --> 00:20:02,788
and also I see how they connect
among each other, ok?

291
00:20:04,542 --> 00:20:08,837
So <i>the Painter</i> follows <i>Eduard</i>
in the center

292
00:20:10,153 --> 00:20:12,245
and it seems he's very popular.

293
00:20:13,567 --> 00:20:17,042
And then I extract the last 30 posts
of <i>Eduard--</i>

294
00:20:17,048 --> 00:20:18,509
there's a reason for that

295
00:20:18,510 --> 00:20:21,961
but vernacular 
is mostly economy questions!

296
00:20:24,717 --> 00:20:25,717
I will tell you why!

297
00:20:25,718 --> 00:20:28,857
So I extracted the last 30 posts of <i>Eduard</i>

298
00:20:28,858 --> 00:20:31,966
and then I do
automatic language identification

299
00:20:31,967 --> 00:20:36,734
with the Google API
for language identification

300
00:20:38,548 --> 00:20:39,548
which costs money.

301
00:20:40,527 --> 00:20:43,282
So you have to really think
about how many posts you want to send

302
00:20:43,283 --> 00:20:45,580
to Google and how much money
you have available

303
00:20:45,581 --> 00:20:48,178
and what is the accuracy
you're going to have

304
00:20:48,179 --> 00:20:51,125
according to how many posts you send.

305
00:20:51,348 --> 00:20:53,268
There's a lot of testing going on there.

306
00:20:54,271 --> 00:20:58,482
I do the same with everybody
in the social network.

307
00:20:58,700 --> 00:20:59,893
I extract the last 30 posts

308
00:20:59,894 --> 00:21:02,340
use the Google identification

309
00:21:02,341 --> 00:21:08,086
build that algorithm that decides
based on the languages of these 30 posts

310
00:21:08,087 --> 00:21:11,929
is this person monolingual?
Is this person multilingual?

311
00:21:11,929 --> 00:21:13,221
Which languages?

312
00:21:13,222 --> 00:21:15,379
And then I laddered them, ok.

313
00:21:16,572 --> 00:21:18,746
This is just a visualization behind the--

314
00:21:20,280 --> 00:21:27,315
Perhaps person 1 is monolingual,
or bilingual of two languages.

315
00:21:31,985 --> 00:21:35,782
Now that I have all the friends
of <i>the Painter</i>

316
00:21:35,918 --> 00:21:37,392
how they connect,

317
00:21:37,392 --> 00:21:40,854
I color code them
depending on the languages they are using.

318
00:21:42,020 --> 00:21:44,669
And here, what you can see
is very interesting.

319
00:21:46,076 --> 00:21:48,735
I don't know if you can distinguish
the colors well

320
00:21:48,736 --> 00:21:53,949
because up here, this area,
that is like a triangle

321
00:21:53,950 --> 00:21:57,896
there's a group of users
writing in English.

322
00:21:58,743 --> 00:22:00,753
And it's pink.
Sort of pinkish.

323
00:22:00,753 --> 00:22:04,547
And then, down here
there's this Spanish group

324
00:22:04,548 --> 00:22:06,792
in light green.

325
00:22:07,544 --> 00:22:12,407
And, in the middle, the one
that perhaps doesn't distinguish as well

326
00:22:12,408 --> 00:22:15,464
from the English,
is the Catalan group.

327
00:22:15,935 --> 00:22:18,962
So the users writing in Catalan
in dark blue.

328
00:22:19,776 --> 00:22:21,870
And then there's a set of violets
in between

329
00:22:21,871 --> 00:22:26,319
and these violets represent
the bilingual users

330
00:22:26,319 --> 00:22:29,292
either English and Catalan
or English and Spanish.

331
00:22:29,963 --> 00:22:33,031
And then there's darker green
around here,

332
00:22:33,031 --> 00:22:36,498
they are using both Catalan and Spanish.

333
00:22:36,498 --> 00:22:38,252
So there's a lot of bilinguals
going on.

334
00:22:38,252 --> 00:22:39,736
And there's an interesting dynamics

335
00:22:39,737 --> 00:22:42,710
in that you have this English group
up there

336
00:22:42,711 --> 00:22:44,060
and the Spanish group up here

337
00:22:44,061 --> 00:22:46,200
and the Catalan group in the middle.

338
00:22:46,201 --> 00:22:49,147
And this Catalan group is very mixed up
with the Spanish group

339
00:22:49,744 --> 00:22:52,184
which makes sense,
because it's a bilingual community.

340
00:23:01,121 --> 00:23:06,529
So, this is how I built the egocentric
network of my 92 multilingual users.

341
00:23:08,601 --> 00:23:10,987
<i>The Painter</i> is just one of them.
I have 92.

342
00:23:10,988 --> 00:23:16,575
I have 92 cells or <i>egocentric</i> networks
that I studied with my microscope.

343
00:23:17,868 --> 00:23:21,817
Do you want to ask some questions
about this process

344
00:23:21,818 --> 00:23:23,419
or this visualization?

345
00:23:25,051 --> 00:23:29,982
<i>(person 1) Of the bilingual units,
are they users or tweets?</i>

346
00:23:30,894 --> 00:23:32,056
They are users, yeah.

347
00:23:32,400 --> 00:23:35,560
So, the dots represent people.

348
00:23:35,561 --> 00:23:40,014
So, like <i>Eduard</i> here.
They represent people.

349
00:23:42,250 --> 00:23:45,317
Now each dot to determine the language
and the color

350
00:23:45,318 --> 00:23:47,931
I extracted 30 posts

351
00:23:48,434 --> 00:23:52,797
So, it's an interesting question
because the 30 posts

352
00:23:52,798 --> 00:23:55,958
have different language levels
assigned to them

353
00:23:56,096 --> 00:23:57,130
especially if they were bilingual

354
00:23:57,131 --> 00:24:01,643
and I had to decide which language level
I was going to assign to the user.

355
00:24:01,644 --> 00:24:05,383
So, I had to build an algorithm
with a set of rules

356
00:24:10,279 --> 00:24:11,346
basically saying--

357
00:24:11,347 --> 00:24:16,651
the Google identification system
would give me a language

358
00:24:16,652 --> 00:24:17,882
and a confidence level

359
00:24:17,882 --> 00:24:19,496
So if the confidence level was very low

360
00:24:19,497 --> 00:24:23,838
I would say "discard that"
because I had a series of pluristics

361
00:24:23,858 --> 00:24:30,113
based on both the number of tweets
using a particular language

362
00:24:30,113 --> 00:24:32,685
and also on the confidence level.

363
00:24:33,655 --> 00:24:38,267
And there are a lot
of technical challenges there as well.

364
00:24:39,973 --> 00:24:41,948
<i>(woman) So, it's possible
that some of these posts</i>

365
00:24:41,949 --> 00:24:45,824
<i>many of these posts would be multilingual, 
I'm sorry monolingual in one language or the other?</i>

366
00:24:46,498 --> 00:24:51,988
<i>So it's also possible that some
of these individual posts</i>

367
00:24:51,989 --> 00:24:54,184
<i>would mix languages?</i>

368
00:24:54,623 --> 00:24:56,733
Yes, it is possible.
It's very possible!

369
00:24:57,063 --> 00:25:00,360
It's very challenging
for the automatic system!

370
00:25:01,915 --> 00:25:03,743
<i>(woman) Right, ok.
I just wanted to be clear--</i>

371
00:25:03,744 --> 00:25:05,185
Yes, exactly.

372
00:25:05,186 --> 00:25:11,303
So it's not as frequent as I expected,
having bilingual posts

373
00:25:11,304 --> 00:25:12,740
that I would call.

374
00:25:12,741 --> 00:25:14,431
But it's happening.

375
00:25:15,058 --> 00:25:20,539
And so, for a series of tests,
I had to do manual combing

376
00:25:20,540 --> 00:25:23,263
and I saw that sometimes
it was the case

377
00:25:23,264 --> 00:25:26,718
that they were doing some sort
of translation in the same tweet

378
00:25:26,719 --> 00:25:31,585
and sometimes it was just the case
that they were mentioning titles of things

379
00:25:31,586 --> 00:25:34,206
or places in a different language.

380
00:25:34,563 --> 00:25:39,470
So, there's a lot of issues
surrounding the automatic handling of this

381
00:25:39,471 --> 00:25:44,478
but you are dealing with 92 networks

382
00:25:44,479 --> 00:25:50,864
and they have between 30
and 5,000 nodes in them.

383
00:25:52,708 --> 00:25:55,841
So, I don't remember the numbers exactly,

384
00:25:55,867 --> 00:25:59,148
but I'm talking about
around 80,000 people.

385
00:26:01,132 --> 00:26:04,527
So detecting the language of 80,000 people
and this is small-scale.

386
00:26:04,913 --> 00:26:08,286
If you go to millions,
you need an automatic system.

387
00:26:08,287 --> 00:26:11,291
And one of the things I'm having
to write up in my dissertation

388
00:26:11,292 --> 00:26:13,832
is what are the challenges.

389
00:26:13,833 --> 00:26:17,984
You have to be prepared for them,
to solve those problems.

390
00:26:18,551 --> 00:26:21,851
And one of them is what do you do
with bilingual posts

391
00:26:21,852 --> 00:26:23,920
which language do you assign to that post?

392
00:26:23,921 --> 00:26:28,287
Automatic posts, spam...
there's a lot of problems.

393
00:26:29,862 --> 00:26:31,219
Challenges, I mean.

394
00:26:31,220 --> 00:26:34,766
That's what makes it interesting
because you cannot do manual combing

395
00:26:34,766 --> 00:26:36,046
on these scales.

396
00:26:39,073 --> 00:26:41,013
Do you have another question?

397
00:26:44,501 --> 00:26:48,025
So, now, what am I doing with this?

398
00:26:50,562 --> 00:26:56,178
I'm going to classify my social networks,
looking at the patterns

399
00:26:56,179 --> 00:26:59,094
of overlaps between the languages groups.

400
00:26:59,720 --> 00:27:01,953
And overlaps or intersections.

401
00:27:02,547 --> 00:27:07,878
I'm looking specifically at the networks
that have only two language groups

402
00:27:08,219 --> 00:27:11,860
I had five of these networks
that were trilingual

403
00:27:12,284 --> 00:27:16,020
so I put them aside to go simple
first with just two language groups

404
00:27:16,021 --> 00:27:18,361
to see how they interconnect.

405
00:27:19,369 --> 00:27:21,272
And then I classified them

406
00:27:21,936 --> 00:27:24,198
first following a qualitative analysis

407
00:27:24,198 --> 00:27:28,822
and then I used network statistics
that I developed with my adviser

408
00:27:28,823 --> 00:27:30,386
for this purpose.

409
00:27:31,338 --> 00:27:33,693
And I will talk later a little more
about it.

410
00:27:34,341 --> 00:27:37,980
So, tried to provide
more robust measures for that.

411
00:27:39,428 --> 00:27:44,074
I classified them and I came up
with some types.

412
00:27:45,922 --> 00:27:49,631
This is what I call <i>the gatekeeper</i>
language bridge type.

413
00:27:50,526 --> 00:27:52,995
And there's some variants of it,
obviously.

414
00:27:53,624 --> 00:27:55,990
What you can see here
is the network of a person

415
00:27:55,991 --> 00:28:00,092
and I'm going to assume this person
is in the United States

416
00:28:00,093 --> 00:28:02,350
and speaks both Spanish and English.

417
00:28:04,043 --> 00:28:05,684
Let's call her <i>Maria</i>.

418
00:28:05,927 --> 00:28:11,581
So she's Maria and she has two groups
of friends using Spanish on Twitter

419
00:28:12,531 --> 00:28:15,768
and then that big group of friends
using English.

420
00:28:17,320 --> 00:28:19,528
And, as you can see,
there's just a few nodes

421
00:28:19,529 --> 00:28:22,003
connecting the two language groups.

422
00:28:22,004 --> 00:28:27,869
You can see that the social structure
can be different from the language groups

423
00:28:29,391 --> 00:28:32,174
so you can have maybe a group of friends
and a group of coworkers

424
00:28:32,175 --> 00:28:36,424
inside the same language group,
so it can be more complex

425
00:28:36,425 --> 00:28:41,205
than just dividing the social network
by language groups.

426
00:28:41,206 --> 00:28:45,522
There can be more grouping
because of other social resources.

427
00:28:46,811 --> 00:28:50,572
But the interesting thing is that
there are only a few nodes

428
00:28:50,573 --> 00:28:53,455
where people are connecting
holding together these Twitters.

429
00:28:55,058 --> 00:29:00,675
I think this was friends
with English here.

430
00:29:00,676 --> 00:29:05,461
You can see, in this case, it seems
like the two groups

431
00:29:05,462 --> 00:29:08,089
are holding closely together

432
00:29:08,809 --> 00:29:13,833
because there are much more links
holding the two groups together.

433
00:29:14,663 --> 00:29:18,246
Of course, this is going to depend
on the size of the networks

434
00:29:18,247 --> 00:29:23,067
so I had to account for the size
when coming up with measures

435
00:29:23,068 --> 00:29:25,943
with network connections

436
00:29:25,944 --> 00:29:28,257
I had to provide ratios.

437
00:29:28,258 --> 00:29:32,340
Now, the ratio of [close] language linking
here and here

438
00:29:32,341 --> 00:29:34,312
and you have these types--

439
00:29:36,477 --> 00:29:40,266
These types are not just clear-cut.

440
00:29:40,346 --> 00:29:41,696
There's an evolution.

441
00:29:41,700 --> 00:29:43,337
There's people that have
very few connections

442
00:29:43,338 --> 00:29:44,653
with the language groups

443
00:29:44,654 --> 00:29:46,943
and then progressively there's people
with more and more.

444
00:29:47,704 --> 00:29:49,037
And this increases.

445
00:29:49,037 --> 00:29:52,048
Which points to the fact,
that my cells are there.

446
00:29:52,735 --> 00:29:57,001
Which means I don't see the evolution
over time, ok?

447
00:29:57,819 --> 00:29:59,724
This is a limitation of my research.

448
00:29:59,725 --> 00:30:04,594
I just see the social network
of this person looked

449
00:30:04,594 --> 00:30:07,491
at a particular point in time.

450
00:30:07,925 --> 00:30:10,057
I don't know how it evolves over time.

451
00:30:10,058 --> 00:30:13,130
So, for myself, it's just there.

452
00:30:13,508 --> 00:30:18,702
It would be interesting
to see these different patterns

453
00:30:18,702 --> 00:30:20,771
that I have been observing.

454
00:30:20,771 --> 00:30:26,632
Maybe over time these connections
between languages maybe increasing.

455
00:30:28,862 --> 00:30:32,131
Now we have the <i>integration
and union</i> type

456
00:30:32,693 --> 00:30:37,128
where in this case you have a person
from an Arab country

457
00:30:37,129 --> 00:30:40,778
and green represents the friends
that are using Arabic

458
00:30:40,779 --> 00:30:45,155
and the friends using English are in pink,
but there's also violet

459
00:30:45,156 --> 00:30:46,837
there are bilinguals.

460
00:30:47,196 --> 00:30:51,534
That means there's a group
of English users

461
00:30:51,535 --> 00:30:57,187
and bilingual English - Arabic users
inserted in the group of Arabic, inside.

462
00:30:59,530 --> 00:31:01,289
That's the integration,
so they're integrated.

463
00:31:02,419 --> 00:31:07,726
And then I have a Greek guy,
who uses Greek and English

464
00:31:07,726 --> 00:31:09,446
and his Arabic friends.

465
00:31:09,446 --> 00:31:11,935
And in this case, you can see
it's sort of light blue

466
00:31:11,936 --> 00:31:16,788
representing Greek, so the friends
that tweet in Greek

467
00:31:16,789 --> 00:31:20,729
Pink again represents people tweeting
in English

468
00:31:21,353 --> 00:31:23,426
and there's a lot of bilinguals.

469
00:31:23,449 --> 00:31:26,994
So these kind of dark blues
represent the bilinguals.

470
00:31:26,995 --> 00:31:28,604
And these are two groups

471
00:31:28,605 --> 00:31:32,741
that if you've seen before,
<i>the gatekeeper</i> and the language bridge

472
00:31:32,742 --> 00:31:35,281
progressively getting closer and closer

473
00:31:35,282 --> 00:31:40,990
with more and more links
across languages.

474
00:31:41,184 --> 00:31:42,815
In this case, this is like the extreme.

475
00:31:42,816 --> 00:31:46,016
The links between the two languages
are so dense

476
00:31:46,017 --> 00:31:51,021
that you cannot almost distinguish
where the border is

477
00:31:51,021 --> 00:31:53,128
between the two language groups.

478
00:31:53,164 --> 00:31:58,534
And, interestingly, the border might be
even only noticeable

479
00:31:58,534 --> 00:32:01,406
because there's a lot of bilinguals
around it.

480
00:32:02,091 --> 00:32:04,924
And this is the union type
where they unite.

481
00:32:07,201 --> 00:32:09,806
And finally, the <i>peripheral</i> language type.

482
00:32:09,807 --> 00:32:13,690
This is a Brazilian guy,
the network of a Brazilian guy

483
00:32:15,324 --> 00:32:16,892
where you have--

484
00:32:16,893 --> 00:32:18,885
probably he lives in the United States
or something like that--

485
00:32:18,886 --> 00:32:23,192
because this guy has mostly
all this big group of friends

486
00:32:23,226 --> 00:32:24,850
tweeting in English.

487
00:32:26,532 --> 00:32:31,978
And then there's the side tentacle
running outside, using Portuguese.

488
00:32:34,702 --> 00:32:36,399
And this is like a periphery landscape.

489
00:32:36,400 --> 00:32:39,137
So, in the periphery there's a small group
of Portuguese language.

490
00:32:39,893 --> 00:32:45,233
Now, I forgot to mention that there's dots
that are light yellow or white.

491
00:32:45,286 --> 00:32:48,100
Those are the ones that have no data.

492
00:32:49,074 --> 00:32:51,270
So, I don't know
the language they're using

493
00:32:51,271 --> 00:32:53,382
because either their accounts are closed

494
00:32:53,383 --> 00:32:57,803
or for some reason, in between the collection
of data they closed the account.

495
00:32:59,307 --> 00:33:03,059
Mostly, the reason
is that they're private accounts

496
00:33:03,570 --> 00:33:05,640
where you cannot get the data from.

497
00:33:06,442 --> 00:33:08,755
I think somewhere I read
it was about 5 percent.

498
00:33:08,756 --> 00:33:10,216
I'm not sure.

499
00:33:10,216 --> 00:33:14,010
But for one reason or another,
I don't have that information.

500
00:33:16,563 --> 00:33:20,976
Now, why am I classifying them?
These networks?

501
00:33:22,785 --> 00:33:26,088
Well, the reason is that--

502
00:33:26,089 --> 00:33:28,793
well, there are some studies
that demonstrate that the social structure

503
00:33:28,794 --> 00:33:33,539
the structure of the social networks
influences the spread of information.

504
00:33:34,096 --> 00:33:36,457
How information disseminates
in the network.

505
00:33:38,553 --> 00:33:42,909
So, I'm just assuming
that these different structures

506
00:33:42,910 --> 00:33:46,382
are going to influence the spread
of information.

507
00:33:47,292 --> 00:33:49,750
But this is a study that has to be done.

508
00:33:49,929 --> 00:33:52,944
I cannot demonstrate that one
of these types

509
00:33:52,945 --> 00:33:55,681
facilitates the spread of information.

510
00:33:55,682 --> 00:34:02,330
I can only say that I am assuming,
so that potential study

511
00:34:04,200 --> 00:34:09,400
could just look at, for example,
if <i>gatekeeper</i> and <i>language bridges</i>

512
00:34:10,551 --> 00:34:16,231
are not as good for spreading information
as <i>union and integration</i> types.

513
00:34:20,178 --> 00:34:25,022
Right, we can just assume
because of the cross-language links

514
00:34:28,295 --> 00:34:33,380
so, how many links there are
or the ratio of discourse language

515
00:34:33,380 --> 00:34:38,331
may potentially facilitate information
diffusion in these cases.

516
00:34:39,944 --> 00:34:42,557
So, that study needs to be done.

517
00:34:42,607 --> 00:34:44,732
I cannot say what's going to happen!

518
00:34:44,732 --> 00:34:47,123
I just assume it's going to be like that.

519
00:34:49,178 --> 00:34:52,009
So that is the reason why I classify them.

520
00:34:52,498 --> 00:34:54,599
I have some network statistics.

521
00:34:55,969 --> 00:35:00,753
We've made about an 80 percent accuracy
guess, which is quite good,

522
00:35:00,753 --> 00:35:02,453
but the sample is small.

523
00:35:08,014 --> 00:35:10,961
So now, do you have any more questions
before I move past to the next study?

524
00:35:13,726 --> 00:35:15,444
<i>man) I was curious as to how many--</i>

525
00:35:15,444 --> 00:35:19,144
<i>what was the selection process like 
to find the 92 users?</i>

526
00:35:20,324 --> 00:35:22,891
Well, this is what I've been spending
the beginning

527
00:35:22,892 --> 00:35:26,690
about just using two stopwords
from two different languages

528
00:35:26,691 --> 00:35:31,482
typing that in the search box in Google
and searching Twitter

529
00:35:31,482 --> 00:35:32,875
and then once--

530
00:35:32,876 --> 00:35:36,192
Basically you just go through 
the list of results

531
00:35:36,193 --> 00:35:41,540
and start opening the profile,
counting the tweets.

532
00:35:42,327 --> 00:35:44,536
How many in this language,
how many in the other.

533
00:35:44,601 --> 00:35:46,640
And we put a threshold of 10 percent

534
00:35:46,640 --> 00:35:53,026
they had to have written 10 percent
of the tweets in a second language

535
00:35:53,228 --> 00:35:56,742
and you couldn't count retweets
or automatic posting.

536
00:35:57,937 --> 00:36:00,296
We also had to manually discard
these spammers.

537
00:36:01,535 --> 00:36:03,733
So, that was the process.

538
00:36:06,151 --> 00:36:09,536
<i>(woman) And that's a paid search
 through Google?</i>

539
00:36:10,131 --> 00:36:12,601
No, that we did manually

540
00:36:12,717 --> 00:36:14,087
and then once--

541
00:36:14,088 --> 00:36:20,392
So the other thing you can say is you can
use these core multilingual users

542
00:36:20,938 --> 00:36:23,929
and then do what I did for behavior
in these social networks

543
00:36:23,929 --> 00:36:29,363
which is once you extract the friends
and extract the messages of the friends

544
00:36:30,669 --> 00:36:33,559
and automatically find the language

545
00:36:34,035 --> 00:36:36,522
then you can say "Oh, this person
is multilingual" automatically.

546
00:36:36,522 --> 00:36:41,099
You just process it and you can detect
a lot more multilingual people

547
00:36:41,183 --> 00:36:42,756
through that process.

548
00:36:42,757 --> 00:36:46,101
The paid process was sending these posts


549
00:36:46,101 --> 00:36:49,075
to the Google language 
identification tool.

550
00:36:49,885 --> 00:36:55,010
So, what I did was clean each message
automatically.

551
00:36:55,544 --> 00:37:00,387
Basically, eliminating the hashtags


552
00:37:01,437 --> 00:37:05,230
and the mentions
that had an <i>@</i> in front,

553
00:37:05,230 --> 00:37:10,074
symbols, URLs, all those things
I would automatically eliminate them

554
00:37:10,392 --> 00:37:13,777
and then with the rest of the message,
I'd send that to the Google API

555
00:37:14,125 --> 00:37:15,849
for language identification

556
00:37:16,009 --> 00:37:21,726
and the Google API would give me
a language level and a confidence binary.

557
00:37:21,726 --> 00:37:23,476
And that for each message.

558
00:37:23,485 --> 00:37:26,371
And then I built the algorithm
with the help of Jen Golbeck

559
00:37:26,372 --> 00:37:30,688
to decide, well I have 30 messages,
500 English

560
00:37:30,714 --> 00:37:35,420
10 million Spanish and then one in Swahili
which is unlikely

561
00:37:36,728 --> 00:37:39,954
and you had to decide
the confidence value--

562
00:37:39,955 --> 00:37:42,935
So I used rules, defined rules

563
00:37:42,936 --> 00:37:45,559
but it could be done
statistically I think.

564
00:37:46,097 --> 00:37:48,388
And write some statistical method
to decide

565
00:37:48,389 --> 00:37:51,869
"well this person actually is bilingual"
or whatever.

566
00:37:52,779 --> 00:37:54,429
That's the process.

567
00:37:54,477 --> 00:37:55,597
It's long!

568
00:37:55,788 --> 00:37:56,788
Yes.

569
00:37:58,026 --> 00:38:00,487
<i>(woman) Hi, I understand
that you did it manually</i>

570
00:38:00,488 --> 00:38:05,265
<i>but currently in existing research field
is there any software</i>

571
00:38:05,265 --> 00:38:08,489
<i>that we can use to capture,</i>

572
00:38:08,489 --> 00:38:11,935
<i>to have access to all
these different tweets?</i>

573
00:38:11,983 --> 00:38:15,400
<i>And to capture the different categories?
[inaudible]</i>

574
00:38:15,400 --> 00:38:18,472
Ok, so you mean the extraction?

575
00:38:18,912 --> 00:38:19,983
<i>(woman) Yeah.</i>

576
00:38:19,983 --> 00:38:21,226
No, I didn't do it manually.

577
00:38:21,227 --> 00:38:22,705
(woman) <i>And the other,
I think the other part</i>

578
00:38:22,706 --> 00:38:25,570
<i>of your data presentation
is visualizations coming out</i>

579
00:38:25,571 --> 00:38:27,132
<i>like this graph.</i>

580
00:38:27,132 --> 00:38:32,610
<i>Can you show us what kind of research
do we have for social scientists</i>

581
00:38:33,250 --> 00:38:35,478
<i>to present the data in a visual form?</i>

582
00:38:35,479 --> 00:38:37,461
This is a tool I would recommend.

583
00:38:37,461 --> 00:38:39,123
[inaudible]

584
00:38:39,123 --> 00:38:41,427
So, the first question.

585
00:38:42,572 --> 00:38:45,748
All the extraction from Twitter,
it was automatic.

586
00:38:46,265 --> 00:38:48,638
I didn't copy the tweets,
it was automatic.

587
00:38:48,855 --> 00:38:50,707
I used the Twitter API.

588
00:38:51,286 --> 00:38:54,849
They have a process
for registered developers

589
00:38:54,850 --> 00:38:57,205
and I extracted it automatically.

590
00:39:01,925 --> 00:39:05,777
Now, the tools, and I forgot
to put that in this slide

591
00:39:05,847 --> 00:39:09,444
but in the beginning,
when I showed you the first visualization

592
00:39:09,445 --> 00:39:11,605
I put the name of the tool in--

593
00:39:12,703 --> 00:39:17,644
I don't know if I translate well,
but I think it's G-E--

594
00:39:17,644 --> 00:39:23,785
You can see here, G-E-P-H-I,
I don't know how to pronounce it!

595
00:39:23,785 --> 00:39:26,997
["Jefy" I think...]

596
00:39:28,201 --> 00:39:32,216
So, this is the one I've used
for the visualizations

597
00:39:33,709 --> 00:39:36,871
and it's good because you can use it
on any platform.

598
00:39:36,872 --> 00:39:41,911
So both on a Mac or a PC or Linux.

599
00:39:44,829 --> 00:39:46,696
Now, it has limitations for...

600
00:39:47,209 --> 00:39:50,778
mostly for network statistics
in my opinion.

601
00:39:54,237 --> 00:39:57,061
The other one, that is very popular
is Node XL.

602
00:39:57,062 --> 00:40:00,548
And in fact it was developed
here in the ATI lab.

603
00:40:01,773 --> 00:40:04,092
In the lab where I work.

604
00:40:05,190 --> 00:40:06,937
So, they collaborated with Microsoft.

605
00:40:06,938 --> 00:40:09,867
It's a template for Excel

606
00:40:11,076 --> 00:40:12,552
and it allows--

607
00:40:12,553 --> 00:40:17,849
In fact they are still adding new features
and there's two people working on it

608
00:40:18,235 --> 00:40:19,665
in the lab.

609
00:40:19,739 --> 00:40:23,984
But the reason I haven't used it here,
is because I have a Mac

610
00:40:24,264 --> 00:40:29,166
and also there's another reason
I like this positioning algorithm

611
00:40:31,302 --> 00:40:32,807
and this is...

612
00:40:32,808 --> 00:40:37,014
this is another issue
I haven't talked about

613
00:40:37,124 --> 00:40:40,476
is how you actually place the dots.

614
00:40:40,476 --> 00:40:47,182
And actually these algorithms for layout
use force-directed schemes

615
00:40:48,820 --> 00:40:50,507
like in physics science.

616
00:40:50,584 --> 00:40:53,598
So if a node has a lot of links
with another node

617
00:40:53,599 --> 00:40:56,980
they put it closer,
so it's like there's forces

618
00:40:56,981 --> 00:41:00,276
or strings attaching the nodes.

619
00:41:00,858 --> 00:41:04,293
And depending on how many strings
there are, they're closer or farther.

620
00:41:04,605 --> 00:41:07,933
There's physics science rules
for placing them.

621
00:41:07,959 --> 00:41:09,508
But there's different algorithms

622
00:41:09,509 --> 00:41:14,981
but the other reason I chose Gephi
is that it has an algorithm

623
00:41:15,336 --> 00:41:20,899
specifically in this tool
that places my language groups separately

624
00:41:20,943 --> 00:41:24,338
more than any other algorithm
that I could use in Node XL.

625
00:41:24,339 --> 00:41:29,142
And it was more useful
to see the groups separated.

626
00:41:30,407 --> 00:41:33,186
But you can use both
depending on what you want to do.

627
00:41:33,187 --> 00:41:35,905
They both have weaknesses and strengths,

628
00:41:35,931 --> 00:41:38,847
different depending
on what you have to do.

629
00:41:40,592 --> 00:41:46,628
Node XL has more features
for processing many networks

630
00:41:48,068 --> 00:41:51,147
and extracting network statistics
for many networks at the same time.

631
00:41:52,217 --> 00:41:57,372
And it has a lot of interesting features,
maybe this is more manual.

632
00:41:58,528 --> 00:41:59,998
I don't know.

633
00:42:00,215 --> 00:42:04,670
Somebody called it
"the Photoshop of visualization".

634
00:42:09,125 --> 00:42:13,580
So I'm going to briefly comment
on the factor analysis.

635
00:42:13,892 --> 00:42:18,627
The point here, what I want to see
is multilingual users of Twitter

636
00:42:20,784 --> 00:42:23,663
are aware of their audience in a way.

637
00:42:24,848 --> 00:42:29,480
And they somehow perceive
how many followers

638
00:42:29,480 --> 00:42:32,205
of this language or the other they have.

639
00:42:32,761 --> 00:42:35,501
Maybe not very consciously,

640
00:42:37,641 --> 00:42:39,763
but they perceive something.

641
00:42:39,932 --> 00:42:42,468
So, I went to see how this social network

642
00:42:42,469 --> 00:42:46,691
the fact that there's many languages
or just one in the social network

643
00:42:47,628 --> 00:42:52,814
can affect the choice of language in this person,
the <i>ego</i> person.

644
00:42:54,638 --> 00:42:57,734
So, I actually did a lot of testing,
different variables,

645
00:42:57,735 --> 00:43:01,434
but I'm just going to focus
on the essence,


646
00:43:01,434 --> 00:43:05,729
which is I have my dependent variable
which is the proportion of English

647
00:43:05,730 --> 00:43:11,064
used by the ego has 50 posts,
maybe 60 percent of them are in English

648
00:43:11,883 --> 00:43:14,409
and 40 percent in Spanish,
I don't know.

649
00:43:14,693 --> 00:43:18,630
And then they have the factor
of how many users in the network

650
00:43:18,631 --> 00:43:21,381
are in English
and how many are using other languages.

651
00:43:21,597 --> 00:43:24,274
And then the multilingual index
of the network

652
00:43:24,275 --> 00:43:26,153
- and this is my favorite part -

653
00:43:26,153 --> 00:43:29,674
because it's basically saying

654
00:43:29,774 --> 00:43:35,900
"is multilingualism encouraging English
as a lingua franca?"

655
00:43:37,026 --> 00:43:41,693
especially on Twitter, where we have these
public posts that anybody can read.

656
00:43:43,339 --> 00:43:47,418
So anyway... I'm not going to go
into the technical details

657
00:43:47,940 --> 00:43:50,516
of bi-nodal statistical interpretation.

658
00:43:50,517 --> 00:43:55,415
What I wanted to do is
that in these combined effects

659
00:43:56,046 --> 00:44:00,500
of the factors,
which one was more important?

660
00:44:00,998 --> 00:44:03,208
Was heavier than the others?

661
00:44:03,289 --> 00:44:07,340
Had more weight in defining these
proportional [inaudible] used by the ego.

662
00:44:08,750 --> 00:44:11,242
I tried other factors,

663
00:44:11,243 --> 00:44:14,237
I also looked at the use
of non-English language

664
00:44:15,370 --> 00:44:18,137
In the end... there are certain,

665
00:44:19,620 --> 00:44:21,423
I mean, they're obvious somehow.

666
00:44:21,424 --> 00:44:23,602
I think it's more interesting the process
of what I've learned

667
00:44:23,603 --> 00:44:25,908
than the results themselves.

668
00:44:27,166 --> 00:44:30,031
Because basically what I've learned
is that, yeah,

669
00:44:31,040 --> 00:44:32,931
the English use of the network

670
00:44:32,931 --> 00:44:36,338
is encouraged by the use
of English by the ego

671
00:44:36,338 --> 00:44:40,756
and in a certain way it's so important
that any other factor

672
00:44:40,757 --> 00:44:44,029
is really not that important.

673
00:44:45,231 --> 00:44:48,980
And even the second most important,
the multilingual index

674
00:44:49,770 --> 00:44:54,830
was so light compared with
the heavy impact of English

675
00:44:55,575 --> 00:44:57,107
used in the network.

676
00:44:57,608 --> 00:45:00,294
But what I thought was really interesting

677
00:45:00,295 --> 00:45:03,329
was how do you define
the multlinguality of a network?

678
00:45:03,968 --> 00:45:07,295
And with this I got help
from Jordan Boyd-Graber

679
00:45:07,296 --> 00:45:09,336
who is also in the iSchool

680
00:45:09,337 --> 00:45:14,331
and in the lab for computational lab,
the information processing lab

681
00:45:14,332 --> 00:45:15,332
here in Maryland.

682
00:45:15,333 --> 00:45:17,556
He helped me
with all these technical aspects.

683
00:45:18,183 --> 00:45:20,590
And he was the one suggesting
"Well, why don't you look--"

684
00:45:20,590 --> 00:45:24,620
"instead of just looking at the number
of languages in the network...

685
00:45:24,620 --> 00:45:28,694
"because sometimes you get
wrongly detected languages...

686
00:45:28,695 --> 00:45:30,231
like Swahili. Well, no one was really
speaking Swahihi in this network.

687
00:45:33,201 --> 00:45:37,029
There were technical challenges,
like I explained to you.

688
00:45:38,122 --> 00:45:42,248
So maybe there's a high number
of languages in the network

689
00:45:42,249 --> 00:45:44,189
but the network is mostly monolingual.

690
00:45:44,190 --> 00:45:49,064
Mostly everybody uses English
and just a few people maybe use others

691
00:45:49,633 --> 00:45:52,337
or maybe just it got wrongly detected.

692
00:45:52,338 --> 00:45:54,810
And maybe you're just saying

693
00:45:54,811 --> 00:45:57,047
"Oh yeah, there's ten languages
in the network!"

694
00:45:57,048 --> 00:45:59,548
and actually it's not
a very multilingual network at all.

695
00:45:59,549 --> 00:46:02,650
So, we came up with this, the entropy.

696
00:46:03,390 --> 00:46:06,495
And this is a physics concept
that measures the disorder

697
00:46:06,496 --> 00:46:07,866
in a system.

698
00:46:07,866 --> 00:46:11,452
And in this case, the entropy
would be my multilingual index

699
00:46:11,453 --> 00:46:17,104
and what it's doing is providing a value
between 0 and 1

700
00:46:17,364 --> 00:46:23,105
So, with 0 it's a very homogeneous system
everyone speaks the same language

701
00:46:23,549 --> 00:46:26,900
and if it's closer to 1,
it's really a heterogeneous

702
00:46:26,972 --> 00:46:28,911
and it places an importance

703
00:46:28,912 --> 00:46:31,823
in how many people
are using its language.

704
00:46:32,235 --> 00:46:36,480
So, this is the equation,
just to show you it.

705
00:46:38,009 --> 00:46:40,641
And it takes into account the number
of languages in the network

706
00:46:40,642 --> 00:46:45,427
and then one of the variables
is how many nodes in that language

707
00:46:45,498 --> 00:46:48,337
that there are divided by the total number

708
00:46:48,338 --> 00:46:50,971
and this is what gives the proportion
for example.

709
00:46:52,889 --> 00:46:56,556
So just to let you know
that there's interesting lessons

710
00:46:56,557 --> 00:46:57,977
from this study.

711
00:46:57,982 --> 00:47:00,479
Despite the research not being exciting!

712
00:47:00,549 --> 00:47:02,881
And this is what I'm doing right now.

713
00:47:04,816 --> 00:47:08,002
So, the intrinsic characteristic
of the message

714
00:47:08,484 --> 00:47:11,038
how that influences the language choice.

715
00:47:11,062 --> 00:47:16,370
First, I'm wondering,
because I just saw it in the content

716
00:47:19,070 --> 00:47:22,495
are replies encouraging people
to use their native language?

717
00:47:22,992 --> 00:47:27,150
And public posts encouraging people
to use English as a lingua franca?

718
00:47:27,759 --> 00:47:30,251
This is one that showed up the same.

719
00:47:30,252 --> 00:47:34,151
And I changed the handle,
for privacy reasons...

720
00:47:34,549 --> 00:47:37,709
So this is the reply to somebody
and it's in Arabic.

721
00:47:38,443 --> 00:47:41,001
And this is a public posting
and it's in English.

722
00:47:42,414 --> 00:47:45,501
Now, the thing I'm looking at
is public analysis

723
00:47:45,502 --> 00:47:50,314
and I'm considering with Jordan
to do some automatic topic analysis

724
00:47:50,706 --> 00:47:54,215
because there's many languages,
so I cannot decode it all

725
00:47:54,782 --> 00:47:56,503
in many of them.

726
00:47:56,507 --> 00:47:58,459
Only in three, maybe four...

727
00:47:59,910 --> 00:48:01,406
So, I'm wondering,

728
00:48:01,407 --> 00:48:04,213
are technology topics favoring
the use of English?

729
00:48:04,600 --> 00:48:10,072
And other topics,
international news maybe?

730
00:48:11,308 --> 00:48:16,147
Whereas other topics
like national news or songs

731
00:48:16,148 --> 00:48:19,407
they might be encouraging the use
of native languages.

732
00:48:20,566 --> 00:48:22,904
And then I'm looking
if there's translations

733
00:48:22,904 --> 00:48:26,845
or if there's cross-cultural words
that you can detect.

734
00:48:27,324 --> 00:48:29,111
For instance, this person
is writing in English

735
00:48:29,112 --> 00:48:33,313
but it recommending a visit to a museum
in the city of Lille in France.

736
00:48:33,767 --> 00:48:38,830
So this person knows the city in France,
knows that to visit the museum

737
00:48:38,987 --> 00:48:40,556
you go there.

738
00:48:40,559 --> 00:48:43,089
And this is what I call
<i>cross-cultural</i> words.

739
00:48:44,239 --> 00:48:49,095
[What I kind of found] is that surprisingly
there's not many translation behaviors

740
00:48:49,096 --> 00:48:52,589
going on, despite these people
being multilingual.

741
00:48:53,001 --> 00:48:56,264
And this is what is going to trigger
some reflections.

742
00:49:00,289 --> 00:49:02,085
How am I doing on time?

743
00:49:04,172 --> 00:49:05,646
(woman) 1:22.

744
00:49:05,646 --> 00:49:10,050
<i>(man) Umm, it's usually an hour long...</i>

745
00:49:10,450 --> 00:49:14,358
So, I will go on with my reflections.

746
00:49:14,358 --> 00:49:18,266
to encourage some thoughts.

747
00:49:18,266 --> 00:49:22,027
So the greatest connecting power
is the will of users who want

748
00:49:22,027 --> 00:49:23,317
to be connected.

749
00:49:23,317 --> 00:49:28,201
This is a really nice quality,
because the communities of interest

750
00:49:28,290 --> 00:49:32,012
in social media, in Twitter
is what is bringing people

751
00:49:32,013 --> 00:49:33,701
from different countries, together.

752
00:49:34,794 --> 00:49:41,151
And also experiences,
like <i>the Voluntweeters</i>,

753
00:49:42,095 --> 00:49:45,815
so after the earthquake in Haiti,
there were these spontaneous

754
00:49:45,816 --> 00:49:48,972
self-organizations of Twitter users
for translating tweets

755
00:49:50,213 --> 00:49:53,755
and they called themselves <i>Voluntweeters</i>,
there's a paper about that--

756
00:49:53,826 --> 00:49:59,151
So this is the triggering
of social connections

757
00:50:00,820 --> 00:50:04,486
across countries, across borders
and across languages.

758
00:50:06,759 --> 00:50:10,300
But even when the social structure
could potentially facilitate

759
00:50:10,301 --> 00:50:13,375
information diffusion
and cross-language linking

760
00:50:14,558 --> 00:50:16,731
this condition is not sufficient.

761
00:50:16,732 --> 00:50:19,720
There are other factors
like the design of the interfaces

762
00:50:19,721 --> 00:50:22,479
and the design of systems
that can influence...

763
00:50:23,145 --> 00:50:27,438
can promote, or not translation behaviors
and cross-cultural awareness.

764
00:50:28,293 --> 00:50:31,503
And the Wikipedia
of cross-language linking

765
00:50:31,504 --> 00:50:35,113
you have links for many languages
for every article.

766
00:50:37,257 --> 00:50:41,061
We also still acknowledge the dynamic
language preferences of multilingual users

767
00:50:41,790 --> 00:50:44,145
so they could address their messages
to the appropriate audience.

768
00:50:44,146 --> 00:50:47,187
I like the solution of Google+
with their circles

769
00:50:47,880 --> 00:50:51,890
where I can put my friends and family
in Spain in a circle

770
00:50:51,891 --> 00:50:54,559
and write them in Spanish.

771
00:50:54,739 --> 00:51:00,633
And then the recommendation of people
based on language profile

772
00:51:01,437 --> 00:51:04,134
would be useful for this spontaneous
self-organization.

773
00:51:05,708 --> 00:51:08,057
So, these are some of the things.

774
00:51:08,143 --> 00:51:10,455
The impact of mediation.

775
00:51:10,782 --> 00:51:13,206
Global Voices is
an international community of bloggers

776
00:51:13,207 --> 00:51:18,303
that connect bloggers and citizens
from around the world

777
00:51:18,814 --> 00:51:20,504
in different languages.

778
00:51:21,171 --> 00:51:22,580
And Scott Hale

779
00:51:22,581 --> 00:51:27,353
a student from Oxford University
led a very interesting study

780
00:51:27,354 --> 00:51:33,960
after the earthquake in Haiti about blogs
in Spanish, Japanese and English

781
00:51:35,561 --> 00:51:38,542
and he looked
at the cross-language linking

782
00:51:38,543 --> 00:51:41,388
and focusing on this topic
over time.

783
00:51:41,488 --> 00:51:45,495
And he discovered that 50 percent
of the cross-language linking

784
00:51:45,496 --> 00:51:48,304
was happening through this platform,
Global Voices.

785
00:51:49,062 --> 00:51:51,941
So, it had a very big impact
in the language links.

786
00:51:54,170 --> 00:51:57,857
And finally, social media,
big media outlets,

787
00:51:57,858 --> 00:52:01,592
people are interconnected
in these complex networks

788
00:52:04,693 --> 00:52:08,945
and underlying is this language ecosystem.

789
00:52:09,058 --> 00:52:12,786
So we have the language ecosystem,
and on top of that

790
00:52:12,787 --> 00:52:15,296
we have the social media ecosystem.

791
00:52:15,305 --> 00:52:20,200
People would share a video from YouTube
on Twitter, or news on Facebook.

792
00:52:21,302 --> 00:52:26,011
What happened if we integrate
in this ecosystem

793
00:52:26,517 --> 00:52:30,518
these platforms, like Global Voices,
like Universal Subtitles

794
00:52:30,519 --> 00:52:34,327
which is a platform
for crowdsourcing subtitling of videos

795
00:52:34,328 --> 00:52:37,108
and translation of subtitles
for videos.

796
00:52:38,050 --> 00:52:42,222
If you integrate that and this
starts connecting, starts building paths

797
00:52:42,223 --> 00:52:45,743
between languages,
that didn't exist before.

798
00:52:45,744 --> 00:52:50,955
So I think we should make it easy
for multilingual people to translate

799
00:52:50,955 --> 00:52:55,187
and subtitle all the content they like,
their favorite content

800
00:52:56,003 --> 00:53:00,326
and share it with the appropriate audience
so they can start connecting

801
00:53:00,327 --> 00:53:03,114
the language islands of the internet.

802
00:53:03,145 --> 00:53:06,219
And that way stories will travel
all over the world.

803
00:53:09,204 --> 00:53:11,950
Particularly I would like to thank
Jen Golbeck, my adviser

804
00:53:11,951 --> 00:53:14,337
and Fulbright for supporting
this research.

805
00:53:14,477 --> 00:53:19,206
And then I open the space
for questions and your ideas

806
00:53:19,488 --> 00:53:21,780
if this has triggered some thoughts.

807
00:53:24,140 --> 00:53:25,972
<i>(woman) I have a question
about how this relates</i>

808
00:53:25,973 --> 00:53:28,112
<i>to your Yahoo award.</i>

809
00:53:29,468 --> 00:53:35,076
Well, they have the Internet Experiences
lab in California.

810
00:53:35,078 --> 00:53:36,428
And they--

811
00:53:36,460 --> 00:53:40,213
So, we tend to think 
maybe it's a super tiny place

812
00:53:40,213 --> 00:53:42,630
but actually there are fields

813
00:53:42,631 --> 00:53:44,818
and I applied for the social systems.

814
00:53:45,121 --> 00:53:48,967
The social systems are a category.

815
00:53:49,068 --> 00:53:54,686
And I think that was embedded
in the Internet Experience lab

816
00:53:56,739 --> 00:53:58,452
and yeah, they liked it.

817
00:53:58,516 --> 00:54:01,530
<i>(man) But is it this
work that they are interested in?</i>

818
00:54:01,813 --> 00:54:02,883
Yes.

819
00:54:02,884 --> 00:54:04,022
- <i>The languages?</i>
- Yes.

820
00:54:04,022 --> 00:54:07,726
Well, now I have results,
because I wrote up reports

821
00:54:09,496 --> 00:54:11,548
about what my work was about.

822
00:54:16,758 --> 00:54:17,968
<i>Great.</i>

823
00:54:22,055 --> 00:54:22,879
Yes?

824
00:54:22,879 --> 00:54:25,682
<i>(woman) I was thinking about
if you analyzed the place...</i>

825
00:54:25,682 --> 00:54:30,689
<i>like if there's any relationship
between tweeters and tweets</i>

826
00:54:31,056 --> 00:54:33,624
<i>and the place that the people are.</i>

827
00:54:35,883 --> 00:54:39,760
<i>I mean, because it's not the same
being a Brazilian in Brazil</i>

828
00:54:39,761 --> 00:54:43,197
<i>and tweeting in Portuguese
or being Brazilian in the US</i>

829
00:54:43,198 --> 00:54:45,330
<i>and tweeting in Portuguese--</i>

830
00:54:45,950 --> 00:54:49,249
There's many, many factors
that I haven't looked at.

831
00:54:50,126 --> 00:54:51,971
<i>It's not part of your study?</i>

832
00:54:52,300 --> 00:54:54,447
But because I had to scope it somehow.

833
00:54:54,448 --> 00:54:56,108
There's so many factors.

834
00:54:56,710 --> 00:54:59,993
Geography was one that I was originally
intending to look at

835
00:55:00,097 --> 00:55:04,458
but I found there were so many problems
to actually get the right geography

836
00:55:04,459 --> 00:55:06,652
the right geolocation.

837
00:55:08,154 --> 00:55:12,136
The problem is that I didn't originally
collect the geolocation.

838
00:55:12,137 --> 00:55:15,898
I think only a small percentage
of messages have...

839
00:55:16,457 --> 00:55:18,297
geolocated information.

840
00:55:18,902 --> 00:55:20,795
I'm not sure about the percentage there.

841
00:55:20,796 --> 00:55:24,690
So there's only a small percentage
of messages that have geolocation.

842
00:55:25,173 --> 00:55:27,604
There's issues with the accuracy...

843
00:55:28,041 --> 00:55:31,147
What I have collected is the information
in their profile

844
00:55:31,931 --> 00:55:35,462
they can put the information
about the place,

845
00:55:35,493 --> 00:55:39,572
but sometimes it's more
or less trustworthy,

846
00:55:39,573 --> 00:55:42,828
sometimes there's nothing,
and sometimes there's just crazy stuff.

847
00:55:43,210 --> 00:55:44,710
(audience laughs)

848
00:55:46,545 --> 00:55:49,735
So, something absolutely has to be there.

849
00:55:50,419 --> 00:55:55,249
If I wanted to expand this,
geography would be a nice place to go!

850
00:55:55,279 --> 00:55:56,609
<i>(woman) Ok.</i>

851
00:55:59,863 --> 00:56:00,631
Yes?

852
00:56:00,631 --> 00:56:01,710
<i>(man) Could you say a little bit more</i>

853
00:56:01,710 --> 00:56:04,946
<i>I think you said about the visualization
choices you made?</i>

854
00:56:04,964 --> 00:56:06,224
Oh yes, well...

855
00:56:08,033 --> 00:56:11,117
I tried this tool, the Node XL,

856
00:56:11,118 --> 00:56:13,284
I used both Node XL and Gephi.

857
00:56:13,522 --> 00:56:14,522
There's more...

858
00:56:16,109 --> 00:56:20,202
I think there's, I don't remember the name
there's one that was developed

859
00:56:20,202 --> 00:56:21,854
here in Maryland

860
00:56:21,854 --> 00:56:24,163
but it's not as user-friendly.

861
00:56:26,108 --> 00:56:29,563
But I've forgotten the name,
I will have to look it up.

862
00:56:29,895 --> 00:56:33,872
And there's a lot of tools
that are for really technical people

863
00:56:34,696 --> 00:56:37,156
that are handling millions of nodes.

864
00:56:37,528 --> 00:56:40,615
Because with these tools,
for social scientists or humanists

865
00:56:40,615 --> 00:56:42,295
maybe they are not.

866
00:56:42,316 --> 00:56:48,685
Some tools can have maybe 300-400 nodes
and still be understandable.

867
00:56:51,115 --> 00:56:55,622
But if you go beyond that,
actually visualizations get crazy

868
00:56:56,058 --> 00:57:02,088
and even for more technical tools
for more technical people

869
00:57:02,563 --> 00:57:07,061
there are hundreds or millions,
they cannot do visualizations

870
00:57:08,349 --> 00:57:11,870
at some point they just give you
statistical measures.

871
00:57:13,729 --> 00:57:15,156
I have to leave it out.

872
00:57:15,156 --> 00:57:17,051
I have a list of tools and that

873
00:57:17,051 --> 00:57:20,598
but if I need the names,
I need to go through everything.

874
00:57:22,596 --> 00:57:25,479
<i>(woman) But yours was Mac-accessible?</i>

875
00:57:25,479 --> 00:57:31,585
Yes, this Gephi tool is Mac-accessible,
you can use it with Microsoft

876
00:57:31,792 --> 00:57:34,446
with Mac and with Linux.

877
00:57:35,905 --> 00:57:37,979
And I forgot to say,
it's open source.

878
00:57:43,480 --> 00:57:48,839
<i>(woman) Did you find
studying languages and internet</i>

879
00:57:48,840 --> 00:57:52,681
<i>was like a place, unexplored?</i>

880
00:57:52,948 --> 00:57:55,208
<i>Like here in the United States?</i>

881
00:57:55,378 --> 00:58:00,001
<i>Like when you began studying
or analyzing this</i>

882
00:58:00,002 --> 00:58:04,303
<i>you felt that a lot of people
are doing this</i>

883
00:58:04,303 --> 00:58:06,200
<i>or nobody is doing this</i>

884
00:58:06,200 --> 00:58:08,352
<i>and I'm the first one trying to--</i>

885
00:58:08,435 --> 00:58:13,114
I'm not the first one,
but it's a very new area

886
00:58:13,114 --> 00:58:14,971
to be exploring.

887
00:58:15,033 --> 00:58:16,983
So, it's very exciting
because of that.

888
00:58:17,012 --> 00:58:18,797
Because there's so many
unanswered questions

889
00:58:18,798 --> 00:58:23,785
and I find that surprisingly enough
the United States is not paying so much attention

890
00:58:23,786 --> 00:58:26,053
about multilinguality issues


891
00:58:26,053 --> 00:58:31,002
And I think that language policies
are very monolingual-oriented

892
00:58:31,003 --> 00:58:32,948
but it's terrible

893
00:58:33,043 --> 00:58:37,182
because there's a whole lot
of multilinguality in this country.

894
00:58:37,183 --> 00:58:41,270
There's so many people
speaking different languages

895
00:58:42,548 --> 00:58:45,290
that I'm so amazed
about that contradiction.

896
00:58:45,780 --> 00:58:48,727
Because in Europe,
it's an obvious challenge for us

897
00:58:49,388 --> 00:58:51,907
because we need to understand each other
between all these countries

898
00:58:51,907 --> 00:58:53,567
of the European Union.

899
00:58:53,567 --> 00:58:58,499
And there's a lot of money invested
in research that relates to multilinguality

900
00:58:58,691 --> 00:59:00,738
and communication in languages

901
00:59:00,738 --> 00:59:04,557
and technology in particular,
cross-language systems

902
00:59:04,558 --> 00:59:09,030
and in libraries there's a lot of work
going on.

903
00:59:09,400 --> 00:59:13,942
There's investment in the research.

904
00:59:14,565 --> 00:59:18,405
So yeah, maybe in terms of investment

905
00:59:18,405 --> 00:59:22,115
the European Union is
not a bad place to be.

906
00:59:22,322 --> 00:59:24,109
Better than the United States!

907
00:59:24,110 --> 00:59:27,445
But at the same time,
what I find interesting

908
00:59:27,446 --> 00:59:33,323
is that here when I talk about it
people are really interested

909
00:59:35,313 --> 00:59:38,376
and interested in the subject
and excited about it.

910
00:59:38,458 --> 00:59:41,294
Maybe in Europe it looks more
like old news.

911
00:59:41,294 --> 00:59:43,796
Like "yeah, we already know that."

912
00:59:44,135 --> 00:59:45,665
(audience laughs)

913
00:59:45,674 --> 00:59:49,580
So I find that it's exciting
to be seeing the audience

914
00:59:49,629 --> 00:59:52,226
like "Oh yeah!"
It's so new.

915
00:59:52,666 --> 00:59:54,026
*(woman) Yes.

916
00:59:58,653 --> 01:00:03,146
<i>(woman) As the emerging view
of research in the United States</i>

917
01:00:03,146 --> 01:00:09,892
<i>can you show me which institutions
or which area of academic institutions</i>

918
01:00:11,798 --> 01:00:14,748
<i>actually have more invested
in this topic in the US?</i>

919
01:00:16,262 --> 01:00:18,916
I'm not sure about the institutions.

920
01:00:20,572 --> 01:00:25,978
What I know, particularly,
in Indiana there's work

921
01:00:26,510 --> 01:00:29,107
because Susan Herring
is a researcher there.

922
01:00:30,797 --> 01:00:32,891
She has inspired my work.

923
01:00:32,891 --> 01:00:35,607
She published a book
<i>The Multilingual Internet</i>

924
01:00:35,687 --> 01:00:40,953
and she has done research on blogs,
also communities

925
01:00:41,891 --> 01:00:45,251
of different languages connecting blogs
in the blogosphere.

926
01:00:45,251 --> 01:00:51,058
So she has been one of the ones,
one of the first tackling these issues

927
01:00:51,144 --> 01:00:54,720
and she's still going
and she's doing something.

928
01:00:54,896 --> 01:00:59,399
So, it's the University of Indiana,
I think.

929
01:01:00,914 --> 01:01:03,348
Yeah, Susan Herring.
Look for her!

930
01:01:06,095 --> 01:01:09,181
And also at the same university
there's Paolillo.

931
01:01:10,156 --> 01:01:12,793
He's also doing research
in this area

932
01:01:12,826 --> 01:01:18,869
and he actually published for UNESCO
for research on language diversity

933
01:01:18,945 --> 01:01:20,275
on the internet.

934
01:01:21,785 --> 01:01:23,479
So Susan Herring and Paolillo,

935
01:01:23,480 --> 01:01:25,444
they are at the same university.

936
01:01:26,736 --> 01:01:30,058
Those are my inspiring ones.

937
01:01:33,682 --> 01:01:37,270
Well, at Harvard at the Berkman Center 
of Internet and Society also did

938
01:01:37,270 --> 01:01:38,639
this mapping of the blogs.

939
01:01:38,640 --> 01:01:40,649
But they don't focus on languages.

940
01:01:41,700 --> 01:01:45,279
But there's tangential thing
around there.

941
01:01:49,387 --> 01:01:51,428
<i>(man) One more question?</i>

942
01:01:53,560 --> 01:01:54,748
<i>Well, thank you very much!</i>

943
01:01:54,749 --> 01:01:55,749
Thanks!

944
01:01:55,759 --> 01:01:57,661
(audience applauds)