WEBVTT 00:00:00.384 --> 00:00:01.891 Good morning, everyone. 00:00:02.858 --> 00:00:05.650 Thank you for coming here [unclear] of the semester. 00:00:08.370 --> 00:00:10.217 So, I'm going to start. 00:00:11.001 --> 00:00:13.901 Access to the internet is greater than ever before 00:00:14.183 --> 00:00:17.166 and as a consequence, it's becoming more multilingual. 00:00:18.706 --> 00:00:22.613 However, there's evidence of segmentation of cyberspace 00:00:22.614 --> 00:00:25.170 due to language and national borders. 00:00:28.011 --> 00:00:30.812 This image serves to illustrate that. 00:00:31.684 --> 00:00:35.656 This is the language communities of Twitter in Europe. 00:00:36.562 --> 00:00:40.656 So, what you can see are tweets geolocated over a map of Europe 00:00:40.657 --> 00:00:44.047 and the different colors represent the different languages. 00:00:45.272 --> 00:00:50.905 You can even see regional languages like Catalan in the Catalan region of Spain 00:00:52.286 --> 00:00:56.037 And this is going to be useful for an example I'm going to use later. 00:01:01.958 --> 00:01:04.209 I'm interested in Twitter in particular, 00:01:04.209 --> 00:01:07.312 because of the speed of information dissemination 00:01:07.313 --> 00:01:10.744 and that most of this information is publicly accessible. 00:01:13.743 --> 00:01:18.780 I'm going to illustrate this with a capture 00:01:18.780 --> 00:01:22.280 of a dynamic visualization you can find on the Twitter blog 00:01:22.281 --> 00:01:24.783 by Miguel Rios. 00:01:25.345 --> 00:01:28.981 And what you can see here is the global flow of tweets 00:01:28.982 --> 00:01:31.205 after the earthquake in Japan. 00:01:32.147 --> 00:01:34.793 In pink, there are the tweets coming out of Japan 00:01:34.794 --> 00:01:37.400 and, in green, the retweets all over the world. 00:01:39.258 --> 00:01:44.966 This illustrates that in Twitter information is spreading across countries. 00:01:46.180 --> 00:01:47.987 But how can this happen? 00:01:49.380 --> 00:01:55.018 Expatriates, migrants, minorities. diaspora communities, language learners 00:01:55.028 --> 00:01:59.484 all play an important role in building transnational networks 00:01:59.484 --> 00:02:02.549 and cultural bridges between nations and communities. 00:02:03.847 --> 00:02:06.169 They are the multilingual users on the internet. 00:02:07.512 --> 00:02:11.045 The overarching research question is: 00:02:11.047 --> 00:02:16.833 how are multilingual users of Twitter connecting different language groups? 00:02:21.961 --> 00:02:27.157 In 2009, the Berkman Center of Internet and Society at Harvard University 00:02:27.158 --> 00:02:29.773 mapped the Arabic blogosphere 00:02:29.774 --> 00:02:33.134 and they described a key concept for my research. 00:02:35.020 --> 00:02:40.855 They discovered an English bridge and a French bridge of bloggers 00:02:40.856 --> 00:02:45.780 that were writing in their native Arabic language and in English or French. 00:02:47.430 --> 00:02:51.551 And they were connecting the different national blogospheres 00:02:51.552 --> 00:02:53.034 with the international one. 00:02:55.387 --> 00:03:00.351 This might have played a role in the Arab popular uprisings in 2011 00:03:00.352 --> 00:03:02.773 for reaching out to the world. 00:03:04.582 --> 00:03:09.223 And this is connected with a concept that first appeared in 2008 00:03:09.224 --> 00:03:11.697 of the bridge bloggers. 00:03:13.601 --> 00:03:16.010 So, bridge bloggers are bloggers 00:03:16.011 --> 00:03:19.991 that are trying to connect their local communities 00:03:19.992 --> 00:03:23.255 to a wider global audience. 00:03:25.190 --> 00:03:29.398 The image you can see here is actually the visualization they created 00:03:29.399 --> 00:03:32.617 of mapping the Arabic blogosphere. 00:03:35.341 --> 00:03:37.370 Each dot is a blogger, or a blog. 00:03:38.763 --> 00:03:42.930 The size represents their popularity, so how many incoming links they have 00:03:42.931 --> 00:03:45.457 and they grouped them-- 00:03:45.457 --> 00:03:47.666 the neighborhoods they created 00:03:47.667 --> 00:03:51.582 in relation to the linking between them. 00:03:52.587 --> 00:03:56.420 So, the ones that are grouped together are linking among each other. 00:03:57.235 --> 00:03:59.454 The colors are a different question. 00:03:59.489 --> 00:04:06.067 The colors represent "attentive clusters", that's how they call it. 00:04:06.067 --> 00:04:11.793 And they look at their online resources and media outlets 00:04:12.065 --> 00:04:13.911 these blogs were linking to. 00:04:15.055 --> 00:04:19.346 So, blogs of the same colors are following the same media outlets 00:04:19.346 --> 00:04:21.100 and online resources. 00:04:21.165 --> 00:04:25.057 And they did human coding to label those groups. 00:04:25.640 --> 00:04:30.282 And here is where we see the label English grids 00:04:30.283 --> 00:04:32.035 the responses from Cuba in English 00:04:32.036 --> 00:04:33.950 and up there, there's [unclear] France. 00:04:35.774 --> 00:04:40.537 And so I think it's important to retain the concept of attentive clusters. 00:04:43.797 --> 00:04:49.005 Now, let's go back to 2011 during the Arab popular uprisings. 00:04:50.444 --> 00:04:55.207 And I'll show you a visualization of the influence network 00:04:55.208 --> 00:04:57.069 of Twitter users in Egypt. 00:04:58.210 --> 00:04:59.868 So, what you're seeing here 00:04:59.869 --> 00:05:04.348 just imagine people down the street at Tahrir Square 00:05:04.349 --> 00:05:08.426 tweeting in Arabic about what's going on on the ground. 00:05:08.427 --> 00:05:12.188 And those are the people in red. 00:05:12.652 --> 00:05:17.525 So, these red dots represent users that are tweeting in Arabic. 00:05:18.309 --> 00:05:20.372 Then we have the international community 00:05:20.373 --> 00:05:25.515 or even Americans, British and so on tweeting in English. 00:05:26.075 --> 00:05:28.287 And they are in blue, those blue dots. 00:05:28.808 --> 00:05:33.052 And then, interestingly, we have people in between them. 00:05:33.052 --> 00:05:38.152 which are illustrated in different degrees of violet, or violet shades. 00:05:39.393 --> 00:05:43.417 This represents the fact that they are tweeting in both Arabic and English. 00:05:45.006 --> 00:05:47.204 So, what we're seeing is the bridge Twitters 00:05:47.996 --> 00:05:52.754 because, like Ethan Zuckermann called them "bridge bloggers". 00:05:55.642 --> 00:05:58.520 So, another context. 00:05:59.333 --> 00:06:04.649 The same year, 2011, a lot of big protests were going on in Europe. 00:06:05.172 --> 00:06:06.780 And in particular, in Spain. 00:06:06.781 --> 00:06:11.868 They started on May 15th 2011 there were massive protests. 00:06:13.002 --> 00:06:16.640 And of because of this context, this situation 00:06:16.640 --> 00:06:22.974 new attentive clusters were emerging in the social media landscape of Spain. 00:06:28.240 --> 00:06:33.711 Now, this is a visualization you can find in the Socialflow blog, research blog 00:06:33.806 --> 00:06:35.216 on social networks. 00:06:35.216 --> 00:06:41.235 And what it is, is it tracks the origin and the initial spread 00:06:41.236 --> 00:06:45.347 of the hashtag #occupywallstreet in Twitter. 00:06:46.987 --> 00:06:51.803 They detected that one of the first users of the hashtag #occupywallstreet 00:06:51.804 --> 00:06:57.093 was on July 13th 2011, linking to a blog post of Adbusters. 00:06:58.344 --> 00:07:02.250 So you have the Twitter account of Adbusters there, very big 00:07:02.251 --> 00:07:05.023 because it's being retweeted a lot. 00:07:05.966 --> 00:07:06.966 And mentioned a lot. 00:07:07.884 --> 00:07:13.514 And they collected these mentions and the tweets that had these mentions 00:07:13.515 --> 00:07:17.298 and these retweets with the hashtag during July 13th. 00:07:18.188 --> 00:07:20.501 From July 13th to July 23rd. 00:07:21.163 --> 00:07:23.523 So, from the first 10 days of the use of this hashtag 00:07:23.524 --> 00:07:26.810 it was from the very beginning of the use of this hashtag on Twitter. 00:07:31.012 --> 00:07:32.320 They just mapped the accounts 00:07:33.462 --> 00:07:39.459 and the series of posts with the hashtag and mentions with the hashtag 00:07:40.652 --> 00:07:42.997 and the users that were connecting 00:07:42.998 --> 00:07:45.329 because of these mentions and retweets. 00:07:46.136 --> 00:07:49.237 Now the interesting thing in this visualization 00:07:49.238 --> 00:07:50.517 is that they 00:07:50.518 --> 00:07:54.348 the Socialflow people particularly in [inaudible] 00:07:54.873 --> 00:07:59.885 detected this Spanish brand of users 00:07:59.886 --> 00:08:03.146 were forming an attentive cluster. 00:08:05.948 --> 00:08:09.282 Mentioning and retweeting about it in Spanish 00:08:09.283 --> 00:08:13.042 using the hashtag in their messages in Spanish. 00:08:14.104 --> 00:08:17.164 And they point out in the blog 00:08:17.865 --> 00:08:19.524 that this Spanish contingent 00:08:19.525 --> 00:08:24.189 helped post and spread the word about Occupy Wall Street 00:08:24.190 --> 00:08:29.358 even before most of the United States was aware of it. 00:08:32.239 --> 00:08:34.043 So, I found that very interesting. 00:08:34.043 --> 00:08:37.438 And it was due to the context in Spain at that moment 00:08:38.287 --> 00:08:45.449 with big protests and new clusters forming in the social media landscape. 00:08:56.673 --> 00:09:01.497 Now I have shown you the importance of these multilingual users 00:09:01.498 --> 00:09:06.470 in connecting language communities and spreading information 00:09:06.471 --> 00:09:09.228 across countries, acting as mediators. 00:09:11.172 --> 00:09:15.831 But let's focus on another aspect of connecting language groups 00:09:15.832 --> 00:09:17.170 which is language choice. 00:09:17.881 --> 00:09:23.261 So I'm going to devote a moment to speak about languages 00:09:23.262 --> 00:09:24.262 and language choice. 00:09:27.743 --> 00:09:31.064 To understand languages in the world 00:09:31.065 --> 00:09:33.478 I'm going to use a telescope. 00:09:37.358 --> 00:09:39.954 So de Swaan... 00:09:41.244 --> 00:09:44.565 ...proposed a theory called the world language system 00:09:44.566 --> 00:09:45.815 back in the 1990s. 00:09:47.268 --> 00:09:50.867 to explain the languages in the world. 00:09:52.127 --> 00:09:55.428 And he used a very beautiful metaphor, the constellation. 00:09:56.906 --> 00:10:01.510 So, in his theory there's about a dozen languages in the world 00:10:01.511 --> 00:10:05.469 that are the hearts of the system, or the suns. 00:10:06.493 --> 00:10:07.493 The suns of the system. 00:10:07.604 --> 00:10:11.015 For instance, English, French, Spanish, Arabic and more. 00:10:12.324 --> 00:10:16.765 And then there are hundreds, maybe more than 100, 200... 00:10:16.766 --> 00:10:22.707 national languages that are orbiting around these suns like planets. 00:10:24.497 --> 00:10:28.393 And finally we have regional and minority languages 00:10:28.394 --> 00:10:31.664 that are orbiting these planets like satellites. 00:10:32.826 --> 00:10:37.606 And he used this metaphor to explain the power relationships 00:10:37.607 --> 00:10:39.729 between languages. 00:10:40.172 --> 00:10:43.096 This is a theory of what he called 00:10:43.097 --> 00:10:46.659 "communication potential and language competition" 00:10:48.408 --> 00:10:50.539 A key point he made 00:10:51.979 --> 00:10:55.379 is that the system holds together 00:10:55.440 --> 00:10:58.793 thanks to multilingual people and interpreters. 00:11:00.291 --> 00:11:03.173 This is what's providing cohesion to the system. 00:11:03.956 --> 00:11:06.959 He also made a controversial proposal 00:11:06.960 --> 00:11:11.340 about the communication potential of a language. 00:11:12.290 --> 00:11:14.868 So, he proposed a formula, a mathematical formula 00:11:14.869 --> 00:11:19.910 where he could estimate the communication potential of a language 00:11:19.911 --> 00:11:24.889 and supposedly a person with tools through learning and usage 00:11:24.889 --> 00:11:27.554 based on the communications of that. 00:11:28.108 --> 00:11:34.206 For example, a person might decide to learn English and use English 00:11:35.137 --> 00:11:41.414 because not only does it provide communication with English native speakers 00:11:42.395 --> 00:11:46.377 but also, adding to that, it provides the possibility to communicate 00:11:46.378 --> 00:11:50.073 with all the second-language learners of English 00:11:50.074 --> 00:11:53.101 from many different languages, many different countries. 00:11:53.102 --> 00:11:55.818 So, supposedly, in history 00:11:55.819 --> 00:11:59.686 English provides the greatest communication. 00:12:01.464 --> 00:12:05.106 And he received some criticism, because of the central role of English 00:12:05.107 --> 00:12:06.785 in his theory 00:12:06.806 --> 00:12:10.139 He said it was the central hub of all the system. 00:12:12.959 --> 00:12:20.043 There's also the language ecology paradigm first proposed by Haugen in 1972 00:12:21.825 --> 00:12:25.243 and there's this idea of an ecosystem of languages 00:12:25.244 --> 00:12:29.695 and, again, it's using another metaphor 00:12:29.696 --> 00:12:31.608 and because of this metaphor 00:12:31.609 --> 00:12:34.466 also appeared the idea of endangered languages. 00:12:36.353 --> 00:12:39.880 I'm going to briefly just read the definition. 00:12:39.881 --> 00:12:42.787 He defined the language ecology as: 00:12:42.788 --> 00:12:47.235 "the study of interactions between any given language and its environment" 00:12:47.985 --> 00:12:49.207 and what I think is very important: 00:12:49.208 --> 00:12:53.690 "language exists only in the minds of its users" 00:12:56.694 --> 00:12:59.956 which leads me to point at my research. 00:13:01.528 --> 00:13:05.808 In my research, I'm using a microscope to see the cells 00:13:05.809 --> 00:13:09.337 and my cells in my study are the Twitter users. 00:13:12.192 --> 00:13:13.452 Why is that? 00:13:15.616 --> 00:13:19.906 Because as Haugen explains, there's a psychological dimension 00:13:19.907 --> 00:13:21.811 to language ecology 00:13:21.812 --> 00:13:25.070 where language interacts with other languages 00:13:25.071 --> 00:13:27.697 in the minds of multilingual people. 00:13:28.860 --> 00:13:31.574 And there's a sociological dimension to language ecology 00:13:32.425 --> 00:13:38.379 where we use language to communicate and interact with other people. 00:13:38.380 --> 00:13:43.860 And this language ecology generates because of the people 00:13:43.861 --> 00:13:46.493 that decide to use that language 00:13:46.494 --> 00:13:50.337 learning and interacting with people using it. 00:13:51.139 --> 00:13:54.758 And this is the point of language choice in languages. 00:13:55.643 --> 00:13:59.827 So, I focus on the connections of people and the language choice. 00:14:04.231 --> 00:14:07.595 So, these are the four points I'm going to be speaking about. 00:14:07.862 --> 00:14:13.015 But actually the main focus is going to be the first point 00:14:13.483 --> 00:14:19.196 Social network analysis and the taxonomy of intersections between language groups 00:14:20.267 --> 00:14:23.376 This is where I'm going to be spending most of the time. 00:14:23.377 --> 00:14:26.820 And then very briefly, just for compilation purposes 00:14:26.821 --> 00:14:29.551 I'm going to speak about another small study that I did 00:14:29.552 --> 00:14:30.861 the factor analysis 00:14:30.862 --> 00:14:34.485 looking at the influence of the social network 00:14:34.486 --> 00:14:39.007 in the language choices of the users. 00:14:39.236 --> 00:14:41.737 So, how the social network influences language choice 00:14:41.738 --> 00:14:43.400 of our multilingual users. 00:14:45.128 --> 00:14:50.399 And then I'm going to briefly also talk about the last study of my dissertation 00:14:50.400 --> 00:14:52.492 that is still ongoing. 00:14:52.992 --> 00:14:55.215 So, I still have new research to talk about. 00:14:55.216 --> 00:14:57.622 And it's content analysis 00:14:57.623 --> 00:15:00.221 and in this case I'm focusing on intrinsic factors 00:15:00.222 --> 00:15:02.666 intrinsic to the messages 00:15:02.667 --> 00:15:05.506 about the topic, and the type of exchange. 00:15:05.973 --> 00:15:07.513 If it's a reply, if it's a public post 00:15:07.514 --> 00:15:09.923 and how that influences the language choice as well. 00:15:11.331 --> 00:15:12.331 And finally I will... 00:15:14.194 --> 00:15:17.193 I'm going to give you my reflections 00:15:17.996 --> 00:15:23.204 so I can invite your thoughts and suggestions and discussions about it. 00:15:27.097 --> 00:15:29.098 Briefly, I'm going to start with the sampling 00:15:29.099 --> 00:15:32.077 so I can talk about the rest of the research. 00:15:34.959 --> 00:15:36.654 So my focus is on multilingual users, 00:15:36.655 --> 00:15:39.782 how did I identify multilingual users on Twitter? 00:15:42.044 --> 00:15:43.784 It was giving me a headache. 00:15:44.058 --> 00:15:46.691 Finally what we decided... 00:15:48.542 --> 00:15:50.240 this research has been-- 00:15:51.458 --> 00:15:54.070 I have always had the help of Jennifer Golbeck, 00:15:54.071 --> 00:15:55.077 she was my adviser. 00:15:55.078 --> 00:15:57.296 And I did this with her help. 00:15:57.978 --> 00:16:01.715 So what we did, was gather a list of what is called stopwords. 00:16:02.813 --> 00:16:05.419 From different languages and you have a list over there. 00:16:06.176 --> 00:16:10.451 And then the stopword lists you can find them on the internet. 00:16:10.452 --> 00:16:13.055 They are created for computational linguistics 00:16:13.056 --> 00:16:15.560 so they use it for filtering purposes. 00:16:16.783 --> 00:16:19.103 And they are common words in a language. 00:16:19.791 --> 00:16:21.356 Very common words in a language. 00:16:21.357 --> 00:16:25.729 So, sometimes they're used precisely for eliminating them from texts 00:16:25.730 --> 00:16:28.689 when they're in, for example, searches in Google 00:16:28.690 --> 00:16:32.952 the eliminate the stopwords, the stopwords that you type 00:16:32.952 --> 00:16:34.612 in the search. 00:16:35.314 --> 00:16:37.797 But in this case I wanted to find the stopwords 00:16:37.798 --> 00:16:41.003 that are very common in the language to represent the language. 00:16:41.004 --> 00:16:44.103 And so we had to select words that were not written the same 00:16:44.104 --> 00:16:45.734 as in another language. 00:16:46.121 --> 00:16:48.425 Sometimes, could be confusing and ambiguous. 00:16:51.491 --> 00:16:54.029 Then I typed in Google... 00:16:55.954 --> 00:16:59.029 one word in one language and one word in another language. 00:16:59.030 --> 00:17:01.372 Usually I was always using one English word 00:17:01.373 --> 00:17:04.006 and one word in a different language. 00:17:04.729 --> 00:17:08.310 And I looked in the Twitter domain. 00:17:08.901 --> 00:17:12.185 So the search results from Google will give me the profiles 00:17:13.105 --> 00:17:19.866 of people on Twitter that in theory wrote messages in both languages. 00:17:20.123 --> 00:17:24.242 We had to do a lot of hand-combing to actually see if it was in two languages 00:17:24.542 --> 00:17:27.563 or it was just that they were mentioning an English song 00:17:28.352 --> 00:17:31.654 the title of an English song but they had no English in the rest. 00:17:31.655 --> 00:17:35.662 So we had to ensure that they were authoring tweets 00:17:35.663 --> 00:17:37.575 in two languages. 00:17:37.911 --> 00:17:39.996 So writing them, not just retweeting them 00:17:39.997 --> 00:17:43.202 they were not just automatic postings from Facebook. 00:17:43.203 --> 00:17:48.275 So we had a long set of criteria a lot of manual combing 00:17:48.276 --> 00:17:52.949 and then finally we selected 92 multilingual users 00:17:52.950 --> 00:17:57.694 and in total they used 19 languages, 2 or 3 languages per person. 00:18:00.989 --> 00:18:04.956 Now, I don't know if you want to ask some questions about the sampling 00:18:04.957 --> 00:18:07.607 because there's a lot of details about it. 00:18:13.392 --> 00:18:14.602 No doubts? 00:18:15.119 --> 00:18:16.585 Or maybe they'll come later! 00:18:19.137 --> 00:18:21.929 Now, how do I do the social networks analysis? 00:18:22.743 --> 00:18:27.869 Well, now I have my 92 multilingual users technically they are called the ego 00:18:28.277 --> 00:18:30.190 of an egocentric network. 00:18:31.084 --> 00:18:33.005 This is the cell of my study. 00:18:33.836 --> 00:18:35.984 It started with the nucleus of the cell 00:18:35.985 --> 00:18:37.550 which is my multilingual user 00:18:37.551 --> 00:18:40.478 and then I go to Twitter 00:18:40.478 --> 00:18:43.439 and first of all I have instructed-- 00:18:44.878 --> 00:18:47.416 so in this case my ego is called the Painter 00:18:48.111 --> 00:18:53.500 and I have extracted the last 50 messages that he posted on Twitter 00:18:53.501 --> 00:18:56.560 to see the languages this person used-- is using. 00:18:57.156 --> 00:19:01.943 And I see that he is using English, Spanish and Catalan. 00:19:02.945 --> 00:19:05.479 Catalan is a regional language in Spain 00:19:05.605 --> 00:19:07.736 and I have shown you on the map the region before 00:19:07.737 --> 00:19:09.247 where the region was. 00:19:09.275 --> 00:19:12.474 And they speak both Catalan and Spanish. 00:19:13.727 --> 00:19:16.825 So, this person is tweeting in a minority language 00:19:16.826 --> 00:19:18.488 a national language 00:19:18.489 --> 00:19:20.779 and also international. 00:19:26.808 --> 00:19:31.754 So, I already found the Painter and I know what languages this person speaks 00:19:31.754 --> 00:19:33.542 well, uses on Twitter, 00:19:33.543 --> 00:19:36.178 and then I extract all the social networks. 00:19:36.179 --> 00:19:37.932 So, the followers on Twitter 00:19:37.933 --> 00:19:39.716 you know that on Twitter you have followers 00:19:39.717 --> 00:19:41.160 and you follow people. 00:19:41.163 --> 00:19:42.513 I extracted both. 00:19:42.546 --> 00:19:48.323 The followers of the Painter the people that are following him on Twitter 00:19:48.323 --> 00:19:52.118 and also how the friends are connecting to each other. 00:19:52.289 --> 00:19:56.671 So, all of them, all of these dots are the followers 00:19:56.672 --> 00:19:58.707 the people following the Painter on Twitter 00:19:58.708 --> 00:20:02.788 and also I see how they connect among each other, ok? 00:20:04.542 --> 00:20:08.837 So the Painter follows Eduard in the center 00:20:10.153 --> 00:20:12.245 and it seems he's very popular. 00:20:13.567 --> 00:20:17.042 And then I extract the last 30 posts of Eduard-- 00:20:17.048 --> 00:20:18.509 there's a reason for that 00:20:18.510 --> 00:20:21.961 but vernacular is mostly economy questions! 00:20:24.717 --> 00:20:25.717 I will tell you why! 00:20:25.718 --> 00:20:28.857 So I extracted the last 30 posts of Eduard 00:20:28.858 --> 00:20:31.966 and then I do automatic language identification 00:20:31.967 --> 00:20:36.734 with the Google API for language identification 00:20:38.548 --> 00:20:39.548 which costs money. 00:20:40.527 --> 00:20:43.282 So you have to really think about how many posts you want to send 00:20:43.283 --> 00:20:45.580 to Google and how much money you have available 00:20:45.581 --> 00:20:48.178 and what is the accuracy you're going to have 00:20:48.179 --> 00:20:51.125 according to how many posts you send. 00:20:51.348 --> 00:20:53.268 There's a lot of testing going on there. 00:20:54.271 --> 00:20:58.482 I do the same with everybody in the social network. 00:20:58.700 --> 00:20:59.893 I extract the last 30 posts 00:20:59.894 --> 00:21:02.340 use the Google identification 00:21:02.341 --> 00:21:08.086 build that algorithm that decides based on the languages of these 30 posts 00:21:08.087 --> 00:21:11.929 is this person monolingual? Is this person multilingual? 00:21:11.929 --> 00:21:13.221 Which languages? 00:21:13.222 --> 00:21:15.379 And then I laddered them, ok. 00:21:16.572 --> 00:21:18.746 This is just a visualization behind the-- 00:21:20.280 --> 00:21:27.315 Perhaps person 1 is monolingual, or bilingual of two languages. 00:21:31.985 --> 00:21:35.782 Now that I have all the friends of the Painter 00:21:35.918 --> 00:21:37.392 how they connect, 00:21:37.392 --> 00:21:40.854 I color code them depending on the languages they are using. 00:21:42.020 --> 00:21:44.669 And here, what you can see is very interesting. 00:21:46.076 --> 00:21:48.735 I don't know if you can distinguish the colors well 00:21:48.736 --> 00:21:53.949 because up here, this area, that is like a triangle 00:21:53.950 --> 00:21:57.896 there's a group of users writing in English. 00:21:58.743 --> 00:22:00.753 And it's pink. Sort of pinkish. 00:22:00.753 --> 00:22:04.547 And then, down here there's this Spanish group 00:22:04.548 --> 00:22:06.792 in light green. 00:22:07.544 --> 00:22:12.407 And, in the middle, the one that perhaps doesn't distinguish as well 00:22:12.408 --> 00:22:15.464 from the English, is the Catalan group. 00:22:15.935 --> 00:22:18.962 So the users writing in Catalan in dark blue. 00:22:19.776 --> 00:22:21.870 And then there's a set of violets in between 00:22:21.871 --> 00:22:26.319 and these violets represent the bilingual users 00:22:26.319 --> 00:22:29.292 either English and Catalan or English and Spanish. 00:22:29.963 --> 00:22:33.031 And then there's darker green around here, 00:22:33.031 --> 00:22:36.498 they are using both Catalan and Spanish. 00:22:36.498 --> 00:22:38.252 So there's a lot of bilinguals going on. 00:22:38.252 --> 00:22:39.736 And there's an interesting dynamics 00:22:39.737 --> 00:22:42.710 in that you have this English group up there 00:22:42.711 --> 00:22:44.060 and the Spanish group up here 00:22:44.061 --> 00:22:46.200 and the Catalan group in the middle. 00:22:46.201 --> 00:22:49.147 And this Catalan group is very mixed up with the Spanish group 00:22:49.744 --> 00:22:52.184 which makes sense, because it's a bilingual community. 00:23:01.121 --> 00:23:06.529 So, this is how I built the egocentric network of my 92 multilingual users. 00:23:08.601 --> 00:23:10.987 The Painter is just one of them. I have 92. 00:23:10.988 --> 00:23:16.575 I have 92 cells or egocentric networks that I studied with my microscope. 00:23:17.868 --> 00:23:21.817 Do you want to ask some questions about this process 00:23:21.818 --> 00:23:23.419 or this visualization? 00:23:25.051 --> 00:23:29.982 (person 1) Of the bilingual units, are they users or tweets? 00:23:30.894 --> 00:23:32.056 They are users, yeah. 00:23:32.400 --> 00:23:35.560 So, the dots represent people. 00:23:35.561 --> 00:23:40.014 So, like Eduard here. They represent people. 00:23:42.250 --> 00:23:45.317 Now each dot to determine the language and the color 00:23:45.318 --> 00:23:47.931 I extracted 30 posts 00:23:48.434 --> 00:23:52.797 So, it's an interesting question because the 30 posts 00:23:52.798 --> 00:23:55.958 have different language levels assigned to them 00:23:56.096 --> 00:23:57.130 especially if they were bilingual 00:23:57.131 --> 00:24:01.643 and I had to decide which language level I was going to assign to the user. 00:24:01.644 --> 00:24:05.383 So, I had to build an algorithm with a set of rules 00:24:10.279 --> 00:24:11.346 basically saying-- 00:24:11.347 --> 00:24:16.651 the Google identification system would give me a language 00:24:16.652 --> 00:24:17.882 and a confidence level 00:24:17.882 --> 00:24:19.496 So if the confidence level was very low 00:24:19.497 --> 00:24:23.838 I would say "discard that" because I had a series of pluristics 00:24:23.858 --> 00:24:30.113 based on both the number of tweets using a particular language 00:24:30.113 --> 00:24:32.685 and also on the confidence level. 00:24:33.655 --> 00:24:38.267 And there are a lot of technical challenges there as well. 00:24:39.973 --> 00:24:41.948 (woman) So, it's possible that some of these posts 00:24:41.949 --> 00:24:45.824 many of these posts would be multilingual, I'm sorry monolingual in one language or the other? 00:24:46.498 --> 00:24:51.988 So it's also possible that some of these individual posts 00:24:51.989 --> 00:24:54.184 would mix languages? 00:24:54.623 --> 00:24:56.733 Yes, it is possible. It's very possible! 00:24:57.063 --> 00:25:00.360 It's very challenging for the automatic system! 00:25:01.915 --> 00:25:03.743 (woman) Right, ok. I just wanted to be clear-- 00:25:03.744 --> 00:25:05.185 Yes, exactly. 00:25:05.186 --> 00:25:11.303 So it's not as frequent as I expected, having bilingual posts 00:25:11.304 --> 00:25:12.740 that I would call. 00:25:12.741 --> 00:25:14.431 But it's happening. 00:25:15.058 --> 00:25:20.539 And so, for a series of tests, I had to do manual combing 00:25:20.540 --> 00:25:23.263 and I saw that sometimes it was the case 00:25:23.264 --> 00:25:26.718 that they were doing some sort of translation in the same tweet 00:25:26.719 --> 00:25:31.585 and sometimes it was just the case that they were mentioning titles of things 00:25:31.586 --> 00:25:34.206 or places in a different language. 00:25:34.563 --> 00:25:39.470 So, there's a lot of issues surrounding the automatic handling of this 00:25:39.471 --> 00:25:44.478 but you are dealing with 92 networks 00:25:44.479 --> 00:25:50.864 and they have between 30 and 5,000 nodes in them. 00:25:52.708 --> 00:25:55.841 So, I don't remember the numbers exactly, 00:25:55.867 --> 00:25:59.148 but I'm talking about around 80,000 people. 00:26:01.132 --> 00:26:04.527 So detecting the language of 80,000 people and this is small-scale. 00:26:04.913 --> 00:26:08.286 If you go to millions, you need an automatic system. 00:26:08.287 --> 00:26:11.291 And one of the things I'm having to write up in my dissertation 00:26:11.292 --> 00:26:13.832 is what are the challenges. 00:26:13.833 --> 00:26:17.984 You have to be prepared for them, to solve those problems. 00:26:18.551 --> 00:26:21.851 And one of them is what do you do with bilingual posts 00:26:21.852 --> 00:26:23.920 which language do you assign to that post? 00:26:23.921 --> 00:26:28.287 Automatic posts, spam... there's a lot of problems. 00:26:29.862 --> 00:26:31.219 Challenges, I mean. 00:26:31.220 --> 00:26:34.766 That's what makes it interesting because you cannot do manual combing 00:26:34.766 --> 00:26:36.046 on these scales. 00:26:39.073 --> 00:26:41.013 Do you have another question? 00:26:44.501 --> 00:26:48.025 So, now, what am I doing with this? 00:26:50.562 --> 00:26:56.178 I'm going to classify my social networks, looking at the patterns 00:26:56.179 --> 00:26:59.094 of overlaps between the languages groups. 00:26:59.720 --> 00:27:01.953 And overlaps or intersections. 00:27:02.547 --> 00:27:07.878 I'm looking specifically at the networks that have only two language groups 00:27:08.219 --> 00:27:11.860 I had five of these networks that were trilingual 00:27:12.284 --> 00:27:16.020 so I put them aside to go simple first with just two language groups 00:27:16.021 --> 00:27:18.361 to see how they interconnect. 00:27:19.369 --> 00:27:21.272 And then I classified them 00:27:21.936 --> 00:27:24.198 first following a qualitative analysis 00:27:24.198 --> 00:27:28.822 and then I used network statistics that I developed with my adviser 00:27:28.823 --> 00:27:30.386 for this purpose. 00:27:31.338 --> 00:27:33.693 And I will talk later a little more about it. 00:27:34.341 --> 00:27:37.980 So, tried to provide more robust measures for that. 00:27:39.428 --> 00:27:44.074 I classified them and I came up with some types. 00:27:45.922 --> 00:27:49.631 This is what I call the gatekeeper language bridge type. 00:27:50.526 --> 00:27:52.995 And there's some variants of it, obviously. 00:27:53.624 --> 00:27:55.990 What you can see here is the network of a person 00:27:55.991 --> 00:28:00.092 and I'm going to assume this person is in the United States 00:28:00.093 --> 00:28:02.350 and speaks both Spanish and English. 00:28:04.043 --> 00:28:05.684 Let's call her Maria. 00:28:05.927 --> 00:28:11.581 So she's Maria and she has two groups of friends using Spanish on Twitter 00:28:12.531 --> 00:28:15.768 and then that big group of friends using English. 00:28:17.320 --> 00:28:19.528 And, as you can see, there's just a few nodes 00:28:19.529 --> 00:28:22.003 connecting the two language groups. 00:28:22.004 --> 00:28:27.869 You can see that the social structure can be different from the language groups 00:28:29.391 --> 00:28:32.174 so you can have maybe a group of friends and a group of coworkers 00:28:32.175 --> 00:28:36.424 inside the same language group, so it can be more complex 00:28:36.425 --> 00:28:41.205 than just dividing the social network by language groups. 00:28:41.206 --> 00:28:45.522 There can be more grouping because of other social resources. 00:28:46.811 --> 00:28:50.572 But the interesting thing is that there are only a few nodes 00:28:50.573 --> 00:28:53.455 where people are connecting holding together these Twitters. 00:28:55.058 --> 00:29:00.675 I think this was friends with English here. 00:29:00.676 --> 00:29:05.461 You can see, in this case, it seems like the two groups 00:29:05.462 --> 00:29:08.089 are holding closely together 00:29:08.809 --> 00:29:13.833 because there are much more links holding the two groups together. 00:29:14.663 --> 00:29:18.246 Of course, this is going to depend on the size of the networks 00:29:18.247 --> 00:29:23.067 so I had to account for the size when coming up with measures 00:29:23.068 --> 00:29:25.943 with network connections 00:29:25.944 --> 00:29:28.257 I had to provide ratios. 00:29:28.258 --> 00:29:32.340 Now, the ratio of [close] language linking here and here 00:29:32.341 --> 00:29:34.312 and you have these types-- 00:29:36.477 --> 00:29:40.266 These types are not just clear-cut. 00:29:40.346 --> 00:29:41.696 There's an evolution. 00:29:41.700 --> 00:29:43.337 There's people that have very few connections 00:29:43.338 --> 00:29:44.653 with the language groups 00:29:44.654 --> 00:29:46.943 and then progressively there's people with more and more. 00:29:47.704 --> 00:29:49.037 And this increases. 00:29:49.037 --> 00:29:52.048 Which points to the fact, that my cells are there. 00:29:52.735 --> 00:29:57.001 Which means I don't see the evolution over time, ok? 00:29:57.819 --> 00:29:59.724 This is a limitation of my research. 00:29:59.725 --> 00:30:04.594 I just see the social network of this person looked 00:30:04.594 --> 00:30:07.491 at a particular point in time. 00:30:07.925 --> 00:30:10.057 I don't know how it evolves over time. 00:30:10.058 --> 00:30:13.130 So, for myself, it's just there. 00:30:13.508 --> 00:30:18.702 It would be interesting to see these different patterns 00:30:18.702 --> 00:30:20.771 that I have been observing. 00:30:20.771 --> 00:30:26.632 Maybe over time these connections between languages maybe increasing. 00:30:28.862 --> 00:30:32.131 Now we have the integration and union type 00:30:32.693 --> 00:30:37.128 where in this case you have a person from an Arab country 00:30:37.129 --> 00:30:40.778 and green represents the friends that are using Arabic 00:30:40.779 --> 00:30:45.155 and the friends using English are in pink, but there's also violet 00:30:45.156 --> 00:30:46.837 there are bilinguals. 00:30:47.196 --> 00:30:51.534 That means there's a group of English users 00:30:51.535 --> 00:30:57.187 and bilingual English - Arabic users inserted in the group of Arabic, inside. 00:30:59.530 --> 00:31:01.289 That's the integration, so they're integrated. 00:31:02.419 --> 00:31:07.726 And then I have a Greek guy, who uses Greek and English 00:31:07.726 --> 00:31:09.446 and his Arabic friends. 00:31:09.446 --> 00:31:11.935 And in this case, you can see it's sort of light blue 00:31:11.936 --> 00:31:16.788 representing Greek, so the friends that tweet in Greek 00:31:16.789 --> 00:31:20.729 Pink again represents people tweeting in English 00:31:21.353 --> 00:31:23.426 and there's a lot of bilinguals. 00:31:23.449 --> 00:31:26.994 So these kind of dark blues represent the bilinguals. 00:31:26.995 --> 00:31:28.604 And these are two groups 00:31:28.605 --> 00:31:32.741 that if you've seen before, the gatekeeper and the language bridge 00:31:32.742 --> 00:31:35.281 progressively getting closer and closer 00:31:35.282 --> 00:31:40.990 with more and more links across languages. 00:31:41.184 --> 00:31:42.815 In this case, this is like the extreme. 00:31:42.816 --> 00:31:46.016 The links between the two languages are so dense 00:31:46.017 --> 00:31:51.021 that you cannot almost distinguish where the border is 00:31:51.021 --> 00:31:53.128 between the two language groups. 00:31:53.164 --> 00:31:58.534 And, interestingly, the border might be even only noticeable 00:31:58.534 --> 00:32:01.406 because there's a lot of bilinguals around it. 00:32:02.091 --> 00:32:04.924 And this is the union type where they unite. 00:32:07.201 --> 00:32:09.806 And finally, the peripheral language type. 00:32:09.807 --> 00:32:13.690 This is a Brazilian guy, the network of a Brazilian guy 00:32:15.324 --> 00:32:16.892 where you have-- 00:32:16.893 --> 00:32:18.885 probably he lives in the United States or something like that-- 00:32:18.886 --> 00:32:23.192 because this guy has mostly all this big group of friends 00:32:23.226 --> 00:32:24.850 tweeting in English. 00:32:26.532 --> 00:32:31.978 And then there's the side tentacle running outside, using Portuguese. 00:32:34.702 --> 00:32:36.399 And this is like a periphery landscape. 00:32:36.400 --> 00:32:39.137 So, in the periphery there's a small group of Portuguese language. 00:32:39.893 --> 00:32:45.233 Now, I forgot to mention that there's dots that are light yellow or white. 00:32:45.286 --> 00:32:48.100 Those are the ones that have no data. 00:32:49.074 --> 00:32:51.270 So, I don't know the language they're using 00:32:51.271 --> 00:32:53.382 because either their accounts are closed 00:32:53.383 --> 00:32:57.803 or for some reason, in between the collection of data they closed the account. 00:32:59.307 --> 00:33:03.059 Mostly, the reason is that they're private accounts 00:33:03.570 --> 00:33:05.640 where you cannot get the data from. 00:33:06.442 --> 00:33:08.755 I think somewhere I read it was about 5 percent. 00:33:08.756 --> 00:33:10.216 I'm not sure. 00:33:10.216 --> 00:33:14.010 But for one reason or another, I don't have that information. 00:33:16.563 --> 00:33:20.976 Now, why am I classifying them? These networks? 00:33:22.785 --> 00:33:26.088 Well, the reason is that-- 00:33:26.089 --> 00:33:28.793 well, there are some studies that demonstrate that the social structure 00:33:28.794 --> 00:33:33.539 the structure of the social networks influences the spread of information. 00:33:34.096 --> 00:33:36.457 How information disseminates in the network. 00:33:38.553 --> 00:33:42.909 So, I'm just assuming that these different structures 00:33:42.910 --> 00:33:46.382 are going to influence the spread of information. 00:33:47.292 --> 00:33:49.750 But this is a study that has to be done. 00:33:49.929 --> 00:33:52.944 I cannot demonstrate that one of these types 00:33:52.945 --> 00:33:55.681 facilitates the spread of information. 00:33:55.682 --> 00:34:02.330 I can only say that I am assuming, so that potential study 00:34:04.200 --> 00:34:09.400 could just look at, for example, if gatekeeper and language bridges 00:34:10.551 --> 00:34:16.231 are not as good for spreading information as union and integration types. 00:34:20.178 --> 00:34:25.022 Right, we can just assume because of the cross-language links 00:34:28.295 --> 00:34:33.380 so, how many links there are or the ratio of discourse language 00:34:33.380 --> 00:34:38.331 may potentially facilitate information diffusion in these cases. 00:34:39.944 --> 00:34:42.557 So, that study needs to be done. 00:34:42.607 --> 00:34:44.732 I cannot say what's going to happen! 00:34:44.732 --> 00:34:47.123 I just assume it's going to be like that. 00:34:49.178 --> 00:34:52.009 So that is the reason why I classify them. 00:34:52.498 --> 00:34:54.599 I have some network statistics. 00:34:55.969 --> 00:35:00.753 We've made about an 80 percent accuracy guess, which is quite good, 00:35:00.753 --> 00:35:02.453 but the sample is small. 00:35:08.014 --> 00:35:10.961 So now, do you have any more questions before I move past to the next study? 00:35:13.726 --> 00:35:15.444 man) I was curious as to how many-- 00:35:15.444 --> 00:35:19.144 what was the selection process like to find the 92 users? 00:35:20.324 --> 00:35:22.891 Well, this is what I've been spending the beginning 00:35:22.892 --> 00:35:26.690 about just using two stopwords from two different languages 00:35:26.691 --> 00:35:31.482 typing that in the search box in Google and searching Twitter 00:35:31.482 --> 00:35:32.875 and then once-- 00:35:32.876 --> 00:35:36.192 Basically you just go through the list of results 00:35:36.193 --> 00:35:41.540 and start opening the profile, counting the tweets. 00:35:42.327 --> 00:35:44.536 How many in this language, how many in the other. 00:35:44.601 --> 00:35:46.640 And we put a threshold of 10 percent 00:35:46.640 --> 00:35:53.026 they had to have written 10 percent of the tweets in a second language 00:35:53.228 --> 00:35:56.742 and you couldn't count retweets or automatic posting. 00:35:57.937 --> 00:36:00.296 We also had to manually discard these spammers. 00:36:01.535 --> 00:36:03.733 So, that was the process. 00:36:06.151 --> 00:36:09.536 (woman) And that's a paid search through Google? 00:36:10.131 --> 00:36:12.601 No, that we did manually 00:36:12.717 --> 00:36:14.087 and then once-- 00:36:14.088 --> 00:36:20.392 So the other thing you can say is you can use these core multilingual users 00:36:20.938 --> 00:36:23.929 and then do what I did for behavior in these social networks 00:36:23.929 --> 00:36:29.363 which is once you extract the friends and extract the messages of the friends 00:36:30.669 --> 00:36:33.559 and automatically find the language 00:36:34.035 --> 00:36:36.522 then you can say "Oh, this person is multilingual" automatically. 00:36:36.522 --> 00:36:41.099 You just process it and you can detect a lot more multilingual people 00:36:41.183 --> 00:36:42.756 through that process. 00:36:42.757 --> 00:36:46.101 The paid process was sending these posts 00:36:46.101 --> 00:36:49.075 to the Google language identification tool. 00:36:49.885 --> 00:36:55.010 So, what I did was clean each message automatically. 00:36:55.544 --> 00:37:00.387 Basically, eliminating the hashtags 00:37:01.437 --> 00:37:05.230 and the mentions that had an @ in front, 00:37:05.230 --> 00:37:10.074 symbols, URLs, all those things I would automatically eliminate them 00:37:10.392 --> 00:37:13.777 and then with the rest of the message, I'd send that to the Google API 00:37:14.125 --> 00:37:15.849 for language identification 00:37:16.009 --> 00:37:21.726 and the Google API would give me a language level and a confidence binary. 00:37:21.726 --> 00:37:23.476 And that for each message. 00:37:23.485 --> 00:37:26.371 And then I built the algorithm with the help of Jen Golbeck 00:37:26.372 --> 00:37:30.688 to decide, well I have 30 messages, 500 English 00:37:30.714 --> 00:37:35.420 10 million Spanish and then one in Swahili which is unlikely 00:37:36.728 --> 00:37:39.954 and you had to decide the confidence value-- 00:37:39.955 --> 00:37:42.935 So I used rules, defined rules 00:37:42.936 --> 00:37:45.559 but it could be done statistically I think. 00:37:46.097 --> 00:37:48.388 And write some statistical method to decide 00:37:48.389 --> 00:37:51.869 "well this person actually is bilingual" or whatever. 00:37:52.779 --> 00:37:54.429 That's the process. 00:37:54.477 --> 00:37:55.597 It's long! 00:37:55.788 --> 00:37:56.788 Yes. 00:37:58.026 --> 00:38:00.487 (woman) Hi, I understand that you did it manually 00:38:00.488 --> 00:38:05.265 but currently in existing research field is there any software 00:38:05.265 --> 00:38:08.489 that we can use to capture, 00:38:08.489 --> 00:38:11.935 to have access to all these different tweets? 00:38:11.983 --> 00:38:15.400 And to capture the different categories? [inaudible] 00:38:15.400 --> 00:38:18.472 Ok, so you mean the extraction? 00:38:18.912 --> 00:38:19.983 (woman) Yeah. 00:38:19.983 --> 00:38:21.226 No, I didn't do it manually. 00:38:21.227 --> 00:38:22.705 (woman) And the other, I think the other part 00:38:22.706 --> 00:38:25.570 of your data presentation is visualizations coming out 00:38:25.571 --> 00:38:27.132 like this graph. 00:38:27.132 --> 00:38:32.610 Can you show us what kind of research do we have for social scientists 00:38:33.250 --> 00:38:35.478 to present the data in a visual form? 00:38:35.479 --> 00:38:37.461 This is a tool I would recommend. 00:38:37.461 --> 00:38:39.123 [inaudible] 00:38:39.123 --> 00:38:41.427 So, the first question. 00:38:42.572 --> 00:38:45.748 All the extraction from Twitter, it was automatic. 00:38:46.265 --> 00:38:48.638 I didn't copy the tweets, it was automatic. 00:38:48.855 --> 00:38:50.707 I used the Twitter API. 00:38:51.286 --> 00:38:54.849 They have a process for registered developers 00:38:54.850 --> 00:38:57.205 and I extracted it automatically. 00:39:01.925 --> 00:39:05.777 Now, the tools, and I forgot to put that in this slide 00:39:05.847 --> 00:39:09.444 but in the beginning, when I showed you the first visualization 00:39:09.445 --> 00:39:11.605 I put the name of the tool in-- 00:39:12.703 --> 00:39:17.644 I don't know if I translate well, but I think it's G-E-- 00:39:17.644 --> 00:39:23.785 You can see here, G-E-P-H-I, I don't know how to pronounce it! 00:39:23.785 --> 00:39:26.997 ["Jefy" I think...] 00:39:28.201 --> 00:39:32.216 So, this is the one I've used for the visualizations 00:39:33.709 --> 00:39:36.871 and it's good because you can use it on any platform. 00:39:36.872 --> 00:39:41.911 So both on a Mac or a PC or Linux. 00:39:44.829 --> 00:39:46.696 Now, it has limitations for... 00:39:47.209 --> 00:39:50.778 mostly for network statistics in my opinion. 00:39:54.237 --> 00:39:57.061 The other one, that is very popular is Node XL. 00:39:57.062 --> 00:40:00.548 And in fact it was developed here in the ATI lab. 00:40:01.773 --> 00:40:04.092 In the lab where I work. 00:40:05.190 --> 00:40:06.937 So, they collaborated with Microsoft. 00:40:06.938 --> 00:40:09.867 It's a template for Excel 00:40:11.076 --> 00:40:12.552 and it allows-- 00:40:12.553 --> 00:40:17.849 In fact they are still adding new features and there's two people working on it 00:40:18.235 --> 00:40:19.665 in the lab. 00:40:19.739 --> 00:40:23.984 But the reason I haven't used it here, is because I have a Mac 00:40:24.264 --> 00:40:29.166 and also there's another reason I like this positioning algorithm 00:40:31.302 --> 00:40:32.807 and this is... 00:40:32.808 --> 00:40:37.014 this is another issue I haven't talked about 00:40:37.124 --> 00:40:40.476 is how you actually place the dots. 00:40:40.476 --> 00:40:47.182 And actually these algorithms for layout use force-directed schemes 00:40:48.820 --> 00:40:50.507 like in physics science. 00:40:50.584 --> 00:40:53.598 So if a node has a lot of links with another node 00:40:53.599 --> 00:40:56.980 they put it closer, so it's like there's forces 00:40:56.981 --> 00:41:00.276 or strings attaching the nodes. 00:41:00.858 --> 00:41:04.293 And depending on how many strings there are, they're closer or farther. 00:41:04.605 --> 00:41:07.933 There's physics science rules for placing them. 00:41:07.959 --> 00:41:09.508 But there's different algorithms 00:41:09.509 --> 00:41:14.981 but the other reason I chose Gephi is that it has an algorithm 00:41:15.336 --> 00:41:20.899 specifically in this tool that places my language groups separately 00:41:20.943 --> 00:41:24.338 more than any other algorithm that I could use in Node XL. 00:41:24.339 --> 00:41:29.142 And it was more useful to see the groups separated. 00:41:30.407 --> 00:41:33.186 But you can use both depending on what you want to do. 00:41:33.187 --> 00:41:35.905 They both have weaknesses and strengths, 00:41:35.931 --> 00:41:38.847 different depending on what you have to do. 00:41:40.592 --> 00:41:46.628 Node XL has more features for processing many networks 00:41:48.068 --> 00:41:51.147 and extracting network statistics for many networks at the same time. 00:41:52.217 --> 00:41:57.372 And it has a lot of interesting features, maybe this is more manual. 00:41:58.528 --> 00:41:59.998 I don't know. 00:42:00.215 --> 00:42:04.670 Somebody called it "the Photoshop of visualization". 00:42:09.125 --> 00:42:13.580 So I'm going to briefly comment on the factor analysis. 00:42:13.892 --> 00:42:18.627 The point here, what I want to see is multilingual users of Twitter 00:42:20.784 --> 00:42:23.663 are aware of their audience in a way. 00:42:24.848 --> 00:42:29.480 And they somehow perceive how many followers 00:42:29.480 --> 00:42:32.205 of this language or the other they have. 00:42:32.761 --> 00:42:35.501 Maybe not very consciously, 00:42:37.641 --> 00:42:39.763 but they perceive something. 00:42:39.932 --> 00:42:42.468 So, I went to see how this social network 00:42:42.469 --> 00:42:46.691 the fact that there's many languages or just one in the social network 00:42:47.628 --> 00:42:52.814 can affect the choice of language in this person, the ego person. 00:42:54.638 --> 00:42:57.734 So, I actually did a lot of testing, different variables, 00:42:57.735 --> 00:43:01.434 but I'm just going to focus on the essence, 00:43:01.434 --> 00:43:05.729 which is I have my dependent variable which is the proportion of English 00:43:05.730 --> 00:43:11.064 used by the ego has 50 posts, maybe 60 percent of them are in English 00:43:11.883 --> 00:43:14.409 and 40 percent in Spanish, I don't know. 00:43:14.693 --> 00:43:18.630 And then they have the factor of how many users in the network 00:43:18.631 --> 00:43:21.381 are in English and how many are using other languages. 00:43:21.597 --> 00:43:24.274 And then the multilingual index of the network 00:43:24.275 --> 00:43:26.153 - and this is my favorite part - 00:43:26.153 --> 00:43:29.674 because it's basically saying 00:43:29.774 --> 00:43:35.900 "is multilingualism encouraging English as a lingua franca?" 00:43:37.026 --> 00:43:41.693 especially on Twitter, where we have these public posts that anybody can read. 00:43:43.339 --> 00:43:47.418 So anyway... I'm not going to go into the technical details 00:43:47.940 --> 00:43:50.516 of bi-nodal statistical interpretation. 00:43:50.517 --> 00:43:55.415 What I wanted to do is that in these combined effects 00:43:56.046 --> 00:44:00.500 of the factors, which one was more important? 00:44:00.998 --> 00:44:03.208 Was heavier than the others? 00:44:03.289 --> 00:44:07.340 Had more weight in defining these proportional [inaudible] used by the ego. 00:44:08.750 --> 00:44:11.242 I tried other factors, 00:44:11.243 --> 00:44:14.237 I also looked at the use of non-English language 00:44:15.370 --> 00:44:18.137 In the end... there are certain, 00:44:19.620 --> 00:44:21.423 I mean, they're obvious somehow. 00:44:21.424 --> 00:44:23.602 I think it's more interesting the process of what I've learned 00:44:23.603 --> 00:44:25.908 than the results themselves. 00:44:27.166 --> 00:44:30.031 Because basically what I've learned is that, yeah, 00:44:31.040 --> 00:44:32.931 the English use of the network 00:44:32.931 --> 00:44:36.338 is encouraged by the use of English by the ego 00:44:36.338 --> 00:44:40.756 and in a certain way it's so important that any other factor 00:44:40.757 --> 00:44:44.029 is really not that important. 00:44:45.231 --> 00:44:48.980 And even the second most important, the multilingual index 00:44:49.770 --> 00:44:54.830 was so light compared with the heavy impact of English 00:44:55.575 --> 00:44:57.107 used in the network. 00:44:57.608 --> 00:45:00.294 But what I thought was really interesting 00:45:00.295 --> 00:45:03.329 was how do you define the multlinguality of a network? 00:45:03.968 --> 00:45:07.295 And with this I got help from Jordan Boyd-Graber 00:45:07.296 --> 00:45:09.336 who is also in the iSchool 00:45:09.337 --> 00:45:14.331 and in the lab for computational lab, the information processing lab 00:45:14.332 --> 00:45:15.332 here in Maryland. 00:45:15.333 --> 00:45:17.556 He helped me with all these technical aspects. 00:45:18.183 --> 00:45:20.590 And he was the one suggesting "Well, why don't you look--" 00:45:20.590 --> 00:45:24.620 "instead of just looking at the number of languages in the network... 00:45:24.620 --> 00:45:28.694 "because sometimes you get wrongly detected languages... 00:45:28.695 --> 00:45:30.231 like Swahili. Well, no one was really speaking Swahihi in this network. 00:45:33.201 --> 00:45:37.029 There were technical challenges, like I explained to you. 00:45:38.122 --> 00:45:42.248 So maybe there's a high number of languages in the network 00:45:42.249 --> 00:45:44.189 but the network is mostly monolingual. 00:45:44.190 --> 00:45:49.064 Mostly everybody uses English and just a few people maybe use others 00:45:49.633 --> 00:45:52.337 or maybe just it got wrongly detected. 00:45:52.338 --> 00:45:54.810 And maybe you're just saying 00:45:54.811 --> 00:45:57.047 "Oh yeah, there's ten languages in the network!" 00:45:57.048 --> 00:45:59.548 and actually it's not a very multilingual network at all. 00:45:59.549 --> 00:46:02.650 So, we came up with this, the entropy. 00:46:03.390 --> 00:46:06.495 And this is a physics concept that measures the disorder 00:46:06.496 --> 00:46:07.866 in a system. 00:46:07.866 --> 00:46:11.452 And in this case, the entropy would be my multilingual index 00:46:11.453 --> 00:46:17.104 and what it's doing is providing a value between 0 and 1 00:46:17.364 --> 00:46:23.105 So, with 0 it's a very homogeneous system everyone speaks the same language 00:46:23.549 --> 00:46:26.900 and if it's closer to 1, it's really a heterogeneous 00:46:26.972 --> 00:46:28.911 and it places an importance 00:46:28.912 --> 00:46:31.823 in how many people are using its language. 00:46:32.235 --> 00:46:36.480 So, this is the equation, just to show you it. 00:46:38.009 --> 00:46:40.641 And it takes into account the number of languages in the network 00:46:40.642 --> 00:46:45.427 and then one of the variables is how many nodes in that language 00:46:45.498 --> 00:46:48.337 that there are divided by the total number 00:46:48.338 --> 00:46:50.971 and this is what gives the proportion for example. 00:46:52.889 --> 00:46:56.556 So just to let you know that there's interesting lessons 00:46:56.557 --> 00:46:57.977 from this study. 00:46:57.982 --> 00:47:00.479 Despite the research not being exciting! 00:47:00.549 --> 00:47:02.881 And this is what I'm doing right now. 00:47:04.816 --> 00:47:08.002 So, the intrinsic characteristic of the message 00:47:08.484 --> 00:47:11.038 how that influences the language choice. 00:47:11.062 --> 00:47:16.370 First, I'm wondering, because I just saw it in the content 00:47:19.070 --> 00:47:22.495 are replies encouraging people to use their native language? 00:47:22.992 --> 00:47:27.150 And public posts encouraging people to use English as a lingua franca? 00:47:27.759 --> 00:47:30.251 This is one that showed up the same. 00:47:30.252 --> 00:47:34.151 And I changed the handle, for privacy reasons... 00:47:34.549 --> 00:47:37.709 So this is the reply to somebody and it's in Arabic. 00:47:38.443 --> 00:47:41.001 And this is a public posting and it's in English. 00:47:42.414 --> 00:47:45.501 Now, the thing I'm looking at is public analysis 00:47:45.502 --> 00:47:50.314 and I'm considering with Jordan to do some automatic topic analysis 00:47:50.706 --> 00:47:54.215 because there's many languages, so I cannot decode it all 00:47:54.782 --> 00:47:56.503 in many of them. 00:47:56.507 --> 00:47:58.459 Only in three, maybe four... 00:47:59.910 --> 00:48:01.406 So, I'm wondering, 00:48:01.407 --> 00:48:04.213 are technology topics favoring the use of English? 00:48:04.600 --> 00:48:10.072 And other topics, international news maybe? 00:48:11.308 --> 00:48:16.147 Whereas other topics like national news or songs 00:48:16.148 --> 00:48:19.407 they might be encouraging the use of native languages. 00:48:20.566 --> 00:48:22.904 And then I'm looking if there's translations 00:48:22.904 --> 00:48:26.845 or if there's cross-cultural words that you can detect. 00:48:27.324 --> 00:48:29.111 For instance, this person is writing in English 00:48:29.112 --> 00:48:33.313 but it recommending a visit to a museum in the city of Lille in France. 00:48:33.767 --> 00:48:38.830 So this person knows the city in France, knows that to visit the museum 00:48:38.987 --> 00:48:40.556 you go there. 00:48:40.559 --> 00:48:43.089 And this is what I call cross-cultural words. 00:48:44.239 --> 00:48:49.095 [What I kind of found] is that surprisingly there's not many translation behaviors 00:48:49.096 --> 00:48:52.589 going on, despite these people being multilingual. 00:48:53.001 --> 00:48:56.264 And this is what is going to trigger some reflections. 00:49:00.289 --> 00:49:02.085 How am I doing on time? 00:49:04.172 --> 00:49:05.646 (woman) 1:22. 00:49:05.646 --> 00:49:10.050 (man) Umm, it's usually an hour long... 00:49:10.450 --> 00:49:14.358 So, I will go on with my reflections. 00:49:14.358 --> 00:49:18.266 to encourage some thoughts. 00:49:18.266 --> 00:49:22.027 So the greatest connecting power is the will of users who want 00:49:22.027 --> 00:49:23.317 to be connected. 00:49:23.317 --> 00:49:28.201 This is a really nice quality, because the communities of interest 00:49:28.290 --> 00:49:32.012 in social media, in Twitter is what is bringing people 00:49:32.013 --> 00:49:33.701 from different countries, together. 00:49:34.794 --> 00:49:41.151 And also experiences, like the Voluntweeters, 00:49:42.095 --> 00:49:45.815 so after the earthquake in Haiti, there were these spontaneous 00:49:45.816 --> 00:49:48.972 self-organizations of Twitter users for translating tweets 00:49:50.213 --> 00:49:53.755 and they called themselves Voluntweeters, there's a paper about that-- 00:49:53.826 --> 00:49:59.151 So this is the triggering of social connections 00:50:00.820 --> 00:50:04.486 across countries, across borders and across languages. 00:50:06.759 --> 00:50:10.300 But even when the social structure could potentially facilitate 00:50:10.301 --> 00:50:13.375 information diffusion and cross-language linking 00:50:14.558 --> 00:50:16.731 this condition is not sufficient. 00:50:16.732 --> 00:50:19.720 There are other factors like the design of the interfaces 00:50:19.721 --> 00:50:22.479 and the design of systems that can influence... 00:50:23.145 --> 00:50:27.438 can promote, or not translation behaviors and cross-cultural awareness. 00:50:28.293 --> 00:50:31.503 And the Wikipedia of cross-language linking 00:50:31.504 --> 00:50:35.113 you have links for many languages for every article. 00:50:37.257 --> 00:50:41.061 We also still acknowledge the dynamic language preferences of multilingual users 00:50:41.790 --> 00:50:44.145 so they could address their messages to the appropriate audience. 00:50:44.146 --> 00:50:47.187 I like the solution of Google+ with their circles 00:50:47.880 --> 00:50:51.890 where I can put my friends and family in Spain in a circle 00:50:51.891 --> 00:50:54.559 and write them in Spanish. 00:50:54.739 --> 00:51:00.633 And then the recommendation of people based on language profile 00:51:01.437 --> 00:51:04.134 would be useful for this spontaneous self-organization. 00:51:05.708 --> 00:51:08.057 So, these are some of the things. 00:51:08.143 --> 00:51:10.455 The impact of mediation. 00:51:10.782 --> 00:51:13.206 Global Voices is an international community of bloggers 00:51:13.207 --> 00:51:18.303 that connect bloggers and citizens from around the world 00:51:18.814 --> 00:51:20.504 in different languages. 00:51:21.171 --> 00:51:22.580 And Scott Hale 00:51:22.581 --> 00:51:27.353 a student from Oxford University led a very interesting study 00:51:27.354 --> 00:51:33.960 after the earthquake in Haiti about blogs in Spanish, Japanese and English 00:51:35.561 --> 00:51:38.542 and he looked at the cross-language linking 00:51:38.543 --> 00:51:41.388 and focusing on this topic over time. 00:51:41.488 --> 00:51:45.495 And he discovered that 50 percent of the cross-language linking 00:51:45.496 --> 00:51:48.304 was happening through this platform, Global Voices. 00:51:49.062 --> 00:51:51.941 So, it had a very big impact in the language links. 00:51:54.170 --> 00:51:57.857 And finally, social media, big media outlets, 00:51:57.858 --> 00:52:01.592 people are interconnected in these complex networks 00:52:04.693 --> 00:52:08.945 and underlying is this language ecosystem. 00:52:09.058 --> 00:52:12.786 So we have the language ecosystem, and on top of that 00:52:12.787 --> 00:52:15.296 we have the social media ecosystem. 00:52:15.305 --> 00:52:20.200 People would share a video from YouTube on Twitter, or news on Facebook. 00:52:21.302 --> 00:52:26.011 What happened if we integrate in this ecosystem 00:52:26.517 --> 00:52:30.518 these platforms, like Global Voices, like Universal Subtitles 00:52:30.519 --> 00:52:34.327 which is a platform for crowdsourcing subtitling of videos 00:52:34.328 --> 00:52:37.108 and translation of subtitles for videos. 00:52:38.050 --> 00:52:42.222 If you integrate that and this starts connecting, starts building paths 00:52:42.223 --> 00:52:45.743 between languages, that didn't exist before. 00:52:45.744 --> 00:52:50.955 So I think we should make it easy for multilingual people to translate 00:52:50.955 --> 00:52:55.187 and subtitle all the content they like, their favorite content 00:52:56.003 --> 00:53:00.326 and share it with the appropriate audience so they can start connecting 00:53:00.327 --> 00:53:03.114 the language islands of the internet. 00:53:03.145 --> 00:53:06.219 And that way stories will travel all over the world. 00:53:09.204 --> 00:53:11.950 Particularly I would like to thank Jen Golbeck, my adviser 00:53:11.951 --> 00:53:14.337 and Fulbright for supporting this research. 00:53:14.477 --> 00:53:19.206 And then I open the space for questions and your ideas 00:53:19.488 --> 00:53:21.780 if this has triggered some thoughts. 00:53:24.140 --> 00:53:25.972 (woman) I have a question about how this relates 00:53:25.973 --> 00:53:28.112 to your Yahoo award. 00:53:29.468 --> 00:53:35.076 Well, they have the Internet Experiences lab in California. 00:53:35.078 --> 00:53:36.428 And they-- 00:53:36.460 --> 00:53:40.213 So, we tend to think maybe it's a super tiny place 00:53:40.213 --> 00:53:42.630 but actually there are fields 00:53:42.631 --> 00:53:44.818 and I applied for the social systems. 00:53:45.121 --> 00:53:48.967 The social systems are a category. 00:53:49.068 --> 00:53:54.686 And I think that was embedded in the Internet Experience lab 00:53:56.739 --> 00:53:58.452 and yeah, they liked it. 00:53:58.516 --> 00:54:01.530 (man) But is it this work that they are interested in? 00:54:01.813 --> 00:54:02.883 Yes. 00:54:02.884 --> 00:54:04.022 - The languages? - Yes. 00:54:04.022 --> 00:54:07.726 Well, now I have results, because I wrote up reports 00:54:09.496 --> 00:54:11.548 about what my work was about. 00:54:16.758 --> 00:54:17.968 Great. 00:54:22.055 --> 00:54:22.879 Yes? 00:54:22.879 --> 00:54:25.682 (woman) I was thinking about if you analyzed the place... 00:54:25.682 --> 00:54:30.689 like if there's any relationship between tweeters and tweets 00:54:31.056 --> 00:54:33.624 and the place that the people are. 00:54:35.883 --> 00:54:39.760 I mean, because it's not the same being a Brazilian in Brazil 00:54:39.761 --> 00:54:43.197 and tweeting in Portuguese or being Brazilian in the US 00:54:43.198 --> 00:54:45.330 and tweeting in Portuguese-- 00:54:45.950 --> 00:54:49.249 There's many, many factors that I haven't looked at. 00:54:50.126 --> 00:54:51.971 It's not part of your study? 00:54:52.300 --> 00:54:54.447 But because I had to scope it somehow. 00:54:54.448 --> 00:54:56.108 There's so many factors. 00:54:56.710 --> 00:54:59.993 Geography was one that I was originally intending to look at 00:55:00.097 --> 00:55:04.458 but I found there were so many problems to actually get the right geography 00:55:04.459 --> 00:55:06.652 the right geolocation. 00:55:08.154 --> 00:55:12.136 The problem is that I didn't originally collect the geolocation. 00:55:12.137 --> 00:55:15.898 I think only a small percentage of messages have... 00:55:16.457 --> 00:55:18.297 geolocated information. 00:55:18.902 --> 00:55:20.795 I'm not sure about the percentage there. 00:55:20.796 --> 00:55:24.690 So there's only a small percentage of messages that have geolocation. 00:55:25.173 --> 00:55:27.604 There's issues with the accuracy... 00:55:28.041 --> 00:55:31.147 What I have collected is the information in their profile 00:55:31.931 --> 00:55:35.462 they can put the information about the place, 00:55:35.493 --> 00:55:39.572 but sometimes it's more or less trustworthy, 00:55:39.573 --> 00:55:42.828 sometimes there's nothing, and sometimes there's just crazy stuff. 00:55:43.210 --> 00:55:44.710 (audience laughs) 00:55:46.545 --> 00:55:49.735 So, something absolutely has to be there. 00:55:50.419 --> 00:55:55.249 If I wanted to expand this, geography would be a nice place to go! 00:55:55.279 --> 00:55:56.609 (woman) Ok. 00:55:59.863 --> 00:56:00.631 Yes? 00:56:00.631 --> 00:56:01.710 (man) Could you say a little bit more 00:56:01.710 --> 00:56:04.946 I think you said about the visualization choices you made? 00:56:04.964 --> 00:56:06.224 Oh yes, well... 00:56:08.033 --> 00:56:11.117 I tried this tool, the Node XL, 00:56:11.118 --> 00:56:13.284 I used both Node XL and Gephi. 00:56:13.522 --> 00:56:14.522 There's more... 00:56:16.109 --> 00:56:20.202 I think there's, I don't remember the name there's one that was developed 00:56:20.202 --> 00:56:21.854 here in Maryland 00:56:21.854 --> 00:56:24.163 but it's not as user-friendly. 00:56:26.108 --> 00:56:29.563 But I've forgotten the name, I will have to look it up. 00:56:29.895 --> 00:56:33.872 And there's a lot of tools that are for really technical people 00:56:34.696 --> 00:56:37.156 that are handling millions of nodes. 00:56:37.528 --> 00:56:40.615 Because with these tools, for social scientists or humanists 00:56:40.615 --> 00:56:42.295 maybe they are not. 00:56:42.316 --> 00:56:48.685 Some tools can have maybe 300-400 nodes and still be understandable. 00:56:51.115 --> 00:56:55.622 But if you go beyond that, actually visualizations get crazy 00:56:56.058 --> 00:57:02.088 and even for more technical tools for more technical people 00:57:02.563 --> 00:57:07.061 there are hundreds or millions, they cannot do visualizations 00:57:08.349 --> 00:57:11.870 at some point they just give you statistical measures. 00:57:13.729 --> 00:57:15.156 I have to leave it out. 00:57:15.156 --> 00:57:17.051 I have a list of tools and that 00:57:17.051 --> 00:57:20.598 but if I need the names, I need to go through everything. 00:57:22.596 --> 00:57:25.479 (woman) But yours was Mac-accessible? 00:57:25.479 --> 00:57:31.585 Yes, this Gephi tool is Mac-accessible, you can use it with Microsoft 00:57:31.792 --> 00:57:34.446 with Mac and with Linux. 00:57:35.905 --> 00:57:37.979 And I forgot to say, it's open source. 00:57:43.480 --> 00:57:48.839 (woman) Did you find studying languages and internet 00:57:48.840 --> 00:57:52.681 was like a place, unexplored? 00:57:52.948 --> 00:57:55.208 Like here in the United States? 00:57:55.378 --> 00:58:00.001 Like when you began studying or analyzing this 00:58:00.002 --> 00:58:04.303 you felt that a lot of people are doing this 00:58:04.303 --> 00:58:06.200 or nobody is doing this 00:58:06.200 --> 00:58:08.352 and I'm the first one trying to-- 00:58:08.435 --> 00:58:13.114 I'm not the first one, but it's a very new area 00:58:13.114 --> 00:58:14.971 to be exploring. 00:58:15.033 --> 00:58:16.983 So, it's very exciting because of that. 00:58:17.012 --> 00:58:18.797 Because there's so many unanswered questions 00:58:18.798 --> 00:58:23.785 and I find that surprisingly enough the United States is not paying so much attention 00:58:23.786 --> 00:58:26.053 about multilinguality issues 00:58:26.053 --> 00:58:31.002 And I think that language policies are very monolingual-oriented 00:58:31.003 --> 00:58:32.948 but it's terrible 00:58:33.043 --> 00:58:37.182 because there's a whole lot of multilinguality in this country. 00:58:37.183 --> 00:58:41.270 There's so many people speaking different languages 00:58:42.548 --> 00:58:45.290 that I'm so amazed about that contradiction. 00:58:45.780 --> 00:58:48.727 Because in Europe, it's an obvious challenge for us 00:58:49.388 --> 00:58:51.907 because we need to understand each other between all these countries 00:58:51.907 --> 00:58:53.567 of the European Union. 00:58:53.567 --> 00:58:58.499 And there's a lot of money invested in research that relates to multilinguality 00:58:58.691 --> 00:59:00.738 and communication in languages 00:59:00.738 --> 00:59:04.557 and technology in particular, cross-language systems 00:59:04.558 --> 00:59:09.030 and in libraries there's a lot of work going on. 00:59:09.400 --> 00:59:13.942 There's investment in the research. 00:59:14.565 --> 00:59:18.405 So yeah, maybe in terms of investment 00:59:18.405 --> 00:59:22.115 the European Union is not a bad place to be. 00:59:22.322 --> 00:59:24.109 Better than the United States! 00:59:24.110 --> 00:59:27.445 But at the same time, what I find interesting 00:59:27.446 --> 00:59:33.323 is that here when I talk about it people are really interested 00:59:35.313 --> 00:59:38.376 and interested in the subject and excited about it. 00:59:38.458 --> 00:59:41.294 Maybe in Europe it looks more like old news. 00:59:41.294 --> 00:59:43.796 Like "yeah, we already know that." 00:59:44.135 --> 00:59:45.665 (audience laughs) 00:59:45.674 --> 00:59:49.580 So I find that it's exciting to be seeing the audience 00:59:49.629 --> 00:59:52.226 like "Oh yeah!" It's so new. 00:59:52.666 --> 00:59:54.026 *(woman) Yes. 00:59:58.653 --> 01:00:03.146 (woman) As the emerging view of research in the United States 01:00:03.146 --> 01:00:09.892 can you show me which institutions or which area of academic institutions 01:00:11.798 --> 01:00:14.748 actually have more invested in this topic in the US? 01:00:16.262 --> 01:00:18.916 I'm not sure about the institutions. 01:00:20.572 --> 01:00:25.978 What I know, particularly, in Indiana there's work 01:00:26.510 --> 01:00:29.107 because Susan Herring is a researcher there. 01:00:30.797 --> 01:00:32.891 She has inspired my work. 01:00:32.891 --> 01:00:35.607 She published a book The Multilingual Internet 01:00:35.687 --> 01:00:40.953 and she has done research on blogs, also communities 01:00:41.891 --> 01:00:45.251 of different languages connecting blogs in the blogosphere. 01:00:45.251 --> 01:00:51.058 So she has been one of the ones, one of the first tackling these issues 01:00:51.144 --> 01:00:54.720 and she's still going and she's doing something. 01:00:54.896 --> 01:00:59.399 So, it's the University of Indiana, I think. 01:01:00.914 --> 01:01:03.348 Yeah, Susan Herring. Look for her! 01:01:06.095 --> 01:01:09.181 And also at the same university there's Paolillo. 01:01:10.156 --> 01:01:12.793 He's also doing research in this area 01:01:12.826 --> 01:01:18.869 and he actually published for UNESCO for research on language diversity 01:01:18.945 --> 01:01:20.275 on the internet. 01:01:21.785 --> 01:01:23.479 So Susan Herring and Paolillo, 01:01:23.480 --> 01:01:25.444 they are at the same university. 01:01:26.736 --> 01:01:30.058 Those are my inspiring ones. 01:01:33.682 --> 01:01:37.270 Well, at Harvard at the Berkman Center of Internet and Society also did 01:01:37.270 --> 01:01:38.639 this mapping of the blogs. 01:01:38.640 --> 01:01:40.649 But they don't focus on languages. 01:01:41.700 --> 01:01:45.279 But there's tangential thing around there. 01:01:49.387 --> 01:01:51.428 (man) One more question? 01:01:53.560 --> 01:01:54.748 Well, thank you very much! 01:01:54.749 --> 01:01:55.749 Thanks! 01:01:55.759 --> 01:01:57.661 (audience applauds)