1 00:00:00,384 --> 00:00:01,891 Good morning, everyone. 2 00:00:02,858 --> 00:00:05,650 Thank you for coming here [unclear] of the semester. 3 00:00:08,370 --> 00:00:10,217 So, I'm going to start. 4 00:00:11,001 --> 00:00:13,901 Access to the internet is greater than ever before 5 00:00:14,183 --> 00:00:17,166 and as a consequence, it's becoming more multilingual. 6 00:00:18,706 --> 00:00:22,613 However, there's evidence of segmentation of cyberspace 7 00:00:22,614 --> 00:00:25,170 due to language and national borders. 8 00:00:28,011 --> 00:00:30,812 This image serves to illustrate that. 9 00:00:31,684 --> 00:00:35,656 This is the language communities of Twitter in Europe. 10 00:00:36,562 --> 00:00:40,656 So, what you can see are tweets geolocated over a map of Europe 11 00:00:40,657 --> 00:00:44,047 and the different colors represent the different languages. 12 00:00:45,272 --> 00:00:50,905 You can even see regional languages like Catalan in the Catalan region of Spain 13 00:00:52,286 --> 00:00:56,037 And this is going to be useful for an example I'm going to use later. 14 00:01:01,958 --> 00:01:04,209 I'm interested in Twitter in particular, 15 00:01:04,209 --> 00:01:07,312 because of the speed of information dissemination 16 00:01:07,313 --> 00:01:10,744 and that most of this information is publicly accessible. 17 00:01:13,743 --> 00:01:18,780 I'm going to illustrate this with a capture 18 00:01:18,780 --> 00:01:22,280 of a dynamic visualization you can find on the Twitter blog 19 00:01:22,281 --> 00:01:24,783 by Miguel Rios. 20 00:01:25,345 --> 00:01:28,981 And what you can see here is the global flow of tweets 21 00:01:28,982 --> 00:01:31,205 after the earthquake in Japan. 22 00:01:32,147 --> 00:01:34,793 In pink, there are the tweets coming out of Japan 23 00:01:34,794 --> 00:01:37,400 and, in green, the retweets all over the world. 24 00:01:39,258 --> 00:01:44,966 This illustrates that in Twitter information is spreading across countries. 25 00:01:46,180 --> 00:01:47,987 But how can this happen? 26 00:01:49,380 --> 00:01:55,018 Expatriates, migrants, minorities. diaspora communities, language learners 27 00:01:55,028 --> 00:01:59,484 all play an important role in building transnational networks 28 00:01:59,484 --> 00:02:02,549 and cultural bridges between nations and communities. 29 00:02:03,847 --> 00:02:06,169 They are the multilingual users on the internet. 30 00:02:07,512 --> 00:02:11,045 The overarching research question is: 31 00:02:11,047 --> 00:02:16,833 how are multilingual users of Twitter connecting different language groups? 32 00:02:21,961 --> 00:02:27,157 In 2009, the Berkman Center of Internet and Society at Harvard University 33 00:02:27,158 --> 00:02:29,773 mapped the Arabic blogosphere 34 00:02:29,774 --> 00:02:33,134 and they described a key concept for my research. 35 00:02:35,020 --> 00:02:40,855 They discovered an English bridge and a French bridge of bloggers 36 00:02:40,856 --> 00:02:45,780 that were writing in their native Arabic language and in English or French. 37 00:02:47,430 --> 00:02:51,551 And they were connecting the different national blogospheres 38 00:02:51,552 --> 00:02:53,034 with the international one. 39 00:02:55,387 --> 00:03:00,351 This might have played a role in the Arab popular uprisings in 2011 40 00:03:00,352 --> 00:03:02,773 for reaching out to the world. 41 00:03:04,582 --> 00:03:09,223 And this is connected with a concept that first appeared in 2008 42 00:03:09,224 --> 00:03:11,697 of the bridge bloggers. 43 00:03:13,601 --> 00:03:16,010 So, bridge bloggers are bloggers 44 00:03:16,011 --> 00:03:19,991 that are trying to connect their local communities 45 00:03:19,992 --> 00:03:23,255 to a wider global audience. 46 00:03:25,190 --> 00:03:29,398 The image you can see here is actually the visualization they created 47 00:03:29,399 --> 00:03:32,617 of mapping the Arabic blogosphere. 48 00:03:35,341 --> 00:03:37,370 Each dot is a blogger, or a blog. 49 00:03:38,763 --> 00:03:42,930 The size represents their popularity, so how many incoming links they have 50 00:03:42,931 --> 00:03:45,457 and they grouped them-- 51 00:03:45,457 --> 00:03:47,666 the neighborhoods they created 52 00:03:47,667 --> 00:03:51,582 in relation to the linking between them. 53 00:03:52,587 --> 00:03:56,420 So, the ones that are grouped together are linking among each other. 54 00:03:57,235 --> 00:03:59,454 The colors are a different question. 55 00:03:59,489 --> 00:04:06,067 The colors represent "attentive clusters", that's how they call it. 56 00:04:06,067 --> 00:04:11,793 And they look at their online resources and media outlets 57 00:04:12,065 --> 00:04:13,911 these blogs were linking to. 58 00:04:15,055 --> 00:04:19,346 So, blogs of the same colors are following the same media outlets 59 00:04:19,346 --> 00:04:21,100 and online resources. 60 00:04:21,165 --> 00:04:25,057 And they did human coding to label those groups. 61 00:04:25,640 --> 00:04:30,282 And here is where we see the label English grids 62 00:04:30,283 --> 00:04:32,035 the responses from Cuba in English 63 00:04:32,036 --> 00:04:33,950 and up there, there's [unclear] France. 64 00:04:35,774 --> 00:04:40,537 And so I think it's important to retain the concept of attentive clusters. 65 00:04:43,797 --> 00:04:49,005 Now, let's go back to 2011 during the Arab popular uprisings. 66 00:04:50,444 --> 00:04:55,207 And I'll show you a visualization of the influence network 67 00:04:55,208 --> 00:04:57,069 of Twitter users in Egypt. 68 00:04:58,210 --> 00:04:59,868 So, what you're seeing here 69 00:04:59,869 --> 00:05:04,348 just imagine people down the street at Tahrir Square 70 00:05:04,349 --> 00:05:08,426 tweeting in Arabic about what's going on on the ground. 71 00:05:08,427 --> 00:05:12,188 And those are the people in red. 72 00:05:12,652 --> 00:05:17,525 So, these red dots represent users that are tweeting in Arabic. 73 00:05:18,309 --> 00:05:20,372 Then we have the international community 74 00:05:20,373 --> 00:05:25,515 or even Americans, British and so on tweeting in English. 75 00:05:26,075 --> 00:05:28,287 And they are in blue, those blue dots. 76 00:05:28,808 --> 00:05:33,052 And then, interestingly, we have people in between them. 77 00:05:33,052 --> 00:05:38,152 which are illustrated in different degrees of violet, or violet shades. 78 00:05:39,393 --> 00:05:43,417 This represents the fact that they are tweeting in both Arabic and English. 79 00:05:45,006 --> 00:05:47,204 So, what we're seeing is the bridge Twitters 80 00:05:47,996 --> 00:05:52,754 because, like Ethan Zuckermann called them "bridge bloggers". 81 00:05:55,642 --> 00:05:58,520 So, another context. 82 00:05:59,333 --> 00:06:04,649 The same year, 2011, a lot of big protests were going on in Europe. 83 00:06:05,172 --> 00:06:06,780 And in particular, in Spain. 84 00:06:06,781 --> 00:06:11,868 They started on May 15th 2011 there were massive protests. 85 00:06:13,002 --> 00:06:16,640 And of because of this context, this situation 86 00:06:16,640 --> 00:06:22,974 new attentive clusters were emerging in the social media landscape of Spain. 87 00:06:28,240 --> 00:06:33,711 Now, this is a visualization you can find in the Socialflow blog, research blog 88 00:06:33,806 --> 00:06:35,216 on social networks. 89 00:06:35,216 --> 00:06:41,235 And what it is, is it tracks the origin and the initial spread 90 00:06:41,236 --> 00:06:45,347 of the hashtag #occupywallstreet in Twitter. 91 00:06:46,987 --> 00:06:51,803 They detected that one of the first users of the hashtag #occupywallstreet 92 00:06:51,804 --> 00:06:57,093 was on July 13th 2011, linking to a blog post of Adbusters. 93 00:06:58,344 --> 00:07:02,250 So you have the Twitter account of Adbusters there, very big 94 00:07:02,251 --> 00:07:05,023 because it's being retweeted a lot. 95 00:07:05,966 --> 00:07:06,966 And mentioned a lot. 96 00:07:07,884 --> 00:07:13,514 And they collected these mentions and the tweets that had these mentions 97 00:07:13,515 --> 00:07:17,298 and these retweets with the hashtag during July 13th. 98 00:07:18,188 --> 00:07:20,501 From July 13th to July 23rd. 99 00:07:21,163 --> 00:07:23,523 So, from the first 10 days of the use of this hashtag 100 00:07:23,524 --> 00:07:26,810 it was from the very beginning of the use of this hashtag on Twitter. 101 00:07:31,012 --> 00:07:32,320 They just mapped the accounts 102 00:07:33,462 --> 00:07:39,459 and the series of posts with the hashtag and mentions with the hashtag 103 00:07:40,652 --> 00:07:42,997 and the users that were connecting 104 00:07:42,998 --> 00:07:45,329 because of these mentions and retweets. 105 00:07:46,136 --> 00:07:49,237 Now the interesting thing in this visualization 106 00:07:49,238 --> 00:07:50,517 is that they 107 00:07:50,518 --> 00:07:54,348 the Socialflow people particularly in [inaudible] 108 00:07:54,873 --> 00:07:59,885 detected this Spanish brand of users 109 00:07:59,886 --> 00:08:03,146 were forming an attentive cluster. 110 00:08:05,948 --> 00:08:09,282 Mentioning and retweeting about it in Spanish 111 00:08:09,283 --> 00:08:13,042 using the hashtag in their messages in Spanish. 112 00:08:14,104 --> 00:08:17,164 And they point out in the blog 113 00:08:17,865 --> 00:08:19,524 that this Spanish contingent 114 00:08:19,525 --> 00:08:24,189 helped post and spread the word about Occupy Wall Street 115 00:08:24,190 --> 00:08:29,358 even before most of the United States was aware of it. 116 00:08:32,239 --> 00:08:34,043 So, I found that very interesting. 117 00:08:34,043 --> 00:08:37,438 And it was due to the context in Spain at that moment 118 00:08:38,287 --> 00:08:45,449 with big protests and new clusters forming in the social media landscape. 119 00:08:56,673 --> 00:09:01,497 Now I have shown you the importance of these multilingual users 120 00:09:01,498 --> 00:09:06,470 in connecting language communities and spreading information 121 00:09:06,471 --> 00:09:09,228 across countries, acting as mediators. 122 00:09:11,172 --> 00:09:15,831 But let's focus on another aspect of connecting language groups 123 00:09:15,832 --> 00:09:17,170 which is language choice. 124 00:09:17,881 --> 00:09:23,261 So I'm going to devote a moment to speak about languages 125 00:09:23,262 --> 00:09:24,262 and language choice. 126 00:09:27,743 --> 00:09:31,064 To understand languages in the world 127 00:09:31,065 --> 00:09:33,478 I'm going to use a telescope. 128 00:09:37,358 --> 00:09:39,954 So de Swaan... 129 00:09:41,244 --> 00:09:44,565 ...proposed a theory called the world language system 130 00:09:44,566 --> 00:09:45,815 back in the 1990s. 131 00:09:47,268 --> 00:09:50,867 to explain the languages in the world. 132 00:09:52,127 --> 00:09:55,428 And he used a very beautiful metaphor, the constellation. 133 00:09:56,906 --> 00:10:01,510 So, in his theory there's about a dozen languages in the world 134 00:10:01,511 --> 00:10:05,469 that are the hearts of the system, or the suns. 135 00:10:06,493 --> 00:10:07,493 The suns of the system. 136 00:10:07,604 --> 00:10:11,015 For instance, English, French, Spanish, Arabic and more. 137 00:10:12,324 --> 00:10:16,765 And then there are hundreds, maybe more than 100, 200... 138 00:10:16,766 --> 00:10:22,707 national languages that are orbiting around these suns like planets. 139 00:10:24,497 --> 00:10:28,393 And finally we have regional and minority languages 140 00:10:28,394 --> 00:10:31,664 that are orbiting these planets like satellites. 141 00:10:32,826 --> 00:10:37,606 And he used this metaphor to explain the power relationships 142 00:10:37,607 --> 00:10:39,729 between languages. 143 00:10:40,172 --> 00:10:43,096 This is a theory of what he called 144 00:10:43,097 --> 00:10:46,659 "communication potential and language competition" 145 00:10:48,408 --> 00:10:50,539 A key point he made 146 00:10:51,979 --> 00:10:55,379 is that the system holds together 147 00:10:55,440 --> 00:10:58,793 thanks to multilingual people and interpreters. 148 00:11:00,291 --> 00:11:03,173 This is what's providing cohesion to the system. 149 00:11:03,956 --> 00:11:06,959 He also made a controversial proposal 150 00:11:06,960 --> 00:11:11,340 about the communication potential of a language. 151 00:11:12,290 --> 00:11:14,868 So, he proposed a formula, a mathematical formula 152 00:11:14,869 --> 00:11:19,910 where he could estimate the communication potential of a language 153 00:11:19,911 --> 00:11:24,889 and supposedly a person with tools through learning and usage 154 00:11:24,889 --> 00:11:27,554 based on the communications of that. 155 00:11:28,108 --> 00:11:34,206 For example, a person might decide to learn English and use English 156 00:11:35,137 --> 00:11:41,414 because not only does it provide communication with English native speakers 157 00:11:42,395 --> 00:11:46,377 but also, adding to that, it provides the possibility to communicate 158 00:11:46,378 --> 00:11:50,073 with all the second-language learners of English 159 00:11:50,074 --> 00:11:53,101 from many different languages, many different countries. 160 00:11:53,102 --> 00:11:55,818 So, supposedly, in history 161 00:11:55,819 --> 00:11:59,686 English provides the greatest communication. 162 00:12:01,464 --> 00:12:05,106 And he received some criticism, because of the central role of English 163 00:12:05,107 --> 00:12:06,785 in his theory 164 00:12:06,806 --> 00:12:10,139 He said it was the central hub of all the system. 165 00:12:12,959 --> 00:12:20,043 There's also the language ecology paradigm first proposed by Haugen in 1972 166 00:12:21,825 --> 00:12:25,243 and there's this idea of an ecosystem of languages 167 00:12:25,244 --> 00:12:29,695 and, again, it's using another metaphor 168 00:12:29,696 --> 00:12:31,608 and because of this metaphor 169 00:12:31,609 --> 00:12:34,466 also appeared the idea of endangered languages. 170 00:12:36,353 --> 00:12:39,880 I'm going to briefly just read the definition. 171 00:12:39,881 --> 00:12:42,787 He defined the language ecology as: 172 00:12:42,788 --> 00:12:47,235 "the study of interactions between any given language and its environment" 173 00:12:47,985 --> 00:12:49,207 and what I think is very important: 174 00:12:49,208 --> 00:12:53,690 "language exists only in the minds of its users" 175 00:12:56,694 --> 00:12:59,956 which leads me to point at my research. 176 00:13:01,528 --> 00:13:05,808 In my research, I'm using a microscope to see the cells 177 00:13:05,809 --> 00:13:09,337 and my cells in my study are the Twitter users. 178 00:13:12,192 --> 00:13:13,452 Why is that? 179 00:13:15,616 --> 00:13:19,906 Because as Haugen explains, there's a psychological dimension 180 00:13:19,907 --> 00:13:21,811 to language ecology 181 00:13:21,812 --> 00:13:25,070 where language interacts with other languages 182 00:13:25,071 --> 00:13:27,697 in the minds of multilingual people. 183 00:13:28,860 --> 00:13:31,574 And there's a sociological dimension to language ecology 184 00:13:32,425 --> 00:13:38,379 where we use language to communicate and interact with other people. 185 00:13:38,380 --> 00:13:43,860 And this language ecology generates because of the people 186 00:13:43,861 --> 00:13:46,493 that decide to use that language 187 00:13:46,494 --> 00:13:50,337 learning and interacting with people using it. 188 00:13:51,139 --> 00:13:54,758 And this is the point of language choice in languages. 189 00:13:55,643 --> 00:13:59,827 So, I focus on the connections of people and the language choice. 190 00:14:04,231 --> 00:14:07,595 So, these are the four points I'm going to be speaking about. 191 00:14:07,862 --> 00:14:13,015 But actually the main focus is going to be the first point 192 00:14:13,483 --> 00:14:19,196 Social network analysis and the taxonomy of intersections between language groups 193 00:14:20,267 --> 00:14:23,376 This is where I'm going to be spending most of the time. 194 00:14:23,377 --> 00:14:26,820 And then very briefly, just for compilation purposes 195 00:14:26,821 --> 00:14:29,551 I'm going to speak about another small study that I did 196 00:14:29,552 --> 00:14:30,861 the factor analysis 197 00:14:30,862 --> 00:14:34,485 looking at the influence of the social network 198 00:14:34,486 --> 00:14:39,007 in the language choices of the users. 199 00:14:39,236 --> 00:14:41,737 So, how the social network influences language choice 200 00:14:41,738 --> 00:14:43,400 of our multilingual users. 201 00:14:45,128 --> 00:14:50,399 And then I'm going to briefly also talk about the last study of my dissertation 202 00:14:50,400 --> 00:14:52,492 that is still ongoing. 203 00:14:52,992 --> 00:14:55,215 So, I still have new research to talk about. 204 00:14:55,216 --> 00:14:57,622 And it's content analysis 205 00:14:57,623 --> 00:15:00,221 and in this case I'm focusing on intrinsic factors 206 00:15:00,222 --> 00:15:02,666 intrinsic to the messages 207 00:15:02,667 --> 00:15:05,506 about the topic, and the type of exchange. 208 00:15:05,973 --> 00:15:07,513 If it's a reply, if it's a public post 209 00:15:07,514 --> 00:15:09,923 and how that influences the language choice as well. 210 00:15:11,331 --> 00:15:12,331 And finally I will... 211 00:15:14,194 --> 00:15:17,193 I'm going to give you my reflections 212 00:15:17,996 --> 00:15:23,204 so I can invite your thoughts and suggestions and discussions about it. 213 00:15:27,097 --> 00:15:29,098 Briefly, I'm going to start with the sampling 214 00:15:29,099 --> 00:15:32,077 so I can talk about the rest of the research. 215 00:15:34,959 --> 00:15:36,654 So my focus is on multilingual users, 216 00:15:36,655 --> 00:15:39,782 how did I identify multilingual users on Twitter? 217 00:15:42,044 --> 00:15:43,784 It was giving me a headache. 218 00:15:44,058 --> 00:15:46,691 Finally what we decided... 219 00:15:48,542 --> 00:15:50,240 this research has been-- 220 00:15:51,458 --> 00:15:54,070 I have always had the help of Jennifer Golbeck, 221 00:15:54,071 --> 00:15:55,077 she was my adviser. 222 00:15:55,078 --> 00:15:57,296 And I did this with her help. 223 00:15:57,978 --> 00:16:01,715 So what we did, was gather a list of what is called stopwords. 224 00:16:02,813 --> 00:16:05,419 From different languages and you have a list over there. 225 00:16:06,176 --> 00:16:10,451 And then the stopword lists you can find them on the internet. 226 00:16:10,452 --> 00:16:13,055 They are created for computational linguistics 227 00:16:13,056 --> 00:16:15,560 so they use it for filtering purposes. 228 00:16:16,783 --> 00:16:19,103 And they are common words in a language. 229 00:16:19,791 --> 00:16:21,356 Very common words in a language. 230 00:16:21,357 --> 00:16:25,729 So, sometimes they're used precisely for eliminating them from texts 231 00:16:25,730 --> 00:16:28,689 when they're in, for example, searches in Google 232 00:16:28,690 --> 00:16:32,952 the eliminate the stopwords, the stopwords that you type 233 00:16:32,952 --> 00:16:34,612 in the search. 234 00:16:35,314 --> 00:16:37,797 But in this case I wanted to find the stopwords 235 00:16:37,798 --> 00:16:41,003 that are very common in the language to represent the language. 236 00:16:41,004 --> 00:16:44,103 And so we had to select words that were not written the same 237 00:16:44,104 --> 00:16:45,734 as in another language. 238 00:16:46,121 --> 00:16:48,425 Sometimes, could be confusing and ambiguous. 239 00:16:51,491 --> 00:16:54,029 Then I typed in Google... 240 00:16:55,954 --> 00:16:59,029 one word in one language and one word in another language. 241 00:16:59,030 --> 00:17:01,372 Usually I was always using one English word 242 00:17:01,373 --> 00:17:04,006 and one word in a different language. 243 00:17:04,729 --> 00:17:08,310 And I looked in the Twitter domain. 244 00:17:08,901 --> 00:17:12,185 So the search results from Google will give me the profiles 245 00:17:13,105 --> 00:17:19,866 of people on Twitter that in theory wrote messages in both languages. 246 00:17:20,123 --> 00:17:24,242 We had to do a lot of hand-combing to actually see if it was in two languages 247 00:17:24,542 --> 00:17:27,563 or it was just that they were mentioning an English song 248 00:17:28,352 --> 00:17:31,654 the title of an English song but they had no English in the rest. 249 00:17:31,655 --> 00:17:35,662 So we had to ensure that they were authoring tweets 250 00:17:35,663 --> 00:17:37,575 in two languages. 251 00:17:37,911 --> 00:17:39,996 So writing them, not just retweeting them 252 00:17:39,997 --> 00:17:43,202 they were not just automatic postings from Facebook. 253 00:17:43,203 --> 00:17:48,275 So we had a long set of criteria a lot of manual combing 254 00:17:48,276 --> 00:17:52,949 and then finally we selected 92 multilingual users 255 00:17:52,950 --> 00:17:57,694 and in total they used 19 languages, 2 or 3 languages per person. 256 00:18:00,989 --> 00:18:04,956 Now, I don't know if you want to ask some questions about the sampling 257 00:18:04,957 --> 00:18:07,607 because there's a lot of details about it. 258 00:18:13,392 --> 00:18:14,602 No doubts? 259 00:18:15,119 --> 00:18:16,585 Or maybe they'll come later! 260 00:18:19,137 --> 00:18:21,929 Now, how do I do the social networks analysis? 261 00:18:22,743 --> 00:18:27,869 Well, now I have my 92 multilingual users technically they are called the ego 262 00:18:28,277 --> 00:18:30,190 of an egocentric network. 263 00:18:31,084 --> 00:18:33,005 This is the cell of my study. 264 00:18:33,836 --> 00:18:35,984 It started with the nucleus of the cell 265 00:18:35,985 --> 00:18:37,550 which is my multilingual user 266 00:18:37,551 --> 00:18:40,478 and then I go to Twitter 267 00:18:40,478 --> 00:18:43,439 and first of all I have instructed-- 268 00:18:44,878 --> 00:18:47,416 so in this case my ego is called the Painter 269 00:18:48,111 --> 00:18:53,500 and I have extracted the last 50 messages that he posted on Twitter 270 00:18:53,501 --> 00:18:56,560 to see the languages this person used-- is using. 271 00:18:57,156 --> 00:19:01,943 And I see that he is using English, Spanish and Catalan. 272 00:19:02,945 --> 00:19:05,479 Catalan is a regional language in Spain 273 00:19:05,605 --> 00:19:07,736 and I have shown you on the map the region before 274 00:19:07,737 --> 00:19:09,247 where the region was. 275 00:19:09,275 --> 00:19:12,474 And they speak both Catalan and Spanish. 276 00:19:13,727 --> 00:19:16,825 So, this person is tweeting in a minority language 277 00:19:16,826 --> 00:19:18,488 a national language 278 00:19:18,489 --> 00:19:20,779 and also international. 279 00:19:26,808 --> 00:19:31,754 So, I already found the Painter and I know what languages this person speaks 280 00:19:31,754 --> 00:19:33,542 well, uses on Twitter, 281 00:19:33,543 --> 00:19:36,178 and then I extract all the social networks. 282 00:19:36,179 --> 00:19:37,932 So, the followers on Twitter 283 00:19:37,933 --> 00:19:39,716 you know that on Twitter you have followers 284 00:19:39,717 --> 00:19:41,160 and you follow people. 285 00:19:41,163 --> 00:19:42,513 I extracted both. 286 00:19:42,546 --> 00:19:48,323 The followers of the Painter the people that are following him on Twitter 287 00:19:48,323 --> 00:19:52,118 and also how the friends are connecting to each other. 288 00:19:52,289 --> 00:19:56,671 So, all of them, all of these dots are the followers 289 00:19:56,672 --> 00:19:58,707 the people following the Painter on Twitter 290 00:19:58,708 --> 00:20:02,788 and also I see how they connect among each other, ok? 291 00:20:04,542 --> 00:20:08,837 So the Painter follows Eduard in the center 292 00:20:10,153 --> 00:20:12,245 and it seems he's very popular. 293 00:20:13,567 --> 00:20:17,042 And then I extract the last 30 posts of Eduard-- 294 00:20:17,048 --> 00:20:18,509 there's a reason for that 295 00:20:18,510 --> 00:20:21,961 but vernacular is mostly economy questions! 296 00:20:24,717 --> 00:20:25,717 I will tell you why! 297 00:20:25,718 --> 00:20:28,857 So I extracted the last 30 posts of Eduard 298 00:20:28,858 --> 00:20:31,966 and then I do automatic language identification 299 00:20:31,967 --> 00:20:36,734 with the Google API for language identification 300 00:20:38,548 --> 00:20:39,548 which costs money. 301 00:20:40,527 --> 00:20:43,282 So you have to really think about how many posts you want to send 302 00:20:43,283 --> 00:20:45,580 to Google and how much money you have available 303 00:20:45,581 --> 00:20:48,178 and what is the accuracy you're going to have 304 00:20:48,179 --> 00:20:51,125 according to how many posts you send. 305 00:20:51,348 --> 00:20:53,268 There's a lot of testing going on there. 306 00:20:54,271 --> 00:20:58,482 I do the same with everybody in the social network. 307 00:20:58,700 --> 00:20:59,893 I extract the last 30 posts 308 00:20:59,894 --> 00:21:02,340 use the Google identification 309 00:21:02,341 --> 00:21:08,086 build that algorithm that decides based on the languages of these 30 posts 310 00:21:08,087 --> 00:21:11,929 is this person monolingual? Is this person multilingual? 311 00:21:11,929 --> 00:21:13,221 Which languages? 312 00:21:13,222 --> 00:21:15,379 And then I laddered them, ok. 313 00:21:16,572 --> 00:21:18,746 This is just a visualization behind the-- 314 00:21:20,280 --> 00:21:27,315 Perhaps person 1 is monolingual, or bilingual of two languages. 315 00:21:31,985 --> 00:21:35,782 Now that I have all the friends of the Painter 316 00:21:35,918 --> 00:21:37,392 how they connect, 317 00:21:37,392 --> 00:21:40,854 I color code them depending on the languages they are using. 318 00:21:42,020 --> 00:21:44,669 And here, what you can see is very interesting. 319 00:21:46,076 --> 00:21:48,735 I don't know if you can distinguish the colors well 320 00:21:48,736 --> 00:21:53,949 because up here, this area, that is like a triangle 321 00:21:53,950 --> 00:21:57,896 there's a group of users writing in English. 322 00:21:58,743 --> 00:22:00,753 And it's pink. Sort of pinkish. 323 00:22:00,753 --> 00:22:04,547 And then, down here there's this Spanish group 324 00:22:04,548 --> 00:22:06,792 in light green. 325 00:22:07,544 --> 00:22:12,407 And, in the middle, the one that perhaps doesn't distinguish as well 326 00:22:12,408 --> 00:22:15,464 from the English, is the Catalan group. 327 00:22:15,935 --> 00:22:18,962 So the users writing in Catalan in dark blue. 328 00:22:19,776 --> 00:22:21,870 And then there's a set of violets in between 329 00:22:21,871 --> 00:22:26,319 and these violets represent the bilingual users 330 00:22:26,319 --> 00:22:29,292 either English and Catalan or English and Spanish. 331 00:22:29,963 --> 00:22:33,031 And then there's darker green around here, 332 00:22:33,031 --> 00:22:36,498 they are using both Catalan and Spanish. 333 00:22:36,498 --> 00:22:38,252 So there's a lot of bilinguals going on. 334 00:22:38,252 --> 00:22:39,736 And there's an interesting dynamics 335 00:22:39,737 --> 00:22:42,710 in that you have this English group up there 336 00:22:42,711 --> 00:22:44,060 and the Spanish group up here 337 00:22:44,061 --> 00:22:46,200 and the Catalan group in the middle. 338 00:22:46,201 --> 00:22:49,147 And this Catalan group is very mixed up with the Spanish group 339 00:22:49,744 --> 00:22:52,184 which makes sense, because it's a bilingual community. 340 00:23:01,121 --> 00:23:06,529 So, this is how I built the egocentric network of my 92 multilingual users. 341 00:23:08,601 --> 00:23:10,987 The Painter is just one of them. I have 92. 342 00:23:10,988 --> 00:23:16,575 I have 92 cells or egocentric networks that I studied with my microscope. 343 00:23:17,868 --> 00:23:21,817 Do you want to ask some questions about this process 344 00:23:21,818 --> 00:23:23,419 or this visualization? 345 00:23:25,051 --> 00:23:29,982 (person 1) Of the bilingual units, are they users or tweets? 346 00:23:30,894 --> 00:23:32,056 They are users, yeah. 347 00:23:32,400 --> 00:23:35,560 So, the dots represent people. 348 00:23:35,561 --> 00:23:40,014 So, like Eduard here. They represent people. 349 00:23:42,250 --> 00:23:45,317 Now each dot to determine the language and the color 350 00:23:45,318 --> 00:23:47,931 I extracted 30 posts 351 00:23:48,434 --> 00:23:52,797 So, it's an interesting question because the 30 posts 352 00:23:52,798 --> 00:23:55,958 have different language levels assigned to them 353 00:23:56,096 --> 00:23:57,130 especially if they were bilingual 354 00:23:57,131 --> 00:24:01,643 and I had to decide which language level I was going to assign to the user. 355 00:24:01,644 --> 00:24:05,383 So, I had to build an algorithm with a set of rules 356 00:24:10,279 --> 00:24:11,346 basically saying-- 357 00:24:11,347 --> 00:24:16,651 the Google identification system would give me a language 358 00:24:16,652 --> 00:24:17,882 and a confidence level 359 00:24:17,882 --> 00:24:19,496 So if the confidence level was very low 360 00:24:19,497 --> 00:24:23,838 I would say "discard that" because I had a series of pluristics 361 00:24:23,858 --> 00:24:30,113 based on both the number of tweets using a particular language 362 00:24:30,113 --> 00:24:32,685 and also on the confidence level. 363 00:24:33,655 --> 00:24:38,267 And there are a lot of technical challenges there as well. 364 00:24:39,973 --> 00:24:41,948 (woman) So, it's possible that some of these posts 365 00:24:41,949 --> 00:24:45,824 many of these posts would be multilingual, I'm sorry monolingual in one language or the other? 366 00:24:46,498 --> 00:24:51,988 So it's also possible that some of these individual posts 367 00:24:51,989 --> 00:24:54,184 would mix languages? 368 00:24:54,623 --> 00:24:56,733 Yes, it is possible. It's very possible! 369 00:24:57,063 --> 00:25:00,360 It's very challenging for the automatic system! 370 00:25:01,915 --> 00:25:03,743 (woman) Right, ok. I just wanted to be clear-- 371 00:25:03,744 --> 00:25:05,185 Yes, exactly. 372 00:25:05,186 --> 00:25:11,303 So it's not as frequent as I expected, having bilingual posts 373 00:25:11,304 --> 00:25:12,740 that I would call. 374 00:25:12,741 --> 00:25:14,431 But it's happening. 375 00:25:15,058 --> 00:25:20,539 And so, for a series of tests, I had to do manual combing 376 00:25:20,540 --> 00:25:23,263 and I saw that sometimes it was the case 377 00:25:23,264 --> 00:25:26,718 that they were doing some sort of translation in the same tweet 378 00:25:26,719 --> 00:25:31,585 and sometimes it was just the case that they were mentioning titles of things 379 00:25:31,586 --> 00:25:34,206 or places in a different language. 380 00:25:34,563 --> 00:25:39,470 So, there's a lot of issues surrounding the automatic handling of this 381 00:25:39,471 --> 00:25:44,478 but you are dealing with 92 networks 382 00:25:44,479 --> 00:25:50,864 and they have between 30 and 5,000 nodes in them. 383 00:25:52,708 --> 00:25:55,841 So, I don't remember the numbers exactly, 384 00:25:55,867 --> 00:25:59,148 but I'm talking about around 80,000 people. 385 00:26:01,132 --> 00:26:04,527 So detecting the language of 80,000 people and this is small-scale. 386 00:26:04,913 --> 00:26:08,286 If you go to millions, you need an automatic system. 387 00:26:08,287 --> 00:26:11,291 And one of the things I'm having to write up in my dissertation 388 00:26:11,292 --> 00:26:13,832 is what are the challenges. 389 00:26:13,833 --> 00:26:17,984 You have to be prepared for them, to solve those problems. 390 00:26:18,551 --> 00:26:21,851 And one of them is what do you do with bilingual posts 391 00:26:21,852 --> 00:26:23,920 which language do you assign to that post? 392 00:26:23,921 --> 00:26:28,287 Automatic posts, spam... there's a lot of problems. 393 00:26:29,862 --> 00:26:31,219 Challenges, I mean. 394 00:26:31,220 --> 00:26:34,766 That's what makes it interesting because you cannot do manual combing 395 00:26:34,766 --> 00:26:36,046 on these scales. 396 00:26:39,073 --> 00:26:41,013 Do you have another question? 397 00:26:44,501 --> 00:26:48,025 So, now, what am I doing with this? 398 00:26:50,562 --> 00:26:56,178 I'm going to classify my social networks, looking at the patterns 399 00:26:56,179 --> 00:26:59,094 of overlaps between the languages groups. 400 00:26:59,720 --> 00:27:01,953 And overlaps or intersections. 401 00:27:02,547 --> 00:27:07,878 I'm looking specifically at the networks that have only two language groups 402 00:27:08,219 --> 00:27:11,860 I had five of these networks that were trilingual 403 00:27:12,284 --> 00:27:16,020 so I put them aside to go simple first with just two language groups 404 00:27:16,021 --> 00:27:18,361 to see how they interconnect. 405 00:27:19,369 --> 00:27:21,272 And then I classified them 406 00:27:21,936 --> 00:27:24,198 first following a qualitative analysis 407 00:27:24,198 --> 00:27:28,822 and then I used network statistics that I developed with my adviser 408 00:27:28,823 --> 00:27:30,386 for this purpose. 409 00:27:31,338 --> 00:27:33,693 And I will talk later a little more about it. 410 00:27:34,341 --> 00:27:37,980 So, tried to provide more robust measures for that. 411 00:27:39,428 --> 00:27:44,074 I classified them and I came up with some types. 412 00:27:45,922 --> 00:27:49,631 This is what I call the gatekeeper language bridge type. 413 00:27:50,526 --> 00:27:52,995 And there's some variants of it, obviously. 414 00:27:53,624 --> 00:27:55,990 What you can see here is the network of a person 415 00:27:55,991 --> 00:28:00,092 and I'm going to assume this person is in the United States 416 00:28:00,093 --> 00:28:02,350 and speaks both Spanish and English. 417 00:28:04,043 --> 00:28:05,684 Let's call her Maria. 418 00:28:05,927 --> 00:28:11,581 So she's Maria and she has two groups of friends using Spanish on Twitter 419 00:28:12,531 --> 00:28:15,768 and then that big group of friends using English. 420 00:28:17,320 --> 00:28:19,528 And, as you can see, there's just a few nodes 421 00:28:19,529 --> 00:28:22,003 connecting the two language groups. 422 00:28:22,004 --> 00:28:27,869 You can see that the social structure can be different from the language groups 423 00:28:29,391 --> 00:28:32,174 so you can have maybe a group of friends and a group of coworkers 424 00:28:32,175 --> 00:28:36,424 inside the same language group, so it can be more complex 425 00:28:36,425 --> 00:28:41,205 than just dividing the social network by language groups. 426 00:28:41,206 --> 00:28:45,522 There can be more grouping because of other social resources. 427 00:28:46,811 --> 00:28:50,572 But the interesting thing is that there are only a few nodes 428 00:28:50,573 --> 00:28:53,455 where people are connecting holding together these Twitters. 429 00:28:55,058 --> 00:29:00,675 I think this was friends with English here. 430 00:29:00,676 --> 00:29:05,461 You can see, in this case, it seems like the two groups 431 00:29:05,462 --> 00:29:08,089 are holding closely together 432 00:29:08,809 --> 00:29:13,833 because there are much more links holding the two groups together. 433 00:29:14,663 --> 00:29:18,246 Of course, this is going to depend on the size of the networks 434 00:29:18,247 --> 00:29:23,067 so I had to account for the size when coming up with measures 435 00:29:23,068 --> 00:29:25,943 with network connections 436 00:29:25,944 --> 00:29:28,257 I had to provide ratios. 437 00:29:28,258 --> 00:29:32,340 Now, the ratio of [close] language linking here and here 438 00:29:32,341 --> 00:29:34,312 and you have these types-- 439 00:29:36,477 --> 00:29:40,266 These types are not just clear-cut. 440 00:29:40,346 --> 00:29:41,696 There's an evolution. 441 00:29:41,700 --> 00:29:43,337 There's people that have very few connections 442 00:29:43,338 --> 00:29:44,653 with the language groups 443 00:29:44,654 --> 00:29:46,943 and then progressively there's people with more and more. 444 00:29:47,704 --> 00:29:49,037 And this increases. 445 00:29:49,037 --> 00:29:52,048 Which points to the fact, that my cells are there. 446 00:29:52,735 --> 00:29:57,001 Which means I don't see the evolution over time, ok? 447 00:29:57,819 --> 00:29:59,724 This is a limitation of my research. 448 00:29:59,725 --> 00:30:04,594 I just see the social network of this person looked 449 00:30:04,594 --> 00:30:07,491 at a particular point in time. 450 00:30:07,925 --> 00:30:10,057 I don't know how it evolves over time. 451 00:30:10,058 --> 00:30:13,130 So, for myself, it's just there. 452 00:30:13,508 --> 00:30:18,702 It would be interesting to see these different patterns 453 00:30:18,702 --> 00:30:20,771 that I have been observing. 454 00:30:20,771 --> 00:30:26,632 Maybe over time these connections between languages maybe increasing. 455 00:30:28,862 --> 00:30:32,131 Now we have the integration and union type 456 00:30:32,693 --> 00:30:37,128 where in this case you have a person from an Arab country 457 00:30:37,129 --> 00:30:40,778 and green represents the friends that are using Arabic 458 00:30:40,779 --> 00:30:45,155 and the friends using English are in pink, but there's also violet 459 00:30:45,156 --> 00:30:46,837 there are bilinguals. 460 00:30:47,196 --> 00:30:51,534 That means there's a group of English users 461 00:30:51,535 --> 00:30:57,187 and bilingual English - Arabic users inserted in the group of Arabic, inside. 462 00:30:59,530 --> 00:31:01,289 That's the integration, so they're integrated. 463 00:31:02,419 --> 00:31:07,726 And then I have a Greek guy, who uses Greek and English 464 00:31:07,726 --> 00:31:09,446 and his Arabic friends. 465 00:31:09,446 --> 00:31:11,935 And in this case, you can see it's sort of light blue 466 00:31:11,936 --> 00:31:16,788 representing Greek, so the friends that tweet in Greek 467 00:31:16,789 --> 00:31:20,729 Pink again represents people tweeting in English 468 00:31:21,353 --> 00:31:23,426 and there's a lot of bilinguals. 469 00:31:23,449 --> 00:31:26,994 So these kind of dark blues represent the bilinguals. 470 00:31:26,995 --> 00:31:28,604 And these are two groups 471 00:31:28,605 --> 00:31:32,741 that if you've seen before, the gatekeeper and the language bridge 472 00:31:32,742 --> 00:31:35,281 progressively getting closer and closer 473 00:31:35,282 --> 00:31:40,990 with more and more links across languages. 474 00:31:41,184 --> 00:31:42,815 In this case, this is like the extreme. 475 00:31:42,816 --> 00:31:46,016 The links between the two languages are so dense 476 00:31:46,017 --> 00:31:51,021 that you cannot almost distinguish where the border is 477 00:31:51,021 --> 00:31:53,128 between the two language groups. 478 00:31:53,164 --> 00:31:58,534 And, interestingly, the border might be even only noticeable 479 00:31:58,534 --> 00:32:01,406 because there's a lot of bilinguals around it. 480 00:32:02,091 --> 00:32:04,924 And this is the union type where they unite. 481 00:32:07,201 --> 00:32:09,806 And finally, the peripheral language type. 482 00:32:09,807 --> 00:32:13,690 This is a Brazilian guy, the network of a Brazilian guy 483 00:32:15,324 --> 00:32:16,892 where you have-- 484 00:32:16,893 --> 00:32:18,885 probably he lives in the United States or something like that-- 485 00:32:18,886 --> 00:32:23,192 because this guy has mostly all this big group of friends 486 00:32:23,226 --> 00:32:24,850 tweeting in English. 487 00:32:26,532 --> 00:32:31,978 And then there's the side tentacle running outside, using Portuguese. 488 00:32:34,702 --> 00:32:36,399 And this is like a periphery landscape. 489 00:32:36,400 --> 00:32:39,137 So, in the periphery there's a small group of Portuguese language. 490 00:32:39,893 --> 00:32:45,233 Now, I forgot to mention that there's dots that are light yellow or white. 491 00:32:45,286 --> 00:32:48,100 Those are the ones that have no data. 492 00:32:49,074 --> 00:32:51,270 So, I don't know the language they're using 493 00:32:51,271 --> 00:32:53,382 because either their accounts are closed 494 00:32:53,383 --> 00:32:57,803 or for some reason, in between the collection of data they closed the account. 495 00:32:59,307 --> 00:33:03,059 Mostly, the reason is that they're private accounts 496 00:33:03,570 --> 00:33:05,640 where you cannot get the data from. 497 00:33:06,442 --> 00:33:08,755 I think somewhere I read it was about 5 percent. 498 00:33:08,756 --> 00:33:10,216 I'm not sure. 499 00:33:10,216 --> 00:33:14,010 But for one reason or another, I don't have that information. 500 00:33:16,563 --> 00:33:20,976 Now, why am I classifying them? These networks? 501 00:33:22,785 --> 00:33:26,088 Well, the reason is that-- 502 00:33:26,089 --> 00:33:28,793 well, there are some studies that demonstrate that the social structure 503 00:33:28,794 --> 00:33:33,539 the structure of the social networks influences the spread of information. 504 00:33:34,096 --> 00:33:36,457 How information disseminates in the network. 505 00:33:38,553 --> 00:33:42,909 So, I'm just assuming that these different structures 506 00:33:42,910 --> 00:33:46,382 are going to influence the spread of information. 507 00:33:47,292 --> 00:33:49,750 But this is a study that has to be done. 508 00:33:49,929 --> 00:33:52,944 I cannot demonstrate that one of these types 509 00:33:52,945 --> 00:33:55,681 facilitates the spread of information. 510 00:33:55,682 --> 00:34:02,330 I can only say that I am assuming, so that potential study 511 00:34:04,200 --> 00:34:09,400 could just look at, for example, if gatekeeper and language bridges 512 00:34:10,551 --> 00:34:16,231 are not as good for spreading information as union and integration types. 513 00:34:20,178 --> 00:34:25,022 Right, we can just assume because of the cross-language links 514 00:34:28,295 --> 00:34:33,380 so, how many links there are or the ratio of discourse language 515 00:34:33,380 --> 00:34:38,331 may potentially facilitate information diffusion in these cases. 516 00:34:39,944 --> 00:34:42,557 So, that study needs to be done. 517 00:34:42,607 --> 00:34:44,732 I cannot say what's going to happen! 518 00:34:44,732 --> 00:34:47,123 I just assume it's going to be like that. 519 00:34:49,178 --> 00:34:52,009 So that is the reason why I classify them. 520 00:34:52,498 --> 00:34:54,599 I have some network statistics. 521 00:34:55,969 --> 00:35:00,753 We've made about an 80 percent accuracy guess, which is quite good, 522 00:35:00,753 --> 00:35:02,453 but the sample is small. 523 00:35:08,014 --> 00:35:10,961 So now, do you have any more questions before I move past to the next study? 524 00:35:13,726 --> 00:35:15,444 man) I was curious as to how many-- 525 00:35:15,444 --> 00:35:19,144 what was the selection process like to find the 92 users? 526 00:35:20,324 --> 00:35:22,891 Well, this is what I've been spending the beginning 527 00:35:22,892 --> 00:35:26,690 about just using two stopwords from two different languages 528 00:35:26,691 --> 00:35:31,482 typing that in the search box in Google and searching Twitter 529 00:35:31,482 --> 00:35:32,875 and then once-- 530 00:35:32,876 --> 00:35:36,192 Basically you just go through the list of results 531 00:35:36,193 --> 00:35:41,540 and start opening the profile, counting the tweets. 532 00:35:42,327 --> 00:35:44,536 How many in this language, how many in the other. 533 00:35:44,601 --> 00:35:46,640 And we put a threshold of 10 percent 534 00:35:46,640 --> 00:35:53,026 they had to have written 10 percent of the tweets in a second language 535 00:35:53,228 --> 00:35:56,742 and you couldn't count retweets or automatic posting. 536 00:35:57,937 --> 00:36:00,296 We also had to manually discard these spammers. 537 00:36:01,535 --> 00:36:03,733 So, that was the process. 538 00:36:06,151 --> 00:36:09,536 (woman) And that's a paid search through Google? 539 00:36:10,131 --> 00:36:12,601 No, that we did manually 540 00:36:12,717 --> 00:36:14,087 and then once-- 541 00:36:14,088 --> 00:36:20,392 So the other thing you can say is you can use these core multilingual users 542 00:36:20,938 --> 00:36:23,929 and then do what I did for behavior in these social networks 543 00:36:23,929 --> 00:36:29,363 which is once you extract the friends and extract the messages of the friends 544 00:36:30,669 --> 00:36:33,559 and automatically find the language 545 00:36:34,035 --> 00:36:36,522 then you can say "Oh, this person is multilingual" automatically. 546 00:36:36,522 --> 00:36:41,099 You just process it and you can detect a lot more multilingual people 547 00:36:41,183 --> 00:36:42,756 through that process. 548 00:36:42,757 --> 00:36:46,101 The paid process was sending these posts 549 00:36:46,101 --> 00:36:49,075 to the Google language identification tool. 550 00:36:49,885 --> 00:36:55,010 So, what I did was clean each message automatically. 551 00:36:55,544 --> 00:37:00,387 Basically, eliminating the hashtags 552 00:37:01,437 --> 00:37:05,230 and the mentions that had an @ in front, 553 00:37:05,230 --> 00:37:10,074 symbols, URLs, all those things I would automatically eliminate them 554 00:37:10,392 --> 00:37:13,777 and then with the rest of the message, I'd send that to the Google API 555 00:37:14,125 --> 00:37:15,849 for language identification 556 00:37:16,009 --> 00:37:21,726 and the Google API would give me a language level and a confidence binary. 557 00:37:21,726 --> 00:37:23,476 And that for each message. 558 00:37:23,485 --> 00:37:26,371 And then I built the algorithm with the help of Jen Golbeck 559 00:37:26,372 --> 00:37:30,688 to decide, well I have 30 messages, 500 English 560 00:37:30,714 --> 00:37:35,420 10 million Spanish and then one in Swahili which is unlikely 561 00:37:36,728 --> 00:37:39,954 and you had to decide the confidence value-- 562 00:37:39,955 --> 00:37:42,935 So I used rules, defined rules 563 00:37:42,936 --> 00:37:45,559 but it could be done statistically I think. 564 00:37:46,097 --> 00:37:48,388 And write some statistical method to decide 565 00:37:48,389 --> 00:37:51,869 "well this person actually is bilingual" or whatever. 566 00:37:52,779 --> 00:37:54,429 That's the process. 567 00:37:54,477 --> 00:37:55,597 It's long! 568 00:37:55,788 --> 00:37:56,788 Yes. 569 00:37:58,026 --> 00:38:00,487 (woman) Hi, I understand that you did it manually 570 00:38:00,488 --> 00:38:05,265 but currently in existing research field is there any software 571 00:38:05,265 --> 00:38:08,489 that we can use to capture, 572 00:38:08,489 --> 00:38:11,935 to have access to all these different tweets? 573 00:38:11,983 --> 00:38:15,400 And to capture the different categories? [inaudible] 574 00:38:15,400 --> 00:38:18,472 Ok, so you mean the extraction? 575 00:38:18,912 --> 00:38:19,983 (woman) Yeah. 576 00:38:19,983 --> 00:38:21,226 No, I didn't do it manually. 577 00:38:21,227 --> 00:38:22,705 (woman) And the other, I think the other part 578 00:38:22,706 --> 00:38:25,570 of your data presentation is visualizations coming out 579 00:38:25,571 --> 00:38:27,132 like this graph. 580 00:38:27,132 --> 00:38:32,610 Can you show us what kind of research do we have for social scientists 581 00:38:33,250 --> 00:38:35,478 to present the data in a visual form? 582 00:38:35,479 --> 00:38:37,461 This is a tool I would recommend. 583 00:38:37,461 --> 00:38:39,123 [inaudible] 584 00:38:39,123 --> 00:38:41,427 So, the first question. 585 00:38:42,572 --> 00:38:45,748 All the extraction from Twitter, it was automatic. 586 00:38:46,265 --> 00:38:48,638 I didn't copy the tweets, it was automatic. 587 00:38:48,855 --> 00:38:50,707 I used the Twitter API. 588 00:38:51,286 --> 00:38:54,849 They have a process for registered developers 589 00:38:54,850 --> 00:38:57,205 and I extracted it automatically. 590 00:39:01,925 --> 00:39:05,777 Now, the tools, and I forgot to put that in this slide 591 00:39:05,847 --> 00:39:09,444 but in the beginning, when I showed you the first visualization 592 00:39:09,445 --> 00:39:11,605 I put the name of the tool in-- 593 00:39:12,703 --> 00:39:17,644 I don't know if I translate well, but I think it's G-E-- 594 00:39:17,644 --> 00:39:23,785 You can see here, G-E-P-H-I, I don't know how to pronounce it! 595 00:39:23,785 --> 00:39:26,997 ["Jefy" I think...] 596 00:39:28,201 --> 00:39:32,216 So, this is the one I've used for the visualizations 597 00:39:33,709 --> 00:39:36,871 and it's good because you can use it on any platform. 598 00:39:36,872 --> 00:39:41,911 So both on a Mac or a PC or Linux. 599 00:39:44,829 --> 00:39:46,696 Now, it has limitations for... 600 00:39:47,209 --> 00:39:50,778 mostly for network statistics in my opinion. 601 00:39:54,237 --> 00:39:57,061 The other one, that is very popular is Node XL. 602 00:39:57,062 --> 00:40:00,548 And in fact it was developed here in the ATI lab. 603 00:40:01,773 --> 00:40:04,092 In the lab where I work. 604 00:40:05,190 --> 00:40:06,937 So, they collaborated with Microsoft. 605 00:40:06,938 --> 00:40:09,867 It's a template for Excel 606 00:40:11,076 --> 00:40:12,552 and it allows-- 607 00:40:12,553 --> 00:40:17,849 In fact they are still adding new features and there's two people working on it 608 00:40:18,235 --> 00:40:19,665 in the lab. 609 00:40:19,739 --> 00:40:23,984 But the reason I haven't used it here, is because I have a Mac 610 00:40:24,264 --> 00:40:29,166 and also there's another reason I like this positioning algorithm 611 00:40:31,302 --> 00:40:32,807 and this is... 612 00:40:32,808 --> 00:40:37,014 this is another issue I haven't talked about 613 00:40:37,124 --> 00:40:40,476 is how you actually place the dots. 614 00:40:40,476 --> 00:40:47,182 And actually these algorithms for layout use force-directed schemes 615 00:40:48,820 --> 00:40:50,507 like in physics science. 616 00:40:50,584 --> 00:40:53,598 So if a node has a lot of links with another node 617 00:40:53,599 --> 00:40:56,980 they put it closer, so it's like there's forces 618 00:40:56,981 --> 00:41:00,276 or strings attaching the nodes. 619 00:41:00,858 --> 00:41:04,293 And depending on how many strings there are, they're closer or farther. 620 00:41:04,605 --> 00:41:07,933 There's physics science rules for placing them. 621 00:41:07,959 --> 00:41:09,508 But there's different algorithms 622 00:41:09,509 --> 00:41:14,981 but the other reason I chose Gephi is that it has an algorithm 623 00:41:15,336 --> 00:41:20,899 specifically in this tool that places my language groups separately 624 00:41:20,943 --> 00:41:24,338 more than any other algorithm that I could use in Node XL. 625 00:41:24,339 --> 00:41:29,142 And it was more useful to see the groups separated. 626 00:41:30,407 --> 00:41:33,186 But you can use both depending on what you want to do. 627 00:41:33,187 --> 00:41:35,905 They both have weaknesses and strengths, 628 00:41:35,931 --> 00:41:38,847 different depending on what you have to do. 629 00:41:40,592 --> 00:41:46,628 Node XL has more features for processing many networks 630 00:41:48,068 --> 00:41:51,147 and extracting network statistics for many networks at the same time. 631 00:41:52,217 --> 00:41:57,372 And it has a lot of interesting features, maybe this is more manual. 632 00:41:58,528 --> 00:41:59,998 I don't know. 633 00:42:00,215 --> 00:42:04,670 Somebody called it "the Photoshop of visualization". 634 00:42:09,125 --> 00:42:13,580 So I'm going to briefly comment on the factor analysis. 635 00:42:13,892 --> 00:42:18,627 The point here, what I want to see is multilingual users of Twitter 636 00:42:20,784 --> 00:42:23,663 are aware of their audience in a way. 637 00:42:24,848 --> 00:42:29,480 And they somehow perceive how many followers 638 00:42:29,480 --> 00:42:32,205 of this language or the other they have. 639 00:42:32,761 --> 00:42:35,501 Maybe not very consciously, 640 00:42:37,641 --> 00:42:39,763 but they perceive something. 641 00:42:39,932 --> 00:42:42,468 So, I went to see how this social network 642 00:42:42,469 --> 00:42:46,691 the fact that there's many languages or just one in the social network 643 00:42:47,628 --> 00:42:52,814 can affect the choice of language in this person, the ego person. 644 00:42:54,638 --> 00:42:57,734 So, I actually did a lot of testing, different variables, 645 00:42:57,735 --> 00:43:01,434 but I'm just going to focus on the essence, 646 00:43:01,434 --> 00:43:05,729 which is I have my dependent variable which is the proportion of English 647 00:43:05,730 --> 00:43:11,064 used by the ego has 50 posts, maybe 60 percent of them are in English 648 00:43:11,883 --> 00:43:14,409 and 40 percent in Spanish, I don't know. 649 00:43:14,693 --> 00:43:18,630 And then they have the factor of how many users in the network 650 00:43:18,631 --> 00:43:21,381 are in English and how many are using other languages. 651 00:43:21,597 --> 00:43:24,274 And then the multilingual index of the network 652 00:43:24,275 --> 00:43:26,153 - and this is my favorite part - 653 00:43:26,153 --> 00:43:29,674 because it's basically saying 654 00:43:29,774 --> 00:43:35,900 "is multilingualism encouraging English as a lingua franca?" 655 00:43:37,026 --> 00:43:41,693 especially on Twitter, where we have these public posts that anybody can read. 656 00:43:43,339 --> 00:43:47,418 So anyway... I'm not going to go into the technical details 657 00:43:47,940 --> 00:43:50,516 of bi-nodal statistical interpretation. 658 00:43:50,517 --> 00:43:55,415 What I wanted to do is that in these combined effects 659 00:43:56,046 --> 00:44:00,500 of the factors, which one was more important? 660 00:44:00,998 --> 00:44:03,208 Was heavier than the others? 661 00:44:03,289 --> 00:44:07,340 Had more weight in defining these proportional [inaudible] used by the ego. 662 00:44:08,750 --> 00:44:11,242 I tried other factors, 663 00:44:11,243 --> 00:44:14,237 I also looked at the use of non-English language 664 00:44:15,370 --> 00:44:18,137 In the end... there are certain, 665 00:44:19,620 --> 00:44:21,423 I mean, they're obvious somehow. 666 00:44:21,424 --> 00:44:23,602 I think it's more interesting the process of what I've learned 667 00:44:23,603 --> 00:44:25,908 than the results themselves. 668 00:44:27,166 --> 00:44:30,031 Because basically what I've learned is that, yeah, 669 00:44:31,040 --> 00:44:32,931 the English use of the network 670 00:44:32,931 --> 00:44:36,338 is encouraged by the use of English by the ego 671 00:44:36,338 --> 00:44:40,756 and in a certain way it's so important that any other factor 672 00:44:40,757 --> 00:44:44,029 is really not that important. 673 00:44:45,231 --> 00:44:48,980 And even the second most important, the multilingual index 674 00:44:49,770 --> 00:44:54,830 was so light compared with the heavy impact of English 675 00:44:55,575 --> 00:44:57,107 used in the network. 676 00:44:57,608 --> 00:45:00,294 But what I thought was really interesting 677 00:45:00,295 --> 00:45:03,329 was how do you define the multlinguality of a network? 678 00:45:03,968 --> 00:45:07,295 And with this I got help from Jordan Boyd-Graber 679 00:45:07,296 --> 00:45:09,336 who is also in the iSchool 680 00:45:09,337 --> 00:45:14,331 and in the lab for computational lab, the information processing lab 681 00:45:14,332 --> 00:45:15,332 here in Maryland. 682 00:45:15,333 --> 00:45:17,556 He helped me with all these technical aspects. 683 00:45:18,183 --> 00:45:20,590 And he was the one suggesting "Well, why don't you look--" 684 00:45:20,590 --> 00:45:24,620 "instead of just looking at the number of languages in the network... 685 00:45:24,620 --> 00:45:28,694 "because sometimes you get wrongly detected languages... 686 00:45:28,695 --> 00:45:30,231 like Swahili. Well, no one was really speaking Swahihi in this network. 687 00:45:33,201 --> 00:45:37,029 There were technical challenges, like I explained to you. 688 00:45:38,122 --> 00:45:42,248 So maybe there's a high number of languages in the network 689 00:45:42,249 --> 00:45:44,189 but the network is mostly monolingual. 690 00:45:44,190 --> 00:45:49,064 Mostly everybody uses English and just a few people maybe use others 691 00:45:49,633 --> 00:45:52,337 or maybe just it got wrongly detected. 692 00:45:52,338 --> 00:45:54,810 And maybe you're just saying 693 00:45:54,811 --> 00:45:57,047 "Oh yeah, there's ten languages in the network!" 694 00:45:57,048 --> 00:45:59,548 and actually it's not a very multilingual network at all. 695 00:45:59,549 --> 00:46:02,650 So, we came up with this, the entropy. 696 00:46:03,390 --> 00:46:06,495 And this is a physics concept that measures the disorder 697 00:46:06,496 --> 00:46:07,866 in a system. 698 00:46:07,866 --> 00:46:11,452 And in this case, the entropy would be my multilingual index 699 00:46:11,453 --> 00:46:17,104 and what it's doing is providing a value between 0 and 1 700 00:46:17,364 --> 00:46:23,105 So, with 0 it's a very homogeneous system everyone speaks the same language 701 00:46:23,549 --> 00:46:26,900 and if it's closer to 1, it's really a heterogeneous 702 00:46:26,972 --> 00:46:28,911 and it places an importance 703 00:46:28,912 --> 00:46:31,823 in how many people are using its language. 704 00:46:32,235 --> 00:46:36,480 So, this is the equation, just to show you it. 705 00:46:38,009 --> 00:46:40,641 And it takes into account the number of languages in the network 706 00:46:40,642 --> 00:46:45,427 and then one of the variables is how many nodes in that language 707 00:46:45,498 --> 00:46:48,337 that there are divided by the total number 708 00:46:48,338 --> 00:46:50,971 and this is what gives the proportion for example. 709 00:46:52,889 --> 00:46:56,556 So just to let you know that there's interesting lessons 710 00:46:56,557 --> 00:46:57,977 from this study. 711 00:46:57,982 --> 00:47:00,479 Despite the research not being exciting! 712 00:47:00,549 --> 00:47:02,881 And this is what I'm doing right now. 713 00:47:04,816 --> 00:47:08,002 So, the intrinsic characteristic of the message 714 00:47:08,484 --> 00:47:11,038 how that influences the language choice. 715 00:47:11,062 --> 00:47:16,370 First, I'm wondering, because I just saw it in the content 716 00:47:19,070 --> 00:47:22,495 are replies encouraging people to use their native language? 717 00:47:22,992 --> 00:47:27,150 And public posts encouraging people to use English as a lingua franca? 718 00:47:27,759 --> 00:47:30,251 This is one that showed up the same. 719 00:47:30,252 --> 00:47:34,151 And I changed the handle, for privacy reasons... 720 00:47:34,549 --> 00:47:37,709 So this is the reply to somebody and it's in Arabic. 721 00:47:38,443 --> 00:47:41,001 And this is a public posting and it's in English. 722 00:47:42,414 --> 00:47:45,501 Now, the thing I'm looking at is public analysis 723 00:47:45,502 --> 00:47:50,314 and I'm considering with Jordan to do some automatic topic analysis 724 00:47:50,706 --> 00:47:54,215 because there's many languages, so I cannot decode it all 725 00:47:54,782 --> 00:47:56,503 in many of them. 726 00:47:56,507 --> 00:47:58,459 Only in three, maybe four... 727 00:47:59,910 --> 00:48:01,406 So, I'm wondering, 728 00:48:01,407 --> 00:48:04,213 are technology topics favoring the use of English? 729 00:48:04,600 --> 00:48:10,072 And other topics, international news maybe? 730 00:48:11,308 --> 00:48:16,147 Whereas other topics like national news or songs 731 00:48:16,148 --> 00:48:19,407 they might be encouraging the use of native languages. 732 00:48:20,566 --> 00:48:22,904 And then I'm looking if there's translations 733 00:48:22,904 --> 00:48:26,845 or if there's cross-cultural words that you can detect. 734 00:48:27,324 --> 00:48:29,111 For instance, this person is writing in English 735 00:48:29,112 --> 00:48:33,313 but it recommending a visit to a museum in the city of Lille in France. 736 00:48:33,767 --> 00:48:38,830 So this person knows the city in France, knows that to visit the museum 737 00:48:38,987 --> 00:48:40,556 you go there. 738 00:48:40,559 --> 00:48:43,089 And this is what I call cross-cultural words. 739 00:48:44,239 --> 00:48:49,095 [What I kind of found] is that surprisingly there's not many translation behaviors 740 00:48:49,096 --> 00:48:52,589 going on, despite these people being multilingual. 741 00:48:53,001 --> 00:48:56,264 And this is what is going to trigger some reflections. 742 00:49:00,289 --> 00:49:02,085 How am I doing on time? 743 00:49:04,172 --> 00:49:05,646 (woman) 1:22. 744 00:49:05,646 --> 00:49:10,050 (man) Umm, it's usually an hour long... 745 00:49:10,450 --> 00:49:14,358 So, I will go on with my reflections. 746 00:49:14,358 --> 00:49:18,266 to encourage some thoughts. 747 00:49:18,266 --> 00:49:22,027 So the greatest connecting power is the will of users who want 748 00:49:22,027 --> 00:49:23,317 to be connected. 749 00:49:23,317 --> 00:49:28,201 This is a really nice quality, because the communities of interest 750 00:49:28,290 --> 00:49:32,012 in social media, in Twitter is what is bringing people 751 00:49:32,013 --> 00:49:33,701 from different countries, together. 752 00:49:34,794 --> 00:49:41,151 And also experiences, like the Voluntweeters, 753 00:49:42,095 --> 00:49:45,815 so after the earthquake in Haiti, there were these spontaneous 754 00:49:45,816 --> 00:49:48,972 self-organizations of Twitter users for translating tweets 755 00:49:50,213 --> 00:49:53,755 and they called themselves Voluntweeters, there's a paper about that-- 756 00:49:53,826 --> 00:49:59,151 So this is the triggering of social connections 757 00:50:00,820 --> 00:50:04,486 across countries, across borders and across languages. 758 00:50:06,759 --> 00:50:10,300 But even when the social structure could potentially facilitate 759 00:50:10,301 --> 00:50:13,375 information diffusion and cross-language linking 760 00:50:14,558 --> 00:50:16,731 this condition is not sufficient. 761 00:50:16,732 --> 00:50:19,720 There are other factors like the design of the interfaces 762 00:50:19,721 --> 00:50:22,479 and the design of systems that can influence... 763 00:50:23,145 --> 00:50:27,438 can promote, or not translation behaviors and cross-cultural awareness. 764 00:50:28,293 --> 00:50:31,503 And the Wikipedia of cross-language linking 765 00:50:31,504 --> 00:50:35,113 you have links for many languages for every article. 766 00:50:37,257 --> 00:50:41,061 We also still acknowledge the dynamic language preferences of multilingual users 767 00:50:41,790 --> 00:50:44,145 so they could address their messages to the appropriate audience. 768 00:50:44,146 --> 00:50:47,187 I like the solution of Google+ with their circles 769 00:50:47,880 --> 00:50:51,890 where I can put my friends and family in Spain in a circle 770 00:50:51,891 --> 00:50:54,559 and write them in Spanish. 771 00:50:54,739 --> 00:51:00,633 And then the recommendation of people based on language profile 772 00:51:01,437 --> 00:51:04,134 would be useful for this spontaneous self-organization. 773 00:51:05,708 --> 00:51:08,057 So, these are some of the things. 774 00:51:08,143 --> 00:51:10,455 The impact of mediation. 775 00:51:10,782 --> 00:51:13,206 Global Voices is an international community of bloggers 776 00:51:13,207 --> 00:51:18,303 that connect bloggers and citizens from around the world 777 00:51:18,814 --> 00:51:20,504 in different languages. 778 00:51:21,171 --> 00:51:22,580 And Scott Hale 779 00:51:22,581 --> 00:51:27,353 a student from Oxford University led a very interesting study 780 00:51:27,354 --> 00:51:33,960 after the earthquake in Haiti about blogs in Spanish, Japanese and English 781 00:51:35,561 --> 00:51:38,542 and he looked at the cross-language linking 782 00:51:38,543 --> 00:51:41,388 and focusing on this topic over time. 783 00:51:41,488 --> 00:51:45,495 And he discovered that 50 percent of the cross-language linking 784 00:51:45,496 --> 00:51:48,304 was happening through this platform, Global Voices. 785 00:51:49,062 --> 00:51:51,941 So, it had a very big impact in the language links. 786 00:51:54,170 --> 00:51:57,857 And finally, social media, big media outlets, 787 00:51:57,858 --> 00:52:01,592 people are interconnected in these complex networks 788 00:52:04,693 --> 00:52:08,945 and underlying is this language ecosystem. 789 00:52:09,058 --> 00:52:12,786 So we have the language ecosystem, and on top of that 790 00:52:12,787 --> 00:52:15,296 we have the social media ecosystem. 791 00:52:15,305 --> 00:52:20,200 People would share a video from YouTube on Twitter, or news on Facebook. 792 00:52:21,302 --> 00:52:26,011 What happened if we integrate in this ecosystem 793 00:52:26,517 --> 00:52:30,518 these platforms, like Global Voices, like Universal Subtitles 794 00:52:30,519 --> 00:52:34,327 which is a platform for crowdsourcing subtitling of videos 795 00:52:34,328 --> 00:52:37,108 and translation of subtitles for videos. 796 00:52:38,050 --> 00:52:42,222 If you integrate that and this starts connecting, starts building paths 797 00:52:42,223 --> 00:52:45,743 between languages, that didn't exist before. 798 00:52:45,744 --> 00:52:50,955 So I think we should make it easy for multilingual people to translate 799 00:52:50,955 --> 00:52:55,187 and subtitle all the content they like, their favorite content 800 00:52:56,003 --> 00:53:00,326 and share it with the appropriate audience so they can start connecting 801 00:53:00,327 --> 00:53:03,114 the language islands of the internet. 802 00:53:03,145 --> 00:53:06,219 And that way stories will travel all over the world. 803 00:53:09,204 --> 00:53:11,950 Particularly I would like to thank Jen Golbeck, my adviser 804 00:53:11,951 --> 00:53:14,337 and Fulbright for supporting this research. 805 00:53:14,477 --> 00:53:19,206 And then I open the space for questions and your ideas 806 00:53:19,488 --> 00:53:21,780 if this has triggered some thoughts. 807 00:53:24,140 --> 00:53:25,972 (woman) I have a question about how this relates 808 00:53:25,973 --> 00:53:28,112 to your Yahoo award. 809 00:53:29,468 --> 00:53:35,076 Well, they have the Internet Experiences lab in California. 810 00:53:35,078 --> 00:53:36,428 And they-- 811 00:53:36,460 --> 00:53:40,213 So, we tend to think maybe it's a super tiny place 812 00:53:40,213 --> 00:53:42,630 but actually there are fields 813 00:53:42,631 --> 00:53:44,818 and I applied for the social systems. 814 00:53:45,121 --> 00:53:48,967 The social systems are a category. 815 00:53:49,068 --> 00:53:54,686 And I think that was embedded in the Internet Experience lab 816 00:53:56,739 --> 00:53:58,452 and yeah, they liked it. 817 00:53:58,516 --> 00:54:01,530 (man) But is it this work that they are interested in? 818 00:54:01,813 --> 00:54:02,883 Yes. 819 00:54:02,884 --> 00:54:04,022 - The languages? - Yes. 820 00:54:04,022 --> 00:54:07,726 Well, now I have results, because I wrote up reports 821 00:54:09,496 --> 00:54:11,548 about what my work was about. 822 00:54:16,758 --> 00:54:17,968 Great. 823 00:54:22,055 --> 00:54:22,879 Yes? 824 00:54:22,879 --> 00:54:25,682 (woman) I was thinking about if you analyzed the place... 825 00:54:25,682 --> 00:54:30,689 like if there's any relationship between tweeters and tweets 826 00:54:31,056 --> 00:54:33,624 and the place that the people are. 827 00:54:35,883 --> 00:54:39,760 I mean, because it's not the same being a Brazilian in Brazil 828 00:54:39,761 --> 00:54:43,197 and tweeting in Portuguese or being Brazilian in the US 829 00:54:43,198 --> 00:54:45,330 and tweeting in Portuguese-- 830 00:54:45,950 --> 00:54:49,249 There's many, many factors that I haven't looked at. 831 00:54:50,126 --> 00:54:51,971 It's not part of your study? 832 00:54:52,300 --> 00:54:54,447 But because I had to scope it somehow. 833 00:54:54,448 --> 00:54:56,108 There's so many factors. 834 00:54:56,710 --> 00:54:59,993 Geography was one that I was originally intending to look at 835 00:55:00,097 --> 00:55:04,458 but I found there were so many problems to actually get the right geography 836 00:55:04,459 --> 00:55:06,652 the right geolocation. 837 00:55:08,154 --> 00:55:12,136 The problem is that I didn't originally collect the geolocation. 838 00:55:12,137 --> 00:55:15,898 I think only a small percentage of messages have... 839 00:55:16,457 --> 00:55:18,297 geolocated information. 840 00:55:18,902 --> 00:55:20,795 I'm not sure about the percentage there. 841 00:55:20,796 --> 00:55:24,690 So there's only a small percentage of messages that have geolocation. 842 00:55:25,173 --> 00:55:27,604 There's issues with the accuracy... 843 00:55:28,041 --> 00:55:31,147 What I have collected is the information in their profile 844 00:55:31,931 --> 00:55:35,462 they can put the information about the place, 845 00:55:35,493 --> 00:55:39,572 but sometimes it's more or less trustworthy, 846 00:55:39,573 --> 00:55:42,828 sometimes there's nothing, and sometimes there's just crazy stuff. 847 00:55:43,210 --> 00:55:44,710 (audience laughs) 848 00:55:46,545 --> 00:55:49,735 So, something absolutely has to be there. 849 00:55:50,419 --> 00:55:55,249 If I wanted to expand this, geography would be a nice place to go! 850 00:55:55,279 --> 00:55:56,609 (woman) Ok. 851 00:55:59,863 --> 00:56:00,631 Yes? 852 00:56:00,631 --> 00:56:01,710 (man) Could you say a little bit more 853 00:56:01,710 --> 00:56:04,946 I think you said about the visualization choices you made? 854 00:56:04,964 --> 00:56:06,224 Oh yes, well... 855 00:56:08,033 --> 00:56:11,117 I tried this tool, the Node XL, 856 00:56:11,118 --> 00:56:13,284 I used both Node XL and Gephi. 857 00:56:13,522 --> 00:56:14,522 There's more... 858 00:56:16,109 --> 00:56:20,202 I think there's, I don't remember the name there's one that was developed 859 00:56:20,202 --> 00:56:21,854 here in Maryland 860 00:56:21,854 --> 00:56:24,163 but it's not as user-friendly. 861 00:56:26,108 --> 00:56:29,563 But I've forgotten the name, I will have to look it up. 862 00:56:29,895 --> 00:56:33,872 And there's a lot of tools that are for really technical people 863 00:56:34,696 --> 00:56:37,156 that are handling millions of nodes. 864 00:56:37,528 --> 00:56:40,615 Because with these tools, for social scientists or humanists 865 00:56:40,615 --> 00:56:42,295 maybe they are not. 866 00:56:42,316 --> 00:56:48,685 Some tools can have maybe 300-400 nodes and still be understandable. 867 00:56:51,115 --> 00:56:55,622 But if you go beyond that, actually visualizations get crazy 868 00:56:56,058 --> 00:57:02,088 and even for more technical tools for more technical people 869 00:57:02,563 --> 00:57:07,061 there are hundreds or millions, they cannot do visualizations 870 00:57:08,349 --> 00:57:11,870 at some point they just give you statistical measures. 871 00:57:13,729 --> 00:57:15,156 I have to leave it out. 872 00:57:15,156 --> 00:57:17,051 I have a list of tools and that 873 00:57:17,051 --> 00:57:20,598 but if I need the names, I need to go through everything. 874 00:57:22,596 --> 00:57:25,479 (woman) But yours was Mac-accessible? 875 00:57:25,479 --> 00:57:31,585 Yes, this Gephi tool is Mac-accessible, you can use it with Microsoft 876 00:57:31,792 --> 00:57:34,446 with Mac and with Linux. 877 00:57:35,905 --> 00:57:37,979 And I forgot to say, it's open source. 878 00:57:43,480 --> 00:57:48,839 (woman) Did you find studying languages and internet 879 00:57:48,840 --> 00:57:52,681 was like a place, unexplored? 880 00:57:52,948 --> 00:57:55,208 Like here in the United States? 881 00:57:55,378 --> 00:58:00,001 Like when you began studying or analyzing this 882 00:58:00,002 --> 00:58:04,303 you felt that a lot of people are doing this 883 00:58:04,303 --> 00:58:06,200 or nobody is doing this 884 00:58:06,200 --> 00:58:08,352 and I'm the first one trying to-- 885 00:58:08,435 --> 00:58:13,114 I'm not the first one, but it's a very new area 886 00:58:13,114 --> 00:58:14,971 to be exploring. 887 00:58:15,033 --> 00:58:16,983 So, it's very exciting because of that. 888 00:58:17,012 --> 00:58:18,797 Because there's so many unanswered questions 889 00:58:18,798 --> 00:58:23,785 and I find that surprisingly enough the United States is not paying so much attention 890 00:58:23,786 --> 00:58:26,053 about multilinguality issues 891 00:58:26,053 --> 00:58:31,002 And I think that language policies are very monolingual-oriented 892 00:58:31,003 --> 00:58:32,948 but it's terrible 893 00:58:33,043 --> 00:58:37,182 because there's a whole lot of multilinguality in this country. 894 00:58:37,183 --> 00:58:41,270 There's so many people speaking different languages 895 00:58:42,548 --> 00:58:45,290 that I'm so amazed about that contradiction. 896 00:58:45,780 --> 00:58:48,727 Because in Europe, it's an obvious challenge for us 897 00:58:49,388 --> 00:58:51,907 because we need to understand each other between all these countries 898 00:58:51,907 --> 00:58:53,567 of the European Union. 899 00:58:53,567 --> 00:58:58,499 And there's a lot of money invested in research that relates to multilinguality 900 00:58:58,691 --> 00:59:00,738 and communication in languages 901 00:59:00,738 --> 00:59:04,557 and technology in particular, cross-language systems 902 00:59:04,558 --> 00:59:09,030 and in libraries there's a lot of work going on. 903 00:59:09,400 --> 00:59:13,942 There's investment in the research. 904 00:59:14,565 --> 00:59:18,405 So yeah, maybe in terms of investment 905 00:59:18,405 --> 00:59:22,115 the European Union is not a bad place to be. 906 00:59:22,322 --> 00:59:24,109 Better than the United States! 907 00:59:24,110 --> 00:59:27,445 But at the same time, what I find interesting 908 00:59:27,446 --> 00:59:33,323 is that here when I talk about it people are really interested 909 00:59:35,313 --> 00:59:38,376 and interested in the subject and excited about it. 910 00:59:38,458 --> 00:59:41,294 Maybe in Europe it looks more like old news. 911 00:59:41,294 --> 00:59:43,796 Like "yeah, we already know that." 912 00:59:44,135 --> 00:59:45,665 (audience laughs) 913 00:59:45,674 --> 00:59:49,580 So I find that it's exciting to be seeing the audience 914 00:59:49,629 --> 00:59:52,226 like "Oh yeah!" It's so new. 915 00:59:52,666 --> 00:59:54,026 *(woman) Yes. 916 00:59:58,653 --> 01:00:03,146 (woman) As the emerging view of research in the United States 917 01:00:03,146 --> 01:00:09,892 can you show me which institutions or which area of academic institutions 918 01:00:11,798 --> 01:00:14,748 actually have more invested in this topic in the US? 919 01:00:16,262 --> 01:00:18,916 I'm not sure about the institutions. 920 01:00:20,572 --> 01:00:25,978 What I know, particularly, in Indiana there's work 921 01:00:26,510 --> 01:00:29,107 because Susan Herring is a researcher there. 922 01:00:30,797 --> 01:00:32,891 She has inspired my work. 923 01:00:32,891 --> 01:00:35,607 She published a book The Multilingual Internet 924 01:00:35,687 --> 01:00:40,953 and she has done research on blogs, also communities 925 01:00:41,891 --> 01:00:45,251 of different languages connecting blogs in the blogosphere. 926 01:00:45,251 --> 01:00:51,058 So she has been one of the ones, one of the first tackling these issues 927 01:00:51,144 --> 01:00:54,720 and she's still going and she's doing something. 928 01:00:54,896 --> 01:00:59,399 So, it's the University of Indiana, I think. 929 01:01:00,914 --> 01:01:03,348 Yeah, Susan Herring. Look for her! 930 01:01:06,095 --> 01:01:09,181 And also at the same university there's Paolillo. 931 01:01:10,156 --> 01:01:12,793 He's also doing research in this area 932 01:01:12,826 --> 01:01:18,869 and he actually published for UNESCO for research on language diversity 933 01:01:18,945 --> 01:01:20,275 on the internet. 934 01:01:21,785 --> 01:01:23,479 So Susan Herring and Paolillo, 935 01:01:23,480 --> 01:01:25,444 they are at the same university. 936 01:01:26,736 --> 01:01:30,058 Those are my inspiring ones. 937 01:01:33,682 --> 01:01:37,270 Well, at Harvard at the Berkman Center of Internet and Society also did 938 01:01:37,270 --> 01:01:38,639 this mapping of the blogs. 939 01:01:38,640 --> 01:01:40,649 But they don't focus on languages. 940 01:01:41,700 --> 01:01:45,279 But there's tangential thing around there. 941 01:01:49,387 --> 01:01:51,428 (man) One more question? 942 01:01:53,560 --> 01:01:54,748 Well, thank you very much! 943 01:01:54,749 --> 01:01:55,749 Thanks! 944 01:01:55,759 --> 01:01:57,661 (audience applauds)