Hello everyone, and a warm welcome to Multimodal Language Processing. My name is Xaver Funk, and I recently had the chance to really [involve] myself into this topic, because I am studying neurosciences and this was kind of something that I had to do. And, yeah, that's what I want to share with you today. So, what I have been doing recently also, is learning arabic, and a little bit of mongolian. And mostly what I did was, I had this stream of auditory signals that maybe came from the ASML audio, and I tried to match those to symbols that were representing these, right, in the book. And I kind of had this feeling that this is incomplete. So there is something missing there. And while I was on the other hand, studying a lot about multimodal language processing, which I how gestures influence processing and stuff like that. I came to the conclusion that, yeah, there is something missing. In our world today, we are all litterate so we mostly think of languages as these auditory signals, these mouth noises and the symbols that represent these. But there is so much more going on, in face to face communication and, yeah, I want to make this point clear, with a virtual experiment. So, I want to invite you to first of all, listen to this audio excerpt, from an "Easy Languages" video. And I give you the subtitles here, with the english translation as well. So, basically, these are auditory signals in Dutch, and sequences of symbols in Dutch and English. And for the people learning Dutch, please just ignore the English, just to make it a little bit harder. And people who know Dutch, please close your eyes, so that you don't see it at all. So, let's go. So, when I was listening to this at first, I was - because I know some Dutch, I was understanding quite a lot, but, kind of, not everything. And then, I watched the video that goes with it, it was kind of a different experience. And that's what we are going to do now. So just watch the video, and if you can, see how these two women, that are interviewed here, are interacting with the interviewer, and between each other. So, I hoped this worked, and you felt a little bit different now. And even for the people who don't know Dutch, I hope you could kind of follow what was going on. And even if you didn't, the point I want to make is that messages are not only auditory, they are always also visual. We have a lot of non-auditory articulators, like 43 face muscles for example, and then 2x 34 muscles in the hands, and then even more in the arms, in our torso. And the people in this video really knew how to use these. So for example we had a lot of facial movement going on, like you see on the top, here. See how she raise her eyebrows, and then, you have this head tilting at the end, that really put an emphasis on what she's saying. And then there is a lot of gaze switching as well.That's right in the begining, when she says (Dutch): Oh genoeg ! Heb je even ? So, "Oh, there is so much that I want to see! Do you have some time?" But, she doesn't really say "time", she says "Heb je even", "Do you have a little" And for me, when I was only listening, I didn't quite get what she was saying, but when I saw how she adresses the interviewer, I kind of got it, afterwards. So then there is of course manual gestures like "hoop op mijn list", that's that one here, "hoop op mijn list". She says "berglandschap", so that's a mountain range, and then "lang geleden" "long time ago", right? So there is a lot of messages that are supported with these manual gestures. Then there is also stuff like this nose scratching, where we don't even know is there something to it, or is it just a nose scratching. Does it carry some information? We don't know. And then, lastly also arm and torso movements. And also if you watch at the top here, you have nodding. So you see how these two kind of nod together, they really give us the impression of how good friends they are, right. And then if you look at this bottom part here, that's my favorite part of the video. You really have this complex orchestration of different gestures, and they are turn-taking. So, the one on the right says something, and the one on the left answers that perfectly, and then you have gestures, and then the, putting their hair back, right, so there is so much going on between them, and it really gives more than just the auditory message, right. So, note that there is something that our brain has to achieve here. Mainly two things : so, it has to segregate all of the stuff that is not important for the message. That's the segregation problem. From the important stuff, and then, take all the important stuff, all the auditory and visuals information and put it together into a coherent message. That's our binding problem. And all of this, note, all of this is under a really tight time constraint, when you're turn taking, when you're having a conversation. And if you say something and the other person say something, and there is not that much time between turns. And if you need more time, then that also has a meaning, right ? If you take time, then that means that you're hesitating to answer, maybe there is something going on with you emotionally... So you don't want to have that as well. So, yeah, so basically, this is really a huge computational problem for your brain. And well, how did your brain do? Did you feel the video was more difficult than the audio ? Did you understand more, or did you understand less? Did you feel more in the scene, maybe? Catching more informations between the lines? And well, for me, at least as you might guess, for me it was way easier to follow with the video to interpret these gestures. And this is kind of a paradox. So, how come that processing more signals simultaneously is easier that processing speech alone? And this also was shown in the litterature, so people have made experience with this. And this is really a surprising facilitation. For example, there are lot of studies, I'll just give you one example. So in this study they showed people a "prime", so this was some video of an action that somebody did, and then they showed the people different videos. And the videos were either completely congruent, so what was said was the same as the gesture, and was the same as this prime. So in that case it would be "chop" and doing the chopping gesture. And then there were different conditions where either the speech was congruent, incongruent, and the gesture was congruent. Or the speech was incongruent and the gesture was congruent. And then they had also weakly congruent stuff, like, this for "chopping", but this is actually cutting so this is only weakly incongruent. And then this twisting, which is strongly incongruent. And then people had to press a button for "yes" if either the speech or the gesture were related to the prime, and no if neither speech nor gesture was related to the prime. And what the people found out was that there were differences in response times, and also in the proportion of errors that people did, as soon as soon as there were something incongruent. And from that the authors come to the conclusion that really, speech and gestures are two sides of the same coin, they mutually interact to enhance comprehension. And now the big question is, of course, how does our brain achieve this surprising facilitation? And we can look back at turn-taking, to maybe get some clues here. So on average a turn-take takes only about 0 to 200 miliseconds, which is a fifth of a second. You can see in this video how fast she is responding, right now, after this, like, this is an instant, right? And this is quite extraordinary, because producing a single word actually takes about 600 miliseconds. So if I just prompt you to say a word, you would take 600 miliseconds to say it. So there's something going on, it seems like we are predicting already what we are going to say before the turn of the other person is finished, and we already prepare our turn. So there is something that is going on, that has to do with prediction. Most language use in conversation has to be based on prediction somehow. And this is quite nice, because prediction is anyways the current hype in neuroscience nowadays, and it's basically a good candidate for the overarching function of the brain. And many people think that what we are doing in our daily lives is basically constantly computing and updating probability distributions. And this applies both to action, to perception, and also to language. So, this will be a rephrasing of the problem we had before, as a prediction problem. And this become then, "given the preceding context - so, given all the words that come before - what word is most likely to come up next?" Right? And to make this more clear, let me give you a quick example: so, imagine I come to you and I say, without anymore context "I would like to". And then, you don't know what I'm going to say next, right? It could be any of these, for example. I would like to drink, eat, work... And so on. And now, if I shape my hand in the form of a "C", and I put it to my mouth, like this, while I say "I would like to", then your probability distribution over these words changes in such a way that "drink" is much more likely to be the next word. And maybe "eat" also a little bit, but the others words probably not, because, you can associate this gesture with drinking, or a little bit with eating, because it's also something that you put to your mouth, but mostly this is commonly understood as "drinking", right. So, in this way gestures add context to predictions and help this process of predicting, and that also helps the comprehension. And, we can actually measure prediction, using neurophysiology. So this is EEG, and "EEG" stands for "Electro Encephalography", and it's basically putting electrodes on the scalp, and then measuring the brain activity that's below. If you do this, well you can measure brain activity basically. What people usually do, is that they give people these sentences. So these could be normal sentences, like this one : "It was his first day at work." Or it could be so-called garden-path sentences. So these are sentences that are somehow manipulated artificially to elicit some response. Right? So this would be : "He spread the warm bread with socks." So you may have a weird feeling on your head, because nobody spreads the bread with socks. And this weird feeling, if we would measure you with an EEG, would constitute this reaction here, that's a so-called N400. "N" because it is a negative polarity, and it's 400 miliseconds after the word. So, all of this above here is just electrical activity, right? And you have this really pronounced peak, when there is a violation of the semantics, like with "socks". And it's also taken to be a prediction error. So you did not predict socks, you predicted Nutella, for example, or honey. But not socks. And this is reflected in this N400 prediction error. So people are doing this a lot, like, showing these sentences that are somehow manipulated. We have another example here, this is another topic : if you write in all caps you have this kind of response for example. But, what I want to do now with you is bringing you more to the cutting edge of what is currently done in multimodal processing research. So the trend is to go away from these artificially constructed sentences, and more towards naturalistic language comprehension. So, using actual stories, actual sentences, that are not manipulated in any way. And this will be combined with computational linguistics - how that works, you will see in a bit. And also, yeah, with that you can look at multimodal processing if you just add a video to the audio that you make people listen to. And what it might look like is like this. So, this is one study that is currently not published officially yet. It is already on the archive. And I want to use this to illustrate to you how we might research naturalistic language comprehension. So the general planners get some per-word measures - so these would be these ones here. So for each word, there is some value attached. And then we can use these as regressors in the big linear regression model. So, using fancy statistics, and with that we're basically asking our data "how well are you predicted by these regressors?" And for example, this one here is surprise and this is closely related to predictions or prediction errors. So this is the negative log probability of a word, given all of the words that come before it. So, this is the contexte, basically, and this is some word, "w". So this is basically telling you how unpredictable is a given word. And this measure is base on computational language models, so for example, they would take the whole corpus of a language, and then, see which words occurs after each other, and thereby get to this value of how unpredictable it is. And then, they have another thing here. They use the fundamental frequency of each word as a pitch indicator, to control for prosody, which is also pretty cool. So they let loose their linear regression models, with these predictors, so they have a surprisal value for each word, for example, a prosody, ready for each word, then they indicate where there are meaningful gestures happening, and, yeah, also, mouth movements. And, what came out of this, one finding that might be interesting for us, now, is that for meaningful gestures, the N400 is less negative. So you can also see this here : for meaningful gesture, this blue line, you see that it is a lot less negative than the red line where the gestures are absent. And then there's also, that's why I told you about surprisal, an interesting interaction between gestures and surprisal. So, the higher the surprisal, the less unexpected a word, the stronger this facilitating effect of gestures is. Which is also really interesting. Then there's, this is a similar study, that I actually got the chance to work on, with a colleague. So what we did here, we had a measure of entropy. This measures basically the uncertainty about the next word. So if you think back to the example we had before, where I was telling you "I would like to", and then something, but without context, that would be really high entropy, really high uncertainty : you don't know what's coming next. Right? Then we also had surprisal, we had word frequency, how often the word came up. And IVC is a measure of, it's an abbreviation for "instantaneous visual change", so, how much the actor moved while we were showing this to the people. And then speech envelope, this is basically a measure of the level of the sound. And what we found is - and this by the way was an FRMI experiment, so we can look at wich regions are active during some condition. And for words where the surprisal was really high, there were these regions in red active, and for words where entropy was really high, these regions in blue. And now if we look at interactions with gestures for the entropy condition, we can see that when there were gestures present, we had really specific activations compared to when there were no gestures present, in situations where there is high entropy, so high uncertainty. So with these tools we try to get into the processes that underlie prediction in language. So let's take a step back, and have a look at kind of a more global evolutionnary perspective. We know from primate research that gesture and gaze are crucial for communication. You can see it in this video : this ape right here does this gesture, and this signals to its mother to pick her up. Right? So these are bonobos, and you can see right now, this "pick me up" gesture. And Federico Rossano, from the Max Planck Institute for Evolutionnary Institute, could show that this gesture get more and more ritualized, to the point where it becomes only a small wrist bend with the arm and one gaze, to instantiate this carry behavior. Right? So you see that there is also kind of a prediction involved : the mother has to predict what the child is wanting to do, right? Going from this, to only this. Then, building on this, there are some authors that propose that speech and gesture have a common origin. And the idea here is that, through these ritualized gestures that we've just seen in those bonobos, after a time there will be a proto sign language evolving. Which then at some point will be accompanied by sound as well, evolving into a proto speech language. And then the proto sign, the proto speech, will reinforce each other more and more, until language emerges. And another point, here, or an observation : those of you who have tried sign language, it kind of feel surprinsingly natural, right? So, if speech is the true communication medium for humans, why is it thet sign language feels so real, so natural, right? And then another point that goes into this theory is that voluntary hand movements came before voluntary breathing. And you need voluntary breathing to articulate yourself, right? So, also, just as complementary to speech, you can more easily show spatial relations between things. And then, if you look at child development the same pattern : gesture develop before speech, and pre-speech turn-taking is faster than later. So if you're a baby and you gesture, the turn-taking with your mother, the communication is quite fast, it's almost adult level turn-taking. Then as you learn language it gets way slower, and only in middle-school it gets gets back to the adult level turn-taking. So, what's the point, right? What does all of this mean for language learning? So for this, let's do another time-travel, back to 1768, and meet this French Jesuit monk : Claude-François Lizarde de Radonvilliers. And he wrote this book : (French) "About The Way To Learn Languages" back in the day, where he reflected on how we should teach people languages. And interestingly, this is basically the grandfather of the Assimil method, and also the Méthode Toussaint-Langenscheidt, or, also called "interlinearversion". So this would be this sheet here. This was a way people learned languages at the turn of the previous century. And you can see here that you have the spanish at the top, then some consideration in the middle, and on the bottom the german. And this is kind of similar to what Assimil does, right? So this is really interesting, but that's not the point here. What he also did in this book is to compare L1 - so first language acquisition - with second language learning. And he noted that it seems that, for the first language, parents show their children pictures, and enact words or concepts, and encourage the children to do the same, like this little boy does here. But for second language acquisition all we do is give people these vocabulary lists, and expect them to learn it just like that. So this is kind of an interesting point, and since then it has been shown, - and this is actually pretty robust, I was really surprised, that it has been a really robust finding, that gesture enriched material enhances learning. So in this study, for example, people tried to teach english-speaking people japanese words, and they had four different training conditions. So one, only speech, one repeated speech, one speech plus incongruent gestures - so gestures that would not match - and then, congruent gestures. And this is the interesting condition, right? And then they tested the people after encoding for three different times : after five minutes, after two days, and after one week. And also they tested them on forced choice so it's basically multiple choice, and free recall, so it's prompting the people with the word, and then they come up themselves with the answer. So these numbers here are basically the proportion of correct answers that people give. And you can see that, across the board, the speech plus congruent gesture condition is very superior compared to the other ones, which is, yeah, which is interesting, and so, you would maybe think that the point is "okay, so we just use videos instead of audios", right? And this is what I would call Multisensory enrichment. And there is nothing wrong with this, this is really useful, you have these YouTube channel like Easy Languages - I'm not sponsored by the way (laugh) - where you have conversations with real people that from time to time make gestures, and you get the full conversation thing, right? And you have these one-on-one videos, like this one from Mandarin Corner. Where they are also a lot of gestures involved, so the host, Eileen, really tries to integrate a lot of gestures. But this is actually not the point - I mean, this is cool but I think you already do that. The point is way deeper. So, there's another thing going on, not only when you watch gestures, but when you enact them. This is called the enactment effect. This was actually coined in 1980, by two germans. They called it first the "Tu-Effekt", which translates literally to "Do-Effect". And you can see why people chose to call it the enactment effect, because it sounds way more fancy (laugh) but I really like the "tu-effekt", it sounds funny. Anyways, the point is that action words or phrases, this is what they - Engelkamp et Krumnacker - noticed : that action words and phrases are remembered better if they're acted out, or accompanied by gestures. So if you would learn the phrase "chopping garlic", then if you enact it actually while learning it, you will retain it way better. And this effect is also really well replicated, and this was also really surprising to me, because it is virtually not at all translated into actual teaching. Nobody does this, nobody tells the students to enact things, right? Enact words, enact anything. And it has been well replicated across tasks, across materials and also across populations : across children, adults, even clinical populations : People with Alzheimer, people recovering from stroke... Somehow people made them learn words and then act the words, and it worked better than without enactment. And also, this is not only true for action words and concrete words, but also abstract words. Anything you can somehow find a representation - with gestures - for, you can use this enactment effect. And this is way more powerful than multysensory enrichment, and we can call this "sensorimotor enrichment", because you use your senses and your "motor". So, this is also, this ties in with another really interesting development in neurosciences, called "embodied cognition". Basically this is the idea that many features of cognition - and these might be concepts, categories, reasoning or judgement - are shaped by aspects of the body. And this would be the motor system - so how we move - the perceptual system - what we see, what we feel, what we hear, and so on. And also bodily interactions with the environment. And you might see where I get with this, if you think about concepts and categories What are words, if not concepts and categories, right? So we might ask the question, "how are words represented in the brain?" And there is this really funny study, they showed people words that had strong olfactory associations, which means they either stink really hard, or they smell really well. And, in case you're looking for some inspiration for your spanish poem, you can go (laugh) to this publication and search through the list of words. This is also a small - this is only a small sample, there are tons of really strong smelling words in the study and, yeah. So basically, what they found is that when they showed people these words, as compared to words that did not smell that much, some regions in the brain that are associated with olfaction, so, with smelling, lighted up. And this kind of has been extended as well to actions. So, on the left here, these are all the regions that light up when you move your foot when you move your fingers, or when you move your tongue. And on the right here, these are the regions that light up when you read leg-related words, arm-related words, or face-related words. And you can see that, this more or less, this is more or less, these activations fit each other, right? So, in some way, leg-related words are stored where you also move you leg, arm-related words are stored where you also move your arms, and so on. So, we can think about words actually as functional networks, like this. And, note that words are experience-dependent functional networks. And experience is connected to the body, right? So for exemple, you surely have, not only read and heard the word "garlic", you also have smelled garlic, you touched garlic, you tasted garlic and, really important thing you chopped garlic. So when you read "garlic", you not only have the core language areas - in yellow here - activated, but also subcortical olfactory areas, and some gustatory areas - so, for taste - action areas, right, and visual areas as well. So, what I want to tell you here, think about this when you learn languages. Did you do the same for "knoblauch", for example - the german word for "garlic"? If you learn german, do you actually get into this huge associated network? So, and that's the point basically, we are coming to the end, the point is that language is multimodal, you should use sensory-motor enrichment when learning languages, and thereby embody your languages. And if you want to learn more about this, and also for me to give credit, this is basically where I got most of my input from. These are four big review articles that discuss all of this stuff. So yeah, that's basically it. Thanks for listening, and hoping for some cool questions. I would most certainly guess so. This thing with the phone is also something that I have experienced quite a lot. I have lived in Chile for some time, and I've got myself a chilean SIM-card And I didn't give this number to a lot of people, but somehow, this number got to people that were, I don't know, trying to sell me something. And I would get a call from somebody, pick up the phone, and I would not understand a single word. Like, Chilean spanish is already really hard, and then it's completely out of context, I don't know what this person wants from me, and then it's just [gestures] and I'm like "sorry, I don't understand you" "I don't understand you", "I don't understand you", over and over again. And yeah, I mean, if you're on the phone, there's also a little bit of noise maybe, and I really have the feeling that that makes, especially in a foreign language, conversing that much harder, because you don't see the mouth movements, it's not that clear of a sound, you don't see anything else, and yeah. I would say so. Cool question. I would guess so, I would guess so. Like, I mean, especially for autistic people there is a lot of research on language processing in general, but I don't know of any studies that are specifically for multimodal processing, but I think there are quite a few. Actually the experiment that I showed you, the one that I worked on, as well, in the middle of the presentation, the entropy stuff, we also did this with schizophrenic patients, but we have not looked at the data yet. So once this publication is done then, somebody else will deal with the clinical data, with the schizophrenic people, and in general for schizophrenics there's, there's a lot of, like, language-related abnormalities, and I think for autistic people as well. I'm not sure about ADHD but yeah, it would be a really interesting thing to, to look at this for autistic people, for sure. And maybe people did this. You can, maybe, you can look it up. I don't have anything in my head right now but yeah. There should be something. I think one of the studies that I glanced over actually tried this. So they had some gestures that were nonsense I don't know if it was with abstract words or with concrete words, but they used nonsense gestures, and they still had an effect, but it was smaller. So if you try to make this, to integrate this into your studies, I would suggest that you try to find an enactment that is as sensical as it gets I mean, it's not that it's impossible to get enactment for abstract words, you just have to be a little bit more creative, and I think the more creative you will be the more effective. Like, similar to mnemonics, like, the more crazy mnemonic is, the easier it is to remember. I could see the same effect with enactments as well. And if you should use signs from sign languages, I think if you want to that's a cool idea, because then you automatically also learn the sign language. And when I was preparing this presentation I actually thought about this. Like why do we learn languages? Like if I now start learning a language, why do I not learn the sign language that goes with it, right? I think it would make things easier, because you actually have the enactment ready for you, and it's just a cool thing, right? You can talk with so many more people. And also, I think in general, people should learn sign languages regardless This became really clear to me actually at the Polyglot Gathering in 2019. In Bratislava we were on some ship where there was a party. There was loud music, then there were some people who knew sign language maybe they are listening right now. So they just started, they were like, on the dance floor, and instead of screaming into each other's ears, like people usually do, they just started to sign, and it was so smooth, like why should we communicate with sound when we can do it with gestures. Right? I think for many situations it would be a lot easier. So yeah, if you can use the signs of your target language, I think that's a cool idea. Yeah, it does sound like that. Indeed, indeed. Yeah, I can totally see that. There is a study, that I came across while reasearching, but I didn't look into it. If you want the reference, you can reach out to me somehow and I can see if I can find it and send it to you. I didn't look deeply into it, and I think I wouldn't find it now quickly. So again, there's something done, but I can't recall it from my head right now. Well so, there's two things maybe more things but let's start with two So first of all when you learn a new word, try to get the whole picture of the word. Like the garlic example, try to imagine how it smells, how it feels, how you chop it, try to enact it. Take a moment, and really try to activate the whole functional network of this word. And then the other thing was also just to use input with video, if you're learning with some input. Look up if you find some interesting channels on YouTube or something. And then also, that might have not been clear from my presentation, if you are conversing with people, use signs. I don't know if people do this naturally in general, I think I kind of do it, if I talk in my target language, and I'm not sure about a word I will try to make sure with my hands that somehow, something gets, like, to the other person. So what I'm going for is that the other person recognizes what I'm trying to say, and then gives me the word, right. Like in the example with the glass of water, if I don't know "drink" in some language I would, I would try to [MIMES] Right? "I want to [MIMES]" Right? And then have the other person gives me the word, because I'm actively reducing the uncertainty that the other perso has, that is trying to predict what I'm going to say, by giving gestures. Right? So that would be my three practical implications for now. Yeah so this is something that I don't know. Again I think it's worth trying to do this with the sign language. I mean, there's a system of really... There's a system of really fitting gestures that people use, and it might actually be a good idea to try this out, to use sign language as you're learning the actual language, to get this enactment working. Might be more effective than making up your own gestures. I mean if you make up your own gestures you have the advantage that during the process of coming up with the gesture, you are engaging your brain in a specific way that's not there if you just get the gesture from somebody. So there might be an advantage there, but the other advantage is of course time that you can save, and the ability to communicate with people that can't hear So yeah, I think that's open for exploration, for sure. Well you see it in sign language, right? People that sign don't really speak. And they get along pretty nicely. Another question would be if all of society as a whole can do without verbal. That's another question, but I think you can restructure society in a way that everybody can communicate with gestures, for sure. And according to some people, it was like that before speech developped. So yeah, there has been some research, not much. If you are interested in this, make sure to check out my presentation on this topic from last year's Gathering, and also from last year's conference. The conference one is not up on YouTube already, but the Gathering one, and there's, in the end, I show some... I show a study that was done on polyglots and hyper polyglots actually only hyper polyglots I think. And so they put people in a FMRI scanner, and just gave them language material. And what they found is that the language network was less active than for monolinguals. So if you listen to something there's some areas on the left side of your brain that light up : you have some typical areas, like Broca's area, Wernicke's area, and some other ones. And they found that, for polyglots this "lighting up" is less, and the interpretation was that the polyglots' language network, through extensive practice, has become more and more efficient at dealing with language. And therefore it needs less activation. So this is one thing that you observe quite often, when there's some process that you get really good at, in your brain the activity that you see goes down, because the network gets more efficient. So that's why this paper was aptly titled "The Small And Efficient Network Of Polyglots And Hyperpolyglots". And they, you can also look this up as well, they also made them listen to different languages, and there, the better known the foreign language was so the first experiment was completed in english, their mother tongue, and then the second experiment they used their target languages like, the second best language, third best language and so on. And there, the lesser known a language, the less active the language network, and the better known, the more active. So you have kind of the opposite effect. And they interpreted this as reflecting that the more you know in a target language, in a foreign language, the more of the language network gets recruited, the more context you have. So you have this effect of getting really efficient for your mother tongue, and getting more of the whole message for, the better you know a foreign language. So this was the last question. Alright Thanks for listening, thanks to the organizers for organizing this, the streaming works really well, I'm really impressed Thanks guys!