0:00:00.408,0:00:03.991 In this segment I'm going to show you that dependency syntax 0:00:03.991,0:00:09.040 is a very natural representation for relation extraction applications. 0:00:10.702,0:00:16.504 One domain in which a lot of work has been done on relation extraction is in the biomedical text domain. 0:00:16.504,0:00:19.410 So here for example, we have the sentence 0:00:19.410,0:00:26.195 “The results demonstrated that KaiC interacts rhythmically with SasA, KaiA, and KaiB.” 0:00:26.195,0:00:30.562 And what we’d like to get out of that is a protein interaction event. 0:00:30.562,0:00:34.628 So here’s the “interacts” that indicates the relation, 0:00:34.628,0:00:36.746 and these are the proteins involved. 0:00:36.746,0:00:40.165 And there are a bunch of other proteins involved as well. 0:00:40.535,0:00:48.219 Well, the point we get out of here is that if we can have this kind of dependency syntax, 0:00:48.219,0:00:55.213 then it's very easy starting from here to follow along the arguments of the subject and the preposition “with” 0:00:55.213,0:00:59.367 and to easily see the relation that we’d like to get out. 0:00:59.367,0:01:01.714 And if we're just a little bit cleverer, 0:01:01.714,0:01:05.811 we can then also follow along the conjunction relations 0:01:05.811,0:01:12.967 and see that KaiC is also interacting with these other two proteins. 0:01:14.259,0:01:17.362 And that's something that a lot of people have worked on. 0:01:17.362,0:01:24.355 In particular, one representation that’s being widely used for relation extraction applications in biomedicine 0:01:24.355,0:01:27.796 is the Stanford dependencies representation. 0:01:27.796,0:01:33.639 So the basic form of this representation is as a projective dependency tree. 0:01:33.639,0:01:40.699 And it was designed that way so it could be easily generated by postprocessing of phrase structure trees. 0:01:40.699,0:01:44.077 So if you have a notion of headedness in the phrase structure tree, 0:01:44.077,0:01:49.640 the Stanford dependency software provides a set of matching pattern rules 0:01:49.640,0:01:55.291 that will then type the dependency relations and give you out a Stanford dependency tree. 0:01:55.291,0:02:01.998 But Stanford dependencies can also be, and now increasingly are generated directly 0:02:01.998,0:02:06.749 by dependency parsers such as the MaltParser that we looked at recently. 0:02:07.319,0:02:11.470 Okay, so this is roughly what the representation looks like. 0:02:11.470,0:02:13.299 So it's just as we saw before, 0:02:13.299,0:02:17.855 with the words connected by type dependency arcs. 0:02:19.655,0:02:24.240 But something that has been explored in the Stanford dependencies framework 0:02:24.240,0:02:27.772 is, starting from that basic dependencies representation, 0:02:27.772,0:02:34.053 let’s make some changes to it to facilitate relation extraction applications. 0:02:34.053,0:02:38.482 And the idea here is to emphasize the relationships 0:02:38.482,0:02:43.302 between content words that are useful for relation extraction applications. 0:02:43.302,0:02:45.387 Let me give a couple of examples. 0:02:45.387,0:02:51.553 So, one example is that commonly you’ll have a content word like “based” 0:02:51.553,0:02:56.599 and where the company here is based—Los Angeles— 0:02:56.599,0:03:01.029 and it’s separated by this preposition “in”, a function word. 0:03:01.029,0:03:07.101 And you can think of these function words as really functioning like case markers in a lot of other languages. 0:03:07.101,0:03:11.410 So it’d seem more useful if we directly connected “based” and “LA”, 0:03:11.410,0:03:15.034 and we introduced the relationship of “prep_in”. 0:03:15.911,0:03:20.734 And so that’s what we do, and we simplify the structure. 0:03:20.734,0:03:22.982 But there are some other places, too, 0:03:22.982,0:03:29.649 in which we can do a better job at representing the semantics with some modifications of the graph structure. 0:03:29.649,0:03:34.868 And so a particular place of that is these coordination relationships. 0:03:34.868,0:03:40.393 So we very directly got here that “Bell makes products”. 0:03:40.393,0:03:44.158 But we’d also like to get out that Bell distributes products, 0:03:44.158,0:03:51.819 and one way we could do that is by recognizing this “and” relationship 0:03:51.819,0:04:01.820 and saying “Okay, well that means that ‘Bell’ should also be the subject of ‘distributing’ 0:04:03.159,0:04:07.493 and what they distribute is ‘products.’” 0:04:09.432,0:04:11.315 And similarly down here, 0:04:11.315,0:04:21.104 we can recognize that they’re computer products as well as electronic products. 0:04:21.781,0:04:24.606 So we can make those changes to the graph, 0:04:24.606,0:04:28.118 and get a reduced graph representation. 0:04:28.595,0:04:33.489 Now, once you do this, there are some things that are not as simple. 0:04:33.489,0:04:38.857 In particular, if you look at this structure, it’s no longer a dependency tree 0:04:38.857,0:04:43.019 because we have multiple arcs pointing at this node, 0:04:43.019,0:04:46.128 and multiple arcs pointing at this node. 0:04:47.251,0:04:48.569 But on the other hand, 0:04:48.569,0:04:54.588 the relations that we’d like to extract are represented much more directly. 0:04:54.588,0:04:58.006 And let me just show you one graph that gives an indication of this. 0:04:58.652,0:05:06.422 So, this was a graph that was originally put together by Jari Björne et al, 0:05:06.422,0:05:12.465 who were the team that won the BioNLP 2009 shared tasks in relation extraction 0:05:12.465,0:05:17.498 using, as the representational substrate, Stanford dependencies. 0:05:17.498,0:05:20.677 And what they wanted to illustrate with this graph 0:05:20.677,0:05:25.231 is how much more effective dependency structures were 0:05:25.231,0:05:30.860 at linking up the words that you wanted to extract in a relation, 0:05:30.860,0:05:34.757 than simply looking for words in the linear context. 0:05:35.434,0:05:40.401 So, here what we have is that this is the distance 0:05:40.925,0:05:45.891 which can be measured either by just counting words to the left or right, 0:05:45.891,0:05:50.042 or by counting the number of dependency arcs that you have to follow. 0:05:50.042,0:05:53.324 And this is the percent of time that it occurred. 0:05:53.324,0:05:56.337 And so what you see is, if you just look at linear distance, 0:05:56.337,0:06:02.892 there are lots of times that there are arguments and relations that you want to connect out 0:06:02.892,0:06:06.223 that are four, five, six, seven, eight words away. 0:06:06.223,0:06:11.726 In fact, there’s even a pretty large residue here of well over ten percent 0:06:11.726,0:06:16.768 where the linear distance away in words is greater than ten words. 0:06:16.768,0:06:21.176 If on the other hand though, you are trying to identify, 0:06:21.176,0:06:25.636 relate the arguments of relations by looking at the dependency distance, 0:06:25.636,0:06:30.460 then what you’d discover is that the vast majority of the arguments 0:06:30.460,0:06:35.428 are very close-by neighbors in terms of dependency distance. 0:06:35.428,0:06:42.068 So, about 47 percent of them are direct dependencies and another 30 percent of distance too. 0:06:42.068,0:06:48.512 So take those together and that’s greater than three quarters of the dependencies that you want to find. 0:06:48.512,0:06:51.537 And then this number trails away quickly. 0:06:51.537,0:06:59.431 So there are virtually no arguments of relations that aren’t fairly close together in dependency distance 0:06:59.431,0:07:02.621 and it’s precisely because of this reason that you can get 0:07:02.621,0:07:09.617 a lot of mileage in doing relation extraction by having a representation-like dependency syntax. 0:07:11.447,0:07:16.050 Okay, I hope that’s given you some idea of why knowing about syntax is useful, 0:07:16.050,9:59:59.000 when you want to do various semantic tasks in natural language processing.