0:00:00.408,0:00:03.991
In this segment I'm going to show you that dependency syntax

0:00:03.991,0:00:09.040
is a very natural representation for relation extraction applications.

0:00:10.702,0:00:16.504
One domain in which a lot of work has been done on relation extraction is in the biomedical text domain.

0:00:16.504,0:00:19.410
So here for example, we have the sentence

0:00:19.410,0:00:26.195
“The results demonstrated that KaiC interacts rhythmically with SasA, KaiA, and KaiB.”

0:00:26.195,0:00:30.562
And what we’d like to get out of that is a protein interaction event.

0:00:30.562,0:00:34.628
So here’s the “interacts” that indicates the relation,

0:00:34.628,0:00:36.746
and these are the proteins involved.

0:00:36.746,0:00:40.165
And there are a bunch of other proteins involved as well.

0:00:40.535,0:00:48.219
Well, the point we get out of here is that if we can have this kind of dependency syntax,

0:00:48.219,0:00:55.213
then it's very easy starting from here to follow along the arguments of the subject and the preposition “with”

0:00:55.213,0:00:59.367
and to easily see the relation that we’d like to get out.

0:00:59.367,0:01:01.714
And if we're just a little bit cleverer,

0:01:01.714,0:01:05.811
we can then also follow along the conjunction relations

0:01:05.811,0:01:12.967
and see that KaiC is also interacting with these other two proteins.

0:01:14.259,0:01:17.362
And that's something that a lot of people have worked on.

0:01:17.362,0:01:24.355
In particular, one representation that’s being widely used for relation extraction applications in biomedicine

0:01:24.355,0:01:27.796
is the Stanford dependencies representation.

0:01:27.796,0:01:33.639
So the basic form of this representation is as a projective dependency tree.

0:01:33.639,0:01:40.699
And it was designed that way so it could be easily generated by postprocessing of phrase structure trees.

0:01:40.699,0:01:44.077
So if you have a notion of headedness in the phrase structure tree,

0:01:44.077,0:01:49.640
the Stanford dependency software provides a set of matching pattern rules

0:01:49.640,0:01:55.291
that will then type the dependency relations and give you out a Stanford dependency tree.

0:01:55.291,0:02:01.998
But Stanford dependencies can also be, and now increasingly are generated directly

0:02:01.998,0:02:06.749
by dependency parsers such as the MaltParser that we looked at recently.

0:02:07.319,0:02:11.470
Okay, so this is roughly what the representation looks like.

0:02:11.470,0:02:13.299
So it's just as we saw before,

0:02:13.299,0:02:17.855
with the words connected by type dependency arcs.

0:02:19.655,0:02:24.240
But something that has been explored in the Stanford dependencies framework

0:02:24.240,0:02:27.772
is, starting from that basic dependencies representation,

0:02:27.772,0:02:34.053
let’s make some changes to it to facilitate relation extraction applications.

0:02:34.053,0:02:38.482
And the idea here is to emphasize the relationships

0:02:38.482,0:02:43.302
between content words that are useful for relation extraction applications.

0:02:43.302,0:02:45.387
Let me give a couple of examples.

0:02:45.387,0:02:51.553
So, one example is that commonly you’ll have a content word like “based”

0:02:51.553,0:02:56.599
and where the company here is based—Los Angeles—

0:02:56.599,0:03:01.029
and it’s separated by this preposition “in”, a function word.

0:03:01.029,0:03:07.101
And you can think of these function words as really functioning like case markers in a lot of other languages.

0:03:07.101,0:03:11.410
So it’d seem more useful if we directly connected “based” and “LA”,

0:03:11.410,0:03:15.034
and we introduced the relationship of “prep_in”.

0:03:15.911,0:03:20.734
And so that’s what we do, and we simplify the structure.

0:03:20.734,0:03:22.982
But there are some other places, too,

0:03:22.982,0:03:29.649
in which we can do a better job at representing the semantics with some modifications of the graph structure.

0:03:29.649,0:03:34.868
And so a particular place of that is these coordination relationships.

0:03:34.868,0:03:40.393
So we very directly got here that “Bell makes products”.

0:03:40.393,0:03:44.158
But we’d also like to get out that Bell distributes products,

0:03:44.158,0:03:51.819
and one way we could do that is by recognizing this “and” relationship

0:03:51.819,0:04:01.820
and saying “Okay, well that means that ‘Bell’ should also be the subject of ‘distributing’

0:04:03.159,0:04:07.493
and what they distribute is ‘products.’”

0:04:09.432,0:04:11.315
And similarly down here,

0:04:11.315,0:04:21.104
we can recognize that they’re computer products as well as electronic products.

0:04:21.781,0:04:24.606
So we can make those changes to the graph,

0:04:24.606,0:04:28.118
and get a reduced graph representation.

0:04:28.595,0:04:33.489
Now, once you do this, there are some things that are not as simple.

0:04:33.489,0:04:38.857
In particular, if you look at this structure, it’s no longer a dependency tree

0:04:38.857,0:04:43.019
because we have multiple arcs pointing at this node,

0:04:43.019,0:04:46.128
and multiple arcs pointing at this node.

0:04:47.251,0:04:48.569
But on the other hand,

0:04:48.569,0:04:54.588
the relations that we’d like to extract are represented much more directly.

0:04:54.588,0:04:58.006
And let me just show you one graph that gives an indication of this.

0:04:58.652,0:05:06.422
So, this was a graph that was originally put together by Jari Björne et al,

0:05:06.422,0:05:12.465
who were the team that won the BioNLP 2009 shared tasks in relation extraction

0:05:12.465,0:05:17.498
using, as the representational substrate, Stanford dependencies.

0:05:17.498,0:05:20.677
And what they wanted to illustrate with this graph

0:05:20.677,0:05:25.231
is how much more effective dependency structures were

0:05:25.231,0:05:30.860
at linking up the words that you wanted to extract in a relation,

0:05:30.860,0:05:34.757
than simply looking for words in the linear context.

0:05:35.434,0:05:40.401
So, here what we have is that this is the distance

0:05:40.925,0:05:45.891
which can be measured either by just counting words to the left or right,

0:05:45.891,0:05:50.042
or by counting the number of dependency arcs that you have to follow.

0:05:50.042,0:05:53.324
And this is the percent of time that it occurred.

0:05:53.324,0:05:56.337
And so what you see is, if you just look at linear distance,

0:05:56.337,0:06:02.892
there are lots of times that there are arguments and relations that you want to connect out

0:06:02.892,0:06:06.223
that are four, five, six, seven, eight words away.

0:06:06.223,0:06:11.726
In fact, there’s even a pretty large residue here of well over ten percent

0:06:11.726,0:06:16.768
where the linear distance away in words is greater than ten words.

0:06:16.768,0:06:21.176
If on the other hand though, you are trying to identify,

0:06:21.176,0:06:25.636
relate the arguments of relations by looking at the dependency distance,

0:06:25.636,0:06:30.460
then what you’d discover is that the vast majority of the arguments

0:06:30.460,0:06:35.428
are very close-by neighbors in terms of dependency distance.

0:06:35.428,0:06:42.068
So, about 47 percent of them are direct dependencies and another 30 percent of distance too.

0:06:42.068,0:06:48.512
So take those together and that’s greater than three quarters of the dependencies that you want to find.

0:06:48.512,0:06:51.537
And then this number trails away quickly.

0:06:51.537,0:06:59.431
So there are virtually no arguments of relations that aren’t fairly close together in dependency distance

0:06:59.431,0:07:02.621
and it’s precisely because of this reason that you can get

0:07:02.621,0:07:09.617
a lot of mileage in doing relation extraction by having a representation-like dependency syntax.

0:07:11.447,0:07:16.050
Okay, I hope that’s given you some idea of why knowing about syntax is useful,

0:07:16.050,9:59:59.000
when you want to do various semantic tasks in natural language processing.