Return to Video

Log-Linear Models (22:08)

  • 0:00 - 0:05
    Local structure that doesn’t require full table representations
  • 0:05 - 0:09
    is important in both directed and undirected models.
  • 0:09 - 0:15
    How do we incorporate local structure into undirected models?
  • 0:15 - 0:22
    The framework for that is called “log-linear models” for reasons that will be clear in just a moment.
  • 0:23 - 0:24
    So
  • 0:27 - 0:33
    Whereas, in the original representation of the unnormalized density
  • 0:33 - 0:39
    we defined P tilde as the product of factors φi(Di),
  • 0:39 - 0:42
    each [of] which is potentially a full table.
  • 0:43 - 0:46
    Now we're going to shift that representation
  • 0:46 - 0:50
    to something that uses a linear form
  • 0:50 - 0:52
    (So here's a linear form)
  • 0:52 - 0:55
    that is subsequently exponentiated,
  • 0:55 - 0:57
    and that's why it's called log-linear—
  • 0:57 - 1:00
    because the logarithm is a linear function.
  • 1:00 - 1:03
    So what is this form over here?
  • 1:03 - 1:07
    It's a linear function that has these things that are called “coefficients”
  • 1:10 - 1:13
    and these things that are called “features”.
  • 1:14 - 1:22
    Features, like factors, each have a scope which is a set of variables on which the feature depends.
  • 1:24 - 1:27
    But different features can have the same scope.
  • 1:27 - 1:31
    You can have multiple features all of which are over the same set of variables.
  • 1:31 - 1:36
    Notice that each feature has just a single parameter wj that multiplies it.
  • 1:38 - 1:40
    So, what does this give rise to?
  • 1:40 - 1:43
    I mean if we have a log-linear model,
  • 1:43 - 1:49
    we can push in the exponent through the summation,
  • 1:49 - 1:59
    and that gives us something that is a product of exponential functions.
  • 1:59 - 2:04
    You can think of each of these as effectively a little factor,
  • 2:04 - 2:08
    but it’s a factor that only has a single parameter wj.
  • 2:08 - 2:11
    Since this is a little bit abstract, so let’s look at an example.
  • 2:11 - 2:18
    Specifically lets look at how we might represent a simple table factor as a log linear model.
  • 2:18 - 2:24
    So here’s a param, here’s a factor φ, over two binary random variables X1 and X2.
  • 2:24 - 2:31
    And so a full table factor would have four parameters: a00, a01, a10, and a11.
  • 2:31 - 2:35
    So we can capture this model using a log linear model,
  • 2:35 - 2:38
    using a set of such of features,
  • 2:38 - 2:41
    using a set of these guys, which are indicator functions.
  • 2:41 - 2:43
    So this is an indicator function.
  • 2:43 - 2:49
    It takes one if X1 is zero and X2 is zero,
  • 2:49 - 2:51
    and it takes zero otherwise.
  • 2:51 - 2:54
    So this the general notion of an indicator function.
  • 2:54 - 3:00
    It looks at the event—or constraint—inside the curly braces,
  • 3:00 - 3:05
    and it returns a value of 0 or 1, depending on
    whether that event is true or not.
  • 3:06 - 3:14
    And so, if we wanted to represent this factor as a log-linear model,
  • 3:14 - 3:20
    We can see that we can simply sum up over all the four values of k and ℓ,
  • 3:20 - 3:23
    which are either 0 or 1, each of them.
  • 3:23 - 3:25
    So were summing up over all four entries here.
  • 3:25 - 3:32
    And we have a parameter—or coefficient—w_kℓ which multiplies this feature.
  • 3:33 - 3:44
    And so, we would have a summation of w_kℓ:
  • 3:44 - 3:49
    of w00 only in the case where X1 is zero and X2 is zero.
  • 3:49 - 4:00
    So we would have exp of negative w00 when X1=0 and X2=0,
  • 4:01 - 4:12
    and we would have exp of negative w01 when
    X1=0 and X2=1, and so on and so forth.
  • 4:12 - 4:16
    And it’s not difficult to convince ourselves that
  • 4:16 - 4:25
    if we define w_kℓ to be the negative log of the corresponding entries in this table,
  • 4:25 - 4:29
    then that gives us right back the factor that
    we defined to begin with.
  • 4:29 - 4:33
    So this shows that this is a general representation,
  • 4:34 - 4:37
    in the sense that we can take any factor
  • 4:38 - 4:41
    and represent it as a log-linear model
  • 4:41 - 4:47
    simply by including all of the appropriate features.
  • 4:49 - 4:52
    But we don’t generally want to do that.
  • 4:52 - 4:56
    Generally we want a much finer grain set of features.
  • 4:56 - 5:01
    So let’s look at some of the examples of features that people use in practice.
  • 5:01 - 5:04
    So here are the features used in a language model.
  • 5:04 - 5:08
    This is a language model that we that we discussed previously.
  • 5:08 - 5:12
    And here we have features that relate:
  • 5:12 - 5:16
    First of all, let’s just remind ourselves [that] we have two sets of variables.
  • 5:16 - 5:24
    We have the variables Y which represent the annotations for each word
  • 5:24 - 5:30
    in the sequence corresponding to what category that corresponds to.
  • 5:30 - 5:32
    So this is a person.
  • 5:32 - 5:35
    This is the beginning of a person name.
  • 5:35 - 5:38
    This is the continuation of a person name.
  • 5:38 - 5:39
    The beginning of a location.
  • 5:39 - 5:42
    The continuation of a location, and so on.
  • 5:42 - 5:46
    As well as a bunch of words that are not:
  • 5:46 - 5:48
    [i.e.,] none of person, location, organization.
  • 5:48 - 5:51
    And they’re all labeled “other”.
  • 5:51 - 5:55
    And so the value Y tells us for each word what
    category it belongs to,
  • 5:55 - 5:59
    so that we’re trying to identify people, locations, and
    organizations in the sentence.
  • 5:59 - 6:02
    We have another set of variables X,
  • 6:03 - 6:09
    which are the actual words in the sentence.
  • 6:09 - 6:13
    Now we can go ahead and define…
  • 6:13 - 6:16
    We can use a full table representation that
  • 6:16 - 6:23
    basically tries to relate each and every Y that has a feature,
  • 6:23 - 6:28
    that has a full factor that looks at every possible word in the English language;
  • 6:28 - 6:32
    but those are going to be very, very, expensive,
  • 6:32 - 6:34
    and a very large number of parameters.
  • 6:34 - 6:40
    And so we're going to define a feature that looks, for example, at f of
  • 6:40 - 6:45
    say a particular Y_i, which is the label for the i’th word in the sentence,
  • 6:45 - 6:48
    and X_i, being that i’th word.
  • 6:48 - 6:55
    And that feature says, for example: Y_i equals person.
  • 6:56 - 7:04
    It’s the indicator function for “Y_i = person and X_i is capitalized”.
  • 7:06 - 7:10
    And so that feature doesn’t look at the individual words.
  • 7:10 - 7:13
    It just looks at whether that word is capitalized.
  • 7:13 - 7:17
    Now we have just the single parameter that looks just at capitalization,
  • 7:17 - 7:23
    and parameterizes how important is capitalization for recognizing that something's a person.
  • 7:23 - 7:27
    We could also have another feature.
  • 7:27 - 7:29
    This is an alternative:
  • 7:29 - 7:33
    This a different feature that can and could be part of the same model
  • 7:33 - 7:38
    that says: Y_i is equal to location,
  • 7:38 - 7:41
    Or, actually, I was little bit imprecise here—
  • 7:41 - 7:45
    This might be beginning of person. This might be beginning of location.
  • 7:45 - 7:51
    And X_i appears in some atlas.
  • 7:52 - 7:55
    Now there is other things that appear in the atlas than locations,
  • 7:55 - 7:59
    but if a word appears in the atlas,
  • 7:59 - 8:02
    there is a much higher probability presumably that it’s actually a location
  • 8:02 - 8:06
    and so we might have, again, [a] weight for this feature
  • 8:06 - 8:14
    that indicates that maybe increases the probability in Y_i being labeled in this way.
  • 8:14 - 8:19
    And so you can imagine that constructing a very rich set of features,
  • 8:19 - 8:24
    all of which look at certain aspects of the word,
  • 8:24 - 8:32
    and rather than enumerating all the possible words
    and giving a parameter to each and one of them.
  • 8:34 - 8:38
    Let’s look at some other examples of feature-based models.
  • 8:38 - 8:41
    So this is an example from statistical physics.
  • 8:41 - 8:43
    It’s called the Ising model.
  • 8:43 - 8:49
    And the Ising model is something that looks at pairs
    of variables.
  • 8:49 - 8:52
    It’s a pairwise Markov network.
  • 8:53 - 8:56
    And [it] looks the pairs of adjacent variables,
  • 8:56 - 8:59
    and basically gives us a coefficient for their products.
  • 8:59 - 9:04
    So now, this is a case where variables are in the end are binary,
  • 9:04 - 9:08
    but not in the space {0, 1} but rather
    negative one and positive one.
  • 9:08 - 9:10
    And so now, we have a model that's parametrized
  • 9:10 - 9:14
    as features that are just the product of the values of the adjacent variables.
  • 9:14 - 9:16
    Where might this come up?
  • 9:16 - 9:23
    It comes up in the context, for example, of modeling the spins of electrons in a grid.
  • 9:24 - 9:29
    So here you have a case where the electrons can rotate
  • 9:29 - 9:32
    either along one direction or in the other direction
  • 9:32 - 9:41
    so here is a bunch of the atoms that are marked with a blue arrow.
  • 9:41 - 9:43
    You have one rotational axis,
  • 9:43 - 9:46
    and the red arrow[s] are rotating in the opposite direction.
  • 9:46 - 9:53
    And this basically says we have a term that
  • 9:53 - 9:57
    [whose] probability distribution over the joint set of spins.
  • 9:57 - 10:02
    (So this is the joint spins.)
  • 10:02 - 10:08
    And the model, depends on whether adjacent
    atoms have the same spin or opposite spin.
  • 10:08 - 10:12
    So notice that one times one is the same as negative one times negative one.
  • 10:12 - 10:17
    So this really just looks at whether they have the same spin
    or different spins.
  • 10:17 - 10:23
    And there is a parameter that looks at, you know, same or
    different.
  • 10:24 - 10:27
    That's what this feature represents.
  • 10:27 - 10:33
    And, depending on the value of this parameter over here,
  • 10:33 - 10:35
    if the parameter goes one way,
  • 10:35 - 10:40
    we're going to favor systems
  • 10:40 - 10:43
    where the atoms spin in the same direction.
  • 10:43 - 10:48
    And if it’s going in the opposite direction, you’re going to favor atoms that spin in the different direction.
  • 10:48 - 10:51
    And those are called ferromagnetic and anti-ferromagnetic.
  • 10:53 - 10:59
    Furthermore, you can define in these systems the notion of a temperature.
  • 10:59 - 11:05
    So the temperature here says how strong is this connection.
  • 11:05 - 11:13
    So notice that as T grows—as the temperature grows—the w_ij’s get divided by T.
  • 11:16 - 11:19
    And they all kind of go towards zero,
  • 11:21 - 11:25
    which means that the strength of the connection between
  • 11:25 - 11:28
    adjacent atoms, effectively becomes almost moot,
  • 11:28 - 11:31
    and they become almost decoupled from each other.
  • 11:31 - 11:36
    On the other hand, as the temperature decreases,
  • 11:38 - 11:43
    Then the effect of the interaction between the atoms becomes much more significant
  • 11:43 - 11:47
    and they’re going to impose much stronger constraints on each other.
  • 11:47 - 11:49
    And this is actually a model of a real physical system.
  • 11:49 - 11:52
    I mean, this is real temperature, and real atoms, and so on.
  • 11:52 - 11:57
    And sure enough, if you look at what happens to these models as a function of temperature,
  • 11:57 - 12:01
    what we see over here is high temperature.
  • 12:02 - 12:04
    This is high temperature
  • 12:04 - 12:10
    and you can see that there is a lot of mixing between the two types of spin
  • 12:10 - 12:12
    and this is low temperature
  • 12:13 - 12:20
    and you can see that there is much stronger
    constraints in this configuration
  • 12:20 - 12:23
    about the spins of adjacent atoms.
  • 12:27 - 12:34
    Another kind of feature that's used very much in lots of practical applications
  • 12:34 - 12:38
    is the notion of a metric, of a metric feature, an M.R.F.
  • 12:38 - 12:40
    So what's a metric feature?
  • 12:40 - 12:42
    This is something that comes up, mostly in cases
  • 12:42 - 12:48
    where you have a bunch of random variables X_i that all take values in some joint label space of V.
  • 12:48 - 12:50
    So, for example, they might all be binary.
  • 12:50 - 12:53
    They all might take values one, two, three, four.
  • 12:53 - 12:57
    And what we'd like to do is
  • 12:57 - 13:03
    we have X_i and X_j that are connected to each other by an edge.
  • 13:03 - 13:08
    We want X_ and X_j to take “similar” values.
  • 13:08 - 13:12
    So in order to enforce the fact that X_i and X_j should take similar values
  • 13:12 - 13:14
    we need a notion of similarity.
  • 13:14 - 13:24
    And we're going to encode that using the distance function µ that takes two values, one for X_i and one for X_j’s,
  • 13:24 - 13:26
    [that] says how close are they to each other.
  • 13:26 - 13:29
    So what does the distance function need to be?
  • 13:29 - 13:36
    Well, the distance function needs to satisfy the standard condition on a distance function or a metric.
  • 13:36 - 13:39
    So first is reflexivity,
  • 13:39 - 13:43
    which means that if the two variables take on the same value,
  • 13:43 - 13:46
    then that distance better be zero.
  • 13:49 - 13:54
    Oh I forgot to say that this. Sorry, this needs to be a non-negative function.
  • 13:54 - 14:01
    Symmetry means that the distances are symetrical.
  • 14:01 - 14:06
    So the distance between two values v1 and v2 are the same as the distance between v2 and v1.
  • 14:06 - 14:12
    And finally is the triangle inequality, which says that the distance between v1 and v2
  • 14:12 - 14:14
    (So here is v1)
  • 14:14 - 14:15
    (Here is v2)
  • 14:15 - 14:21
    and the distance between v1 and v2 is less than the distance between v1 and v3
  • 14:22 - 14:25
    and then going to v2. So the standard triangle inequality.
  • 14:27 - 14:33
    if a distance just satisfies these two conditions, it's called a semi metric.
  • 14:34 - 14:37
    Otherwise, if it satisfies all three, it's called a metric.
  • 14:38 - 14:41
    And both are actually used in practical applications.
  • 14:44 - 14:49
    But how do we take this distance feature and put it in the context of an MRF?
  • 14:49 - 14:56
    We have a feature that looks at two variables, X_i and X_j.
  • 14:56 - 14:59
    And that feature is the distance between X_i and X_j.
  • 14:59 - 15:09
    And now, we put it together by multiplying that with a coefficient, w_ij,
  • 15:09 - 15:13
    such that w_ij has to be greater than zero.
  • 15:13 - 15:16
    So that we want the metric MRF
  • 15:17 - 15:21
    [to have] the effect that
  • 15:21 - 15:29
    the lower the distance, the higher this is,
  • 15:30 - 15:36
    because of the negative coefficient, which means that higher the probability. Okay?
  • 15:37 - 15:43
    So, the more pairs you have that are close to each other
  • 15:43 - 15:47
    and the closer they are to each other the higher
    the probability of the overall configuration.
  • 15:47 - 15:50
    Which is exactly what we wanted to have happen.
  • 15:53 - 15:58
    So, conversely, if you have values that are far from
    each other in the distance metric
  • 15:58 - 16:01
    the lower the probability in the model.
  • 16:02 - 16:05
    So, here are some examples of metric MRF’s.
  • 16:05 - 16:08
    So one: The simplest possible metric MRF
  • 16:08 - 16:12
    is one that gives [a] distance of zero when the two classes are equal to each other
  • 16:12 - 16:14
    and [a] distance of one everywhere else.
  • 16:14 - 16:16
    So now, you know, this is just like a step function.
  • 16:16 - 16:23
    And, this gives rise to a potential that looks like this.
  • 16:23 - 16:27
    So we have 0’s on the diagonal.
  • 16:27 - 16:34
    So we get a bump in the probability when the two adjacent variables take on the same label
  • 16:34 - 16:37
    and otherwise we get a reduction in the probability.
  • 16:37 - 16:41
    But it doesn’t matter what particular value they take.
  • 16:41 - 16:44
    That’s one example of a simple metric.
  • 16:44 - 16:52
    A somewhat more expressive example might come up when the values V are actually numerical values.
  • 16:52 - 16:58
    In which case you can look at maybe the difference between the miracle values.
  • 16:58 - 17:01
    So, v_k minus v_l.
  • 17:01 - 17:05
    And you want, and when v_k is equal to v_l, the distance is zero,
  • 17:05 - 17:14
    and then you have a linear function that increases the
    distance as the distance between v_k and v_l grows.
  • 17:14 - 17:19
    So, this is the absolute value of v_k minus v_l.
  • 17:20 - 17:26
    A more interesting notion that comes up a lot in
    practice is:
  • 17:26 - 17:32
    we don’t want to penalize arbitrarily things that are far way from each other in label space.
  • 17:32 - 17:37
    So this is what is called a truncated linear penalty.
  • 17:38 - 17:42
    And you can see that beyond a certain threshold,
  • 17:44 - 17:49
    the penalty just becomes constant, so it plateaus.
  • 17:49 - 17:54
    So that there is a penalty, but it doesn’t keep increasing over as the labels get further from each other
  • 17:54 - 18:00
    One example where metric MRF’s are used is when we’re doing image segmentation.
  • 18:00 - 18:04
    And here we tend to favor segmentations where adjacent superpixels…
  • 18:07 - 18:09
    (These are adjacent superpixels.)
  • 18:10 - 18:13
    And we want them to take the same class.
  • 18:18 - 18:22
    And so here we have no penalty when the superpixels take the same class
  • 18:22 - 18:25
    and we have some penalty when they take different classes.
  • 18:25 - 18:31
    And this is actually a very common, albeit simple, model for
    image segmentation.
  • 18:32 - 18:37
    Let’s look at a different MRF, also in the context of
    computer vision.
  • 18:37 - 18:40
    This is an MRF that’s used for image denoising.
  • 18:40 - 18:45
    So here we have a noisy version of a real image that looks like this.
  • 18:45 - 18:51
    So this is, you can see this kind of, white noise overlayed on top of the image.
  • 18:51 - 18:54
    And what we’d like to do, is we’d like to get a cleaned-up version of the image.
  • 18:54 - 19:01
    So here we have, a set of variables, X, that correspond to the noisy pixels.
  • 19:02 - 19:08
    And we have a set of variables, Y, that corresponds to the cleaned pixels.
  • 19:08 - 19:15
    And we'd like to have a probabilistic model that relates X and Y.
  • 19:15 - 19:21
    And what we’re going to do is we’d like, so, intuiti—, I mean,
  • 19:21 - 19:25
    so you’d like to have two effects on the pixels Y:
  • 19:25 - 19:33
    First, you'd like Y_i to be close to X_i.
  • 19:33 - 19:36
    But if you just do that, then you're just going to stick with
    the original image.
  • 19:36 - 19:41
    So what is the main constraint that we can employ on the image in order to clean it up
  • 19:41 - 19:47
    is the fact that adjacent pixels tend to have the same value.
  • 19:47 - 19:53
    So in this case what we’re going to do is we’re going to model, we’re going to constrain the image
  • 19:53 - 20:00
    so that we’re going to constrain the Y_i’s to try and make Y_i close to its neighbors.
  • 20:03 - 20:06
    And the further away it is, the bigger the penalty.
  • 20:06 - 20:08
    And that's a metric MRF.
  • 20:12 - 20:17
    Now we could use just a linear penalty,
  • 20:17 - 20:22
    but that’s going to be a very fragile model,
  • 20:22 - 20:28
    because, now obviously the right answer isn't the model
    where all pixels are equal to each other
  • 20:28 - 20:34
    in their actual intensity value because that would be just a single, you know, grayish-looking image.
  • 20:34 - 20:40
    So what you like is that you would like to let one pixel depart from its adjacent pixel
  • 20:40 - 20:44
    if it’s getting close in a different direction either by its own observation or by other adjacent pixels.
  • 20:44 - 20:49
    And so the right model to use here is actually the truncated linear model
  • 20:49 - 20:52
    and that one is [the] one that’s commonly used
  • 20:52 - 20:55
    and is very successful for doing image denoising.
  • 20:55 - 21:02
    Interesting, almost exactly the same idea is used in the context of stereo reconstruction.
  • 21:02 - 21:06
    There, the values that you’d like to infer, the Y_i’s,
  • 21:06 - 21:12
    are the depth disparity for a given pixel in the image—how deep it is.
  • 21:12 - 21:14
    And here also we have spacial continuity.
  • 21:14 - 21:19
    We like the depth of one pixel to be close to the depth of an adjacent pixel.
  • 21:19 - 21:22
    But once again we don’t want to enforce this too strongly
  • 21:22 - 21:25
    because you do have depth disparity in the image
  • 21:25 - 21:28
    and so eventually you'd like things to be allowed to break away from each other.
  • 21:28 - 21:33
    And so once again, one typically uses some kind of truncated linear model
  • 21:33 - 21:36
    for doing this stereo construction,
  • 21:36 - 21:38
    often augmented by other little tricks.
  • 21:38 - 21:45
    So, for example, here we have the actual pixel appearance,
  • 21:45 - 21:48
    for example, the color and texture.
  • 21:48 - 21:50
    And if the color and texture are very similar to each other,
  • 21:50 - 21:55
    you might want to have the stronger constraint on similarity.
  • 21:55 - 22:00
    Versus: if the color and texture of the adjacent pixels are
    very different from each other,
  • 22:00 - 22:03
    they may be more likely to belong to different objects
  • 22:03 - 22:08
    and you don’t want to enforce quite as strong of a similarity constraint.
Title:
Log-Linear Models (22:08)
Video Language:
English

English subtitles

Revisions