Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Strip Snipping - Programming Languages

Get Embed Code
3 Languages

Showing Revision 2 created 05/25/2016 by Udacity Robot.

  1. So you may have noticed a bit of redundancy
  2. in our handling of "quoted strings".
  3. We return the entire matched text,
  4. which includes these double quotes at the end.
  5. But, in some sense, they're not as much part of the meaning,
  6. as they are beginning and ending markers to tell us when the string starts.
  7. This is our default token value,
  8. but we might want to take a small pair of scissors to this string,
  9. and snip off the quotes at the beginning--and at the end.
  10. Here we have an example of a token definition
  11. that does just that.
  12. After matching the right kind of string,
  13. we take the token value--
  14. the entire thing--
  15. and we're going to use substring selection,
  16. starting at character 1--
  17. this is going to be character 1--
  18. and going up to, but not including,
  19. character negative 1.
  20. Now if you haven't seen this trick before in Python this might surprise you a bit,
  21. but you can count back from the end of the string,
  22. using negative numbers.
  23. So this is actually the negative first character.
  24. And remember that substring inclusion
  25. starts at 1 and goes up to, but not including, the negative 1.
  26. So this is going to get everything from the "q"
  27. over to the "s" in strings--
  28. or in other words, have exactly the effect that we wanted.
  29. Cute little trick, huh?
  30. So now I'm going to show you how to make
  31. a lexical analyzer--which, recall--
  32. is just a bunch of token definitions put together.
  33. I'm going to write it out in Python
  34. and we'll follow along.
  35. This top line--the import statement--is a lot like Import RE.
  36. It's telling Python where to find our lexical analyzer software
  37. or libraries that we're going to build upon.
  38. Just like regular expressions were called RE to save space,
  39. a lexical analyzer is just called "lex"--to save space.
  40. And now I'm going to give a list of all of the tokens that I care about.
  41. Here, I'm just going to be concerned with the 6 that we've previously spoken about:
  42. the Left Angle bracket; the Left Angle bracket, followed by a slash; the Right Angle bracket--
  43. these 3 make tags--
  44. an Equal sign, Strings that are surrounded by quotes,
  45. and every other word.
  46. I'm also going to use a little shortcut.
  47. Before, we used a Whitespace token,
  48. but if you like, you can write the word t_ignore instead
  49. and, implicitly, we'll ignore everything matching this regular expression.
  50. Here's my first token definition rule.
  51. It's for LANGLESLASH.
  52. Here's the regular expression that it matches.
  53. We return the text, unchanged.
  54. Here's another rule for the Left Angle bracket,
  55. the regular expression that it matches--and we return the text, unchanged.
  56. And you'll note that I have the LANGELSLASH rule ahead--
  57. before it--in the file.
  58. And that's because I want this one to win, on ties.
  59. If I see a Left Angle, followed by a slash,
  60. I want it to be the LANGLESLASH (token)--
  61. and not the Left Angle, followed by--say--a WORD(token).
  62. More on that in just a bit; I'll test that out and show it to you.
  63. Here's our rule for the Right Angle bracket.
  64. Here's our rule for the Equal sign token.
  65. Note that while these are long--
  66. they take up a bit of space--they're not actually particularly complicated.
  67. This has mostly been listing 5 regular expressions.
  68. Here's one now.
  69. This one is a little bit more complicated--here's are rule for STRING(token)s.
  70. Here's our regular expression that matches it.
  71. And there I am, dropping off--shaving off--
  72. the surrounding double quotes,
  73. just as you've seen before.
  74. Finally, there's our definition for the WORD(token).
  75. And now what we want to do is use
  76. these regular expressions, together--these token definitions--
  77. to break up a Web page.
  78. So here, I'll make a variable that holds the text of a hypothetical Web page.
  79. "This is my webpage!"
  80. Let's make it more exciting; Ho, ho--this is at least 10 percent more exciting!
  81. This function call tells our lexical analysis library
  82. that we want to use all of the token definitions above
  83. to make a lexical analyzer, and break up strings.
  84. This function call tells it which string to break up.
  85. I want to break up this Web page:
  86. "This is my webpage!"
  87. Now, recall that the output of a lexical analyzer
  88. is a list of tokens.
  89. I want to print out every element of that list.
  90. This call, .token, returns the next token that's available.
  91. If there are not more tokens,
  92. then we're going to break out of this loop.
  93. Otherwise, we print out the token.
  94. Well, let's go see what sort of output we get.
  95. The odds of me having written this, making no mistakes
  96. the first time, from scratch, are about zero.
  97. Let's go see what happens.
  98. Oh! I actually don't really believe it!
  99. We can see the output here at the bottom:
  100. LexToken (WORD, ' T ',' h ', ' i ', ' s '
  101. but it's not quite the output I was expecting.
  102. Oh, here's the mistake that I made--
  103. right now, I only have one character in t_WORD
  104. and if you look down here, instead of seeing
  105. the word, "This"--for "This is my webpage!"--
  106. I have each letter spelled out separately.
  107. Let me fix that.
  108. And now we get more of the output that we were expecting.
  109. Our first token is ' This ';
  110. our next token is a word, ' is '.
  111. Then we saw the Left Angle bracket,
  112. a word, ' b '--for bold,
  113. the Right Angle bracket; a word, ' my ';
  114. the LANGLESLASH,
  115. and then the word, ' webpage '.