English subtitles

← 21-40 Software Engineering

Get Embed Code
2 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. Now, let me back up just for a minute and talk about software engineering in general
  2. rather than talking about specific AI techniques.
  3. What I'm showing here is a small excerpt from the spelling correction code
  4. from a project called Htdig, which is an open-source search engine. It's a great search engine.
  5. If you ever have need of one, you might want to check it out.
  6. All the code is very straightforward and easy to deal with.
  7. It has several thousand lines of code dealing with spelling correction.
  8. Here we see a little bit of code.
  9. It has the good idea of saying one word might be misspelled for another if they sound alike,
  10. and so let's go through each word and figure out what each letter is sounding like
  11. and see if there are other words that sound similar.
  12. So for example, here it's saying what does a "c" sound like.
  13. Well, "c" is ambiguous in English.
  14. It has this "x" sound, the "ch" sound, this "s" or "k" sound,
  15. and there's all these possibilities about how it can have one sound or another.
  16. Now imagine you're in charge of maintaining this program.
  17. In order for you to make sure that it's right you have to do several things.
  18. First, you could look at this comment and say, well, does this comment
  19. accurately reflect the rules for English pronunciation?
  20. Here, it's talking about pronouncing a "c" as an "s" in the context of an "i," "e," or "y."
  21. What about the other vowels--"a" and "o?"
  22. Were they left out by accident or is this correct?
  23. So you'd have to do some work to check that out.
  24. Then you'd have to do more work to say if this comment correct,
  25. is the comment correctly implemented in this code here?
  26. In fact, just this sort of one page of code just dealing with a couple letters
  27. is about the same as all the code that we use to implement the probabilistic model.
  28. But I think the most important difficulty in maintaining code like this
  29. is that it's so specific to the English language.
  30. Imagine you're in charge of maintaining it, and you're boss or professor comes to you and says,
  31. "Great job. Now I'd like you to make this work for
  32. German and French and Azerbaijani and 50 other languages."
  33. You'd have to go through and understand the pronunciation rules in each of those languages
  34. and edit a version of this code for each particular language.
  35. That would be quite tedious.
  36. But if you were dealing with a probabilistic model
  37. and you were asked to work in another language,
  38. all you would have to do is go out and collect a large corpus of words in that language.
  39. Then you'd have the probability of the individuals words.
  40. And then find a corpus of spelling errors.
  41. Then you'd have the probability of the spelling edits.
  42. And so gathering that data is much faster, much easier software engineering process
  43. than writing this code by hand.
  44. In sense, you could say that machine learning over probabilistic models
  45. is the ultimate in agile programming.