  1. That didn't seem too hard.
  2. This looks like English. This looks like German.
  3. I may not be familiar with Azerbaijan,
  4. but it doesn't look like English, German, French, or Spanish,
  5. so I'll probably choose that, and that would be the right answer.
  6. Now, how could I do that? Well, I could do it by recognizing some of the words.
  7. But it turns out I can also do it just by looking at letter sequences,
  8. the frequency of of single letters or pairs of letters or triplets of letters.
  9. In fact, you can get about 99% accuracy for language identification just looking at tables of letters.
  10. And a great thing about dealing with letter models is that
  11. the probability tables you need are much more compact.
  12. If you think about triples of words, there may be a million words in the vocabulary,
  13. so a table of triples is a million to the 3rd power.
  14. That's quite a number of entries.
  15. Whereas for letters in the alphabet, most alphabets have about 30 letters or so.
  16. So it's very easy and compact to store triples of those.
  17. Now, in doing actual language identification,
  18. it's also common to add other features, to not look only at the letter combinations.
  19. So you might add words as well.
  20. You might add a small number of words--the most common words in a language,
  21. or it may be even better to add the most discriminative words--
  22. words that show up in one language but not in another language
  23. and count the occurrence of those words.