YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← International Characters - Intro to Computer Science

Get Embed Code
3 Languages

Showing Revision 6 created 05/24/2016 by Udacity Robot.

  1. Raymond has a question to do with the use of non-US alphabet characters.
  2. When he first wrote programs with text in Spanish and ran them using Python,
  3. he got an error--syntax error "non-ASCII character."
  4. He did not get that error when running the same code in IDLE. It worked as it should.
  5. He solved this by changing the encoding.
  6. So as we're building a search engine that will scan pages in different languages
  7. and perhaps use different language interfaces, how should we change our code
  8. so that it does not run into problems with encoding? What should our default character set be?
  9. Thanks for the question, Raymond. This is a really good point to bring up.
  10. As you take inputs that include more languages than just English
  11. the character sets are different.
  12. If you look at a webpage, there's a header in the webpage
  13. that identifies what characters set it's using.
  14. This is part of the HTTP protocol to have a header that describes the content type,
  15. which includes the character set.
  16. The character set can be selected from a few different ones.
  17. The most common one is called UTF-8.
  18. That's using a way of encoding characters that makes the simple characters
  19. that can be encoded in just 7 bits.
  20. This is the character set that is known as ASCII.
  21. These are characters that are common in English.
  22. It doesn't cover all the characters that are used in all other languages,
  23. but those 7-bit characters can be encoded using a small amount of space.
  24. The standard we've been using in Python with standard strings is that they're all ASCII.
  25. Each character is 1 byte.
  26. There's only 255 possible characters that can be encoded in 1 byte though.
  27. If you're wanting to deal with more languages, you need to use different characters.
  28. The way to do that is called Unicode.
  29. Unicode is a character set that can support large numbers of characters.
  30. It's actually a way of encoding characters that doesn't have a limit on
  31. how many characters you can support.
  32. A way of encoding those that makes it efficient when the characters are small
  33. but allows you to still encode a large number of characters is called UTF-8,
  34. which is what most web browsers use.
  35. To deal with this in Python, what you'd have to deal with instead of standard strings is Unicode strings.
  36. There is a built-in type for Unicode characters.
  37. You can convert a string to Unicode by using Unicode.
  38. Then there are ways of encoding Unicode in other character sets.
  39. If you wanted to build a web search engine that can deal with text that's not using
  40. the standard English character set, you definitely need to worry about handling Unicode
  41. and all these different character encodings.