Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Unicode & Utf-8

Myths and facts

Jose Vidal

on 15 March 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Unicode & Utf-8

Unicode & Utf-8 Myths and facts Unicode & Utf-8 Python 2.x Decode Encode Bibliography In Python 2.x UTF-8 Wait a second... Unicode A bit of common sense Iso 8859-1 and
Windows-1252 ASCII In old times... A computer cannot store "letters", "numbers", "pictures" or anything else like that. It has 97 Symbols in the box (within 7 bits) Iso 8858-1 and Windows-1252 added 127 and 27 symbols more respectively. It has room for 1.1 million code points, and only 110,000 are already assigned. We have to map Unicode "code points" to bytes
somehow. Character Set: Refers to the set of characters and their numbers (code points). There have been a bit of controversy about how
Unicode has been implemented in Python 2.x. Assuming that you received a UTF-8 text and you need
to convert it to Unicode: http://www.grauw.nl/blog/entry/254 In order to allow a set of bits to represent a complex object, we need some rules: encoding scheme The only thing it can store and work with are bits. A bit can only have two values: yes or no, true or false, 1 or 0 or whatever else you want to call these two values With already nearly 50 years of history. Works fine in English speaking
environments. It's a disaster if you try to use it for communicating with people using a different language. After a time, turned out really limited. So, why not adding more symbols to the free space? We won't run out of space! Unless we meet
aliens of course... Big difference with ASCII. It doesn't say how
we should save it in the disk. WAT? Only defines "code points". >>> 'á'.decode('utf-8')
u'\xe1' UTF-32, UTF-16, UTF-8... UTF-8 is the "standard" in the industry, cause
it has some nice features. e.g. ASCII is a subset of UTF-8, so if you had
a program using it, this upgrade will be
transparent to the end user. Character Encoding: Refers to the representation of these code points. e.g. Unicode is a character set, and UTF-8 and UTF-16 are different character encodings of Unicode. Python 3.x changes it completely! That's the main
reason why is that hard to move on. Python 2.x does many different parsings under
the hood if it faces a conversion problem,
while Python 3.x it's absolutely strict on this. >>> 'ñ'.decode('utf-8')
u'\xf1' http://nedbatchelder.com/text/unipain.html http://www.joelonsoftware.com/articles/Unicode.html >>> 'ñ'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) You can encode it back again to UTF-8 by typing: >>> print '%s' % '\xc3\xb1'
ñ >>> u'ñ'.encode('utf-8')
'\xc3\xb1' >>> u'ñ'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 0: ordinal not in range(128) But yes, also got really limited knowing that more and more people wanted to say things! It's a character-encoding scheme. So each country created its own ASCII version. Imagine how fun was sending emails to Russia! See what happens if we try to decode it from ASCII: Let's try now to encode it into ASCII... Plain Text doesn't exist! Uses 8 bits (1 byte) to encode each character. A love story... not everything was better...
Full transcript