2006/8/30

Unicode in python2.4 under IDLE and XP SP2

先看stdin, stdout的encode是什麼

>>> sys.stdin.encoding
'cp950'
>>> sys.stdout.encoding
'cp950'

現在知道輸入時是用 cp950 來encode,把cp950的string轉成unicode的string
>>> cp= '三九あいう'
>>> cp
'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> cp.decode('cp950')
u'\u4e09\u4e5d\u3042\u3044\u3046'

可以看到已經轉成功了,再把unicode string印出來看看
>>> uni = cp.decode('cp950')
>>> uni
u'\u4e09\u4e5d\u3042\u3044\u3046'
>>> print uni
三九あいう

ok,unicode string即使在cp950的環境下也可以正常使用,再試著轉回成cp950
>>> uni.encode('cp950')
'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> print uni.encode('cp950')
三九あいう

看起來蠻正常的,就醬。


試著在string前加u看會如何
>>> cp = u'三九あいう'
>>> cp
u'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> print cp
¤T¤EƦƨƪ

結果印出來的不是預期的結果,以下是測試的結果

>>> cp.decode('cp950')

Traceback (most recent call last):
File "", line 1, in -toplevel-
cp.decode('cp950')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa4' in position 0: ordinal not in range(128)

>>> cp.encode('cp950')

Traceback (most recent call last):
File "", line 1, in -toplevel-
cp.encode('cp950')
UnicodeEncodeError: 'cp950' codec can't encode character u'\xa4' in position 0: illegal multibyte sequence

>>> cp.decode('utf8')

Traceback (most recent call last):
File "", line 1, in -toplevel-
cp.decode('utf8')
File "C:\Dev\python\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa4' in position 0: ordinal not in range(128)

>>> cp.encode('utf8')

'\xc2\xa4T\xc2\xa4E\xc3\x86\xc2\xa6\xc3\x86\xc2\xa8\xc3\x86\xc2\xaa'

轉成奇怪的string

>>> print cp.encode('utf8')
¤T¤EƦƨƪ

結論是直接在非unicode的string前加u會有奇怪的結果,連結裡還講到了對file的處理以及對unicode的簡單解釋

Reference:
http://www.jorendorff.com/articles/unicode/index.html
http://evanjones.ca/python-utf8.html

沒有留言:

張貼留言