>>> sys.stdin.encoding
'cp950'
>>> sys.stdout.encoding
'cp950'
現在知道輸入時是用 cp950 來encode,把cp950的string轉成unicode的string
>>> cp= '三九あいう'
>>> cp
'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> cp.decode('cp950')
u'\u4e09\u4e5d\u3042\u3044\u3046'
可以看到已經轉成功了,再把unicode string印出來看看
>>> uni = cp.decode('cp950')
>>> uni
u'\u4e09\u4e5d\u3042\u3044\u3046'
>>> print uni
三九あいう
ok,unicode string即使在cp950的環境下也可以正常使用,再試著轉回成cp950
>>> uni.encode('cp950')
'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> print uni.encode('cp950')
三九あいう
看起來蠻正常的,就醬。
試著在string前加u看會如何
>>> cp = u'三九あいう'
>>> cp
u'\xa4T\xa4E\xc6\xa6\xc6\xa8\xc6\xaa'
>>> print cp
¤T¤EƦƨƪ
結果印出來的不是預期的結果,以下是測試的結果
>>> cp.decode('cp950')
Traceback (most recent call last):
File "
cp.decode('cp950')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa4' in position 0: ordinal not in range(128)
>>> cp.encode('cp950')
Traceback (most recent call last):
File "
cp.encode('cp950')
UnicodeEncodeError: 'cp950' codec can't encode character u'\xa4' in position 0: illegal multibyte sequence
>>> cp.decode('utf8')
Traceback (most recent call last):
File "
cp.decode('utf8')
File "C:\Dev\python\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa4' in position 0: ordinal not in range(128)
>>> cp.encode('utf8')
'\xc2\xa4T\xc2\xa4E\xc3\x86\xc2\xa6\xc3\x86\xc2\xa8\xc3\x86\xc2\xaa'
>>> print cp.encode('utf8')
¤T¤EƦƨƪ
Reference:
http://www.jorendorff.com/articles/unicode/index.html
http://evanjones.ca/python-utf8.html
沒有留言:
張貼留言