Python 處理各種編碼的字符串

ew3y 11年前發布 | 1K 次閱讀 Python

# file: Unicode2.py

-- coding: utf-8 --

import chilkat

The CkString object can handle any character encoding.

s1 = chilkat.CkString()

The appendEnc method allows us to append a string in any encoding.

s1.appendEnc('èéêëabc','utf-8')

If you're working with different encodings, you may wish

to name your string variables to reflect the encoding.

strAnsi = s1.getAnsi() strUtf8 = s1.getUtf8()

Prints "7"

print len(strAnsi)

Prints "11"

print len(strUtf8)

getNumChars returns the number of characters

print 'Num Chars: ' + str(s1.getNumChars())

utf-8 chars do not have a constant number of bytes/char.

A single utf-8 char is represented in 1 to 6 bytes.

print 'utf-8: ' + str(s1.getSizeUtf8())

ANSI is typically 1 byte per/char, but for some languages

such as Japanese, ANSI equates to a character encoding that may

not be 1 byte/char. (Shift_JIS is the ANSI encoding for Japanese)

print 'ANSI: ' + str(s1.getSizeAnsi())

Let's create an English/Japanese string.

s2 = chilkat.CkString() s2.appendEnc('abc愛知県新城市の','utf-8')

We can get the string in any multibyte encoding.

print 's2 num chars = ' + str(s2.getNumChars())

strShiftJIS = s2.getEnc('shift_JIS') print 'Shift-JIS num bytes = ' + str(len(strShiftJIS))

strIso2022JP = s2.getEnc('iso-2022-jp') print 'iso-2022-jp num bytes = ' + str(len(strIso2022JP))

strEucJp = s2.getEnc('euc-jp') print 'euc-jp num bytes = ' + str(len(strEucJp))

We can save the string in any encoding

s2.saveToFile('out_shift_jis.txt','shift_JIS') s2.saveToFile('out_iso_2022_jp.txt','iso-2022-jp') s2.saveToFile('out_utf8.txt','utf-8') s2.saveToFile('out_euc_jp.txt','euc-jp')

You may mix any number of languages in a utf-8 string

because utf-8 can encode characters in all languages.

(utf-8 is the multi-byte encoding of Unicode)

#

An ANSI string can generally hold us-ascii + the native language.

For example, Shift_JIS can represent us-ascii characters

in addition to Japanese characters.

For example, this is OK

strShiftJis = 'abc123' + s2.getEnc('shift_JIS')

This is not OK:

strShiftJis2 = '?στυφ' + s2.getEnc('shift_JIS')

print "Done!" </pre>


 本文由用戶 ew3y 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!