在Python中處理中文的工具包:Mafan

jopen 10年前發布 | 22K 次閱讀 Mafan Python開發

Mafan是一組Python工具集合,用于方便處理中文。可以做繁簡檢測,繁簡轉化,檢查中文標點,檢查是否中英文混合,甚至還提供分詞。

encodings

encodings contains functions for converting files from any number of 麻煩 character encodings to something more sane (utf-8, by default). For example:

from mafan import encoding

filename = 'ugly_big5.txt' # name or path of file as string
encoding.convert(filename) # creates a file with name 'ugly_big5_utf-8.txt' in glorious utf-8 encoding

text

text contains some functions for working with strings. Things like detecting english in a string, whether a string has Chinese punctuation, etc. Check out text.py for all the latest goodness. It also contains a handy wrapper for the jianfan package for converting between simplified and traditional:

>>> from mafan import simplify, tradify
>>> string = u'這是麻煩啦'
>>> print tradify(string) # convert string to traditional
這是麻煩啦
>>> print simplify(tradify(string)) # convert back to simplified
這是麻煩啦

The has_punctuation and contains_latin functions are useful for knowing whether you are really dealing with Chinese, or Chinese characters:

>>> from mafan import text
>>> text.has_punctuation(u'這是麻煩啦') # check for any Chinese punctuation (full-stops, commas, quotation marks, etc)
False
>>> text.has_punctuation(u'這是麻煩啦.')
False
>>> text.has_punctuation(u'這是麻煩啦。')
True
>>> text.contains_latin(u'這是麻煩啦。')
False
>>> text.contains_latin(u'You are麻煩啦。')
True

You can also test whether sentences or documents use simplified characters, traditional characters, both or neither:

>>> import mafan
>>> from mafan import text
>>> text.is_simplified(u'這是麻煩啦')
True
>>> text.is_traditional(u'Hello,這是麻煩啦') # ignores non-chinese characters
True

# Or done another way:
>>> text.identify(u'這是麻煩啦') is mafan.SIMPLIFIED
True
>>> text.identify(u'這是麻煩啦') is mafan.TRADITIONAL
True
>>> text.identify(u'這是麻煩啦! 這是麻煩啦') is mafan.BOTH
True
>>> text.identify(u'This is so mafan.') is mafan.NEITHER # or None
True

The identification functionality is introduced as a very thin wrapper to Thomas Roten's hanzidentifier, which is included as part of mafan.

Another function that comes pre-built into Mafan is split_text, which tokenizes Chinese sentences into words:

>>> from mafan import split_text
>>> split_text(u"這是麻煩啦")
[u'\u9019', u'\u662f', u'\u9ebb\u7169', u'\u5566']
>>> print ' '.join(split_text(u"這是麻煩啦"))
這 是 麻煩 啦

You can also optionally pass the boolean include_part_of_speech parameter to get tagged words back:

>>> split_text(u"這是麻煩啦", include_part_of_speech=True)
[(u'\u9019', 'r'), (u'\u662f', 'v'), (u'\u9ebb\u7169', 'x'), (u'\u5566', 'y')]

pinyin

pinyin contains functions for working with or converting between pinyin. At the moment, the only function in there is one to convert numbered pinyin to the pinyin with correct tone marks. For example:

>>> from mafan import pinyin
>>> print pinyin.decode("ni3hao3")
nǐhǎo

traditional characters

If you want to be able to use split_text on traditional characters, you can make use of one of two options:

  • Either set an environment variable, MAFAN_DICTIONARY_PATH, to the absolute path to a local copy of this dictionary file,
  • or install the mafan_traditional convenience package: pip install mafan_traditional. If this package is installed and available, mafan will default to use this extended dictionary file.

項目主頁:http://www.baiduhome.net/lib/view/home/1418367676855

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!