7 個程序員應該知道的Python開發庫

jopen 12年前發布 | 65K 次閱讀 Python Python開發

在我多年的 Python 編程經歷以及在 Github 上的探索漫游過程中,我發掘到一些很不錯的 Python 開發包,這些包大大簡化了開發過程,而本文就是為了向大家推薦這些開發包。

請注意我特別排除了像 SQLAlchemy 和 Flask 這樣的庫,因為其實在太優秀了,無需多提。

下面開始:

1. PyQuery (with lxml)

安裝方法 pip install pyquery

Python 解析 HTML 時最經常被推薦的是 Beautiful Soup ,而且它的確也表現很好。提供良好的 Python 風格的 API,而且很容易在網上找到相關的資料文檔,但是當你需要在短時間內解析大量文檔時便會碰到性能的問題,簡單,但是真的非常慢。

下圖是 08 年的一份性能比較圖:

parsing-results.png

這個圖里我們發現 lxml 的性能是如此之好,不過文檔就很少,而且使用上相當的笨拙!那么是選擇一個使用簡單但是速度奇慢的庫呢,還是選擇一個速度飛快但是用起來巨復雜的庫呢?

誰說二者一定要選其一呢,我們要的是用起來方便,速度也一樣飛快的 XML/HTML 解析庫!

而 PyQuery 就可以同時滿足你的易用性和解析速度方面的苛刻要求。

看看下面這幾行代碼:

from pyquery import PyQuery
page = PyQuery(some_html)

last_red_anchor = page('#container > a.red:last')
很簡單吧,很像是 jQuery,但它卻是 Python。


不過也有一些不足,在使用迭代時需要對文本進行重新封裝:

for paragraph in page('#container > p'):
    paragraph = PyQuery(paragraph)
    text = paragraph.text()


2. dateutil

安裝方法:pip install dateutil

處理日期很痛苦,多虧有了 dateutil

from dateutil.parser import parse

>>> parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)')
datetime.datetime(2011, 7, 11, 10, 1, 56, tzinfo=tzlocal())

# fuzzy ignores unknown tokens

>>> s = """Today is 25 of September of 2003, exactly
...        at 10:49:41 with timezone -03:00."""
>>> parse(s, fuzzy=True)
datetime.datetime(2003, 9, 25, 10, 49, 41,
                  tzinfo=tzoffset(None, -10800))


3. fuzzywuzzy

安裝方法:pip install fuzzywuzzy

fuzzywuzzy 可以讓你對兩個字符串進行模糊比較,當你需要處理一些人類產生的數據時,這非常有用。下面代碼使用Levenshtein 距離比較方法來匹配用戶輸入數組和可能的選擇。

from Levenshtein import distance

countries = ['Canada', 'Antarctica', 'Togo', ...]

def choose_least_distant(element, choices):
    'Return the one element of choices that is most similar to element'
    return min(choices, key=lambda s: distance(element, s))

user_input = 'canaderp'
choose_least_distant(user_input, countries)
>>> 'Canada'
這已經不錯了,但還可以做的更好:
from fuzzywuzzy import process

process.extractOne("canaderp", countries)
>>> ("Canada", 97)


4. watchdog

安裝方法:pip install watchdog

watchdog 是一個用來監控文件系統事件的 Python API和shell實用工具。

5. sh

安裝方法:pip install sh

sh 可讓你調用任意程序,就好象是一個函數一般:

from sh import git, ls, wc

# checkout master branch
git(checkout="master")

# print(the contents of this directory
print(ls("-l"))

# get the longest line of this file
longest_line = wc(__file__, "-L")


6. pattern

安裝方法:pip install pattern

Pattern 是 Python 的一個 Web 數據挖掘模塊。可用于數據挖掘、自然語言處理、機器學習和網絡分析。

7. path.py

安裝方法:pip install path.py

當我開始學習 Python 時,os.path 是我最不喜歡的 stdlib 的一部分。盡管在一個目錄下創建一組文件很簡單。

import os

some_dir = '/some_dir'
files = []

for f in os.listdir(some_dir):
    files.append(os.path.joinpath(some_dir, f))


但 listdir 在 os 而不是 os.path 中。

而有了 path.py ,處理文件路徑變得簡單:

from path import path

some_dir = path('/some_dir')

files = some_dir.files()
其他的用法:
>>> path('/').owner
'root'

>>> path('a/b/c').splitall()
[path(''), 'a', 'b', 'c']

# overriding __div__
>>> path('a') / 'b' / 'c'
path('a/b/c')

>>> path('ab/c').relpathto('ab/d/f')
path('../d/f')
是不是要好很多?


英文原文:

In my years of programming in Python and roaming around GitHub's Explore section, I've come across a few libraries that stood out to me as being particularly enjoyable to use. This blog post is an effort to further spread that knowledge.

I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric etc. because I think they're already pretty "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.

1. pyquery (with lxml)

pip install pyquery

For parsing HTML in Python, Beautiful Soup is oft recommended and it does a great job. It sports a good pythonic API and it's easy to find introductory guides on the web. All is good in parsing-land .. until you want to parse more than a dozen documents at a time and immediately run head-first into performance problems. It's - simply put - very, very slow.

Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008:

parsing-results.png

What immediately stands out is how fast lxml is. Compared to Beautiful Soup, the lxml docs are pretty sparse and that's what originally kept me from adopting this mustang of a parsing library. lxml is pretty clunky to use. Yeah you can learn and use Xpath or cssselect to select specific elements out of the tree and it becomes kind of tolerable. But once you've selected the elements that you actually want to get, you have to navigate the labyrinth of attributes lxml exposes, some containing the bits you want to get at, but the vast majority just returning None. This becomes easier after a couple dozen uses but it remains unintuitive.

So either slow and easy to use or fast and hard to use, right?

Wrong!

Enter PyQuery

Oh PyQuery you beautiful seductress:

from pyquery import PyQuery page = PyQuery(some_html) last_red_anchor = page('#container > a.red:last') 

Easy as pie. It's ever-beloved jQuery but in Python!

There are some gotchas, like for example that PyQuery, like jQuery, exposes its internals upon iteration, forcing you to re-wrap:

for paragraph in page('#container > p'):     paragraph = PyQuery(paragraph)     text = paragraph.text() 

That's a wart the PyQuery creators ported over from jQuery (where they'd fix it if it didn't break compatability). Understandable but still unfortunate for such a great library.

2. dateutil

pip install dateutil

Handling dates is a pain. Thank god dateutil exists. I won't even go near parsing dates without trying dateutil.parser first:

from dateutil.parser import parse >>> parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)') datetime.datetime(2011, 7, 11, 10, 1, 56, tzinfo=tzlocal()) # fuzzy ignores unknown tokens >>> s = """Today is 25 of September of 2003, exactly ...        at 10:49:41 with timezone -03:00.""" >>> parse(s, fuzzy=True) datetime.datetime(2003, 9, 25, 10, 49, 41,                   tzinfo=tzoffset(None, -10800)) 

Another thing that dateutil does for you, that would be a total pain to do manually, is recurrence:

>>> list(rrule(DAILY, count=3, byweekday=(TU,TH), ...            dtstart=datetime(2007,1,1))) [datetime.datetime(2007, 1, 2, 0, 0),  datetime.datetime(2007, 1, 4, 0, 0),  datetime.datetime(2007, 1, 9, 0, 0)] 

3. fuzzywuzzy

pip install fuzzywuzzy

fuzzywuzzy allows you to do fuzzy comparison on wuzzes strings. This has a whole host of use cases and is especially nice when you have to deal with human-generated data.

Consider the following code that uses the Levenshtein distance comparing some user input to an array of possible choices.

from Levenshtein import distance countries = ['Canada', 'Antarctica', 'Togo', ...] def choose_least_distant(element, choices):     'Return the one element of choices that is most similar to element'     return min(choices, key=lambda s: distance(element, s)) user_input = 'canaderp' choose_least_distant(user_input, countries) >>> 'Canada' 

This is all nice and dandy but we can do better. The ocean of 3rd party libs in Python is so vast, that in most cases we can just import something and be on our way:

from fuzzywuzzy import process process.extractOne("canaderp", countries) >>> ("Canada", 97) 

More has been written about fuzzywuzzy here.

4. watchdog

pip install watchdog

watchdog is a Python API and shell utilities to monitor file system events. This means you can watch some directory and define a "push-based" system. Watchdog supports all kinds of problems. A solid piece of engineering that does it much better than the 5 or so libraries I tried before finding out about it.

5. sh

pip install sh

sh allows you to call any program as if it were a function:

from sh import git, ls, wc # checkout master branch git(checkout="master") # print(the contents of this directory print(ls("-l")) # get the longest line of this file longest_line = wc(__file__, "-L") 

6. pattern

pip install pattern

This behemoth of a library advertises itself quite modestly:

Pattern is a web mining module for the Python programming language.

... that does Data Mining, Natural Language Processing, Machine Learning and Network Analysis all in one. I myself yet have to play with it but a friend's verdict was very positive.

7. path.py

pip install path.py

When I first learned Python os.path was my least favorite part of the stdlib.

Even something as simple as creating a list of files in a directory turned out to be grating:

import os some_dir = '/some_dir' files = [] for f in os.listdir(some_dir):     files.append(os.path.joinpath(some_dir, f)) 

That listdir is in os and not os.path is unfortunate and unexpected and one would really hope for more from such a prominent module. And then all this manual fiddling for what really should be as simple as possible.

But with the power of path, handling file paths becomes fun again:

from path import path some_dir = path('/some_dir') files = some_dir.files() 

Done!

Other goodies include:

>>> path('/').owner 'root' >>> path('a/b/c').splitall() [path(''), 'a', 'b', 'c'] # overriding __div__ >>> path('a') / 'b' / 'c' path('a/b/c') >>> path('ab/c').relpathto('ab/d/f') path('../d/f') 

Best part of it all? path subclasses Python's str so you can use it completely guilt-free without constantly being forced to cast it to str and worrying about libraries that check isinstance(s, basestring) (or even worse isinstance(s, str)).

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!