7個你應該知道的 Python 庫

ngww 10年前發布 | 14K 次閱讀 Python

    1. <a class="reference external" href="/misc/goto?guid=4958871900305728680">pyquery</a> (with lxml)
</h2>
<div class="section" id="pip-install-pyquery">
    <h3>
        <tt class="docutils literal">pip install pyquery</tt> 
    </h3>
    <p>
        For parsing HTML in Python, <a class="reference external" href="/misc/goto?guid=4958187577038044362">Beautiful

Soup</a> is oft recommended and it does a great job. It sports a good pythonic API and it's easy to find introductory guides on the web. All is good in parsing-land .. until you want to parse more than a dozen documents at a time and immediately run head-first into performance problems. It's - simply put - very, very slow. </p>

Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008

What immediately stands out is how fast lxml is. Compared to Beautiful Soup, the lxml docs are pretty sparse and that's what originally kept me from adopting this mustang of a parsing library. lxml is pretty clunky to use. Yeah you can learn and use Xpath or cssselect to select specific elements out of the tree and it becomes kind of tolerable. But once you've selected the elements that you actually want to get, you have to navigate the labyrinth of attributes lxml exposes, some containing the bits you want to get at, but the vast majority just returning None. This becomes easier after a couple dozen uses but it remains unintuitive.

So either slow and easy to use or fast and hard to use, right?

Wrong!

Enter PyQuery

Oh PyQuery you beautiful seductress:

from pyquery import PyQuery
page = PyQuery(some_html)
last_red_anchor = page('#container > a.red:last')</pre>      

            Easy as pie. It's ever-beloved jQuery
but in Python!
        
        
            There are some gotchas, like for example that PyQuery, like jQuery,
exposes its internals upon iteration, forcing you to re-wrap:
        

for paragraph in page('#container > p'):
    paragraph = PyQuery(paragraph)
    text = paragraph.text()
       
            That's a wart the PyQuery creators ported over from jQuery (where they'd
fix it if it didn't break compatability). Understandable but still
unfortunate for such a great library.
        
    </div>
</div>

    

    2. <a class="reference external" href="/misc/goto?guid=4958871900582367766">dateutil</a> 
</h2>
<div class="section" id="pip-install-python-dateutil">
    <h3>
        <tt class="docutils literal">pip install <span class="pre">python-dateutil</span></tt> 
    </h3>
    <p>
        Handling dates is a pain. Thank god <tt class="docutils literal">dateutil</tt> exists. I won't even go

near parsing dates without trying dateutil.parser first:
        </p>
from dateutil.parser import parse
>>> parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)')
datetime.datetime(2011, 7, 11, 10, 1, 56, tzinfo=tzlocal())
fuzzy ignores unknown tokens
>>> s = """Today is 25 of September of 2003, exactly
...        at 10:49:41 with timezone -03:00."""
>>> parse(s, fuzzy=True)
datetime.datetime(2003, 9, 25, 10, 49, 41,
                  tzinfo=tzoffset(None, -10800))</pre>      

            Another thing that dateutil does for you, that would be a total pain
to do manually, is recurrence:
        

>>> list(rrule(DAILY, count=3, byweekday=(TU,TH),
...            dtstart=datetime(2007,1,1)))
[datetime.datetime(2007, 1, 2, 0, 0),
 datetime.datetime(2007, 1, 4, 0, 0),
 datetime.datetime(2007, 1, 9, 0, 0)]     </div>
</div>

    

    3. <a class="reference external" href="/misc/goto?guid=4958871900689305452">fuzzywuzzy</a> 
</h2>
<div class="section" id="pip-install-fuzzywuzzy">
    <h3>
        <tt class="docutils literal">pip install fuzzywuzzy</tt> 
    </h3>
    <p>
        <tt class="docutils literal">fuzzywuzzy</tt> allows you to do fuzzy comparison on wuzzes strings. This

has a whole host of use cases and is especially nice when you have to
deal with human-generated data.
        </p>
        

            Consider the following code that uses the Levenshtein
distance comparing some user input to an array of possible choices.
        

from Levenshtein import distance
countries = ['Canada', 'Antarctica', 'Togo', ...]
def choose_least_distant(element, choices):
    'Return the one element of choices that is most similar to element'
    return min(choices, key=lambda s: distance(element, s))
user_input = 'canaderp'
choose_least_distant(user_input, countries)
>>> 'Canada'</pre>         

            This is all nice and dandy but we can do better. The ocean of 3rd party
libs in Python is so vast, that in most cases we can just import something and be on our way:
        

from fuzzywuzzy import process
process.extractOne("canaderp", countries)
>>> ("Canada", 97)</pre>       

            More has been written about fuzzywuzzy here.
        
    </div>
</div>

    

    4. <a class="reference external" href="/misc/goto?guid=4958871900971515832">watchdog</a> 
</h2>
<div class="section" id="pip-install-watchdog">
    <h3>
        <tt class="docutils literal">pip install watchdog</tt> 
    </h3>
    <p>
        <tt class="docutils literal">watchdog</tt> is a Python API and shell utilities to monitor file system

events. This means you can watch some directory and define a
"push-based" system. Watchdog supports all kinds of problems. A solid
piece of engineering that does it much better than the 5 or so libraries
I tried before finding out about it.
        </p>
    </div>
</div>

    

    5. <a class="reference external" href="/misc/goto?guid=4958871901106242014">sh</a> 
</h2>
<div class="section" id="pip-install-sh">
    <h3>
        <tt class="docutils literal">pip install sh</tt> 
    </h3>
    <p>
        <tt class="docutils literal">sh</tt> allows you to call any program as if it were a function:
    </p>

from sh import git, ls, wc
checkout master branch
git(checkout="master")
print(the contents of this directory
print(ls("-l"))
get the longest line of this file
longest_line = wc(file, "-L")</pre>     </div>
</div>

    

    6. <a class="reference external" href="/misc/goto?guid=4958837507204637626">pattern</a> 
</h2>
<div class="section" id="pip-install-pattern">
    <h3>
        <tt class="docutils literal">pip install pattern</tt> 
    </h3>
    <p>
        This behemoth of a library advertises itself quite modestly:
    </p>
    <blockquote>
        Pattern is a web mining module for the Python programming language.
    </blockquote>
    <p>
        ... that does <strong>Data Mining</strong>, <strong>Natural Language Processing</strong>, <strong>Machine Learning</strong> and <strong>Network Analysis</strong> all in one. I myself yet

have to play with it but a friend's verdict was very positive.
        </p>
    </div>
</div>


7. <a class="reference external" href="/misc/goto?guid=4958871901274673060">path.py</a> </h2>


    pip install path.py 


    When I first learned Python os.path was my least favorite part of
the stdlib.

    Even something as simple as creating a list of files in a directory
turned out to be grating:
import os

some_dir = '/some_dir'
files = []

for f in os.listdir(some_dir):
    files.append(os.path.joinpath(some_dir, f))
 
    That listdir is in os and not os.path is unfortunate and
unexpected and one would really hope for more from such a prominent
module. And then all this manual fiddling for what really should be as
simple as possible.

    But with the power of path, handling file paths becomes fun again:
from path import path

some_dir = path('/some_dir')

files = some_dir.files()
 
    Done!

    Other goodies include:
>>> path('/').owner
'root'

>>> path('a/b/c').splitall()
[path(''), 'a', 'b', 'c']

# overriding __div__
>>> path('a') / 'b' / 'c'
path('a/b/c')

>>> path('ab/c').relpathto('ab/d/f')
path('../d/f')
 
    Best part of it all? path subclasses Python's str so you can use
it completely guilt-free without constantly being forced to cast it to str and worrying about libraries that check isinstance(s, basestring) (or even worse isinstance(s, str)).
                    
                         本文由用戶 ngww  自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。
                         轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。
                         本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！
                         
本文地址：http://www.baiduhome.net/news/view/117bec0                        
                         Python
                    
                
                
                
                    相關資訊
                    
                        
                            
                             7個你應該知道的 Python 庫
                        
                        
                            
                             7 個你應該知道的Java工具
                        
                        
                            
                             關于 Tor 你應該知道的 7 件事
                        
                        
                            
                             現在寫 PHP，你應該知道這些
                        
                        
                            
                             你應該知道的機器學習方法
                        
                    
                
                
                    相關經驗
                    
 Python 程序員應該知道的 10 個庫
 你應該知道的Android調試神器adb
 你應該知道的setTimeout秘密
 你應該知道的那些Android小經驗
 (譯)你應該知道的jQuery技巧
                    
                
                
                    相關文檔
                    
 你所應該知道的Dom4J
 你必須知道的N個Java語言問題
 手機app開發你不知道的細節
 python的庫
 Java程序員應該知道的10個調試技巧