Python Web 爬蟲匯總

jopen 10年前發布 | 61K 次閱讀 Python 網絡爬蟲

Network

  • General

    • urllib - network library (stdlib)
    • requests - network library
    • grab - network library (pycurl based)
    • pycurl - network library (binding to libcurl)
    • urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
    • httplib2 - network library
    • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • mechanize - Stateful programmatic web browsing.
    • socket low-level networking interface (stdlib)
    • Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
    • hyper - HTTP/2 Client for Python
    • PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
    • </ul> </li>

    • Asynchronous

      • treq - requests like API (twisted based)
      • aiohttp - http client/server for asyncio (PEP-3156)
      • </ul> </li> </ul>

        Web-Scraping Frameworks

        • Full Featured Crawlers

          • grab - web-scraping framework (pycurl/multicurl based)
          • scrapy - web-scraping framework (twisted based). Does not support Python3.
          • pyspider - A powerful spider system.
          • cola - A distributed crawling framework.
          • </ul> </li>

          • Other

            • portia - Visual scraping for Scrapy.
            • restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
            • demiurge - PyQuery-based scraping micro-framework.
            • </ul> </li> </ul>

              HTML/XML Parsing

              • General

                • lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
                • cssselect - working with DOM tree with CSS selectors
                • pyquery - working with DOM tree with jQuery-like selectors
                • BeautifulSoup - slow HTML/XMl processing library, written in pure python
                • html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
                • feedparser - parsing of RSS/ATOM feeds.
                • MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
                • xmltodict - Working with XML feel like you are working with JSON.
                • xhtml2pdf - HTML/CSS to PDF converter.
                • untangle - Converts XML documents to Python objects for easy access.
                • </ul> </li>

                • Sanitizing

                  • Bleach - cleaning of HTML (requires html5lib)
                  • sanitize - Bringing sanity to world of messed-up data.
                  • </ul> </li> </ul>

                    Text Processing

                    Libraries for parsing and manipulating plain texts.

                    • General

                      • difflib - (Python standard library) Helpers for computing deltas.
                      • Levenshtein - Fast computation of Levenshtein distance and string similarity.
                      • fuzzywuzzy - Fuzzy String Matching.
                      • esmre - Regular expression accelerator.
                      • ftfy - Makes Unicode text less broken and more consistent automagically.
                      • </ul> </li>

                      • Transliteration

                        • unidecode - ASCII transliterations of Unicode text.
                        • </ul> </li>

                        • Character encoding

                          • uniout - Print readable chars instead of the escaped string.
                          • chardet - Python 2/3 compatible character encoding detector.
                          • xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
                          • pangu.py - Spacing texts for CJK and alphanumerics.
                          • </ul> </li>

                          • Slugify

                            • awesome-slugify - A Python slugify library that can preserve unicode.
                            • python-slugify - A Python slugify library that translates unicode to ASCII.
                            • unicode-slugify - A slugifier that generates unicode slugs.
                            • pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)
                            • </ul> </li>

                            • General Parser

                              • PLY - Implementation of lex and yacc parsing tools for Python
                              • pyparsing - A general purpose framework for generating parsers.
                              • </ul> </li>

                              • Human names

                                • python-nameparser - Parsing human names into their individual components.
                                • </ul> </li>

                                • Phone Number

                                  • phonenumbers - Parsing, formatting, storing and validating international phone numbers.
                                  • </ul> </li>

                                  • User-agent string

                                    • python-user-agents - Browser user agent parser.
                                    • HTTP Agent Parser - Python HTTP Agent Parser
                                    • </ul> </li> </ul>

                                      Specific Formats Processing

                                      Libraries for parsing and manipulating specific text formats.

                                      • General

                                        • tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
                                        • textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
                                        • messytables - Tools for parsing messy tabular data
                                        • rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
                                        • </ul> </li>

                                        • Office

                                          • python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
                                          • xlwt / xlrd - Writing and reading data and formatting information from Excel files.
                                          • XlsxWriter - A Python module for creating Excel .xlsx files.
                                          • xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
                                          • openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
                                          • Marmir - Takes Python data structures and turns them into spreadsheets.
                                          • </ul> </li>

                                          • PDF

                                            • PDFMiner - A tool for extracting information from PDF documents.
                                            • PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
                                            • ReportLab - Allowing Rapid creation of rich PDF documents.
                                            • pdftables - Extract tables from PDF files directly
                                            • </ul> </li>

                                            • Markdown

                                              • Python-Markdown - A Python implementation of John Gruber’s Markdown.
                                              • Mistune - Fastest and full featured pure Python parsers of Markdown.
                                              • markdown2 - A fast and complete Python implementation of Markdown
                                              • </ul> </li>

                                              • YAML

                                                • PyYAML - YAML implementations for Python.
                                                • </ul> </li>

                                                • CSS

                                                  • cssutils - A CSS library for Python.
                                                  • </ul> </li>

                                                  • ATOM/RSS

                                                    • feedparser - Universal feed parser.
                                                    • </ul> </li>

                                                    • SQL

                                                      • sqlparse - A non-validating SQL parser.
                                                      • </ul> </li>

                                                      • HTTP

                                                        • http-parser - HTTP request/response parser for python in C
                                                        • </ul> </li>

                                                        • Microformats

                                                          • opengraph - A Python module to parse the Open Graph Protocol tags
                                                          • </ul> </li>

                                                          • Portable Executable

                                                            • pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.
                                                            • </ul> </li>

                                                            • PSD

                                                              • psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures.
                                                              • </ul> </li> </ul>

                                                                Natural Language Processing

                                                                Libraries for working with human languages.

                                                                • NLTK - A leading platform for building Python programs to work with human language data.
                                                                • Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
                                                                • TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
                                                                • jieba - Chinese Words Segmentation Utilities.
                                                                • SnowNLP - A library for processing Chinese text.
                                                                • loso - Another Chinese segmentation library.
                                                                • genius - A Chinese segment base on Conditional Random Field.
                                                                • langid.py - Stand-alone language identification system.
                                                                • Korean - A library for Korean morphology.
                                                                • pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
                                                                • PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
                                                                • </ul>

                                                                  Browser automation and emulation

                                                                  • Browsers

                                                                    • selenium - automating real browsers (Chrome, Firefox, Opera, IE)
                                                                    • Ghost.py - wrapper of QtWebKit (requires PyQT)
                                                                    • Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
                                                                    • Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
                                                                    • </ul> </li>

                                                                    • Headless tools

                                                                      • xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)
                                                                      • </ul> </li> </ul>

                                                                        Multiprocessing

                                                                        • threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
                                                                        • multiprocessing - standard python library to run processes.
                                                                        • celery - An asynchronous task queue/job queue based on distributed message passing.
                                                                        • concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.
                                                                        • </ul>

                                                                          Asynchronous

                                                                          Libraries for asynchronous networking programming.

                                                                          • asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
                                                                          • Twisted - An event-driven networking engine.
                                                                          • Tornado - A Web framework and asynchronous networking library.
                                                                          • pulsar - Event-driven concurrent framework for Python.
                                                                          • diesel - Greenlet-based event I/O Framework for Python.
                                                                          • gevent - A coroutine-based Python networking library that uses greenlet.
                                                                          • eventlet - Asynchronous framework with WSGI support.
                                                                          • Tomorrow - Magic decorator syntax for asynchronous code.
                                                                          • </ul>

                                                                            Queue

                                                                            • celery - An asynchronous task queue/job queue based on distributed message passing.
                                                                            • huey - Little multi-threaded task queue.
                                                                            • mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
                                                                            • RQ - lightweight task queue manager based on redis
                                                                            • simpleq - A simple, infinitely scalable, Amazon SQS based queue.
                                                                            • python-gearman - python API for Gearman
                                                                            • </ul>

                                                                              Cloud Computing

                                                                              • picloud - executing python-code in cloud
                                                                              • dominoup.com - executing R, Python и matlab code in cloud
                                                                              • </ul>

                                                                                Email

                                                                                Libraries for parsing email.

                                                                                • flanker - A email address and Mime parsing library.
                                                                                • Talon - Mailgun library to extract message quotations and signatures.
                                                                                • </ul>

                                                                                  URL and Network Address Manipulation

                                                                                  Libraries for parsing/modifying URLs and network addresses.

                                                                                  • URL

                                                                                    • furl - A small Python library that makes manipulating URLs simple.
                                                                                    • purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
                                                                                    • urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
                                                                                    • tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
                                                                                    • </ul> </li>

                                                                                    • Network Address

                                                                                      • netaddr - A Python library for representing and manipulating network addresses.
                                                                                      • </ul> </li> </ul>

                                                                                        Web Content Extracting

                                                                                        Libraries for extracting web contents.

                                                                                        • Text and Meta Data from HTML pages

                                                                                          • newspaper - News extraction, article extraction and content curation in Python.
                                                                                          • html2text - Convert HTML to Markdown-formatted text.
                                                                                          • python-goose - HTML Content/Article Extractor.
                                                                                          • lassie - Web Content Retrieval for Humans.
                                                                                          • micawber - A small library for extracting rich content from URLs.
                                                                                          • sumy - A module for automatic summarization of text documents and HTML pages.
                                                                                          • Haul - An Extensible Image Crawler.
                                                                                          • python-readability - Fast Python port of arc90's readability tool.
                                                                                          • scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
                                                                                          • </ul> </li>

                                                                                          • Video

                                                                                            • 油Tube-dl - A small command-line program to download videos from 油Tube.
                                                                                            • you-get - A 油Tube/Youku/Niconico video downloader written in Python 3.
                                                                                            • </ul> </li>

                                                                                            • Wiki

                                                                                              • WikiTeam - Tools for downloading and preserving wikis.
                                                                                              • </ul> </li> </ul>

                                                                                                WebSocket

                                                                                                Libraries for working with WebSocket.

                                                                                                • Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
                                                                                                • AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
                                                                                                • WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.
                                                                                                • </ul>

                                                                                                  DNS Resolving

                                                                                                  • dnsyo - Check your DNS against over 1500 global DNS servers.
                                                                                                  • pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously
                                                                                                  • </ul>

                                                                                                    Computer Vision

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!