Spidr : Ruby開發的Web爬蟲
Spidr是一個多功能的Ruby web 爬蟲庫。它可以抓取一個網站,多個域名或某些鏈接。Spidr被設計成快速和容易使用。
具體特性:
- Follows:
- a tags.
- iframe tags.
- frame tags.
- Cookie protected links.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- HTTP Basic Auth protected links.
- Black-list or white-list URLs based upon:
- URL scheme
- Host name
- Port number
- Full link
- URL extension
- Provides call-backs for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Every URL that failed to be visited.
- Provides action methods to:
- Pause spidering.
- Skip processing of pages.
- Skip processing of links.
- Restore the spidering queue and history from a previous session.
- Custom User-Agent strings.
- Custom proxy settings.
- HTTPS support.
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!