用scrapy進行網頁抓取
用scrapy來進行網頁抓取,對于pythoner來說它用起來非常方便,詳細文檔在這里:http://doc.scrapy.org/en/0.14/index.html
要想利用scrapy來抓取網頁信息,需要先新建一個工程,scrapy startproject myproject
工程建立好后,會有一個myproject/myproject的子目錄,里面有item.py(由于你要抓取的東西的定義),pipeline.py(用于處理抓取后的數據,可以保存數據庫,或是其他),然后是spiders文件夾,可以在里面編寫爬蟲的腳本.
這里以爬取某網站的書籍信息為例:
item.py如下:
from scrapy.item import Item, Field class BookItem(Item): # define the fields for your item here like: name = Field() publisher = Field() publish_date = Field() price = Field()
我們要抓取的東西都在上面定義好了,分別是名字,出版商,出版日期,價格,
下面就要寫爬蟲去網戰抓取信息了,
spiders/book.py如下:
from urlparse import urljoin import simplejson from scrapy.http import Request from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from myproject.items import BookItem class BookSpider(CrawlSpider): name = 'bookspider' allowed_domains = ['test.com'] start_urls = [ "http://test_url.com", #這里寫開始抓取的頁面地址(這里網址是虛構的,實際使用時請替換) ] rules = ( #下面是符合規則的網址,但是不抓取內容,只是提取該頁的鏈接(這里網址是虛構的,實際使用時請替換) Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))), #下面是符合規則的網址,提取內容,(這里網址是虛構的,實際使用時請替換) Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"), ) def parse_item(self, response): hxs = HtmlXPathSelector(response) item = BookItem() item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0] item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract() publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract() item['publisher'] = publisher and publisher[0] or '' publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)") item['publish_date'] = publish_date and publish_date[0] or '' prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)") item['price'] = prices and prices[0] or '' return item
然后信息抓取后,需要保存,這時就需要寫pipelines.py了(用于scapy是用的twisted,所以具體的數據庫操作可以看twisted的資料,這里只是簡單介紹如何保存到數據庫中):
from scrapy import log #from scrapy.core.exceptions import DropItem from twisted.enterprise import adbapi from scrapy.http import Request from scrapy.exceptions import DropItem from scrapy.contrib.pipeline.images import ImagesPipeline import time import MySQLdb import MySQLdb.cursors class MySQLStorePipeline(object): def __init__(self): self.dbpool = adbapi.ConnectionPool('MySQLdb', db = 'test', user = 'user', passwd = '******', cursorclass = MySQLdb.cursors.DictCursor, charset = 'utf8', use_unicode = False ) def process_item(self, item, spider): query = self.dbpool.runInteraction(self._conditional_insert, item) query.addErrback(self.handle_error) return item def _conditional_insert(self, tx, item): if item.get('name'): tx.execute(\ "insert into book (name, publisher, publish_date, price ) \ values (%s, %s, %s, %s)", (item['name'], item['publisher'], item['publish_date'], item['price']) )
完成之后在setting.py中添加該pipeline:
ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']
最后運行scrapy crawl bookspider就開始抓取了
參考:www.chengxuyuans.com
http://www.chengxuyuans.com/Python/39302.html
來自:http://blog.csdn.net/huaweitman/article/details/9613999
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!