宜搜全站數十萬小說爬蟲

bigzhangxy 9年前發布 | 14K 次閱讀 數據庫 網絡爬蟲 beautifulsoup

自從看了師傅爬了頂點全站之后,我也手癢癢的,也想爬一個比較牛逼的小說網看看,于是選了宜搜這個網站,好了,馬上開干,這次用的是mogodb數據庫,感覺mysql太麻煩了下圖是我選擇宜搜里面遍歷的網站

Paste_Image.png

先看代碼框架圖

Paste_Image.png

第一個,肯定先提取排行榜里面每個類別的鏈接啊,然后進入鏈接進行爬取,先看all_theme文件

from bs4 import BeautifulSoup
import requests
from MogoQueue import MogoQueue
spider_queue = MogoQueue('novel_list','crawl_queue')#實例化封裝數據庫操作這個類,這個表是存每一頁書籍的鏈接的
theme_queue = MogoQueue('novel_list','theme_queue')#這個表是存每一個主題頁面的鏈接的
html = requests.get('http://book.easou.com/w/cat_yanqing.html')

soup = BeautifulSoup(html.text,'lxml')

all_list = soup.find('div',{'class':'classlist'}).findAll('div',{'class':'tit'})
for list in all_list:
    title = list.find('span',{'class':'name'}).get_text()
    book_number = list.find('span',{'class':'count'}).get_text()
    theme_link = list.find('a')['href']
    theme_links='http://book.easou.com/'+theme_link#每個書籍類目的數量
    #print(title,book_number,theme_links)找到每個分類的標題和每個類目的鏈接,然后再下面的links提取出來
    theme_queue.push_theme(theme_links,title,book_number)
links=['http://book.easou.com//w/cat_yanqing.html',
'http://book.easou.com//w/cat_xuanhuan.html',
'http://book.easou.com//w/cat_dushi.html',
'http://book.easou.com//w/cat_qingxiaoshuo.html',
'http://book.easou.com//w/cat_xiaoyuan.html',
'http://book.easou.com//w/cat_lishi.html',
'http://book.easou.com//w/cat_wuxia.html',
'http://book.easou.com//w/cat_junshi.html',
'http://book.easou.com//w/cat_juqing.html',
'http://book.easou.com//w/cat_wangyou.html',
'http://book.easou.com//w/cat_kehuan.html',
'http://book.easou.com//w/cat_lingyi.html',
'http://book.easou.com//w/cat_zhentan.html',
'http://book.easou.com//w/cat_jishi.html',
'http://book.easou.com//w/cat_mingzhu.html',
'http://book.easou.com//w/cat_qita.html',
]
def make_links(number,url):#這里要解釋一下,因為每個類目的書頁不同,而且最末頁是動態數據,源代碼沒有
    #這里采取了手打上最后一頁的方法,畢竟感覺抓包花的時間更多
    for i in range(int(number)+1):
        link=url+'?attb=&s=&tpg=500&tp={}'.format(str(i))
        spider_queue.push_queue(link)#這里將每一頁的書籍鏈接插進數據庫
        #print(link)

make_links(500,'http://book.easou.com//w/cat_yanqing.html')
make_links(500,'http://book.easou.com//w/cat_xuanhuan.html')
make_links(500,'http://book.easou.com//w/cat_dushi.html')
make_links(5,'http://book.easou.com//w/cat_qingxiaoshuo.html')
make_links(500,'http://book.easou.com//w/cat_xiaoyuan.html')
make_links(500,'http://book.easou.com//w/cat_lishi.html')
make_links(500,'http://book.easou.com//w/cat_wuxia.html')
make_links(162,'http://book.easou.com//w/cat_junshi.html')
make_links(17,'http://book.easou.com//w/cat_juqing.html')
make_links(500,'http://book.easou.com//w/cat_wangyou.html')
make_links(474,'http://book.easou.com//w/cat_kehuan.html')
make_links(427,'http://book.easou.com//w/cat_lingyi.html')
make_links(84,'http://book.easou.com//w/cat_zhentan.html')
make_links(9,'http://book.easou.com//w/cat_jishi.html')
make_links(93,'http://book.easou.com//w/cat_mingzhu.html')
make_links(500,'http://book.easou.com//w/cat_qita.html')

看看運行結果,這是書籍類目的

Paste_Image.png

這是構造出的每一個類目里面所有的頁數鏈接,也是我們爬蟲的入口,一共5000多頁

Paste_Image.png

接下來是封裝的數據庫操作,因為用到了多進程以及多線程每個進程,他們需要知道那些URL爬取過了、哪些URL需要爬取!我們來給每個URL設置兩種狀態:

outstanding:等待爬取的URL

complete:爬取完成的URL

processing:正在進行的URL。

嗯!當一個所有初始的URL狀態都為outstanding;當開始爬取的時候狀態改為:processing;爬取完成狀態改為:complete;失敗的URL重置狀態為:outstanding。為了能夠處理URL進程被終止的情況、我們設置一個計時參數,當超過這個值時;我們則將狀態重置為outstanding。

from pymongo import MongoClient,errors
from _datetime import datetime,timedelta
class MogoQueue():
    OUTSIANDING = 1
    PROCESSING = 2
    COMPLETE = 3
    def __init__(self,db,collection,timeout=300):
        self.client=MongoClient()
        self.Clinet=self.client[db]
        self.db=self.Clinet[collection]
        self.timeout=timeout
    def __bool__(self):
        record = self.db.find_one(
            {'status': {'$ne': self.COMPLETE}}
        )
        return True if record else False
    def push_theme(self,url,title,number):#這個函數用來添加新的URL以及URL主題名字進去隊列
        try:
            self.db.insert({'_id':url,'status':self.OUTSIANDING,'主題':title,'書籍數量':number})
            print(title,url,'插入隊列成功')
        except errors.DuplicateKeyError as e:#插入失敗則是已經存在于隊列了
            print(title,url,'已經存在隊列中')
            pass
    def push_queue(self,url):
        try:
            self.db.insert({'_id':url,'status':self.OUTSIANDING})
            print(url,'插入隊列成功')
        except errors.DuplicateKeyError as e:#插入失敗則是已經存在于隊列了
            print(url,'已經存在隊列中')
            pass
    def push_book(self,title,author,book_style,book_introduction,book_url):
        try:
            self.db.insert({'_id':book_url,'書籍名稱':title,'書籍作者':author,'書籍類型':book_style,'簡介':book_introduction})
            print(title, '書籍插入隊列成功')
        except errors.DuplicateKeyError as e:
            print(title, '書籍已經存在隊列中')
            pass


    def select(self):
        record = self.db.find_and_modify(
            query={'status':self.OUTSIANDING},
            update={'$set':{'status': self.PROCESSING, 'timestamp':datetime.now() }}
        )
        if record:
            return record['_id']
        else:
            self.repair()
            raise KeyError
    def repair(self):
        record = self.db.find_and_modify(
            query={
                'timestamp':{'$lt':datetime.now()-timedelta(seconds=self.timeout)},
                'status':{'$ne':self.COMPLETE}
            },
            update={'$set':{'status':self.OUTSIANDING}}#超時的要更改狀態

        )
        if record:
            print('重置URL',record['_id'])
    def complete(self,url):
        self.db.update({'_id':url},{'$set':{'status':self.COMPLETE}})

接下來是爬蟲主程序

from ip_pool_request import html_request
from bs4 import BeautifulSoup
import random
import multiprocessing
import time
import threading
from ip_pool_request2 import download_request
from MogoQueue import MogoQueue
def novel_crawl(max_thread=8):
    crawl_queue = MogoQueue('novel_list','crawl_queue')#實例化數據庫操作,鏈接到數據庫,這個是爬蟲需要的書籍鏈接表
    book_list = MogoQueue('novel_list','book_list')#爬取的內容放進這里
    def pageurl_crawler():
        while True:
            try:
                url = crawl_queue.select()#從數據庫提取鏈接,開始抓
                print(url)
            except KeyError:#觸發這個異常,則是鏈接都被爬完了
                print('隊列沒有數據,你好壞耶')
            else:

                data=html_request.get(url,3)
                soup = BeautifulSoup(data,'lxml')

                all_novel = soup.find('div',{'class':'kindContent'}).findAll('li')


                for novel in all_novel:#提取所需要的所以信息
                    text_tag =novel.find('div',{'class':'textShow'})
                    title = text_tag.find('div',{'class':'name'}).find('a').get_text()
                    author = text_tag.find('span',{'class':'author'}).find('a').get_text()
                    book_style = text_tag.find('span',{'class':'kind'}).find('a').get_text()
                    book_introduction= text_tag.find('div',{'class':'desc'}).get_text().strip().replace('\n','')
                    img_tag = novel.find('div',{'class':'imgShow'}).find('a',{'class':'common'})

                    book_url = 'http://book.easou.com/' + img_tag.attrs['href']
                    book_list.push_book(title,author,book_style,book_introduction,book_url)
                    crawl_queue.complete(url)#完成之后改變鏈接的狀態
                    #print(title,author,book_style,book_introduction,book_url)
    threads=[]
    while threads or crawl_queue:
        for thread in threads:
            if not thread.is_alive():
                threads.remove(thread)
        while len(threads)< max_thread:
            thread = threading.Thread(target=pageurl_crawler())#創建線程
            thread.setDaemon(True)#線程保護
            thread.start()
            threads.append(thread)
        time.sleep(5)
def process_crawler():
    process=[]
    num_cups=multiprocessing.cpu_count()
    print('將會啟動的進程數為',int(num_cups)-2)
    for i in range(int(num_cups)-2):
        p=multiprocessing.Process(target=novel_crawl)#創建進程
        p.start()
        process.append(p)
        for p in process:
            p.join()
if __name__ == '__main__':
    process_crawler()

讓我們來看看結果吧

Paste_Image.png

里面因為很多都是重復的,所有去重之后只有十幾萬本,好失望......

 

來自:http://www.jianshu.com/p/a1c5183f3f4d

 

 本文由用戶 bigzhangxy 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!