python多線程+隊列下載資源

jopen 12年前發布 | 25K 次閱讀 Python Python開發

網上有一些公開課視頻教程還有課件啥的，手動下太慢了，寫個python下載。我想盡可能的做到通用性，以后可以直接用的，代碼如下，拋磚引玉，歡迎建議和意見：

    import urllib.request  
    import re  
    import queue  
    import threading  
    import os  
    class download(threading.Thread):  
        def __init__(self,que):  
            threading.Thread.__init__(self)  
            self.que=que  
        def run(self):  
            while True:  
                if not self.que.empty():  
                    print('-----%s------'%(self.name))  
                    os.system('wget '+self.que.get())  
                else:  
                    break  

    def startDown(url,rule,num,start,end,decoding=None):  
        if not decoding:  
            decoding='utf8'  
        req=urllib.request.urlopen(url)  
        body=req.read().decode(decoding)  
        rule=re.compile(rule)  
        link=rule.findall(body)  
        que=queue.Queue()  
        for l in link:  
            que.put(l[start:end])  
        for i in range(num):  
            d=download(que)  
            d.start()  

    if __name__=='__main__':  
        url='https://class.coursera.org/algo-004/lecture/index'  
        rule='<a target=\"_new\" href=\".*\"'  
        startDown(url,rule,10,23,-1)

簡單說一下：download類繼承了threading.Thread類，并重寫了run函數，目的是只要隊列不為空，則不停的從隊列中取出資源真實鏈接地址調用wget下載，如果為空則退出線程。startDown函數是多線程下載的接口，里面的參數分別為：url--資源的網頁，rule--正則表達式匹配方式，num--開啟的線程數，start--正則中匹配真實鏈接的起始位置，end--正則中匹配真實鏈接的結束位置，decoding--資源頁面采用的編碼方式，默認是utf8。

下面是我運行時的樣子：好了，下次要下載直接import這個文件就妥了