python提取頁面內的url列表

pykde 10年前發布 | 3K 次閱讀 Python

python提取頁面內的url列表

from bs4 import BeautifulSoup
import time,re,urllib2
t=time.time()
websiteurls={}
def scanpage(url):

websiteurl=url
t=time.time()
n=0
html=urllib2.urlopen(websiteurl).read()
soup=BeautifulSoup(html)
pageurls=[]
Upageurls={}
pageurls=soup.find_all("a",href=True)

for links in pageurls:
    if websiteurl in links.get("href") and links.get("href") not in Upageurls and links.get("href") not in websiteurls:
        Upageurls[links.get("href")]=0
for links in Upageurls.keys():
    try:
        urllib2.urlopen(links).getcode()
    except:
        print "connect failed"
    else:
        t2=time.time()
        Upageurls[links]=urllib2.urlopen(links).getcode()
        print n,
        print links,
        print Upageurls[links]
        t1=time.time()
        print t1-t2
    n+=1
print ("total is "+repr(n)+" links")
print time.time()-t



scanpage("http://news.163.com/&quot;)</pre>

本文由用戶 pykde 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/code/view/1434378966567

Python

python提取頁面內的url列表

相關代碼

相關文檔

相關經驗

目錄