Python urllib2筆記(爬蟲)
來自: http://my.oschina.net/v5871314/blog/612742
0、簡單例子
利用Python的urllib2庫,可以很方便的完成網頁抓取功能,下列代碼抓取百度主頁并打印。
# -*- coding: utf-8 -*- import urllib import urllib2 response = urllib2.urlopen("http://www.baidu.com") print response.read()
代碼分析
先來看看urllib2.urlopen()函數的原型。
urllib2.
urlopen
(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])
Open the URL url, which can be either a string or a Request
object.
i)timeout參數用于設置超時時間(以秒為單位)
ii)data參數用于即為待提交的參數,需要用urllib.urlencode()函數進行編碼
iii)url參數即為請求的url字符串或者Request對象
1、提交數據
A)POST請求
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) response = urllib2.urlopen("http://httpbin.org/post",formal_post_data) print response.read()
運行結果:
B)GET請求(get圖)
# -*- coding: utf-8 -*- import urllib import urllib2 get_data = {'key1':'value1', 'key2':'value2'} formal_get_data = urllib.urlencode(get_data) url = 'http://httpbin.org/get' + '?' + formal_get_data response = urllib2.urlopen(url) print response.read()
運行結果:
2、Request對象
注意 urllib2.
urlopen
()函數的第一個參數也可以是Request對象,Request對象的引入將更加方便的封裝數據
原型 urllib2.
Request
(url[, data][, headers][, origin_req_host][, unverifiable])
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) #supposed it is encoded in utf-8 content = response.read().decode('utf-8') print content
運行結果:
Request的有關函數
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) print u'返回請求的方法post/GET' method = request.get_method() print 'get_method===>' + method print u'返回提交的數據' data = request.get_data() print 'request.get_data()===>',data print u'返回參數中的url' full_url = request.get_full_url() print 'request.get_full_url()===>',full_url print u'返回請求的schema' request_type = request.get_type() print 'request.get_type()===>',request_type print u'返回請求的主機' host = request.get_host() print 'request.get_host()===>',host print u'返回選擇器 - URL 中發送到服務器中的部分' selector = request.get_selector() print 'request.get_selector()===>',selector print u'返回選擇器請求頭部' header_items = request.header_items() print 'request.header_items()===>',header_items ##get_header(header_name, default=None) 獲得指定的header ## Request.add_header(key, val)可添加頭部 ## Request.has_header(header) 檢查是否實例擁有參數中的頭 ## Request.has_data() 檢查是否含有POST數據
運行結果:
2、Response對象
urllib2.
urlopen
()函數返回的response對象有以下方法
geturl() — 返回所獲取資源的URL, 通常用于決定是否跟著一個重定向
info() — 返回頁面的元信息,例如頭部信息,信息以 mimetools.表單的形式顯現。
getcode() — 返回響應的HTTP狀態碼.
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) print u'獲得真實url(重定向后的url)' print response.geturl() print u'獲得返回狀態碼' print response.code print u'頁面的元信息' print response.info()
運行結果:
3、常用代碼
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) #supposed it is encoded in utf-8 content = response.read().decode('utf-8') print content