Python urllib2筆記(爬蟲)
來自: http://my.oschina.net/v5871314/blog/612742
0、簡單例子
利用Python的urllib2庫,可以很方便的完成網頁抓取功能,下列代碼抓取百度主頁并打印。
# -*- coding: utf-8 -*-
import urllib
import urllib2
response = urllib2.urlopen("http://www.baidu.com")
print response.read() 代碼分析
先來看看urllib2.urlopen()函數的原型。
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])
Open the URL url, which can be either a string or a Request object.
i)timeout參數用于設置超時時間(以秒為單位)
ii)data參數用于即為待提交的參數,需要用urllib.urlencode()函數進行編碼
iii)url參數即為請求的url字符串或者Request對象
1、提交數據
A)POST請求
# -*- coding: utf-8 -*-
import urllib
import urllib2
url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)
response = urllib2.urlopen("http://httpbin.org/post",formal_post_data)
print response.read() 運行結果:

B)GET請求(get圖)
# -*- coding: utf-8 -*-
import urllib
import urllib2
get_data = {'key1':'value1', 'key2':'value2'}
formal_get_data = urllib.urlencode(get_data)
url = 'http://httpbin.org/get' + '?' + formal_get_data
response = urllib2.urlopen(url)
print response.read() 運行結果:

2、Request對象
注意 urllib2.urlopen()函數的第一個參數也可以是Request對象,Request對象的引入將更加方便的封裝數據
原型 urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])
# -*- coding: utf-8 -*-
import urllib
import urllib2
url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)
#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
request = urllib2.Request(url, formal_post_data, headers)
response = urllib2.urlopen(request)
#supposed it is encoded in utf-8
content = response.read().decode('utf-8')
print content 運行結果:

Request的有關函數
# -*- coding: utf-8 -*-
import urllib
import urllib2
url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)
#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
request = urllib2.Request(url, formal_post_data, headers)
print u'返回請求的方法post/GET'
method = request.get_method()
print 'get_method===>' + method
print u'返回提交的數據'
data = request.get_data()
print 'request.get_data()===>',data
print u'返回參數中的url'
full_url = request.get_full_url()
print 'request.get_full_url()===>',full_url
print u'返回請求的schema'
request_type = request.get_type()
print 'request.get_type()===>',request_type
print u'返回請求的主機'
host = request.get_host()
print 'request.get_host()===>',host
print u'返回選擇器 - URL 中發送到服務器中的部分'
selector = request.get_selector()
print 'request.get_selector()===>',selector
print u'返回選擇器請求頭部'
header_items = request.header_items()
print 'request.header_items()===>',header_items
##get_header(header_name, default=None) 獲得指定的header
## Request.add_header(key, val)可添加頭部
## Request.has_header(header) 檢查是否實例擁有參數中的頭
## Request.has_data() 檢查是否含有POST數據 運行結果:

2、Response對象
urllib2.urlopen()函數返回的response對象有以下方法
geturl() — 返回所獲取資源的URL, 通常用于決定是否跟著一個重定向
info() — 返回頁面的元信息,例如頭部信息,信息以 mimetools.表單的形式顯現。
getcode() — 返回響應的HTTP狀態碼.
# -*- coding: utf-8 -*-
import urllib
import urllib2
url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)
#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
request = urllib2.Request(url, formal_post_data, headers)
response = urllib2.urlopen(request)
print u'獲得真實url(重定向后的url)'
print response.geturl()
print u'獲得返回狀態碼'
print response.code
print u'頁面的元信息'
print response.info() 運行結果:

3、常用代碼
# -*- coding: utf-8 -*-
import urllib
import urllib2
url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)
#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
request = urllib2.Request(url, formal_post_data, headers)
response = urllib2.urlopen(request)
#supposed it is encoded in utf-8
content = response.read().decode('utf-8')
print content