Python urllib2筆記（爬蟲）

zmnlvy131s 8年前發布 | 16K 次閱讀 Python開發

來自： http://my.oschina.net/v5871314/blog/612742

0、簡單例子

利用Python的urllib2庫，可以很方便的完成網頁抓取功能，下列代碼抓取百度主頁并打印。

# -*- coding: utf-8 -*-
import urllib
import urllib2

response = urllib2.urlopen("http://www.baidu.com")
print response.read()

代碼分析

先來看看urllib2.urlopen()函數的原型。

urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])

Open the URL url, which can be either a string or a Request object.

i）timeout參數用于設置超時時間（以秒為單位）

ii）data參數用于即為待提交的參數，需要用urllib.urlencode()函數進行編碼

iii）url參數即為請求的url字符串或者Request對象

1、提交數據

A）POST請求

# -*- coding: utf-8 -*-
import urllib
import urllib2

url = 'http://httpbin.org/post'
post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)

response = urllib2.urlopen("http://httpbin.org/post",formal_post_data)

print response.read()

運行結果：

B）GET請求（get圖）

# -*- coding: utf-8 -*-
import urllib
import urllib2

get_data = {'key1':'value1', 'key2':'value2'}
formal_get_data = urllib.urlencode(get_data)

url = 'http://httpbin.org/get' + '?' + formal_get_data   
response = urllib2.urlopen(url)

print response.read()

運行結果：

2、Request對象

注意 urllib2.urlopen（）函數的第一個參數也可以是Request對象，Request對象的引入將更加方便的封裝數據

原型 urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])

# -*- coding: utf-8 -*-
import urllib
import urllib2

url = 'http://httpbin.org/post'

post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)

#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

request = urllib2.Request(url, formal_post_data, headers)

response = urllib2.urlopen(request)
     
#supposed it is encoded in utf-8
content = response.read().decode('utf-8')
print content

運行結果：

Request的有關函數

# -*- coding: utf-8 -*-
import urllib
import urllib2

url = 'http://httpbin.org/post'

post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)

#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

request = urllib2.Request(url, formal_post_data, headers)

print u'返回請求的方法post/GET'
method = request.get_method()
print 'get_method===>' + method
print u'返回提交的數據'
data = request.get_data()
print 'request.get_data()===>',data

print u'返回參數中的url'
full_url = request.get_full_url()
print 'request.get_full_url()===>',full_url

print u'返回請求的schema'
request_type = request.get_type()
print 'request.get_type()===>',request_type

print u'返回請求的主機'
host = request.get_host()
print 'request.get_host()===>',host

print u'返回選擇器 - URL 中發送到服務器中的部分'
selector = request.get_selector()
print 'request.get_selector()===>',selector


print u'返回選擇器請求頭部'
header_items = request.header_items()
print 'request.header_items()===>',header_items

##get_header(header_name, default=None) 獲得指定的header
## Request.add_header(key, val)可添加頭部
## Request.has_header(header) 檢查是否實例擁有參數中的頭
## Request.has_data() 檢查是否含有POST數據

運行結果：

2、Response對象

urllib2.urlopen（）函數返回的response對象有以下方法
geturl() — 返回所獲取資源的URL, 通常用于決定是否跟著一個重定向
info() — 返回頁面的元信息，例如頭部信息，信息以 mimetools.表單的形式顯現。
getcode() — 返回響應的HTTP狀態碼.

# -*- coding: utf-8 -*-
import urllib
import urllib2

url = 'http://httpbin.org/post'

post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)

#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

request = urllib2.Request(url, formal_post_data, headers)

response = urllib2.urlopen(request)

print u'獲得真實url（重定向后的url）'
print response.geturl()
print u'獲得返回狀態碼'
print response.code
print u'頁面的元信息'
print response.info()

運行結果：

3、常用代碼

# -*- coding: utf-8 -*-
import urllib
import urllib2

url = 'http://httpbin.org/post'

post_data = {'key1':'value1', 'key2':'value2'}
formal_post_data = urllib.urlencode(post_data)

#set headers
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

request = urllib2.Request(url, formal_post_data, headers)

response = urllib2.urlopen(request)

#supposed it is encoded in utf-8
content = response.read().decode('utf-8')

print content

本文由用戶 zmnlvy131s 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1454333174198.html

Python開發

Python urllib2筆記（爬蟲）

相關經驗

相關資訊

相關文檔

目錄