使用wget工具抓取網頁和圖片
奇怪的需求
公司需要將服務器的網頁緩存到路由器,用戶在訪問該網頁時就直接取路由器上的緩存即可。雖然我不知道這個需求有什么意義,但還是盡力去實現吧。
wget概述
wget是unix和類unix下的一個網頁抓取工具,待我熟悉它后,發現它的功能遠不止這些。但是這篇博文只說怎么抓取一個指定URL以及它下面的相關內容(包括html,js,css,圖片)并將內容里的絕對路徑換成相對路徑。網上搜到一堆有關wget的文章,關于它怎么抓取網頁和相關的圖片資源,反正我是沒有找到一篇實用的,都以失敗告終。
這是wget -h > ./help_wget.txt 后的文件內容
GNU Wget 1.16, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
Mandatory arguments to long options are mandatory for short options too.
Startup:
-V, --version display the version of Wget and exit.
-h, --help print this help.
-b, --background go to background after startup.
-e, --execute=COMMAND execute a `.wgetrc'-style command.
Logging and input file:
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
--report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits.
-i, --input-file=FILE download URLs found in local or external FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL.
--config=FILE Specify config file to use.
--no-config Do not read any config file.
Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them).
-c, --continue resume getting a partially-downloaded file.
--start-pos=OFFSET start downloading from zero-based position OFFSET.
--progress=TYPE select progress gauge type.
--show-progress display the progress bar in any verbosity mode.
-N, --timestamping don't re-retrieve files unless newer than
local.
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.
--ignore-case ignore case when matching files/directories.
-4, --inet4-only connect only to IPv4 addresses.
-6, --inet6-only connect only to IPv6 addresses.
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none.
--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.
--ask-password prompt for passwords.
--no-iri turn off IRI support.
--local-encoding=ENC use ENC as the local encoding for IRIs.
--remote-encoding=ENC use ENC as the default remote encoding.
--unlink remove file before clobber.
Directories:
-nd, --no-directories don't create directories.
-x, --force-directories force creation of directories.
-nH, --no-host-directories don't create host directories.
--protocol-directories use protocol name in directories.
-P, --directory-prefix=PREFIX save files to PREFIX/...
--cut-dirs=NUMBER ignore NUMBER remote directory components.
HTTP options:
--http-user=USER set http user to USER.
--http-password=PASS set http password to PASS.
--no-cache disallow server-cached data.
--default-page=NAME Change the default page name (normally
this is `index.html'.).
-E, --adjust-extension save HTML/CSS documents with proper extensions.
--ignore-length ignore `Content-Length' header field.
--header=STRING insert STRING among the headers.
--max-redirect maximum redirections allowed per page.
--proxy-user=USER set USER as proxy username.
--proxy-password=PASS set PASS as proxy password.
--referer=URL include `Referer: URL' header in HTTP request.
--save-headers save the HTTP headers to file.
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
--no-http-keep-alive disable HTTP keep-alive (persistent connections).
--no-cookies don't use cookies.
--load-cookies=FILE load cookies from FILE before session.
--save-cookies=FILE save cookies to FILE after session.
--keep-session-cookies load and save session (non-permanent) cookies.
--post-data=STRING use the POST method; send STRING as the data.
--post-file=FILE use the POST method; send contents of FILE.
--method=HTTPMethod use method "HTTPMethod" in the request.
--body-data=STRING Send STRING as data. --method MUST be set.
--body-file=FILE Send contents of FILE. --method MUST be set.
--content-disposition honor the Content-Disposition header when
choosing local file names (EXPERIMENTAL).
--content-on-error output the received content on server errors.
--auth-no-challenge send Basic HTTP authentication information
without first waiting for the server's
challenge.
HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, TLSv1 and PFS.
--https-only only follow secure HTTPS links
--no-check-certificate don't validate the server's certificate.
--certificate=FILE client certificate file.
--certificate-type=TYPE client certificate type, PEM or DER.
--private-key=FILE private key file.
--private-key-type=TYPE private key type, PEM or DER.
--ca-certificate=FILE file with the bundle of CA's.
--ca-directory=DIR directory where hash list of CA's is stored.
--random-file=FILE file with random data for seeding the SSL PRNG.
--egd-file=FILE file naming the EGD socket with random data.
FTP options:
--ftp-user=USER set ftp user to USER.
--ftp-password=PASS set ftp password to PASS.
--no-remove-listing don't remove `.listing' files.
--no-glob turn off FTP file name globbing.
--no-passive-ftp disable the "passive" transfer mode.
--preserve-permissions preserve remote file permissions.
--retr-symlinks when recursing, get linked-to files (not dir).
WARC options:
--warc-file=FILENAME save request/response data to a .warc.gz file.
--warc-header=STRING insert STRING into the warcinfo record.
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER.
--warc-cdx write CDX index files.
--warc-dedup=FILENAME do not store records listed in this CDX file.
--no-warc-compression do not compress WARC files with GZIP.
--no-warc-digests do not calculate SHA1 digests.
--no-warc-keep-log do not store the log file in a WARC record.
--warc-tempdir=DIRECTORY location for temporary files created by the
WARC writer.
Recursive download:
-r, --recursive specify recursive download.
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).
--delete-after delete files locally after downloading them.
-k, --convert-links make links in downloaded HTML or CSS point to
local files.
--backups=N before writing file X, rotate up to N backup files.
-K, --backup-converted before converting file X, back up as X.orig.
-m, --mirror shortcut for -N -r -l inf --no-remove-listing.
-p, --page-requisites get all images, etc. needed to display HTML page.
--strict-comments turn on strict (SGML) handling of HTML comments.
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
--accept-regex=REGEX regex matching accepted URLs.
--reject-regex=REGEX regex matching rejected URLs.
--regex-type=TYPE regex type (posix).
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
--trust-server-names use the name specified by the redirection
url last component.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.
Mail bug reports and suggestions to <bug-wget@gnu.org>. wget嘗試
根據wget的幫助文檔,我嘗試了下面這條命令
wget -r -np -pk -nH -P ./download http://www.baidu.com
-r 遞歸下載所有內容
-np 只下載給定URL下的內容,不下載它的上級內容
-p 下載有關頁面需要用到的所有資源,包括圖片和css樣式
-k 將絕對路徑轉換為相對路徑(這個很重要,為了在用戶打開網頁的時候,加載的相關資源都在本地尋找)
-nH 禁止wget以接收的URL為名稱創建文件夾(如果沒有這個,這條命令會將下載的內容存在./download/www.baidu.com/下)
-P 下載到哪個路徑,這里是當前文件夾下的download文件夾下,沒有的話,wget會幫你自動創建
這些選項都符合目前的這個需求,單結果很意外,并不是我們想象的那么簡單,wget并沒有給我們想要的東西
你如果執行了這條命令,會發現在當前的download文件夾中只是下載了一個index.html和一個robots.txt,而index.html文件所需要的圖片也并沒有被下載
<img>標簽中的路徑也沒有被替換成相對路徑,可能只是去掉了"http:"這個字符串而已。
至于為什么會這樣,請繼續往下看。
wget正解
因為上面的命令行不通,所以,腦洞全開。來吧,讓我們寫一個shell腳本,名稱為wget_cc內容如下
#!/bin/sh URL="$2" PATH="$1" echo "download url: $URL" echo "download dir: $PATH" /usr/bin/wget -e robots=off -w 1 -xq -np -nH -pk -m -t 1 -P "$PATH" "$URL" echo "success to download"
這里多加了幾個參數,解釋一下:
-e 用法是‘-e command’
用來執行額外的.wgetrc命令。就像vim的配置存在.vimrc文件中一樣,wget也用.wgetrc文件來存放它的配置。也就是說在wget執行之前,會先執行.wgetrc文件中的配置命令。一個典型的.wgetrc文件可以參考:
http://www.gnu.org/software/wget/manual/html_node/Sample-Wgetrc.html
http://www.gnu.org/software/wget/manual/html_node/Wgetrc-Commands.html
用戶可以在不改寫.wgetrc文件的情況下,用-e選項指定額外的配置命令。如果想要制定多個配置命令,-e command1 -e command2 ... -e commandN即可。這些制定的配置命令,會在.wgetrc中所有命令之后執行,因此會覆蓋.wgetrc中相同的配置項。
這里robots=off是因為wget默認會根據網站的robots.txt進行操作,如果robots.txt里是User-agent: * Disallow: /的話,wget是做不了鏡像或者下載目錄的。
這就是前面為什么下載不了圖片和其他資源的原因所在了,因為你要爬的HOST禁止蜘蛛去爬它,而wget使用 -e robots=off 這個選項可以通過這個命令繞過這個限制。
-x 創建鏡像網站對應的目錄結構
-q 靜默下載,即不顯示下載信息,你如果想知道wget當前在下載什么資源的話,可以去掉這個選項
-m 它會打開鏡像相關的選項,比如無限深度的子目錄遞歸下載。
-t times 某個資源下載失敗后的重試下載次數
-w seconds 資源請求下載之間的等待時間(減輕服務器的壓力)
剩下有不懂的你就去挖文檔吧。
寫好后保存退出,執行:
chmod 744 wget_cc
下面就讓腳本執行起來吧!
./wget_cc ./download http://www.baidu.com
OK,然后再查看<img>標簽中的src屬性,
src="img/bd_logo1.png"
果然換成了相對路徑啊,大功告成,覺得對您有幫助的請點個贊吧!
這里是Freestyletime@foxmail.com,歡迎交流。
來自:http://my.oschina.net/freestyletime/blog/356985