使用wget工具抓取網頁和圖片

jopen 11年前發布 | 99K 次閱讀 wget 網絡技術

奇怪的需求

公司需要將服務器的網頁緩存到路由器，用戶在訪問該網頁時就直接取路由器上的緩存即可。雖然我不知道這個需求有什么意義，但還是盡力去實現吧。

wget概述

wget是unix和類unix下的一個網頁抓取工具，待我熟悉它后，發現它的功能遠不止這些。但是這篇博文只說怎么抓取一個指定URL以及它下面的相關內容（包括html,js,css,圖片）并將內容里的絕對路徑換成相對路徑。網上搜到一堆有關wget的文章，關于它怎么抓取網頁和相關的圖片資源，反正我是沒有找到一篇實用的，都以失敗告終。

這是wget -h > ./help_wget.txt 后的文件內容

GNU Wget 1.16, a non-interactive network retriever.

Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.



Startup:

  -V,  --version                   display the version of Wget and exit.

  -h,  --help                      print this help.

  -b,  --background                go to background after startup.

  -e,  --execute=COMMAND           execute a `.wgetrc'-style command.



Logging and input file:

  -o,  --output-file=FILE          log messages to FILE.

  -a,  --append-output=FILE        append messages to FILE.

  -q,  --quiet                     quiet (no output).

  -v,  --verbose                   be verbose (this is the default).

  -nv, --no-verbose                turn off verboseness, without being quiet.

       --report-speed=TYPE         Output bandwidth as TYPE.  TYPE can be bits.

  -i,  --input-file=FILE           download URLs found in local or external FILE.

  -F,  --force-html                treat input file as HTML.

  -B,  --base=URL                  resolves HTML input-file links (-i -F)

                                   relative to URL.

       --config=FILE               Specify config file to use.

       --no-config                 Do not read any config file.

Download:

  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits).

       --retry-connrefused         retry even if connection is refused.

  -O,  --output-document=FILE      write documents to FILE.

  -nc, --no-clobber                skip downloads that would download to

                                   existing files (overwriting them).

  -c,  --continue                  resume getting a partially-downloaded file.

       --start-pos=OFFSET          start downloading from zero-based position OFFSET.

       --progress=TYPE             select progress gauge type.

       --show-progress             display the progress bar in any verbosity mode.

  -N,  --timestamping              don't re-retrieve files unless newer than

                                   local.

  --no-use-server-timestamps       don't set the local file's timestamp by

                                   the one on the server.

  -S,  --server-response           print server response.

       --spider                    don't download anything.

  -T,  --timeout=SECONDS           set all timeout values to SECONDS.

       --dns-timeout=SECS          set the DNS lookup timeout to SECS.

       --connect-timeout=SECS      set the connect timeout to SECS.

       --read-timeout=SECS         set the read timeout to SECS.

  -w,  --wait=SECONDS              wait SECONDS between retrievals.

       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval.

       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals.

       --no-proxy                  explicitly turn off proxy.

  -Q,  --quota=NUMBER              set retrieval quota to NUMBER.

       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local host.

       --limit-rate=RATE           limit download rate to RATE.

       --no-dns-cache              disable caching DNS lookups.

       --restrict-file-names=OS    restrict chars in file names to ones OS allows.

       --ignore-case               ignore case when matching files/directories.

  -4,  --inet4-only                connect only to IPv4 addresses.

  -6,  --inet6-only                connect only to IPv6 addresses.

       --prefer-family=FAMILY      connect first to addresses of specified family,

                                   one of IPv6, IPv4, or none.

       --user=USER                 set both ftp and http user to USER.

       --password=PASS             set both ftp and http password to PASS.

       --ask-password              prompt for passwords.

       --no-iri                    turn off IRI support.

       --local-encoding=ENC        use ENC as the local encoding for IRIs.

       --remote-encoding=ENC       use ENC as the default remote encoding.

       --unlink                    remove file before clobber.



Directories:

  -nd, --no-directories            don't create directories.

  -x,  --force-directories         force creation of directories.

  -nH, --no-host-directories       don't create host directories.

       --protocol-directories      use protocol name in directories.

  -P,  --directory-prefix=PREFIX   save files to PREFIX/...

       --cut-dirs=NUMBER           ignore NUMBER remote directory components.



HTTP options:

       --http-user=USER            set http user to USER.

       --http-password=PASS        set http password to PASS.

       --no-cache                  disallow server-cached data.

       --default-page=NAME         Change the default page name (normally

                                   this is `index.html'.).

  -E,  --adjust-extension          save HTML/CSS documents with proper extensions.

       --ignore-length             ignore `Content-Length' header field.

       --header=STRING             insert STRING among the headers.

       --max-redirect              maximum redirections allowed per page.

       --proxy-user=USER           set USER as proxy username.

       --proxy-password=PASS       set PASS as proxy password.

       --referer=URL               include `Referer: URL' header in HTTP request.

       --save-headers              save the HTTP headers to file.

  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION.

       --no-http-keep-alive        disable HTTP keep-alive (persistent connections).

       --no-cookies                don't use cookies.

       --load-cookies=FILE         load cookies from FILE before session.

       --save-cookies=FILE         save cookies to FILE after session.

       --keep-session-cookies      load and save session (non-permanent) cookies.

       --post-data=STRING          use the POST method; send STRING as the data.

       --post-file=FILE            use the POST method; send contents of FILE.

       --method=HTTPMethod         use method "HTTPMethod" in the request.

       --body-data=STRING          Send STRING as data. --method MUST be set.

       --body-file=FILE            Send contents of FILE. --method MUST be set.

       --content-disposition       honor the Content-Disposition header when

                                   choosing local file names (EXPERIMENTAL).

       --content-on-error          output the received content on server errors.

       --auth-no-challenge         send Basic HTTP authentication information

                                   without first waiting for the server's

                                   challenge.



HTTPS (SSL/TLS) options:

       --secure-protocol=PR        choose secure protocol, one of auto, SSLv2,

                                   SSLv3, TLSv1 and PFS.

       --https-only                only follow secure HTTPS links

       --no-check-certificate      don't validate the server's certificate.

       --certificate=FILE          client certificate file.

       --certificate-type=TYPE     client certificate type, PEM or DER.

       --private-key=FILE          private key file.

       --private-key-type=TYPE     private key type, PEM or DER.

       --ca-certificate=FILE       file with the bundle of CA's.

       --ca-directory=DIR          directory where hash list of CA's is stored.

       --random-file=FILE          file with random data for seeding the SSL PRNG.

       --egd-file=FILE             file naming the EGD socket with random data.



FTP options:

       --ftp-user=USER             set ftp user to USER.

       --ftp-password=PASS         set ftp password to PASS.

       --no-remove-listing         don't remove `.listing' files.

       --no-glob                   turn off FTP file name globbing.

       --no-passive-ftp            disable the "passive" transfer mode.

       --preserve-permissions      preserve remote file permissions.

       --retr-symlinks             when recursing, get linked-to files (not dir).



WARC options:

       --warc-file=FILENAME        save request/response data to a .warc.gz file.

       --warc-header=STRING        insert STRING into the warcinfo record.

       --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER.

       --warc-cdx                  write CDX index files.

       --warc-dedup=FILENAME       do not store records listed in this CDX file.

       --no-warc-compression       do not compress WARC files with GZIP.

       --no-warc-digests           do not calculate SHA1 digests.

       --no-warc-keep-log          do not store the log file in a WARC record.

       --warc-tempdir=DIRECTORY    location for temporary files created by the

                                   WARC writer.



Recursive download:

  -r,  --recursive                 specify recursive download.

  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite).

       --delete-after              delete files locally after downloading them.

  -k,  --convert-links             make links in downloaded HTML or CSS point to

                                   local files.

       --backups=N                 before writing file X, rotate up to N backup files.

  -K,  --backup-converted          before converting file X, back up as X.orig.

  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing.

  -p,  --page-requisites           get all images, etc. needed to display HTML page.

       --strict-comments           turn on strict (SGML) handling of HTML comments.



Recursive accept/reject:

  -A,  --accept=LIST               comma-separated list of accepted extensions.

  -R,  --reject=LIST               comma-separated list of rejected extensions.

       --accept-regex=REGEX        regex matching accepted URLs.

       --reject-regex=REGEX        regex matching rejected URLs.

       --regex-type=TYPE           regex type (posix).

  -D,  --domains=LIST              comma-separated list of accepted domains.

       --exclude-domains=LIST      comma-separated list of rejected domains.

       --follow-ftp                follow FTP links from HTML documents.

       --follow-tags=LIST          comma-separated list of followed HTML tags.

       --ignore-tags=LIST          comma-separated list of ignored HTML tags.

  -H,  --span-hosts                go to foreign hosts when recursive.

  -L,  --relative                  follow relative links only.

  -I,  --include-directories=LIST  list of allowed directories.

       --trust-server-names  use the name specified by the redirection

                                   url last component.

  -X,  --exclude-directories=LIST  list of excluded directories.

  -np, --no-parent   don't ascend to the parent directory.


Mail bug reports and suggestions to <bug-wget@gnu.org>.

wget嘗試

根據wget的幫助文檔，我嘗試了下面這條命令

wget -r -np -pk -nH -P ./download http://www.baidu.com

解釋一下這些參數

-r 遞歸下載所有內容

-np 只下載給定URL下的內容，不下載它的上級內容

-p 下載有關頁面需要用到的所有資源，包括圖片和css樣式

-k 將絕對路徑轉換為相對路徑（這個很重要，為了在用戶打開網頁的時候，加載的相關資源都在本地尋找）

-nH 禁止wget以接收的URL為名稱創建文件夾（如果沒有這個，這條命令會將下載的內容存在./download/www.baidu.com/下）

-P 下載到哪個路徑，這里是當前文件夾下的download文件夾下，沒有的話，wget會幫你自動創建

這些選項都符合目前的這個需求，單結果很意外，并不是我們想象的那么簡單，wget并沒有給我們想要的東西

你如果執行了這條命令，會發現在當前的download文件夾中只是下載了一個index.html和一個robots.txt，而index.html文件所需要的圖片也并沒有被下載

<img>標簽中的路徑也沒有被替換成相對路徑，可能只是去掉了"http:"這個字符串而已。

至于為什么會這樣，請繼續往下看。

wget正解

因為上面的命令行不通，所以，腦洞全開。來吧，讓我們寫一個shell腳本，名稱為wget_cc內容如下

#!/bin/sh

URL="$2"
PATH="$1"

echo "download url: $URL"
echo "download dir: $PATH"

/usr/bin/wget -e robots=off -w 1 -xq -np -nH -pk -m  -t 1 -P "$PATH" "$URL"

echo "success to download"

需要注意的是，我的wget是在/usr/bin目錄下（這里必須寫全路徑），你可以使用which wget這個命令確定你的wget路徑所在，然后替換到腳本中就行了。

這里多加了幾個參數，解釋一下：

-e 用法是‘-e command’

用來執行額外的.wgetrc命令。就像vim的配置存在.vimrc文件中一樣，wget也用.wgetrc文件來存放它的配置。也就是說在wget執行之前，會先執行.wgetrc文件中的配置命令。一個典型的.wgetrc文件可以參考：

http://www.gnu.org/software/wget/manual/html_node/Sample-Wgetrc.html

http://www.gnu.org/software/wget/manual/html_node/Wgetrc-Commands.html

用戶可以在不改寫.wgetrc文件的情況下，用-e選項指定額外的配置命令。如果想要制定多個配置命令，-e command1 -e command2 ... -e commandN即可。這些制定的配置命令，會在.wgetrc中所有命令之后執行，因此會覆蓋.wgetrc中相同的配置項。

這里robots=off是因為wget默認會根據網站的robots.txt進行操作，如果robots.txt里是User-agent: * Disallow: /的話，wget是做不了鏡像或者下載目錄的。

這就是前面為什么下載不了圖片和其他資源的原因所在了，因為你要爬的HOST禁止蜘蛛去爬它，而wget使用 -e robots=off 這個選項可以通過這個命令繞過這個限制。

-x 創建鏡像網站對應的目錄結構

-q 靜默下載，即不顯示下載信息，你如果想知道wget當前在下載什么資源的話，可以去掉這個選項

-m 它會打開鏡像相關的選項，比如無限深度的子目錄遞歸下載。

-t times 某個資源下載失敗后的重試下載次數

-w seconds 資源請求下載之間的等待時間（減輕服務器的壓力）

剩下有不懂的你就去挖文檔吧。

寫好后保存退出，執行：

chmod 744 wget_cc

OK，這樣腳本就能直接執行，而不用在每條命令前帶 /bin/sh 讓sh去解釋它了。

下面就讓腳本執行起來吧！

./wget_cc ./download http://www.baidu.com

下載完成后的目錄結構

OK，然后再查看<img>標簽中的src屬性，

src="img/bd_logo1.png"

果然換成了相對路徑啊，大功告成，覺得對您有幫助的請點個贊吧！

這里是Freestyletime@foxmail.com，歡迎交流。

來自：http://my.oschina.net/freestyletime/blog/356985

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1419163083058.html

wget 網絡技術

使用wget工具抓取網頁和圖片

奇怪的需求

wget概述

wget嘗試

wget正解

相關經驗

相關資訊

相關文檔

目錄