一個簡單的PHP Web爬蟲：Goutte

jopen 11年前發布 | 70K 次閱讀 Goutte 網絡爬蟲

Goutte是一個屏幕抓取和web爬蟲PHP庫。

Goutte提供了一個很好的API來抓取網站和從服務器響應的HTML/ XML提取數據。

要求

Goutte depends on PHP 5.4+ and Guzzle 4+.

Tip

If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.0.6.

安裝

Add fabpot/goutte as a require dependency in your composer.json file:

php composer.phar require fabpot/goutte:~2.0

Tip

You can also download the Goutte.phar file:

require_once '/path/to/goutte.phar';

使用

Create a Goutte Client instance (which extendsSymfony\Component\BrowserKit\Client):

use Goutte\Client; $client = new Client();

Make requests with the request() method:

// Go to the symfony.com website $crawler = $client->request('GET', 'http://www.symfony.com/blog/');

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

Fine-tune cURL options:

$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);

點擊鏈接:

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

抽取數據:

// Get the latest post in this category and display the titles
$crawler->filter('h2.post > a')->each(function ($node) {
    print $node->text()."\n";
});

提交表單:

$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

項目主頁：http://www.baiduhome.net/lib/view/home/1413877792059

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1413877792059.html

Goutte 網絡爬蟲

一個簡單的PHP Web爬蟲：Goutte

要求

安裝

使用

相關經驗

相關資訊

相關文檔

目錄