一個簡單的PHP Web爬蟲:Goutte
Goutte是一個屏幕抓取和web爬蟲PHP庫。
Goutte提供了一個很好的API來抓取網站和從服務器響應的HTML/ XML提取數據。
要求
Goutte depends on PHP 5.4+ and Guzzle 4+.
Tip
If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.0.6.
安裝
Add fabpot/goutte as a require dependency in your composer.json file:
php composer.phar require fabpot/goutte:~2.0
使用
Create a Goutte Client instance (which extendsSymfony\Component\BrowserKit\Client):
use Goutte\Client; $client = new Client();
Make requests with the request() method:
// Go to the symfony.com website $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).
Fine-tune cURL options:
$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);
點擊鏈接:
// Click on the "Security Advisories" link $link = $crawler->selectLink('Security Advisories')->link(); $crawler = $client->click($link);
抽取數據:
// Get the latest post in this category and display the titles $crawler->filter('h2.post > a')->each(function ($node) { print $node->text()."\n"; });
提交表單:
$crawler = $client->request('GET', 'http://github.com/'); $crawler = $client->click($crawler->selectLink('Sign in')->link()); $form = $crawler->selectButton('Sign in')->form(); $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx')); $crawler->filter('.flash-error')->each(function ($node) { print $node->text()."\n"; });
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!