用TypeScript開發爬蟲程序
全局安裝typescript:
npm install -g typescript
目前版本2.0.3,這個版本不再需要使用typings命令了。但是vscode捆綁的版本是1.8的,需要一些配置工作,看本文的處理辦法。
測試tsc命令:
tsc
創建要寫的程序項目文件夾:
mkdir test-typescript-spider
進入該文件夾:
cd test-typescript-spider
初始化項目:
npm init
安裝superagent和cheerio模塊:
npm i --save superagent cheerio
安裝對應的類型聲明模塊:
npm i -s @types/superagent --save
npm i -s @types/cheerio --save
安裝項目內的typescript(必須走這一步):
npm i --save typescript
用vscode打開項目文件夾。在該文件夾下創建tsconfig.json文件,并復制以下配置代碼進去:
{
"compilerOptions": {
"target": "ES6",
"module": "commonjs",
"noEmitOnError": true,
"noImplicitAny": true,
"experimentalDecorators": true,
"sourceMap": false,
// "sourceRoot": "./",
"outDir": "./out"
},
"exclude": [
"node_modules"
]
}
在vscode打開“文件”-“首選項”-“工作區設置”
在settings.json中加入(如果不做這個配置,vscode會在打開項目的時候提示選擇哪個版本的typescript):
{
"typescript.tsdk": "node_modules/typescript/lib"
}
創建api.ts文件,復制以下代碼進去:
import superagent = require('superagent');
import cheerio = require('cheerio');
export const remote_get = function(url: string) {
const promise = new Promise<superagent.Response>(function (resolve, reject) {
superagent.get(url)
.end(function (err, res) {
if (!err) {
resolve(res);
} else {
console.log(err)
reject(err);
}
});
});
return promise;
}
創建app.ts文件,書寫測試代碼:
import api = require('./api');
const go = async () => {
let res = await api.remote_get('http://www.baidu.com/');
console.log(res.text);
}
go();
執行命令:
tsc
然后:
node out/app
觀察輸出是否正確。
現在嘗試抓取 http://cnodejs.org/ 的第一頁文章鏈接。
修改app.ts文件,代碼如下:
import api = require('./api');
import cheerio = require('cheerio');
const go = async () => {
const res = await api.remote_get('http://cnodejs.org/');
const $ = cheerio.load(res.text);
let urls: string[] = [];
let titles: string[] = [];
$('.topic_title_wrapper').each((index, element) => {
titles.push($(element).find('.topic_title').first().text().trim());
urls.push('http://cnodejs.org/' + $(element).find('.topic_title').first().attr('href'));
})
console.log(titles, urls);
}
go();
觀察輸出,文章的標題和鏈接都已獲取到了。
現在嘗試深入抓取文章內容
import api = require('./api');
import cheerio = require('cheerio');
const go = async () => {
const res = await api.remote_get('http://cnodejs.org/');
const $ = cheerio.load(res.text);
$('.topic_title_wrapper').each(async (index, element) => {
let url = ('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
const res_content = await api.remote_get(url);
const $_content = cheerio.load(res_content.text);
console.log($_content('.topic_content').first().text());
})
}
go();
可以發現因為訪問服務器太迅猛,導致出現很多次503錯誤。
解決:
添加helper.ts文件:
export const wait_seconds = function (senconds: number) {
return new Promise(resolve => setTimeout(resolve, senconds * 1000));
}
修改api.ts文件為:
import superagent = require('superagent');
import cheerio = require('cheerio');
export const get_index_urls = function () {
const res = await remote_get('http://cnodejs.org/');
const $ = cheerio.load(res.text);
let urls: string[] = [];
$('.topic_title_wrapper').each(async (index, element) => {
urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
});
return urls;
}
export const get_content = async function (url: string) {
const res = await remote_get(url);
const $ = cheerio.load(res.text);
return $('.topic_content').first().text();
}
export const remote_get = function (url: string) {
const promise = new Promise<superagent.Response>(function (resolve, reject) {
superagent.get(url)
.end(function (err, res) {
if (!err) {
resolve(res);
} else {
console.log(err)
reject(err);
}
});
});
return promise;
}
修改app.ts文件為:
import api = require('./api');
import helper = require('./helper');
import cheerio = require('cheerio');
const go = async () => {
const res = await api.remote_get('http://cnodejs.org/');
const $ = cheerio.load(res.text);
let urls = await api.get_index_urls();
for (let i = 0; i < urls.length; i++) {
await helper.wait_seconds(1);
let text = await api.get_content(urls[i]);
console.log(text);
}
}
go();
觀察輸出可以看到,程序實現了隔一秒再請求下一個內容頁。
現在嘗試把抓取到的東西存到數據庫中。
安裝mongoose模塊:
npm i mongoose --save
npm i -s @types/mongoose --save
然后建立Scheme。先創建models文件夾:
mkdir models
在models文件夾下創建index.ts:
import * as mongoose from 'mongoose';
mongoose.connect('mongodb://127.0.0.1/cnodejs_data', {
server: { poolSize: 20 }
}, function (err) {
if (err) {
process.exit(1);
}
});
// models
export const Article = require('./article');
在models文件夾下創建IArticle.ts:
interface IArticle {
title: String;
url: String;
text: String;
}
export = IArticle;
在models文件夾下創建Article.ts:
import mongoose = require('mongoose');
import IArticle = require('./IArticle');
interface IArticleModel extends IArticle, mongoose.Document { }
const ArticleSchema = new mongoose.Schema({
title: { type: String },
url: { type: String },
text: { type: String },
});
const Article = mongoose.model<IArticleModel>("Article", ArticleSchema);
export = Article;
修改api.ts為:
import superagent = require('superagent');
import cheerio = require('cheerio');
import models = require('./models');
const Article = models.Article;
export const get_index_urls = async function () {
const res = await remote_get('http://cnodejs.org/');
const $ = cheerio.load(res.text);
let urls: string[] = [];
$('.topic_title_wrapper').each((index, element) => {
urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
});
return urls;
}
export const fetch_content = async function (url: string) {
const res = await remote_get(url);
const $ = cheerio.load(res.text);
let article = new Article();
article.text = $('.topic_content').first().text();
article.title = $('.topic_full_title').first().text().replace('置頂', '').replace('精華', '').trim();
article.url = url;
console.log('獲取成功:' + article.title);
article.save();
}
export const remote_get = function (url: string) {
return new Promise<superagent.Response>((resolve, reject) => {
superagent.get(url)
.end(function (err, res) {
if (!err) {
resolve(res);
} else {
reject(err);
}
});
});
}
修改app.ts為:
import api = require('./api');
import helper = require('./helper');
import cheerio = require('cheerio');
(async () => {
try {
let urls = await api.get_index_urls();
for (let i = 0; i < urls.length; i++) {
await helper.wait_seconds(1);
await api.fetch_content(urls[i]);
}
} catch (err) {
console.log(err);
}
console.log('完畢!');
})();
執行
tsc
node out/app
觀察輸出,并去數據庫檢查一下
可以發現入庫成功了!
補充:remote_get方法的改進版,實現錯誤重試和加入代理服務器.
放棄了superagent庫,用的request庫,僅供參考:
//config.retries = 3;
let current_retry = config.retries || 0;
export const remote_get = async function (url: string, proxy?: string) {
//每次請求都先稍等一下
await wait_seconds(2);
if (!proxy) {
proxy = '';
}
const promise = new Promise<string>(function (resolve, reject) {
console.log('get: ' + url + ', using proxy: ' + proxy);
let options: request.CoreOptions = {
headers: {
'Cookie': '',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
'Referer': 'https://www.baidu.com/'
},
encoding: 'utf-8',
method: 'GET',
proxy: proxy,
timeout: 3000,
}
request(url, options, async function (err, response, body) {
console.log('got:' + url);
if (!err) {
body = body.toString();
current_retry = config.retries || 0;
console.log('bytes:' + body.length);
resolve(body);
} else {
console.log(err);
if (current_retry <= 0) {
current_retry = config.retries || 0;
reject(err);
} else {
console.log('retry...(' + current_retry + ')')
current_retry--;
try {
let body = await remote_get(url, proxy);
resolve(body);
} catch (e) {
reject(e);
}
}
}
});
});
return promise;
}
另外,IArticle.ts和Article.ts合并為一個文件,可能更好,可以參考我另一個model的寫法:
import mongoose = require('mongoose');
interface IProxyModel {
uri: string;
ip: string;
port:string;
info:string;
}
export interface IProxy extends IProxyModel, mongoose.Document { }
const ProxySchema = new mongoose.Schema({
uri: { type: String },//
ip: { type: String },//
port: { type: String },//
info: { type: String },//
});
export const Proxy = mongoose.model<IProxy>("Proxy", ProxySchema);
導入的時候這么寫就行了:
import { IProxy, Proxy } from './models';
其中Proxy可以用來做new、find、where之類的操作:
let x = new Proxy();
let xx = await Proxy.find({});
let xxx = await Proxy.where('aaa',123).exec();
而IProxy用于實體對象的傳遞,例如
function xxx(p:IProxy){
}
來自:https://segmentfault.com/a/1190000007326795