在java程序中使用jQuery抓取網頁的新方法

openkk 12年前發布 | 81K 次閱讀 Java 網絡爬蟲

你想要的任何信息，基本上在互聯網上存在了，問題是如何把它們整理成你所需要的，比如在某個行業網站上抓取所有相關公司的的名字，聯系電話，Email等，然后存到Excel里面做分析。網頁信息抓取變得原來越有用了。

一般傳統的網頁，web服務器直接返回Html，這類網頁很好抓，不管是用何種方式，只要得到html頁面，然后做Dom解析就可以了。但對于需要Javascript生成的網頁，就不那么容易了。張瑜目前也沒有找到好辦法解決此問題。各位有抓javascript網頁經驗的朋友，歡迎指點。

所以今天要談的還是傳統html網頁的信息抓取。雖然前面說了，沒有技術難度，但是是否能有相對更容易的方法呢？用過jQuery等js框架的朋友，可能都會覺得javascript貌似抓取網頁信息的天然助手，而且其出生就是為了網頁解析而存在的。當然現在有更多的應用了，如Server端的javascript應用，NodeJs.

如果能在我們的應用程序，如java程序中，能使用jQuery去抓網頁，絕對是件激動人心的事情。確實有現成的解決方案，一個Javascript引擎，一個能支撐jQuery運行的環境就可以了。

工具 : java, Rhino, envJs. 其中 Rhino是Mozzila提供的開源Javascript引擎，envJs是一個模擬瀏覽器額環境，如Window等。代碼如下，

package stony.zhang.scrape;


import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.lang.reflect.InvocationTargetException;

import org.mozilla.javascript.Context;
import org.mozilla.javascript.ContextFactory;
import org.mozilla.javascript.Scriptable;
import org.mozilla.javascript.ScriptableObject;

/**
 * @author MyBeautiful
 * @Emal: zhangyu0182@sina.com
 * @date Mar 7, 2012
 */
public class RhinoScaper {
    private String url;
    private String jsFile;

    private Context cx;
    private Scriptable scope;

    public String getUrl() {
        return url;
    }

    public String getJsFile() {
        return jsFile;
    }

    public void setUrl(String url) {
        this.url = url;
        putObject("url", url);
    }

    public void setJsFile(String jsFile) {
        this.jsFile = jsFile;
    }

    public void init() {
        cx = ContextFactory.getGlobal().enterContext();
        scope = cx.initStandardObjects(null);
        cx.setOptimizationLevel(-1);
        cx.setLanguageVersion(Context.VERSION_1_5);

        String[] file = { "./lib/env.rhino.1.2.js", "./lib/jquery.js" };
        for (String f : file) {
            evaluateJs(f);
        }

        try {
            ScriptableObject.defineClass(scope, ExtendUtil.class);
        } catch (IllegalAccessException e1) {
            e1.printStackTrace();
        } catch (InstantiationException e1) {
            e1.printStackTrace();
        } catch (InvocationTargetException e1) {
            e1.printStackTrace();
        }
        ExtendUtil util = (ExtendUtil) cx.newObject(scope, "util");
        scope.put("util", scope, util);
    }

    protected void evaluateJs(String f) {
        try {
            FileReader in = null;
            in = new FileReader(f);
            cx.evaluateReader(scope, in, f, 1, null);
        } catch (FileNotFoundException e1) {
            e1.printStackTrace();
        } catch (IOException e1) {
            e1.printStackTrace();
        }
    }

    public void putObject(String name, Object o) {
        scope.put(name, scope, o);
    }

    public void run() {
        evaluateJs(this.jsFile);
    }
}

測試代碼：

package stony.zhang.scrape;

import java.util.HashMap;
import java.util.Map;

import junit.framework.TestCase;

public class RhinoScaperTest extends TestCase {

    public RhinoScaperTest(String name) {
        super(name);
    }

    public void testRun() {
        RhinoScaper rs = new RhinoScaper();
        rs.init();
        rs.setUrl("http://www.baidu.com");
        rs.setJsFile("test.js");
//      Map<String, String> o = new HashMap<String, String>();
//      rs.putObject("result", o);
        rs.run();
//      System.out.println(o.get("imgurl"));
    }

}

test.js文件，如下

$.ajax({
  url: "http://www.baidu.com",
  context: document.body,
  success: function(data){
 //   util.log(data);

    var result =parseHtml(data);

    var $v= jQuery(result);
 //   util.log(result);
    $v.find('#u a').each(function(index) {
         util.log(index + ': ' + $(this).attr("href"));
  //        arr.add($(this).attr("href"));
    });
  }
});


 function parseHtml(html) {
       //Create an iFrame object that will be used to render the HTML in order to get the DOM objects
        //created - this is a far quicker way of achieving the HTML to DOM conversion than trying
        //to transform the HTML objects one-by-one
         var oIframe = document.createElement('iframe');
     //Hide the iFrame from view
         oIframe.style.display = 'none';
         if (document.body)
            document.body.appendChild(oIframe);
        else
            document.documentElement.appendChild(oIframe);

        //Open the iFrame DOM object and write in our HTML
        oIframe.contentDocument.open();
        oIframe.contentDocument.write(html);
        oIframe.contentDocument.close();

        //Return the document body object containing the HTML that was just
        //added to the iFrame as DOM objects
        var oBody = oIframe.contentDocument.body;

        //TODO: Remove the iFrame object created to cleanup the DOM

        return oBody;
    }

我們執行Unit Test，將會在控制臺打印從網頁上抓取的三個baidu的連接，

0: http://www.baidu.com/gaoji/preferences.html
1: http://passport.baidu.com/?login&tpl=mn
2: https://passport.baidu.com/?reg&tpl=mn

測試成功，故證明在java程序中用jQuery抓取網頁是可行的.

----------------------------------------------------------------------

張瑜，Mybeautiful , zhangyu0182@sina.com

本文由用戶 openkk 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1331187174202.html

Java 網絡爬蟲

在java程序中使用jQuery抓取網頁的新方法

相關經驗

相關資訊

相關文檔

目錄