wring - Extract content from websites using CSS Selectors and XPath
Installation
You can install wring using npm:
$ npm install --global wring
Wring utilizes PhantomJS for some of its commands. To use these, install it using your system package manager by running something like brew install phantomjs
on OS X, or apt-get install phantomjs
on Ubuntu. You can make sure it's on your PATH
by running phantomjs -v
.
Alternatively, you can install a version which automatically downloads PhantomJS binaries for your system:
$ npm install --global wring-with-phantomjs
Usage
wring text
Here is a simple example which prints contents of the matching element (uses Cheerio under the hood):
$ wring text '
You can use the first letter of command as a shortcut
$ wring t http://randomfunfacts.com i
No president of the United States was an only child.</pre> </div>
You can also use jQuery specific selectors such as :contains()
:
$ wring t 'https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Taylor_Swift' 'tr:contains("The Hunger Games") th:first-child'
"Eyes Open"
"Safe & Sound"
wring html
Prints outerHTML
of matching elements. Here is an example, this time using an XPath expression:
$ wring html "http://news.ycombinator.com" "http://td[@class='title']/a[starts-with(@href,'http')]"
<a >PostgreSQL Indexes: First principles</a>
<a >Doing Mathematics Differently</a>
<a >The rise of the API-based SaaS</a>
<a >Rich Hickey Fanclub</a>
...
Accepted inputsFirst argument of a command specifies its input, which can be a URL, path to a file, HTML string, or -
to read the page source from stdin
:
# read from file
$ curl '
read from string
$ wring text '<div class="foo">Hello</div>' '.foo'
Hello
read from stdin
$ curl -s ' Using with PhantomJS
Prefixing a command with phantomjs
or p
will run it using jQuery inside a real web browser context. You can use this if you are having compatibility problems with the commands above, but the real utility comes from being able to scrape dynamically generated content:
$ wring p t '<title>Foo</title> <script>document.title = "Bar";</script>' 'title'
Bar
compare it to the non-phantomjs invocation below
$ wring t '<title>Foo</title> <script>document.title = "Bar";</script>' 'title'
Foo</pre> </div>
wring eval
Lets you evaluate JavaScript inside any page. Calling wring('str')
will write to terminal. You can pass any number of .js file paths, URLs, and JS expressions as script arguments and they will get executed in given order:
$ wring eval '
you can load and use third party libraries:
$ wring e ' Self contained scripts
You can also use a trick to make self contained scripts.
Here is a contrived example which loads Hacker News homepage, loads lodash, sorts posts by their score, and prints the top 5:
#!/bin/sh
":" //; exec wring eval "
var posts = _.map(
document.querySelectorAll(".votelinks + .title > a"),
function(el) {
return el.textContent + "\n" + el.href;
})
var scores = _.map(
document.querySelectorAll(".score"),
function (el) {
return parseInt(el.textContent, 10);
})
_(posts)
.zipWith(scores, function (text, score) {
return { text: text, score: score };
})
.orderBy("score", "desc")
.take(5)
.forEach(function (item) {
wring(item.text + "\n");
})</pre> </div>
# after saving the source above to wring_hn.js
you can run it like this
$ chmod +x wring_hn.js
$ ./wring_hn.js
Raspberry Pi 3 Model B confirmed, with onboard BT LE and WiFi
https://apps.fcc.gov/oetcf/eas/reports/...
After fifteen years of downtime, the MetaFilter gopher server is back
http://metatalk.metafilter.com/24019/...
...</pre> </div>
wring shot
Last command to cover is wring shot
, which renders a screenshot of first matching element and saves it to a file:
$ wring shot 'https://www.google.com/finance?q=GOOG' '#price-panel' goog.png
wring: Saved to goog.png
Resulting goog.png
will contain something like this:
Development
# Install Node.js dependencies:
$ npm install
Install PureScript dependencies:
$ bower install
Build wring.js
and phantom-main.js
:
$ npm run build
Run tests:
$ npm test
Compile & run using Pulp (https://github.com/bodil/pulp):
$ pulp run text '<b>foo</b>' 'b'</pre> </div>
License
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!
相關經驗
相關資訊
sesese色