自動文本摘要生成

JerHma 9年前發布 | 28K 次閱讀 Python Java 算法

來自： https://github.com/miso-belica/sumy

Automatic text summarizer

自動文本摘要生成。簡單的庫和命令行工具用于從HTML頁面或純文本抽取摘要。該軟件包還包含了文本摘要簡單的評價框架。實現的摘要方法如下：

Luhn - heurestic method, reference
Edmundson heurestic method with previous statistic research, reference
Latent Semantic Analysis, LSA - one of the algorithm from http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en I think the author is using more advanced algorithms now. Steinberger, J. a Je?ek, K. Using latent semantic an and summary evaluation. In In Proceedings ISIM '04. 2004. S. 93-100.
LexRank - Unsupervised approach inspired by algorithms PageRank and HITS, reference
TextRank - some sort of combination of a few resources that I found on the internet. I really don't remember the sources. Probably Wikipedia and some papers in 1st page of Google :)
SumBasic - Method that is often used as a baseline in the literature. Source: Read about SumBasic
KL-Sum - Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. Source: Read about KL-Sum

Here are some other summarizers:

https://github.com/thavelick/summarize/ - Python, TF (very simple)
Reduction - Python, TextRank (simple)
Open Text Summarizer - C, TF without normalization
Simple program that summarize text - Python, TF without normalization
Intro to Computational Linguistics - Java, LexRank
Sumtract: Second project for UW LING 572 - Python
TextTeaser - Scala
PyTeaser - TextTeaser port in Python
Automatic Document Summarizer - Java, Bipartite HITS (no sources)
Pythia - Python, LexRank & Centroid
SWING - Ruby
Topic Networks - R, topic models & bipartite graphs
Almus: Automatic Text Summarizer - Java, LSA (without source code)
Musutelsa - Java, LSA (always freezes)
http://mff.bajecni.cz/index.php - C++
MEAD - Perl, various methods + evaluation framework

Installation

Make sure you have Python 2.7/3.3+ and pip ( Windows , Linux ) installed. Run simply (preferred way):

$ [sudo] pip install sumy

Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/sumy.git

Usage

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info

Python API

Or you can use sumy like a library in your project.

# -*- coding: utf8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "czech"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Tests

Setup:

$ pip install pytest pytest-cov

Run tests via

$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5

本文由用戶 JerHma 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1456572939812.html

Python Java 算法

自動文本摘要生成

Automatic text summarizer

Installation

Usage

Python API

Tests

相關經驗

相關資訊

相關文檔

目錄