快速自動提取關鍵詞(RAKE)算法的Python實現:rake

jopen 9年前發布 | 46K 次閱讀 froala 算法

快速自動提取關鍵詞(RAKE)算法的一個Python實現。自動從單個文檔關鍵字提取。

    import rake
import operator

# EXAMPLE ONE - SIMPLE  
stoppath = "SmartStoplist.txt"  
''''' 
# 1. initialize RAKE by providing a path to a stopwords file 
rake_object = rake.Rake(stoppath, 5, 3, 4)  # the notation is: (1)Each word has at least 5 characters, (2)Each phrase has at most 3 words,(3)Each keyword appears in the text at least 4 times 


# 2. run on RAKE on a given text 
sample_file = open("data/docs/fao_test/w2167e.txt", 'r') 
text = sample_file.read() 

keywords = rake_object.run(text) # this command can output all the keywords and their scores 

# 3. print results 
print "Keywords:", keywords 

print "----------"           '''  
# EXAMPLE TWO - BEHIND THE SCENES (from https://github.com/aneesha/RAKE/rake.py)  

# initialize RAKE by providing a path to a stopwords file  
rake_object = rake.Rake(stoppath)  

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility " \  
       "of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. " \  
       "Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating"\  
       " sets of solutions for all types of systems are given. These criteria and the corresponding algorithms " \  
       "for constructing a minimal supporting set of solutions can be used in solving all the considered types of " \  
       "systems and systems of mixed types."  



# Split text into sentences  
sentenceList = rake.split_sentences(text) # sentence was split by  punctuation mark, comma and period here.  

for sentence in sentenceList:  
    print "Sentence:", sentence  

# generate candidate keywords  
stopwordpattern = rake.build_stop_word_regex(stoppath)  
phraseList = rake.generate_candidate_keywords(sentenceList, stopwordpattern)   # phrase is the candidated keywords  
# this method does not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who).  
# improvements can be made here  
Read more at https://www.airpair.com/nlp/keyword-extraction-tutorial#4Lc4GeP5t5cYe7OR.99  
print "Phrases:", phraseList  

# calculate individual word scores  
wordscores = rake.calculate_word_scores(phraseList)  

# generate candidate keyword scores  
keywordcandidates = rake.generate_candidate_keyword_scores(phraseList, wordscores)  
# One issue here is that the candidates are not normalized in any way.   
# As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder.  
# Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first.  
# so stemming is always used before keyword extraction. This can be another improvement.   



for candidate in keywordcandidates.keys():  
    print "Candidate: ", candidate, ", score: ", keywordcandidates.get(candidate)  



# sort candidates by score to determine top-scoring keywords  
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True)  
totalKeywords = len(sortedKeywords)  

# for example, you could just take the top third as the final keywords  
for keyword in sortedKeywords[0:(totalKeywords / 3)]: # note that hte final keywords are determined by top third  
    print "Keyword: ", keyword[0], ", score: ", keyword[1]  

print rake_object.run(text) # this command outputs all the keywords and their scores.  </pre><br />

項目主頁:http://www.baiduhome.net/lib/view/home/1421808449078

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!