Apriori算法的Python實現

jopen 10年前發布 | 35K 次閱讀 Apriori 算法

Apriori算法是數據挖掘中頻發模式挖掘的鼻祖,從60年代就開始流行,其算法思想也十分簡單樸素,首先挖掘出長度為1的頻繁模式,然后k=2

將這些頻繁模式合并組成長度為k的頻繁模式,算出它們的頻繁次數,而且要保證其所有k-1長度的子集也是頻繁的,值得注意的是,為了避免重復,合并的時候,只合并那些前k-2個字符都相同,而k-1的字符一邊是少于另一邊的。

以下是算法的Python實現:

    __author__ = 'linfuyuan'  
    min_frequency = int(raw_input('please input min_frequency:'))  
    file_name = raw_input('please input the transaction file:')  
    transactions = []  


    def has_infrequent_subset(candidate, Lk):  
        for i in range(len(candidate)):  
            subset = candidate[:-1]  
            subset.sort()  
            if not ''.join(subset) in Lk:  
                return False  
            lastitem = candidate.pop()  
            candidate.insert(0, lastitem)  
        return True  


    def countFrequency(candidate, transactions):  
        count = 0  
        for transaction in transactions:  
            if transaction.issuperset(candidate):  
                count += 1  
        return count  


    with open(file_name) as f:  
        for line in f.readlines():  
            line = line.strip()  
            tokens = line.split(',')  
            if len(tokens) > 0:  
                transaction = set(tokens)  
                transactions.append(transaction)  
    currentFrequencySet = {}  
    for transaction in transactions:  
        for item in transaction:  
            time = currentFrequencySet.get(item, 0)  
            currentFrequencySet[item] = time + 1  
    Lk = set()  
    for (itemset, count) in currentFrequencySet.items():  
        if count >= min_frequency:  
            Lk.add(itemset)  
    print ', '.join(Lk)  

    while len(Lk) > 0:  
        newLk = set()  
        for itemset1 in Lk:  
            for itemset2 in Lk:  
                cancombine = True  
                for i in range(len(itemset1)):  
                    if i < len(itemset1) - 1:  
                        cancombine = itemset1[i] == itemset2[i]  
                        if not cancombine:  
                            break  
                    else:  
                        cancombine = itemset1[i] < itemset2[i]  
                        if not cancombine:  
                            break  
                if cancombine:  
                    newitemset = []  
                    for char in itemset1:  
                        newitemset.append(char)  
                    newitemset.append(itemset2[-1])  
                    if has_infrequent_subset(newitemset, Lk) and countFrequency(newitemset, transactions) >= min_frequency:  
                        newLk.add(''.join(newitemset))  
        print ', '.join(newLk)  
        Lk = newLk  
來自:http://blog.csdn.net/xanxus46/article/details/40920275

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!