機器學習 — 發現群組

TobyTraugot 8年前發布 | 16K 次閱讀聚類分析機器學習

聚類

屬于無監督學習

目的：找到數據集中的不同群組

分級聚類

主要思想是：

在數據集中找出兩個最相似的節點
根據這兩個節點生成一個新的聚類節點，這個節點的數據為兩個子節點的數據的平均值,
將兩個子節點從數據集中去除，將新的聚類節點加入數據
回到1，直至數據集中只剩一個節點

K-means聚類

使用分級聚類的時候，因為得計算所有數據的兩兩之間的距離，形成新的聚類之后還得重新計算，所以在數據集較大的時候計算量會很大。

除了分級聚類之外還有一種K-均值聚類方法，主要思想為：

隨機創建（給定）k個點作為中心點
遍歷數據集中每個點，找到距離最近的中心點，將該點劃分在該中心點下
遍歷并劃分完成后，將各個中心點移到自己組下所有點的中心位置
回到2，直到移動之后的結果（不變）和上次一樣

結果展示：使用樹狀圖來展現聚類之后的結果

import feedparser
import re
test
error_list = []
返回一個RSS訂閱源的標題和包含單詞計數情況的字典
def get_word_counts(url):

# 解析訂閱源
doc = feedparser.parse(url)

# 單詞計數
wc = {}

# 遍歷所有文章條目，統計所有單詞出現次數
for entry in doc.entries:
    if 'summary' in entry:
        summary = entry.summary
    else:
        summary = entry.description

    # 提取出所有單詞
    words = get_words(entry.title + ' ' + summary)
    # 統計所有單詞出現的次數
    for word in words:
        wc.setdefault(word, 0)
        wc[word] += 1
print url
if hasattr(doc.feed, 'title'):
    return doc.feed.title, wc
error_list.append(url)
return '', wc


分割出html中的所有單詞
def get_words(html):

# 取出所有html標記
txt = re.compile(r'<[^.]>').sub('', html)

# 利用所有非字母字符拆分出單詞
words = re.compile(r'[^A-Z^a-z]').split(txt)
# 轉換為小寫返回
return [word.lower() for word in words]


apcount = {}
word_counts = {}
feed_list = [line for line in file('feedlist.txt')]
讀取每一個url并統計單詞在每篇博客中出現的次數
for feed_url in feed_list:
    title, wc = get_word_counts(feed_url)
    if title == '':
        continue
    if title in word_counts:
        title += '1'
    print title
    word_counts[title] = wc

# 統計單詞詞頻
for word, count in wc.items():
    apcount.setdefault(word, 0)
    if count > 1:
        apcount[word] += 1


設定詞頻邊界，去除常見無用詞
word_list = []
for w, bc in apcount.items():
    frac = float(bc) / len(feed_list)
    if frac > 0.1 and frac < 0.5:
        word_list.append(w)
out = file('blogdata.txt', 'w')
輸出表頭
out.write('Blog')
for word in word_list:
    out.write('\t%s' % word)
out.write('\n')
輸出表格內容
for blog, wc in word_counts.items():
    out.write(blog)

for word in word_list:
    if word in wc:
        out.write('\t%d' % wc[word])
    else:
        out.write('\t0')
out.write('\n')


print error_list</code></pre> 
  
http://feeds.feedburner.com/37signals/beMH
http://feeds.feedburner.com/blogspot/bRuz
http://blog.outer-court.com/rss.xml
Google Blogoscoped
http://gizmodo.com/index.xml
http://googleblog.blogspot.com/rss.xml
http://feeds.feedburner.com/GoogleOperatingSystem
http://feeds.feedburner.com/Mashable</code></pre> 
  
上面的代碼遇到的問題 
   
   有些博客rss失效，導致doc.feed.title為空，需要判斷title是否為空，如果是則跳過該url 
   列表和元組的區別，元組是不可變的相當于Java中的數組，放在一個圓括號中()；列表是可變的，相當于Java中的List，方法一個方括號中[]，有python內置的方法和列表自己的方法，比如添加append 
   字典是使用大括號{}，遍歷key和value使用items()方法，直接對字典遍歷得到的是keySet 
   
  分級聚類：通過連續不斷的將最為相似的群組兩兩合并。其中每個群組都是從一個單一元素開始的。在這里這個單一元素就是blog。 
  下面對blog進行聚類，先加載數據文件，按主題進行分組 
  def readfile(filename):
    lines = [line for line in file(filename)]

colnames = line[0].strip().split('\t')[1:]
rownames = []
data = []
for row in lines[1:]:
    p = row.strip().split('\t')
    rownames.append(p[0])
    data.append(p[1:])

return rownames, colnames, data</code></pre> 

from math import sqrt
def pearson(v1, v2):
    """
    pearson算法計算兩個向量之間的相似度
    """

# 簡單求和
sum1 = sum(v1)
sum2 = sum(v2)

# 求平方和
sum1_sqrt = sum([pow(v, 2) for v in v1])
sum2_sqrt = sum([pow(v, 2) for v in v2])

# 求乘積之和
multi_sum = sum([v1[i] * v2[i] for i in range(len(v1))])

# 計算pearson score
num = multi_sum - (sum1 * sum2 / len(v1))
den = sqrt((sum1_sqrt - pow(sum1, 2) / len(v1)) * (sum2_sqrt - pow(sum2, 2) / len(v1)))
if den == 0:
    return 0
# 相關度越大den越大，這里為了表示相似度越大，兩個元素之間的距離的更小
return 1.0 - num/den</code></pre> 

# 定義一個聚類的類型
class bicluster:
    def init(self, vec, left=None, right=None, distance=0.0, id=None):

    # 該blog的每個單詞的頻率
    self.vec = vec
    # 左子節點
    self.left = left
    # 右子節點
    self.right = right
    # 當前聚合類的相似度
    self.distance = distance
    # 標識該聚類
    self.id = id</code></pre> 

def hcluster(rows, distance=pearson):
    """
    進行聚類運算:遍歷所有數據集，每次都找出距離最近的兩個聚類，
    然后生成一個新的聚類，將新生成的聚類添加到列表末尾，并刪除找到的兩個聚類，形成新的數據集，重復以上步驟
    """

# 緩存兩個cluster之間的距離
distances = {}
current_cluster_id = -1

# 剛開始的聚類
cluster = [bicluster(rows[i], id = i) for i in range(len(rows))]

# 開始聚類，直到數據集中只剩一個數據，表明聚類完成
while len(cluster) > 1:
    lowestpair = (0, 1)
    closest = distance(cluster[0].vec, cluster[1].vec)

    # 遍歷每一個配置，尋找最相似的兩個blog
    for i in range(len(cluster)):
        for j in range(i+1, len(cluster)):
            # 緩存距離
            if (cluster[i].id, cluster[j].id) not in distances:
                distances[cluster[i].id, cluster[j].id] = distance(cluster[i].vec, cluster[j].vec)
            d = distances[cluster[i].id, cluster[j].id]

            if d < closest:
                closest = d
                lowestpair = (i, j)

    # 計算兩個聚類的平均值
    merfevec = [(cluster[lowestpair[0]].vec[i] + cluster[lowestpair[1]].vec[i]) / 2.0 
                for i in range(len(cluster[0].vec))]

    # 建立新的聚類
    newcluster = bicluster(merfevec, left=cluster[lowestpair[0]], 
                           right=cluster[lowestpair[1]], distance=closest, id=current_cluster_id)
    # 不在原始集合中的聚類，使用id為負數表示
    current_cluster_id -= 1
    # 注意刪除順序，每次刪除之后index都會發生變化，先刪除index大的，即從后往前刪除
    del cluster[lowestpair[1]]
    del cluster[lowestpair[0]]
    cluster.append(newcluster)

return cluster[0]</code></pre> 

進行聚類運算，找出最終的聚合類 
  blognames, words, data = readfile('blogdata.txt')
# 字符串轉換為int
int_data = []
for row in data:
    temp_row = []
    for num in row:
        temp_row.append(int(num))
    int_data.append(temp_row)
cluster = hcluster(int_data)
print cluster 
  <__main__.bicluster instance at 0x7f978c07ca70> 
  # 將聚類后的cluster以樹的形式打印出來
def print_cluster(cluster, labels=None, n=0):
    for i in range(n):
        print '  ',
    # 如果是分支，負數表示一個分支
    if cluster.id < 0:
        print '-'
    else:
        if labels == None:
            print cluster.id
        else:
            print labels[cluster.id]
    # 打印左右分支
    if cluster.left != None:
        print_cluster(cluster.left, labels, n=n+1)
    if cluster.right != None:
        print_cluster(cluster.right, labels, n=n+1) 
  print_cluster(cluster, labels=blognames) 
  在python2中print是默認換行的，如果想print不換行： 
  在python2中：print 'hello', 
  在python3中：print ('hello', end='') 
  上面是使用縮進對整個聚類結構進行排版，可以繪制樹狀圖更加直觀 
  from PIL import Image, ImageDraw
# 計算每個節點的高度
def getheight(clust):
    # 如果是葉子節點，高度為1
    if clust.left == None and clust.right == None:
        return 1
    # 如果是樹枝節點，高度為所有分支高度之和
    return getheight(clust.right) + getheight(clust.left)

# 計算兩個節點之間的線段長度
def getdepth(clust):
    # 葉子節點的距離為0
    if clust.left == None and clust.right == None:
        return 0
    # 枝節點的距離為兩個子節點的距離較大值加上枝節點自身的距離
    return max(getdepth(clust.left), getdepth(clust.right)) + clust.distance

# 繪制樹狀圖，每個節點高度為20像素，寬度固定的jpg圖片
def drawdendrogram(clust, labels, jpeg='clusters.jpg'):
    # 圖片高度
    h = getheight(clust) * 20
    w = 1200
    depth = getdepth(clust)

    # 由于寬度固定，需要對距離進行縮放，150為左右空白
    scaling = float(w-150) / depth

    # 新建一張白色背景圖片
    img = Image.new('RGB', (w, h), (255, 255, 255))
    draw = ImageDraw.Draw(img)

    # 繪制中間根節點的線
    draw.line((0, h/2, 10, h/2), fill=(255, 0, 0))

    drawnode(draw, clust, 10, h/2, scaling, labels)
    img.save(jpeg, 'JPEG')

# 遞歸繪制節點自身以及兩個子節點
def drawnode(draw, clust, x, y, scaling, labels):
    # 如果是枝節點，則只負責繪制兩個子節點
    if clust.id < 0:
        h1 = getheight(clust.left) * 20
        h2 = getheight(clust.right) * 20
        top = y - (h1 + h2) / 2
        bottom = y + (h1 +h2) / 2

        # 當前節點線的長度
        line_len = clust.distance * scaling
        #print scaling, line_len

        # 聚類到子節點的垂直線
        draw.line((x, top + h1/2, x, bottom - h2/2), fill=(255, 0, 0))

        # 聚類到左子節點的垂直線
        draw.line((x, top + h1/2, x + line_len, top + h1/2), fill=(255, 0, 0))

        # 聚類到右子節點的垂直線
        draw.line((x + line_len, bottom - h2/2, x, bottom - h2/2), fill=(255, 0, 0))

        # 繪制左右子節點
        drawnode(draw, clust.left, x + line_len, top + h1/2, scaling, labels)
        drawnode(draw, clust.right, x + line_len, bottom - h2/2, scaling, labels)
    else:
        # 如果是葉子節點，繪制blogname
        draw.text((x + 5, y - 5), labels[clust.id], (0, 0, 0)) 
  drawdendrogram(cluster, blognames) 
  K-均值法 
  使用分級聚類的時候，因為得計算所有數據的兩兩之間的距離，形成新的聚類之后還得重新計算，所以在數據集較大的時候計算量會很大。 
  除了分級聚類之外還有一種K-均值聚類方法，主要思想為： 
   
   隨機創建（給定）k個點作為中心點 
   遍歷數據集中每個點，找到距離最近的中心點，將該點劃分在該中心點下 
   遍歷并劃分完成后，將各個中心點移到自己組下所有點的中心位置 
   回到2，直到移動之后的結果（不變）和上次一樣 
   
  import random

# k均值法進行聚類
def kcluster(rows, distance=pearson, k=4):
    # 確定每個點的最大值和最小值，便于控制隨機生成值的范圍
    ranges = [(min(row[i] for row in rows), max(row[i] for row in rows)) for i in range(len(rows[0]))]

    # 隨機生成k個點
    clusters = [[random.random() * (ranges[i][1] - ranges[i][0]) + ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

    lastmatches = None
    for t in range(100):
        print 'Iterator %d' % t
        bestmatches = [[] for i in range(k)]
        for j in range(len(rows)):
            bestmatch = 0
            for i in range(k):
                d = distance(rows[j], clusters[i])
                if d < distance(rows[j], clusters[bestmatch]):
                    bestmatch = i
            bestmatches[bestmatch].append(j)

        # 和上一次結果比較
        if bestmatches == lastmatches:
            break
        lastmatches = bestmatches

        # 把中心點移到其所有成員的平均位置處
        # 每個中心點都要移動
        for i in range(k):
            avgs = [0.0] * len(rows[0])
            if(len(bestmatches[i]) > 0):
                # 對其所有成員求和
                for rowid in bestmatches[i]:
                    for m in range(len(rows[rowid])):
                        avgs[m] += rows[rowid][m]
                # 求平均值
                for j in range(len(avgs)):
                    avgs[j] /= len(bestmatches[i])
                # 作為新的中心點
                clusters[i] = avgs

    return bestmatches 
  kclust = kcluster(int_data, k=5)
print kclust 
  Iterator 0
Iterator 1
Iterator 2
Iterator 3
Iterator 4
Iterator 5
Iterator 6
Iterator 7
[[5, 8, 10, 11, 17, 21, 29, 34, 48, 55, 62, 63, 66, 67, 68, 70, 73, 75, 84, 97], [2, 7, 13, 16, 23, 24, 25, 30, 36, 40, 44, 49, 56, 65, 69, 79, 80, 85, 91, 94], [4, 41, 42, 45, 46, 81], [6, 26, 27, 32, 33, 35, 37, 51, 53, 54, 57, 59, 64, 71, 76, 78, 82, 87, 88, 89, 90, 92, 95, 96], [0, 1, 3, 9, 12, 14, 15, 18, 19, 20, 22, 28, 31, 38, 39, 43, 47, 50, 52, 58, 60, 61, 72, 74, 77, 83, 86, 93, 98]] 
  使用分級聚類和k-均值聚類的方法，分析的數據是連續的，而對于偏好的數據，更好的是使用Tanimoto系數計算兩者之間的距離，分析兩個人在擁有物品重疊度的大小 
  tanimoto系數計算： 
  使用A和B的交集除以A和B的并集 
   
  # 計算Tanimoto系數
def tanimoto(v1, v2):
    c1, c2, shr = 0, 0, 0

    for i in range(len(v1)):
        # 出現在v1中
        if v1[i] != 0:
            c1 += 1
        # 出現在v2中
        if v2[i] != 0:
            c2 += 1
        # 在v1、v2中均出現
        if v1[i] != 0 and v2[i] != 0:
            shr += 1
    return 1.0 - (float(shr) / (c1 + c2 - shr)) 
    
  來自：http://www.cnblogs.com/sunshine-2015/p/6545641.html


                    
                         本文由用戶 TobyTraugot 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。
                         轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。
                         本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！
                         本文地址：http://www.baiduhome.net/lib/view/open1489480255759.html
                         聚類分析 機器學習

機器學習 — 發現群組

聚類

分級聚類

K-means聚類

test

返回一個RSS訂閱源的標題和包含單詞計數情況的字典

分割出html中的所有單詞

讀取每一個url并統計單詞在每篇博客中出現的次數

設定詞頻邊界，去除常見無用詞

輸出表頭

輸出表格內容

上面的代碼遇到的問題

K-均值法

相關經驗

相關資訊

相關文檔

目錄