[Massive Data Analysis] Clustering

Spring 2016 Massive Data Analysis Lecture Notes
Ch7. Clustering
Instructor: Jia-Shung Wang

Clustering

簡介：將點分群 Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that
- Members of a cluster are close/similar to each other 群內的點相似度越高越好
- Members of different clusters are dissimilar 群跟群之間相似度越低越好
- 而通常：Points are in a high-dimensional space, Similarity is defined using a distance measure

Curse of dimensionality

簡介：clustering in high-dimensional spaces is difficult. 當資料的維度高時，做clustering 會變得困難
為什麼重要（來自上學期 Data Mining 講義）：When dimensionality increase, data becomes increasingly sparse in the space that it occupies. High dimensional data is difficult to work with since adding more features can increase noise. Also, it results in increasing running time, reducing accuracy and increasing complexity.
特性：
- Almost all pairs of points are equally far away from one another. 就是前面為什麼重要裡說的 sparse
  - 如果資料是一維，那麼這些點只會在線上，假設這些點都介於 0-1 之間，那麼任意兩點的距離平均是 1/3；如果維度很高，點會分佈在一個很高維的空間，任兩點的距離自然就遠了
  - 如果沒有任兩點是接近的，那麼就很難做 clustering 了
  - The upper limit between two points is $\sqrt{d}$. In fact, almost all points will have a distance close to the average distance. (Almost all pairs of points are at about the same distance) 基本上任兩點的距離都會接近平均距離，但是如果任意兩點的距離都相當（都大約等於平均距離）那也沒辦法做出有效的 clustering
- Almost any two vectors are almost orthogonal. 任兩個向量（也就是任三個點）的角度會近似直角
  - 但是如果維度小，這就不成立

Cosine, Jaccard, and Euclidean Distance

Sets as vectors: Measure similarity by the cosine distance
Sets as sets: Measure similarity by the Jaccard distance
Sets as points: Measure similarity by Euclidean distance

Methods of Clustering

Hierarchical (see V.)
- Agglomerative (bottom up)
  - Initially, each point is a cluster
  - Repeatedly combine the two “nearest” clusters into one
- Divisive (top down)：Start with one cluster and recursively split it
Point assignment
- Maintain a set of clusters
- Points belong to “nearest” cluster

Hierarchical Clustering

Agglomerative：Repeatedly combine two nearest clusters
- 三個重要問題
  1. How to represent a cluster of many points?
    - Euclidean case: each cluster has a centroid = average of its (data) points 群內每個資料點的平均，centroid 不一定是一個資料點，而是一個“人造的”位置
    - Non-Euclidean case: clustroid = (data) point “closest” to other points 在現存的資料點中，離群內每個資料點最近的那個點稱為 clustroid 而 “closest” 的可能意思有：
      - Smallest maximum distance to other points
      - Smallest average distance to other points
      - Smallest sum of squares of distances to other points
  2. How to determine “nearness” of clusters?
    - Euclidean case: Measure cluster distances by distances of centroids
    - Non-Euclidean case:
      - Step 1: Treat clustroid as if it were centroid, when computing inter-cluster distances
      - Step 2: Intercluster distance = minimum of the distances between any two points, one from each cluster
      - Step 3: Pick a notion of “cohesion” of clusters, and merge clusters whose union is most cohesive 耦合度，選擇耦合度最高的，至於耦合度怎麼看：
        
        Use the diameter of the merged cluster = maximum distance between points in the cluster 相隔最遠的兩點的距離
        
        Use the average distance between points in the cluster 點與點間距離的平均
        
        Use a density-based approach. Take the diameter or avg. distance, e.g., and divide by the number of points in the cluster 群內密度
  3. When to stop combining clusters?
- Implementation and Efficiency
  - At each step, compute pairwise distances between all pairs of clusters, then merge
    - $O(n^{3})$
    - 每一步都計算任兩點間的距離，目的是找到 the best merger（也就是距離最短的兩群）
    - The initial step takes $O(n^{2})$ time, but subsequent steps take time proportional to $(n − 1)^{2}$, $(n − 2)^{2}$, the sum of squares up to $n$ is $O(n^{3})$
  - 較有效率的方法：Using priority queue can reduce time
    - $O(n^{2}logn)$ 但還是太貴
    - 一開始就算好兩點間的距離
    - 將任兩點（一個 pair）與他們間的距離存進 priority queue，這樣掃一遍就可以找到最小距離 —> $O(n^{2})$
    - 當我們決定要 merge 某兩個群 C 和 D 的同時，移除所有包含這兩群的 pair。因為最多要進行 $2n$ 次移除，而 priority queue deletion 需要 $O(logn)$ 所以總共需要 —> $O(nlogn)$
    - 計算新的群和其他群（沒有被刪掉的那些）的距離。最多 _ 個 entry 要被插入 queue，而 priority queue 的插入需要 $O(logn)$ —> $O(nlogn)$

K-means Clustering

簡介：分群演算法，必須事前設定群集的數量 k，然後找尋公式的極大值，以達到分群的最佳化之目的。
細節：
- 假設：Assumes Euclidean space/distance
- 第一步：Start by picking $k$, the number of clusters. Initialize clusters by picking one point per cluster
- 第二步：For each point, place it in the cluster whose current centroid it is nearest
- 第三步：After all points are assigned, update the locations of centroids of the $k$ clusters
- 第四步：Reassign all points to their closest centroid (Sometimes moves points between clusters)
- 重複步驟三四直到收斂（收斂：點不再在群間移動，且centroid也不再變動）
問題：如何選擇要分成幾群？（即如何選擇 $k$?）
- Try different $k$, looking at the change in the average distance to centroid as $k$ increases.
- Average falls rapidly until right $k$, then changes little
- 隨著 $k$ 增加，點到這 $k$ 個 centroid 的平均距離會跟著變動。而變動的幅度有大有小，變動幅度開始趨緩的那一點就是好 $k$

BFR Algorithm

簡介：k-means 的延伸，適合處理大量資料
假設c群內的點相對於 centroid 的分佈是 normally distributed (in a Euclidean space) 而一個群在不同維度下的平均值和標準差可能不同，但是維度之間必須相互獨立
效率 vEfficient way to summarize clusters（只需要 $O(clusters)$ and not $O(data)$ 的記憶體）
前提：
- Points are read from disk one main-memory-full at a time
- BFR 是用來處理非常大量的資料，因此資料會按照 set 的方式讀入。每個 set 裡的資料必須要保證可以在 main memory 裡，因為 main memory 中還要儲存其他必要訊息：DS, CS, and RS
- Most points from previous memory loads are summarized by simple statistics
細節：
- 第一步：From the initial load we select the initial $k$ centroids by some sensible approach:
  - Take $k$ random points
  - Take a small random sample and cluster optimally
  - Take a sample; pick a random point, and then
  - $k–1$ more points, each as far from the previously selected points as possible
- 選完k之後分完群，會有三種類型的點：
  - Discard set (DS): Points close enough to a centroid to be summarized (summarized 不再儲存)
    - The number of points, $N$ 所有點的數目
    - The vector SUM, whose $i$th component is the sum of the coordinates of the points in the $i$th dimension 所有點在每一維的向量之和，即 $SUM[i]$ 表示第 $i$ 維上的向量和
    - The vector $SUMSQ$: $i$th component = sum of squares of coordinates in $i$th dimension 所有點在每一維的向量的平方和
    - 因此，如果資料是 $d$ 維的話，透過此表示法可以用 $2d+1$ 個值来表示一個 DS 或 CS
    - centroid 可以表示成 $SUM_i / N$，其中 $SUM_i = i$th component of $SUM$
    - DS 中 dimension $i$ 的 $Variance ＝ (SUMSQ_i / N) - (SUM_i / N)^{2}$
  - Compression set (CS): Groups of points that are close together but not close to any existing centroid. These points are summarized, but not assigned to a cluster (summarized 不再儲存)
  - Retained set (RS): Isolated points waiting to be assigned to a compression set
- 第二步：找到所有 sufficiently close（充分接近）某個群的 centroid 的點加入到該群中 These points are so close to the centroid that they can be summarized and then discarded
- 第三步：剩下的點以及 RS 可以使用任意main-memory clustering algorithm來分群，將分出來的 clusters 加入 CS; 獨立的點（outlying points）加入 RS
- 第四步：加入新的點後，DS 可以透過 N-SUM-SUMSQ 地表示法，直接加上這些新加入的點的 $N_{S}$, $SUM_{S}$, $SUMSQ_{S}$ 值即可。
- 第五步：透過第三步，有了新的 CS，它們和上一個 set 的資料處理後留下的 CS 和 RS 中間可能距離很近，是可以合併的。因此這一步就是將這些在 CS 中新舊的 cluster 和以前的 RS 進行合併。
- 第六步：如果這是最後一組資料，那麼就將這些 RS 的點和 CS 分配到距離最近的 cluster 中，否則繼續保留 RS 和 CS，等待和下一組資料一起處理。
關於 BFR 的幾個問題
- How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? 多近才叫近呢？
  - We need a way to decide whether to put a new point into a cluster (and discard)
  - 兩種方法
    1. The Mahalanobis distance is less than a threshold 計算每個新的點和 centroid 的 MD
      - Normalized Euclidean distance from centroid
      - For point $(x_{1}, …, x_{d})$ and centroid $(c_{1}, …, c_{d})$,
      - If clusters are normally distributed in $d$ dimensions, then after transformation, one standard deviation $= \sqrt{d}$
      - Accept a point for a cluster if its M.D. is < some threshold 點的MD小於門檻值則接受它（算它距離夠近）
    2. High likelihood of the point belonging to currently nearest centroid
- How do we decide whether two compressed sets (CS) deserve to be combined into one?
  - 計算任兩個要合併的subcluster的variance （利用 $N$, $SUM$, and $SUMSQ$ 可以算得很快）
  - 當 var 小於某個門檻值，則合併
  - 其他方法：Treat dimensions differently, consider density

Slides

from CS246 Mining Massive Data Sets, Stanford University
http://web.stanford.edu/class/cs246/slides/clustering.pdf