[Massive Data Analysis] Frequent itemsets

Spring 2016 Massive Data Analysis Lecture Notes
Ch6. Frequent itemsets
Instructor: Jia-Shung Wang

Frequent itemsets

定義：If $I$ is a set of items, the support (threshold) for $I$ is the number of baskets for which $I$ is a subset. We say $I$ is frequent if its support is $s$ or more.
解釋：假設 $I$ = { milk, coke, pepsi, beer, juice }，Support threshold = 3，而現在有 8 個 transaction（想像成八個籃子），分別是 B1 - B8，B1 - B8 分別有以下 items：
```
B1 = {m, c, b} 
B2 = {m, p, j}
B3 = {m, b}
B4 = {c, j}
B5 = {m, p, b}  
B6 = {m, c, b, j}
B7 = {c, b, j}  
B8 = {b, c}
```
則 Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {b, c}, {c, j}，因為這些組合至少出現在 3 個籃子裡。
support (threshold) (for $I$)：the number of baskets for which $I$ is a subset

Association Rules

定義：$I$ → $j$, where $I$ is a set of items and $j$ is an item. The implication of this association rule is that if all of the items in $I$ appear in some basket, then $j$ is “likely” to appear in that basket as well. “likely”: the confidence of the rule $I$ → $j$ to be the ratio of the support for $I$ ∪ {$j$} to the support for $I$.
解釋：就是說當某個 itemset $I$ 常常一起出現在某些籃子裡，則 $j$ 也“可能”常常一起出現在同樣的籃子裡。有個很有名的例子，{牛奶, 尿布} → {啤酒}。而這個”可能”則是取決於 support 和 confidence
專有名詞：
- confidence (of a rule)：the fraction of the baskets with all of $I$ that also contain $j$.
- Interest (of an association rule)：difference between its confidence and the fraction of baskets that contain $j$. Interesting rules are those with high positive or negative interest values (usually above 0.5)

How to find Frequent itemsets?

A-priori Algorithm (see 4.)
All these extensions to A-Priori have the goal of minimizing the number of pairs that actually have to be counted on the second pass.
A-priori Algorithm 的改進：基於使用 Hash Table
- PCY Algorithm (see 5.)
- Multistage Algorithm: 在 PCY 的第一步完成之後，candidates 裡還是會有許多非 frequent items，所以就有人想要在對第一步的結果做一次 hash (rehash，用另一種 hash function)，進而刪掉那些非頻繁的 items。所以 multistage 就是說要比 PCY 的兩步驟還要更多
- Multihash Algorithm: 有別於 multistage 做兩次 hash，multihash 在第一步就用了兩個 hash table
A-priori Algorithm的改進（基於Random）：A-Priori, PCY, etc., take $k$ passes to find frequent itemsets of size $k$ 而sample的目的在於減少passes
- Random Sampling: random 採樣部分的籃子，然後對他們進行 A-priori 或是 PCY 等演算法 (So we don’t pay for disk I/O each time)，再調整相應的 support (Reduce support threshold proportionally to match the sample size e.g. s/100 for 1% samples)
- SON Algorithm (see 6.)

A-priori Algorithm

簡介：用來尋找 association rules 的演算法，相較於暴力法，它省了很多記憶體的使用
概念：If a set of items $I$appears at least $s$ times, so does every subset $J$ of $I$也就是說如果有個 itemset 我們說它是 frequent itemset，那麼它的子集合應該也是 frequent —> monotonicity
細節
- 第一步： Read baskets and count in main memory the occurrences of each individual item. Items that appear $\geq s$ times are the frequent items 把資料庫掃一遍，找出每個 item 出現的次數（也就是說，把每個籃子裡的東西都看一遍，看每個東西被買的次數）而那些被購買次數大於 support 的就被稱作 frequent items
- 第二步： Read baskets again and count in main memory only those pairs where both elements are frequent (from Pass 1) 再掃一次籃子，然後去算 $k+1$ 個 item 組成的 itemset 出現的次數（$k$ 就是前一步產生的 frequent itemset 裡頭的 item 有幾個，如初始就是一個，所以第二輪會是 1+1 個）
參考：https://goo.gl/Gw0xE0
A-Priori for All Frequent Itemsets

PCY Algorithm

簡介：A-priori 的進階版，有鑒於 A-priori 在第一步的時候，大部分的記憶體都是閒置的，所以想要利用這些空間，來減少第二步使用到的空間
細節
- 第一步：In addition to item counts, maintain a hash table with as many buckets as fit in memory. Keep a count for each bucket into which pairs** of items are hashed. For each bucket just keep the count, not the actual pairs that hash to the bucket! (just for filtering)除了做 A-priori 的第一步，另外再建一個 hash table 來放 counts，演算法如下：
```
FOR (each basket) :
        FOR (each item in the basket) :
            add 1 to item's count;
        FOR (each pair of items) :
            hash the pair to a bucket;
                add 1 to the countfor that bucket;
```
  - 而buckets會有幾種類型：
    1. If a bucket contains a frequent pair, then the bucket is surely frequent (frequent buckets) 桶子裡有 frequent pair 的一定是 frequent bucket
    2. However, even without any frequent pair, a bucket can still be frequent 但是沒有 frequent pair 的並不代表它不是 frequent bucket
    3. But, for a bucket with total count less than $s$, none of its pairs can be frequent 而桶子的總數小於 support 的，一定不是 frequent bucket
  - 來個例子（中國用術語解釋XDD 散列=hash，事務=transaction=籃子）
- Between Passes: Replace the buckets by a bit-vector
- 第二步：Only count pairs that hash to frequent buckets
  - Count all pairs {$i, j$} that meet the conditions for being a candidate pair:
    1. Both $i$ and $j$ are frequent items
    2. The pair {$i, j$} hashes to a bucket whose bit in the bit vector is 1 (i.e., a frequent bucket)
  - 上列兩點都是必要的，才能確認 pair {$i, j$} 是 frequent
PCY 使用 Hash table 的好處：
- We can expect most of the buckets to be infrequent. In that case, PCY reduces the memory requirements of the second pass. (If most buckets are infrequent, we expect that the number of pairs being counted on the second pass will be much smaller.)
參考：http://goo.gl/Bq7jLr

SON Algorithm (p.60-64)

簡介：同樣是找 frequent itemsets 的演算法，但他不 hash 而是基於劃分
優點：同時避免了 false negatives (frequent 被當作 infrequent) 和 false positives (infrequent 被當作 frequent)
概念：將輸入文件劃分成 $\frac{1}{p}$ 個 chunks，再把每個 chunk 視為一個 sample（但不是 sampling 而是每塊都做），然後對他們進行 A-priori。
細節：
- 第一步：Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. 切塊，然後對每塊去找 frequent itemset，只要這個 itemset 在任一塊 chunk 裡被視為 frequent，它就會成為 candidate
- 第二步：Count all the candidate itemsets and determine which are frequent in the entire set. 對所有的 candidate itemsets 去找大於 support 值的作為 frequent itemsets (an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset.)
SON Algorithm and MapReduce: Each of the chunks can be processed in parallel, and the frequent itemsets from each chunk combined to form the candidates.
參考：http://goo.gl/6qQQB9

Slides

from CS246 Mining Massive Data Sets, Stanford University
http://web.stanford.edu/class/cs246/slides/frequent-itemsets.pdf