[Massive Data Analysis] Finding Similar items

Spring 2016 Massive Data Analysis Lecture Notes
Ch3. Finding Similar items
Instructor: Jia-Shung Wang
Credit: Jane To

Similarity

fundamental data-mining problem is to examine data for similar items e.g., matching fingerprint, plagiarism(剽竊), 推薦購物…
Jaccard Similary: $SIM(S, T) = | S 交集 T | / | S 聯集 T |$

finding textually similar documents can be turned into such a set problem** by the technique known as “shingling.” Convert documents to sets (建出 matrix)
$k$-shingle(or $k$-gram): a sequence of k tokens that appears in the doc**
- $k = 2$; Document $D_{1} = abcab$
- 2-shingles: $S(D_{1}) = {ab, bc, ca}$
- shingles as a bag, ‘ab’要算兩次：$S(D_{1}) = {ab, bc, ca, ab}$
Characteristic matrix

which compresses large sets in such a way that we can still deduce the similarity of the underlying sets from their compressed versions. (縮減縱軸) Convert large sets to short signatures 重點是要能保留 similarity
Min-hash: a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the first row, in the permuted order, in which the column has a 1.

Min-hashing and Jaccard Similarity
- 關係：兩個 minHash 相等的概率等於這兩個結合的相似度
- The Min-Hash Property：
  - $Pr[\min(h(C_{1})) = min(h(C_{2}))] = | C_{1} 交集 C_{2} |/| C_{1} 聯集 C_{2} | = sim(C_{1}, C_{2}) $

for focusing our search on pairs that are most likely to be similar. (把橫軸拿去篩選) Candidate pairs
- General LSH approach: “hash” items several times, similar items are more likely to be hashed to the same bucket** than dissimilar items are.
- Same bucket —> Candidate pair
- Check only the candidate pairs for similarity. (most of the dissimilar pairs will never hash to the same bucket)
- False positives: Those dissimilar pairs that do hash to the same bucket (希望佔少數)
- We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the hash functions.
- Those that do not are false negatives (希望佔少數)
General idea: Use a function $f(x,y)$ that tells whether $x$ and $y$ is a candidate pair: a pair of elements whose similarity must be evaluated
Pick a similarity threshold $s (0 < s < 1)$

LSH tradeoff: to balance false positives/negatives
- Example: If we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up
Summary:
- Tune $M$, $b$, $r$ to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures
- Check in main memory that candidate pairs really do have similar signatures
- Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents