Spring 2016 Massive Data Analysis Lecture Notes
Ch3. Finding Similar items
Instructor: Jia-Shung Wang
Credit: Jane To



Similarity

  • fundamental data-mining problem is to examine data for similar items e.g., matching fingerprint, plagiarism(剽竊), 推薦購物…
  • Jaccard Similary: $SIM(S, T) = | S 交集 T | / | S 聯集 T |$

Shingling

  • finding textually similar documents can be turned into such a set problem** by the technique known as “shingling.” Convert documents to sets (建出 matrix)
  • $k$-shingle(or $k$-gram): a sequence of k tokens that appears in the doc**
    • $k = 2$; Document $D_{1} = abcab$
    • 2-shingles: $S(D_{1}) = {ab, bc, ca}$
    • shingles as a bag, ‘ab’要算兩次:$S(D_{1}) = {ab, bc, ca, ab}$
  • Characteristic matrix
    User-uploaded image: hackpad.com_QA099fPu6Tx_p.543864_1463636005411_undefined

Min-hashing

  • which compresses large sets in such a way that we can still deduce the similarity of the underlying sets from their compressed versions. (縮減縱軸) Convert large sets to short signatures 重點是要能保留 similarity
  • Min-hash: a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the first row, in the permuted order, in which the column has a 1.

  • Min-hashing and Jaccard Similarity
    • 關係:兩個 minHash 相等的概率等於這兩個結合的相似度
    • The Min-Hash Property:
      • $Pr[\min(h(C_{1})) = min(h(C_{2}))] = | C_{1} 交集 C_{2} |/| C_{1} 聯集 C_{2} | = sim(C_{1}, C_{2}) $

Locality-sensitive hashing (LSH)

  • for focusing our search on pairs that are most likely to be similar. (把橫軸拿去篩選) Candidate pairs
    • General LSH approach: “hash” items several times, similar items are more likely to be hashed to the same bucket** than dissimilar items are.
    • Same bucket —> Candidate pair
    • Check only the candidate pairs for similarity. (most of the dissimilar pairs will never hash to the same bucket)
    • False positives: Those dissimilar pairs that do hash to the same bucket (希望佔少數)
    • We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the hash functions.
    • Those that do not are false negatives (希望佔少數)
  • General idea: Use a function $f(x,y)$ that tells whether $x$ and $y$ is a candidate pair: a pair of elements whose similarity must be evaluated
  • Pick a similarity threshold $s (0 < s < 1)$

  • LSH tradeoff: to balance false positives/negatives
    • Example: If we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up
  • Summary:
    • Tune $M$, $b$, $r$ to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures
    • Check in main memory that candidate pairs really do have similar signatures
    • Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents

Slides