Spring 2016 Massive Data Analysis Lecture Notes
Ch5. Link Analysis
Instructor: Jia-Shung Wang
Credit: Jane To




前情提要:Google 因為用了 PageRank 所以變成搜尋引擎的老大,眼紅的傢伙就用所謂的 link spam 來設法讓 Google 網頁排序受影響,所以後來又發展了所謂的 TrustRank,讓你更信任 Google。而 HITS 是一個幾乎可以說跟 Google 的 PageRank 一樣厲害的排序法哦~

PageRank





  • Topic-Specific PageRank
    • Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”

TrustRank

前情提要,最一開始的 spammer 是 term spamming (p.100),就是在你的網頁中偷偷放一堆term只讓搜尋引擎看得到,但是這個 PageRank 就可以打趴他了。所以後來又出現了新的 spammer 叫作 link spamming,就是養 link farm 來借他人之力長自己威風的概念。因此,為了打擊犯罪,就需要使用 TrustRank

  • TrustRank: topic-specific PageRank with a teleport set of trusted pages, for example, .edu domains, similar domains for non-US schools
  • Basic principle: Approximate isolation: It is rare for a “good” page to point to a “bad” (spam) page
  • 作法:
    • Sample a set of seed pages from the web
    • Have an oracle (human) to identify the good pages and the spam pages in the seed set. (這是一項貴森森的任務, so we must make seed set as small as possible)
  • 選擇 seed set
    • 遇到的問題
      • Human has to inspect each seed page, so seed set must be as small as possible
      • Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short paths
    • 選擇方法:
      • Pick the top $k$ pages by PageRank (Theory is that you can’t get a bad page’s rank really high)
      • Use trusted domains whose membership is controlled, like .edu, .mil, .gov
  • 兩種解法:
  1. Use a threshold value and mark all pages below the trust threshold as spam
  2. Spam mass estimation (想法來自What fraction of a page’s PageRank comes from spam pages?)

HITS (Hypertext-Induced Topic Selection)

  • 說明:Is a measure of importance of pages or documents, similar to PageRank
  • 目的:“Say we want to find good newspapers”
    • Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers
  • 想法:Links as votes - Page is more important if it has more links
  • 組成:Hubs and Authorities

    • Authorities are pages containing useful information (候選人) e.g. Newspaper home pages
    • Hubs are pages that link to authorities (有投票權的公民) e.g. List of newspapers

  • 演算法

  • PageRank and HITS are two solutions to the same problem:

    • What is the value of an in-link from $u$ to $v$?
      • In the PageRank model, the value of the link depends on the links into $u$
      • In the HITS model, it depends on the value of the other links out of $u$

Slides