[Massive Data Analysis] Link Analysis
Spring 2016 Massive Data Analysis Lecture Notes
Ch5. Link Analysis
Instructor: Jia-Shung Wang
Credit: Jane To
前情提要:Google 因為用了 PageRank 所以變成搜尋引擎的老大,眼紅的傢伙就用所謂的 link spam 來設法讓 Google 網頁排序受影響,所以後來又發展了所謂的 TrustRank,讓你更信任 Google。而 HITS 是一個幾乎可以說跟 Google 的 PageRank 一樣厲害的排序法哦~
PageRank
- Topic-Specific PageRank
- Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”
- Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”
TrustRank
前情提要,最一開始的 spammer 是 term spamming (p.100),就是在你的網頁中偷偷放一堆term只讓搜尋引擎看得到,但是這個 PageRank 就可以打趴他了。所以後來又出現了新的 spammer 叫作 link spamming,就是養 link farm 來借他人之力長自己威風的概念。因此,為了打擊犯罪,就需要使用 TrustRank
- TrustRank: topic-specific PageRank with a teleport set of trusted pages, for example, .edu domains, similar domains for non-US schools
- Basic principle: Approximate isolation: It is rare for a “good” page to point to a “bad” (spam) page
- 作法:
- Sample a set of seed pages from the web
- Have an oracle (human) to identify the good pages and the spam pages in the seed set. (這是一項貴森森的任務, so we must make seed set as small as possible)
- 選擇 seed set
- 遇到的問題
- Human has to inspect each seed page, so seed set must be as small as possible
- Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short paths
- 選擇方法:
- Pick the top $k$ pages by PageRank (Theory is that you can’t get a bad page’s rank really high)
- Use trusted domains whose membership is controlled, like .edu, .mil, .gov
- 遇到的問題
- 兩種解法:
- Use a threshold value and mark all pages below the trust threshold as spam
- Spam mass estimation (想法來自What fraction of a page’s PageRank comes from spam pages?)
HITS (Hypertext-Induced Topic Selection)
- 說明:Is a measure of importance of pages or documents, similar to PageRank
- 目的:“Say we want to find good newspapers”
- Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers
- 想法:Links as votes - Page is more important if it has more links
組成:Hubs and Authorities
- Authorities are pages containing useful information (候選人) e.g. Newspaper home pages
- Hubs are pages that link to authorities (有投票權的公民) e.g. List of newspapers
演算法
PageRank and HITS are two solutions to the same problem:
- What is the value of an in-link from $u$ to $v$?
- In the PageRank model, the value of the link depends on the links into $u$
- In the HITS model, it depends on the value of the other links out of $u$
- What is the value of an in-link from $u$ to $v$?
Slides
- from CS246 Mining Massive Data Sets, Stanford University
- Part 1: http://web.stanford.edu/class/cs246/slides/PageRank-1.pdf
- Part 2: http://web.stanford.edu/class/cs246/slides/PageRank-2.pdf