[Massive Data Analysis] Advertising on the Web

Spring 2016 Massive Data Analysis Lecture Notes
Ch8. Advertising on the Web
Instructor: Jia-Shung Wang

Advertising Opportunities and Online Algorithms

post ads directly, displayed ads, search ads
Each of these methods, and many others like these, raise enormous privacy issues 各類型的線上廣告衍伸出隱私問題
Online Algorithms
- Classic model of algorithms
  - You get to see the entire input, then compute some function of it
  - In this context, “offline algorithm”
- Online Algorithms
  - You get to see the input one piece at a time, and need to make irrevocable（經決定後不可改變的） decisions along the way
  - Similar to the data stream model

Online Bipartite Matching

Bipartite Matching：一張二分圖上的匹配稱作二分匹配，理所當然所有的匹配邊都是這橫跨這兩群點的邊，就像是連連看一樣。
Max. Matching 最大匹配：二分圖G的所有匹配中，edges數量最多的匹配
p.7-9動畫：
- 男生一群、女生一群，p.7 的連線指的是他們彼此間的 Preferences，而目標是要找到一個 match 方式使得男生女生都可以剛剛好連到有 preference 的對象 (Match boys to girls so that maximum number of preferences is satisfied)
- 先連 2-b 和 4-a，因為 2 和 4 除了 b 和 a 以外沒有其他的 preference 了；接著連 1-c 和 3-d（不能重複連到女生，一個配舞伴的概念，一個女生不會有兩個舞伴）
Matching Algorithm
- Problem：Find a maximum matching for a given bipartite graph
- 第一步：初始化，只有男生組
- 第二步開始：在每個round，決定其中一個女生的選擇 (girl’s edges are revealed)
  - At that time, we have to decide to either 兩種選擇
    - Pair the girl with a boy 配對
    - Do not pair the girl with any boy 不配對
Greedy algorithm for the online graph matching problem: 把女生配給任何有資格的男生 (If there is none, do not pair girl) 一開始就把一些明顯可以配對的點給配對起來（能連的先連一連），這樣就不必替每一個未匹配點建立交錯樹了。這個手段在圖很龐大的時候可以發揮作用。
最佳配對方法與 greedy algorithm 的比較
- Competitive Ratio: For input $I$, suppose greedy produces matching $M_{greedy}$ while an optimal matching is $M_{opt}$, Competitive ratio = $\textrm{min_all_possible_inputs } I (\left |M_{greedy} \right | / \left |M_{opt} \right |)$ —> greedy’s worst performance over all possible inputs $I$
- Consider a case: $M_{greedy} ≠ M_{opt}$ 當 greedy 所算出來的配對不是最佳時
  - Consider the set $G$ of girls matched in $M_{opt}$ but not in $M_{greedy}$ 女生配到的男生不是最佳的
  - Then every boy $B$ adjacent to girls in $G$ is already matched in $M_{greedy}$
  - If there would exist such non-matched (by $M_{greedy}$) boy adjacent to a non-matched girl then greedy would have matched them
  - Since boys $B$ are already matched in $M_{greedy}$ then $\left |M_{greedy} \right | \geq \left | B \right |$
  - Summary so far: $G$ 表示在 $M$ 中存在 $M_{opt}$，但在 $M_{greedy}$ 中不存在的右節點的集合，$B$ 表示 $G$ 中有節點的鄰接邊的左節點的集合 —> $\left |M_{greedy} \right | \geq \left | B \right |$
  - There are at least $\left | G \right |$ such boys ($\left | G \right | \leq \left | B \right |$) otherwise the optimal algorithm couldn’t have matched all girls in $G$. So $\left | G \right | \leq \left | B \right | \leq \left |M_{greedy} \right |$
  - By definition of $G$ also: $\left |M_{opt} \right | \leq \left |M_{greedy} \right | + \left | G \right |$
  - Worst case is when $\left | G \right |$ = $\left | B \right |$ = |$M_{greedy}$|
  - $\left |M_{opt} \right | \leq 2 \left |M_{greedy} \right |$ then $\left |M_{greedy} \right | / \left |M_{opt} \right | \geq \frac{1}{2}$

Web Advertising

Google’s Adwords System
- Show only a limited number of ads with each query 針對每次搜尋，Google 只會顯示一部分廣告
- Decide which ads to show, as well as the order in which to show them 所以Google必須決定哪些廣告要show以及show的順序
- Users of the Adwords system specified a budget 而每個廣告商都有一定的預算，這就是 Adwords 要解決的問題
- The revenue of a selection of ads is the total value of the ads selected 廣告收入來自於廣告的價值
  - 而廣告的價值來自於投標的商品 (the product of the bid) 以及廣告點擊率
  - on-line algorithm 的價值則是用一個月內（因為廣告商的預算通常以一個月為週期）獲取的總收入來衡量
  - 所以對所有演算法要有統一的衡量標準：competitive ratio = the minimum total revenue for that algorithm, on any sequence of search queries / the revenue of the optimum off-line algorithm 該演算法的最小收入 / off-line algorithm 的最大收入
- 又回到 Google Adwords 要解決的問題
  - A stream of queries arrives at the search engine: $q_{1}, q_{2}, …$ 今天搜尋引擎收到一個 query stream
  - Several advertisers bid on each query 而 stream 裡每個 query 都有多個廣告商競標
  - When query $q_{i}$ arrives, search engine must pick a subset of advertisers whose ads are shown 當 query 被送進來，搜尋引擎必須決定要 show 哪些廣告
  - Goal: Maximize search engine’s revenues 目標當然是為搜尋引擎帶來最大收益
  - Simple solution: Instead of raw bids, use the “expected revenue per click” (i.e., Bid*CTR) 去計算每次的廣告點擊能有多少的預期收入
- Expected revenue per click
  - Budget: 每個廣告商都有預算的限制，搜尋引擎必須保證廣告的投放不會超過他們的預算
  - CTR 是未知的：每個廣告都有不同的可能性會被點到
    - Clickthrough rate (CTR) is measured historically
    - Exploration vs. exploitation 是一個很困難的問題
      - Exploit: Should we keep showing an ad for which we have good estimates of click-through rate 要持續曝光廣告使得它能擁有較好的 CTR
      - Explore: Shall we show a brand new ad to get a better sense of its click-through rate 還是要放全新的廣告，讓這個廣告的CTR比較好預測
- Greedy Algorithm
  - 假設在一個很簡單的環境下：每個 query 都只有一個廣告、每個廣告商的預算相同、每個廣告被點擊的機會一樣、每個廣告的價值也一樣
  - For a query pick any advertiser who has bid 1 for that query
  - Worst case: Two advertisers $A$ and $B$, $A$ bids on query $x$, $B$ bids on $x$ and $y$, same budgets of $4
    - Query stream: $x x x x y y y y$
    - Worst case greedy choice: $B B B B$ _ _ _ _—> $\mathrm{Competitive ratio} = \frac{4}{8} = \frac{1}{2}$
    - Optimal: $A A A A B B B B$
  - Greedy algorithm is deterministic – it always resolves draws in the same way
- 所以有人提出了 BALANCE Algoritm 來處理這些問題

BALANCE Algoritm

概念：For each query, pick the advertiser with the largest unspent budget 對於每次 query，選擇出價最高且剩餘預算最多的廣告商。如果多個廣告商剩餘預相同，則任意挑選一個，並盡可能的平衡所有廣告商的消費。
與 optimal 比較：Two advertisers $A$ and $B$, $A$ bids on query $x$, $B$ bids on $x$ and $y$, same budgets of $4
- Query stream: $x x x x y y y y$
- BALANCE choice: $A B A B B B$ _ _
- Optimal: $A A A A B B B B$
- In general, for BALANCE on 2 advertisers —> $\mathrm{Competitive ratio} = \frac{6}{8} = \frac{3}{4}$
另一個case (w.l.o.g.): 2 advertisers, $A_{1}$ and $A_{2}$, each with budget $B (\geq 1)$
- 最佳解會把兩方的預算都用完，而 BALANCE 則是必須用完至少一方的預算
- 如果用不完，可以分配更多的 queries. Whenever BALANCE makes a mistake (both advertisers bid on the query), advertiser’s unspent budget only decreases. Since optimal exhausts both budgets, one will for sure get exhausted
- Assume BALANCE exhausts $A_{2}$’s budget, but allocates $x$ queries fewer than the optimal
- Revenue: $BAL = 2B - x$
In the general case, worst competitive ratio of BALANCE is $1–1/e = approx. 0.63$
Balance Algorithm with Many Bidders
- There are $N$ advertisers, $A_{1}, A_{2}, … , A_{N}$. 有 $N$ 個廣告商
- Each advertiser has a budget $B = N$ 預算都是 $N$
- There are $N$ queries $q_{1}, q_{2}, … , q_{N}$. 有 $N$ 個queries
- Advertiser $A_{i}$ bids on queries $q_{1}, q_{2}, … , q_{i}$ and no other queries. 廣告商 $A_{i}$ 投錢在 $q_{1}, q_{2}, … , q_{i}$ 上
- The query sequence consists of $N$ rounds. The $i$th round consists of $B$ occurrences of query $q_{i}$ and nothing else. 查詢序列有 $N$ 個 round，第 $i$ 個 round 包含了 $B$ 個 $q_{i}$
- The optimum off-line algorithm assigns the $B$ queries _q__$i$ in the $i$th round to $A_{i}$ for all $i$. Thus, all queries are assigned to a bidder, and the total revenue of the optimum algorithm is $N$_$B$. 最佳的演算法會把第 $i$ 個 round 的 $B$ 個 $q_{i}$ 分配給 $A_{i}$ (for all $i$). 所以所有的 queries 都會被分配到一個廣告商，總收入會是 $N_{B}$
Worst case for BALANCE
- $N$ advertisers: $A_{1}, A_{2}, … , A_{N}$, each with budget $B > N$
- Queries: $N \cdot B$ queries appear in $N$ rounds of ($B$ queries each)
- Bidding
  - Round 1 queries: bidders $A_{1}, A_{2}, … , A_{N}$
  - Round 2 queries: bidders $A_{2}, A_{3}, … , A_{N}$
  - Round $i$ queries: bidders $A_{i}, A_{i+1}, … , A_{N}$
- Optimum allocation: Allocate round $i$ queries to $A_{i}$ ($B$ queries each)
- Optimum revenue $N \cdot B$
BALANCE 分析
- So after the first $k = N(1-1/e)$ rounds, we cannot allocate a query to any advertiser
- $\mathrm{Revenue} = B \cdot N(1-1/e)$ (queries of the first $j$ rounds.)
- $\mathrm{Competitive ratio} = 1-1/e$
Generalized Version
- BALANCE 在 bids 是 0 或 1 的時候表現很好，但是實務上 bids 不可能只是 0 或 1，可能是任意值（預算也是任意的），而此時 BALANCE 就無法適當地表現 (fails to weight the sizes of the bids properly)
- 舉例：
  - 有十個 $q$: 最佳演算法會把十個 $q$ 都給 $A_{2}$，然後得到 100 塊收入
  - 但是 BALANCE 會把十個都給 $A_{1}$ 因為 $A_{1}$ 的預算比較高，但結果是只有 10 塊收入
  - There is no competitive ratio higher than 0 that holds for the Balance Algorithm.

Slides

from CS246 Mining Massive Data Sets, Stanford University
http://web.stanford.edu/class/cs246/slides/advertising.pdf