Spring 2016 Massive Data Analysis Lecture Notes
Ch4. Mining Data Streams
Instructor: Jia-Shung Wang
Credit: Jane To



  • 名詞解釋 Data Stream: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever.
  • We can think of the data as infinite and non-stationary (the distribution changes over time)
    • Google queries
    • Twitter or Facebook status updates
  • 問題:The system cannot store the entire stream accessibly
  • Stochastic Gradient Descent (SGD) is an example of a stream algorithm
  • Types of queries one wants on answer on a data stream:

    1. Sampling data from a stream: Construct a random sample
    2. Queries over sliding windows: Number of items of type $x$ in the last $k$ elements of the stream
    3. Filtering a data stream: Select elements with property $x$ from the stream
    4. Counting distinct elements: Number of distinct elements in the last $k$ elements of the stream
    5. Estimating moments: Estimate average and standard deviation of last $k$ elements
    6. Finding frequent elements
  • If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to store a sliding window of each stream in the working store.

  • Streams often deliver elements very rapidly, must process elements in real time, or lose the opportunity to process them at all.
  • Sampling from a Data Stream:
    • 問題:can not store the entire stream,所以 store a sample,有兩種:
      1. Sampling a fixed proportion - As the stream grows the sample also gets bigger
        • Naïve solution: Generate a random integer in $[0..9]$ for each query, and store the query if the integer is 0, otherwise discard
      2. Sampling a fixed-size sample - As the stream grows, the sample is of fixed size
        • 問題:我們無法事先知道stream的長度,而因著記憶體的限制,我們需要 maintain a random sample $S$ of size exactly $s$ tuples

Reservoir Sampling

  • 說明:當樣本總體很大或者是在數據流上進行採樣的時候,我們往往無法預知總體的樣本實力個數 $N$。那麼 Reservoir Sampling 就是這樣一組演算法,即使不知道 $N$ ,也能保證每個樣本實力被採樣到的概率依然相等
  • 參考:http://www.sigmainfy.com/blog/reservoir-sampling-learning-notes.html

    • $\frac{s}{n}$ 這個機率的意思是在你已看過 $n$ 個 elements 之後,一個 element 在 $S$ 裡面的機率

  • Sliding Window

DGIM

Bloom Filtering

  • 目的:as a way to eliminate most of the tuples that do not meet the criterion.
  • 組成:$n$ 位的數組,$k$ 個 hash 函數,$m$ 個待過濾的元素
  • 演算法請看 http://blog.csdn.net/dannyPolyu/article/details/9319811
  • 優點:
    • 空間效率和查詢時間都遠遠超過一般演算法
    • 沒有 False negatives,如果某個元素確實沒有在該集合中,那麼 Bloom Filter 是是不會爆該元素存在集合中,所以不會漏報
  • 缺點:
    • 有一定的機率會有 False positives,即 Bloom Filter 報告某個元素在該集合中,事實上該元素根本不在集合中
      • 機率是 $(1 - e^{(- km / n)})^{k}$
      • 刪除困難
  • 總結:

Flajolet-Martin

  • 目的:counting distinct elements - Number of distinct elements in the last $k$ elements of the stream
  • 下面幾張投影片個人覺得看了應該就會懂了,他就是要說明 distinct element 的數量是 $2^{R}$


  • Computing Moments

    • 定義:Assume the universal set is ordered so we can speak of the $i$th element for any $i$. Let $m_{i}$ be the number of occurrences of the $i$th element for any $i$.

    • 0th moment

      • number of distinct elements
      • The problem just considered
    • 1st moment
      • count of the numbers of elements = length of the stream
      • Easy to compute
    • 2nd moment
      • = surprise number $S$
      • = a measure of how uneven the distribution is (我自己把它想成變異數)

AMS

  • 目的:Estimating moments - Estimate std. dev. of last $k$ elements

    • a: 5 次, b: 4 次, c: 3 次, d: 3 次
    • $X_{1}$ (3rd): c, $X_{2}$ (8th): d, $X_{3}$ (13th): a
    • $X_{1}.value = 3$ (因為 c 包含這次並在之後出現的次數是三次)

Slides