[Massive Data Analysis] Mining Data Streams

Spring 2016 Massive Data Analysis Lecture Notes
Ch4. Mining Data Streams
Instructor: Jia-Shung Wang
Credit: Jane To

名詞解釋 Data Stream: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever.
We can think of the data as infinite and non-stationary (the distribution changes over time)
- Google queries
- Twitter or Facebook status updates

問題：The system cannot store the entire stream accessibly
Stochastic Gradient Descent (SGD) is an example of a stream algorithm
Types of queries one wants on answer on a data stream:
1. Sampling data from a stream: Construct a random sample
2. Queries over sliding windows: Number of items of type $x$ in the last $k$ elements of the stream
3. Filtering a data stream: Select elements with property $x$ from the stream
4. Counting distinct elements: Number of distinct elements in the last $k$ elements of the stream
5. Estimating moments: Estimate average and standard deviation of last $k$ elements
6. Finding frequent elements
If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to store a sliding window of each stream in the working store.
Streams often deliver elements very rapidly, must process elements in real time, or lose the opportunity to process them at all.

Reservoir Sampling

說明：當樣本總體很大或者是在數據流上進行採樣的時候，我們往往無法預知總體的樣本實力個數 $N$。那麼 Reservoir Sampling 就是這樣一組演算法，即使不知道 $N$ ，也能保證每個樣本實力被採樣到的概率依然相等
參考：http://www.sigmainfy.com/blog/reservoir-sampling-learning-notes.html
- $\frac{s}{n}$ 這個機率的意思是在你已看過 $n$ 個 elements 之後，一個 element 在 $S$ 裡面的機率

參考：http://nathanlvzs.github.io/blog/DGIM-Algorithm-and-Python-Implementation.html
stream of positive integers(另一種應用，不是 bit stream 的用法)

目的：as a way to eliminate most of the tuples that do not meet the criterion.
組成：$n$ 位的數組，$k$ 個 hash 函數，$m$ 個待過濾的元素
演算法請看 http://blog.csdn.net/dannyPolyu/article/details/9319811
優點：
- 空間效率和查詢時間都遠遠超過一般演算法
- 沒有 False negatives，如果某個元素確實沒有在該集合中，那麼 Bloom Filter 是是不會爆該元素存在集合中，所以不會漏報）
缺點：
- 有一定的機率會有 False positives，即 Bloom Filter 報告某個元素在該集合中，事實上該元素根本不在集合中
  - 機率是 $(1 - e^{(- km / n)})^{k}$
  - 刪除困難
總結：

目的：counting distinct elements - Number of distinct elements in the last $k$ elements of the stream
下面幾張投影片個人覺得看了應該就會懂了，他就是要說明 distinct element 的數量是 $2^{R}$

目的：Estimating moments - Estimate std. dev. of last $k$ elements
- a: 5 次, b: 4 次, c: 3 次, d: 3 次
- $X_{1}$ (3rd): c, $X_{2}$ (8th): d, $X_{3}$ (13th): a
- $X_{1}.value = 3$ (因為 c 包含這次並在之後出現的次數是三次)