[Massive Data Analysis] Mining Data Streams
Spring 2016 Massive Data Analysis Lecture Notes
Ch4. Mining Data Streams
Instructor: Jia-Shung Wang
Credit: Jane To
- 名詞解釋 Data Stream: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever.
- We can think of the data as infinite and non-stationary (the distribution changes over time)
- Google queries
- Twitter or Facebook status updates
- 問題:The system cannot store the entire stream accessibly
- Stochastic Gradient Descent (SGD) is an example of a stream algorithm
Types of queries one wants on answer on a data stream:
- Sampling data from a stream: Construct a random sample
- Queries over sliding windows: Number of items of type $x$ in the last $k$ elements of the stream
- Filtering a data stream: Select elements with property $x$ from the stream
- Counting distinct elements: Number of distinct elements in the last $k$ elements of the stream
- Estimating moments: Estimate average and standard deviation of last $k$ elements
- Finding frequent elements
If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to store a sliding window of each stream in the working store.
- Streams often deliver elements very rapidly, must process elements in real time, or lose the opportunity to process them at all.
- Sampling from a Data Stream:
- 問題:can not store the entire stream,所以 store a sample,有兩種:
- Sampling a fixed proportion - As the stream grows the sample also gets bigger
- Naïve solution: Generate a random integer in $[0..9]$ for each query, and store the query if the integer is 0, otherwise discard
- Sampling a fixed-size sample - As the stream grows, the sample is of fixed size
- 問題:我們無法事先知道stream的長度,而因著記憶體的限制,我們需要 maintain a random sample $S$ of size exactly $s$ tuples
- Sampling a fixed proportion - As the stream grows the sample also gets bigger
- 問題:can not store the entire stream,所以 store a sample,有兩種:
Reservoir Sampling
- 說明:當樣本總體很大或者是在數據流上進行採樣的時候,我們往往無法預知總體的樣本實力個數 $N$。那麼 Reservoir Sampling 就是這樣一組演算法,即使不知道 $N$ ,也能保證每個樣本實力被採樣到的概率依然相等
參考:http://www.sigmainfy.com/blog/reservoir-sampling-learning-notes.html
- $\frac{s}{n}$ 這個機率的意思是在你已看過 $n$ 個 elements 之後,一個 element 在 $S$ 裡面的機率
- Sliding Window
DGIM
參考:http://nathanlvzs.github.io/blog/DGIM-Algorithm-and-Python-Implementation.html
stream of positive integers(另一種應用,不是 bit stream 的用法)
Bloom Filtering
- 目的:as a way to eliminate most of the tuples that do not meet the criterion.
- 組成:$n$ 位的數組,$k$ 個 hash 函數,$m$ 個待過濾的元素
- 演算法請看 http://blog.csdn.net/dannyPolyu/article/details/9319811
- 優點:
- 空間效率和查詢時間都遠遠超過一般演算法
- 沒有 False negatives,如果某個元素確實沒有在該集合中,那麼 Bloom Filter 是是不會爆該元素存在集合中,所以不會漏報)
- 缺點:
- 有一定的機率會有 False positives,即 Bloom Filter 報告某個元素在該集合中,事實上該元素根本不在集合中
- 機率是 $(1 - e^{(- km / n)})^{k}$
- 刪除困難
- 有一定的機率會有 False positives,即 Bloom Filter 報告某個元素在該集合中,事實上該元素根本不在集合中
- 總結:
Flajolet-Martin
- 目的:counting distinct elements - Number of distinct elements in the last $k$ elements of the stream
- 下面幾張投影片個人覺得看了應該就會懂了,他就是要說明 distinct element 的數量是 $2^{R}$
Computing Moments
定義:Assume the universal set is ordered so we can speak of the $i$th element for any $i$. Let $m_{i}$ be the number of occurrences of the $i$th element for any $i$.
0th moment
- number of distinct elements
- The problem just considered
- 1st moment
- count of the numbers of elements = length of the stream
- Easy to compute
- 2nd moment
- = surprise number $S$
- = a measure of how uneven the distribution is (我自己把它想成變異數)
AMS
目的:Estimating moments - Estimate std. dev. of last $k$ elements
- a: 5 次, b: 4 次, c: 3 次, d: 3 次
- $X_{1}$ (3rd): c, $X_{2}$ (8th): d, $X_{3}$ (13th): a
- $X_{1}.value = 3$ (因為 c 包含這次並在之後出現的次數是三次)
Slides
- from CS246 Mining Massive Data Sets, Stanford University
- Part 1: http://web.stanford.edu/class/cs246/slides/streams-1.pdf
- Part 2: http://web.stanford.edu/class/cs246/slides/streams-2.pdf