[NLP Lab] Extracting Collocations from ngram

Spring 2015 Natural Language Processing Lab.
Week 3. and 4. Extracting Collocations from ngram
Instructor: Jason S. Chang

本週要根據 n-gram 來擷取 collocations。

那麼，什麼是 collocations 呢？

Collocations and Collocation Extraction

collocation 是搭配詞。有些字天生就是和另外一些字擺在一起的，比如在中文裡我們會說「一支筆」，不會說「一條筆」；我們會說「講電話」，不會說「談電話」。

而英文當然也是這樣。比如說「開燈」的英文，我們會用 turn on / switch on the light，而不是 open the light。這對於非母語使用者來說，並不是一件天生就會的事，所以才需要大量閱讀，養成習慣。而 collocation extraction 就是電腦自動將這些搭配詞給擷取出來。這是講義裡的定義：

• Collocation = a pair / sequence of words co-occurring more often than chance, and conveying more meaning than the surface words, e.g., nuclear family, plastic surgery, riding boots, motor cyclist
• Collocation Extraction = Identifying collocations automatically from a corpus using a computer

為什麼 collocation 這麼重要呢？因為用對搭配詞才能將事情表得順暢、流利，並且傳達正確的意思。

Properties of Collocations

Syntactic Relations
語法上的搭配詞，比如說形容詞後面通常會接名詞，又分為 Lexical collocations 與 Grammatical collocations。

Statistical Associativity
根據 mutual information 來統計的相關性。比如某個形容詞 $A$ 平常少用，但他只要一出現，都是跟著某個特定的名詞 $N$ 一起出現，這樣 $A$ 和 $N$ 統計上的相關性就會很高，也可以代表他們是很適合彼此的搭配詞。
Syntactic Relation by Distance Analysis
假設我們今天想要知道 play 和 role 是不是合適的搭配詞，我們就去計算 play 和 role 在資料集中一起出現的次數，而我們允許這兩個字中間有其他字（但不超過特定距離）。也就是如果今天，距離 $5$ 以內的搭配詞都算數，那不管 play a role, play an important role 或是 play role 我們都算。
```
distance and frequencies of play_role (in Google Web 1T 5gram)
• -4(81230) -3(161358) -2(920270) -1(255149)
• 4(325548) 3(3452577) 2(1428845) 1(27584)
```
從這個統計中我們可以發現，最大值落在 play 和 role 距離 $3$ 的地方。這也就暗示了，play 和 role 中間可能有著 V. + Det. + Adj. + N. 的關係。

Smadja’s Algorithm

從前面的介紹可以發現：

• Collocations = word + collocate
• Collocations are relatively high frequency
• Collocations has skew distance distribution
• Peaks at some distance imply grammatical construction, e.g., AN, VN

於是有位學者就提出了一套擷取搭配詞的演算法。

他先去計算字 (base word) 和搭配詞 (collocate) 之間的距離 (distance)，接著他列出了三個條件，只要符合這三個條件的搭配詞組，就是這個 base word 合適的搭配詞。這三個條件分別是：

(C1) Count(base_word, collocate) > f(average) + 1σ (standard deviation)
(C2) Count(base_word, collocate, distance) spread out non-uniformly (has 1-2 peaks)
(C3) Some distances where Count(base_word, collocate, distance) peaks

更數學一點的表示方法如下：

舉個例子：

也就是說不論是平均出現頻率、平均平方差等統計上的數據，都必須大於特定數字，才能成為合適的搭配詞組。

Implementation

Generate ngrams for a given corpus $(n = 2-5)$

首先，把資料集的所有句子都讀入，並產生 n-gram，同時計算每個 n-gram 出現過幾次。

ngram_counts = defaultdict(Counter)
with open('sents.txt.50000') as text_file:
    for index,line in enumerate(text_file):
        words = wordpunct_tokenize(line)
        for n in range(2, max_distance + 2):
            ngram_counts[n].update(filter(ngram_is_valid, to_ngrams(words, n)))

wordpunct_tokenize() 是用來切字的。而這裡的 $max_distance = 5$，代表搭配詞與 base word 的最大距離。to_grams() 則是用 izip 將每個單字合起來成 n-gram，詳細程式碼如下：

1 2	def to_ngrams(unigrams, length): return izip(*[unigrams[i:] for i in range(length)])

至於 ngram_is_valid() 是去檢查這個 n-gram 裡面有沒有含數字或是符號，或者頭尾是 stopwords 的，都把他們過濾掉。

eng_stopwords = set(stopwords.words('english'))
eng_symbols = '{}&amp;quot;\'()[].,:;+!?-*/&amp;amp;|&amp;lt;&amp;gt;=~$'
def ngram_is_valid(ngram):
    first, last = ngram[0], ngram[-1]
    if first in eng_stopwords or last in eng_stopwords: return False
    if any( num in first or num in last for num in '0123456789'): return False
    if any( eng_symbol in word for word in ngram for eng_symbol in eng_symbols): return False
    return True

整個建構完之後，ngram_counts 會長這樣（範例只有三個句子）：

key 是 ngram 長度(2-6)，value 則是一個 Counter，儲存 ngram 及他們出現的次數。比如長度為 $3$ 的 ‘plat the role’ 就出現 2 次。

Generate skip bigrams from ngrams $(-5 \leq d \leq 5)$
skip bigrams 的概念是，任兩個字中間有其他字，只要那兩個字的距離不大於 $5$，都算是 bigram。
這部分針對每個在 ngram_counts[n]（這是一個 Counter）key 裡的 ngram，不管 ngram 的長度為何，都去計算它的 skip bigram。

skip_bigram_info = defaultdict(partial(defaultdict, Counter))
for n in range(2, max_distance + 2):
    for ngram in ngram_counts[n].keys():
        skip_bigram_info[ngram[0]][ngram[-1]][n-1] += ngram_counts[n][ngram]
        skip_bigram_info[ngram[-1]][ngram[0]][1-n] += ngram_counts[n][ngram]
print len(skip_bigram_info['play'])

至於 skip_bigram_info[ngram[0]][ngram[-1]][n-1] += ngram_counts[n][ngram] 是什麼？（雖然我覺得就算解釋了我下次回來看還是會想不起來…）

skip_bigram_info 是一個 defaultdict，它的 key 是 base word，而 value 就有點複雜，是一個長這樣的東西：(defaultdict, Counter)。這是什麼呢？簡單來講就是一個 dict，而這個 dict 的 value 是一個 Counter 結構。而這個 Counter 又有自己的 key 和 value，分別是距離和出現頻率。來個圖解：

play 是 skip_bigram_info 的 key，後面接的那一串則是 value。而這個 value 本身是一個 defaultdict，key 是 collocate，value 是一個 Counter。這個 Counter 又有 key 和 value 分別是 word 和 collocate 的距離，還有出現的頻率。好，繞口令結束。（我真的不期待一個禮拜後的我能看得懂囧）

好，現在我們回到 skip_bigram_info[ngram[0]][ngram[-1]][n-1]，假設 ngram 叫做 “play the role”，那麼 ngram[0] = ‘play’，ngram[-1] = ‘role’，$n-1$ 則是 play 和 role 之間的距離。好，到這裡會發現東西都差不多有了，只剩下頻率了。也就是 Counter 裡面的那個 value。這東西怎麼算呢？當然就要用到剛剛算好的 ngram_counts 了，所以 ngram_counts[n][ngram] 出現了！

他來做什麼？記不記得剛剛 ngram_counts 長什麼樣子？ ngram_counts[n][ngram] 是一個數值，也就是頻率，是 ngram 出現的頻率，而 n 在這邊是 n-gram 的長度。所以假設 $n=3$, $ngram$ = ‘play the role’，ngram_counts[3]['play the role'] 就是 $2$，代表 ‘play the role’ 出現過 2 次。

那這個數值又跟 skip_bigram_info['play']['role'][n-1] 有什麼關係呢？假設 play 和 role 的距離為 $2$，也就是 $n=3$，ngram_counts[3]['play the role'] 就會傳回 $2$。意思就是 play 和 role 距離為 $2$ 的 n-gram 出現 2 次。

那 play 和 role 有沒有可能距離為 $3$ 或 $4$？當然有。所以 skip_bigram_info['play']['role'] 這個 Counter 裡就會有各種距離(key)，從 $-5$ ~ $5$，以及他們出現的頻率。（註：距離為負值，代表 collocate 出現在 base word 的前面。）

沒想到短短六行程式碼，我需要花這的多篇幅解釋，真心覺得我能寫完這段真的很了不起。（呼～）

Generate average frequency and standard deviation of a word (e.g., play) with all
its collocate

接下來，要計算平均頻率與其他各種統計資訊了。

skip_bigram_abc = defaultdict(lambda:defaultdict(lambda:0.0))
for baseWord in skip_bigram_info.keys():
    avg_f = 0
    dev = 0
    for colWord in skip_bigram_info[baseWord].keys():
        skip_bigram_abc[(baseWord, colWord, 'freq')] = sum(skip_bigram_info[baseWord][colWord].values())
        mean = skip_bigram_abc[(baseWord, colWord, 'freq')]/10
        spread = 0
        for i in range(-5,0) + range(1,6):
            spread += ((skip_bigram_info[baseWord][colWord][i] - mean)**2)
        skip_bigram_abc[(baseWord, colWord, 'spread')] = spread/10
        avg_f += skip_bigram_abc[(baseWord, colWord, 'freq')]
    skip_bigram_abc[(baseWord, 'avg_freq')] = avg_f/len(skip_bigram_info[baseWord])
    for colWord in skip_bigram_info[baseWord].keys():
        dev += (skip_bigram_abc[(baseWord, colWord, 'freq')] - skip_bigram_abc[(baseWord,'avg_freq')])**2
    skip_bigram_abc[(baseWord, 'dev')] = math.sqrt(float(dev) / float(len(skip_bigram_info[baseWord])))

這裡用 skip_bigram_abc 來儲存相關的統計資訊。

(baseWord, colWord, 'freq') 是出現的總次數，計算的方法是把 skip_bigram_info[baseWord][colWord] 的所有 value 加總
mean 是除以 $U_{0}$ 之後的結果 ($U_{0} = 10$)
(baseWord, colWord, 'spread') 是 base 和 col 的距離從 $-5$ ~ $5$ 的均方差的總和再除以 $U_{0}$
(baseWord, 'avg_freq') 是平均頻率，也就是頻率總和除以 baseword 總數
(baseWord, 'dev') 則是標準差
都統計完之後，就來看看 base word 和 collocate 之間的距離分布吧。這邊用 pandas 來畫出 play 和 role 的距離分布圖。

import pandas
%matplotlib inline
play_role_distances_count = pandas.Series(skip_bigram_info['play']['role'].values(), index= skip_bigram_info['play']['role'].keys()).sort_index()
play_role_distances_count.plot(kind='bar')

畫出來的結果長這樣：

Check $C_{1}$, $C_{2}$ and $C_{3}$

以上都統計完之後，就要進入最後階段啦！！

首先，用條件 $C_{1}$ 來過濾掉 weak collocates。

if skip_bigram_abc[(baseWord, 'dev')] == 0:
    strength = 0
else:
    strength = (skip_bigram_abc[(baseWord, colWord, 'freq')] - skip_bigram_abc[(baseWord,'avg_freq')]) / float(skip_bigram_abc[(baseWord, 'dev')])

接著，對每個 skip bigram，利用條件 $C_{2}$ 來看看分佈情形、條件 $C_{3}$ 來找峰值。($k_{0} = 1, U_{0} = 10$)

# C2
spread = skip_bigram_abc[(baseWord, colWord, 'spread')]
# C3
for i in range(-5,0) + range(1,6):
    peak = skip_bigram_abc[(baseWord, colWord, 'freq')]/10 + (k1 * math.sqrt(skip_bigram_abc[(baseWord, colWord, 'spread')]))
    p = skip_bigram_info[baseWord][colWord][i]
    if strength &amp;amp;amp;gt;= k0 and spread &amp;amp;amp;gt;= U0 and p &amp;amp;amp;gt;= peak:
        collocations.append([baseWord, colWord, i, strength, spread, peak, p]) # i = distance

最後，將這些沒有被淘汰掉的 collocations 輸出，就大功告成了！！！
不過在這之前，pandas 也想再出來秀一下，這次登場的是 Dataframe。這邊是要列出所有 collocation Dataframe。

# cc = [('base word', 'collocate', 'distance', 'strength', 'spread', 'peak', 'p').......]
import pandas
collocations = smadjas_skip_bigram()
print collocations[0]
collocations_df = pandas.DataFrame(collocations,
                                   columns = ['base word', 'collocate', 'distance', 'strength', 'spread', 'peak', 'p'])
collocations_df = collocations_df.set_index(['base word', 'collocate', 'distance']).sort_index()

如此一來，就可以很清楚的看到各個 base word 跟 collocate 的距離、strength、spread、peak 等等的關係。以下是以 work 為 base word，與各個 collocate 的資料呈現。

怎麼比前幾次完成都還要激動…應該的吧，看我這次這麼認真寫了這麼多，下次這樣寫不知道是什麼時候了哈哈哈。

對了，這次有用到一個玩具，叫做 ipython notebook，可以用來呈現一些關於這個 lab 的視覺化資料。先放個參考連結。

Slides and Code

NLP Lab. 2017 Week 04 github repo