[NLP Lab] Selecting Good Dictionary Examples, GDEX

Spring 2015 Natural Language Processing Lab.
Week 5. Selecting Good Dictionary Examples, GDEX
Instructor: Jason S. Chang

來到了第五個 Lab 了。這次要來找例句！

Introduction

這次找例句用的是 GDEX 演算法，它根據句子的長度、易讀性(readability)和 collocations 來找出好的例句。以下是 GDEX 找出來有關 “difficulty” 的例句：

• difficulty of STH
The level of difficulty of the **puzzles** can be selected to suit the audience.

• v.  difficulty
Major efforts have been made to **overcome** difficulties with the numerous issues of the JTMP software.

• adj. difficulty
His last years were also clouded by **financial** difficulties, to which the last building phase at Florence Court must have materially contributed.

• difficulty DOING
The large ground floor bedroom is easily accessible to guests in wheelchairs or those that have some difficulty **walking**.

可以發現，difficulty 的主要用法，都呈現在這些例句裡面了，而不是給你一堆例句，但是用法頻頻重複，那有跟沒有一樣。那一個好的例句又需要哪些條件呢？根據 GDEX 演算法，有以下的特徵：

簡單來說，就是句子要有頭有尾、不能太長也不能太短、用字要簡單易懂，當然包含 collocation 也是很重要的。

這次 Lab 的輸入有三個部分：句子、collocations 和代名詞；輸出則是分數最高的例句和它所包含的 collocations。而這次程式的架構則是使用 MapReduce，雖然資料沒有那麼大，不過還是可以練習一下這個架構來過過癮。

Mapper

這部分的演算法是這樣的：


Read a sentence Sent in SENTS
For each Sent[i], Sent[i+d], where |d| in (1,5)
    If isColl(Sent[i], Sent[i+d], d) and |Sent| in (10, 25)
        Output Sent[i] Sent[i+d]  Sent

那就來照做吧！先把句子讀進來，這裡直接用 fileinput 讀句子，然後斷詞。

def tokens(str1): return re.findall('[a-z]+', str1.lower())
for line in fileinput.input():
    words = tokens(line)

接著我先做了長度的檢查，然後產生 n-gram。

1
2
3

if len(words)-1 &gt;= 10 and len(words)-1 <= 25:
    for n in xrange(2, max_distance + 2):
        ngrams = filter(ngram_is_valid, to_ngrams(words, n))

to_ngrams(words, n) 是用來把一堆字轉成 n-grams 的。而 ngram_is_valid() 則是用來檢查這個 n-gram 是否合格，開頭和結尾不能是 stopwords、也不能是數字或符號，不然就沒有意義了。

def ngram_is_valid(ngram):
    first, last = ngram[0], ngram[-1]
    if first in eng_stopwords or last in eng_stopwords: return False
    if any( num in first or num in last for num in '0123456789'): return False
    if any( eng_symbol in word for word in ngram for eng_symbol in eng_symbols): return False
    return True
def to_ngrams(unigrams, length):
    return izip(*[unigrams[i:] for i in range(length)])

最後檢查的是 collocations，利用一個給定的 collocation list 來查詢句子裡是否有 collocations。如果有才印出，沒有就不理它。這裡 n 表示 n-gram 的長度，一個個 n-gram 去檢查有沒有 collocation。

for ngram in ngrams:
    if (ngram[0], ngram[-1], str(n-1)) in colls:
        try:
            print '{}\t{}\n'.format(ngram[0] + '_' + ngram[-1], line.strip().split('\t')[-1].encode('ascii'))
        except: pass

mapper 到這裡就完成了，而它產生的是一對對合法的 collocation 與句子的 pair。

Reducer

因為在 Mapper 完成，傳到 Reducer 的過程中可以透過指令對 Mapper 的結果做排序，所以到了 Reducer 這邊是一個依照 key（也就是剛剛輸出的 collocations）排序好的 list。既然排序好了，就可以使用 python 內建的 itertools 當中的 groupby 來做，他會將 key 相同的 pairs 都 group 起來，這樣就不需要再建字典來存這些 pairs 了。

groups = groupby(allSents, key = lambda x: x[0])
for key, group in groups:
    scores = []
    numofCols = 0
    for g in group:
        sent = g[1]
    targetWord = key.split('_')[0]
    words = tokens(sent)
    loc = words.index(targetWord) # location of Word in Sent
    for w in words:
        if (targetWord, w, words.index(w)-words.index(targetWord)) in colls:
        numofCols += 1      
    pronWords = [wp for wp in words if wp in prons]
    numofProns = len(pronWords)
    s = loc + 10*numofCols - 10*numofProns
    scores.append((sent, s))
    print str(max(scores, key=lambda x: x[1]))

group 完之後針對每一個句子去算分數。這裡的分數根據 target word（比如例子中的 difficulty）在句子裡的位子(index)、句子含有幾組 collocations 以及句子中有多少的代名詞來做加減。最後輸出分數最高的就完成了。

其實這邊算分數還可以有其他方法，比如把長度也加進去，讓他不再只是一個門檻或限制；或者可以加入高頻字比例等等，之後有機會再研究！