[NLP Lab] WriteAhead and Induction of Grammar Patterns

Spring 2015 Natural Language Processing Lab.
Week 8. and 9. WriteAhead and Induction of Grammar Patterns
Instructor: Jason S. Chang

簡單來講這次的實作要來刻一個 WriteAhead。

Introduction

要實作這個作業，首先要了解什麼是 Pattern Grammar。
這是語言學的一個概念，中文翻作「樣式文法」，它的組成如下：

Pattern = Template + Headword
Headword = verbs, nouns, and adjective (headwords are organized into groups)

簡單來說，就是每一個主要的字 (Headword) 在文法中，都有一個固定的樣板 (Template)，而形成樣式 (Pattern)。

其中 Template 是事先定義好的，均以動詞、名詞、形容詞為主體。例如：

動詞
V prep n, V pl-n, V pron-re, V amount, V adj, V -ing, V to-inf
名詞
N that, N to-inf, the N, N -ing, N ‘s ADJST
形容詞
a ADJ amount, ADJ adj, ADJ and adj, ADJ that, ADJ to-inf

由這些事先定義好的 Template 與 Headword 來組成 Pattern Grammar。

舉個例子：
Pattern: V about n
這個 Pattern 是由 V 這個 Headword 與 V prep n 這個 Template 所組成，而 V 的位置可以填入下列這些單字：

H1 TALK argue, ask, bellyache, …
H2 THINK agonize, agree, bother, …
H3 LEARN hear, find out, learn, read, …

以下是例句：

We’d like to ask (V) about (prep) the show (n).
I read (V) about (prep) the museum (n), and it is really interesting.

而這次 Lab 的目的就在於找出這些在句子中的樣式文法來輔助英文寫作。

Implementation

Mapper

(1) From Chunks to Elements
從已標示好詞性的資料擷取出 Pattern Grammar 的組成元素 (Element)。
這是已標好詞性的資料：

I have great difficulty in understanding him. I have great difficulty in
understand him. PRP VBP JJ NN IN VBG PRP . H-NP H-VP I-NP
H-NP H-PP H-VP H-NP O

根據一些特定規則，我們可以從這些單字的詞性中，找到 Pattern Grammar 的元素。
這些是一部分的規則：

H-NP and NN –> [‘N’, ‘n’]
H-VP and VBG –> [‘V’, ‘v’, ‘-ing’]
H-ADJP –> [‘ADJ’, ‘adj’]
H-PP –> [‘prep’]
WDT, WP, WRB –> [‘wh’]
to, so, not, which, that, if, though, and, together, way –> to, …, way

程式的部分，首先當然是將原始資料讀入，將句子的 words, lemmas, tags, phrases 使用 tab 以及 space 斷開來，分別存入四個 list。再使用zip這個方法將相對應的一組傳入 genElement() 來生成 element。

for line in fileinput.input():
    sentResult = []
    if not line: continue
    words, lemmas, tags, phrases = [ x.split(' ') for x in line.strip().split('\t') ]
    elements = [ genElement(*x) for x in zip(words, lemmas, tags, phrases) ]

genElement() 的部分是用來檢查 tag 和 phrase 是否符合條件。

def genElement(word, lemma, tag, phrase):
    res = []
    if lemma in reservedWords: res += [lemma, ]
    if tag in ['WDT', 'WP', 'WP$', 'WRB']: res += ['wh']
    if phrase == 'H-NP' and tag in ['NN', 'NNS', 'NNPS', 'NNP']: res += ['N', 'n']
    if phrase == 'H-VP' and tag == 'VBG': res += ['V', 'v', '-ing']
    if phrase == 'H-ADJP': res += ['ADJ', 'adj']
    if phrase == 'H-PP': res += ['prep']
    return res

(2) From Elements to Patterns

接著，是由 element 來生成 pattern。用了好多的 break 是因為要找每個句子裡最長的 pattern，比如說 have difficulty 以及 have difficulty in 都是合法的 pattern，但由於 have difficulty in 的長度較長，我們只留下這個較長的 pattern。

另外注意 region.reverse() 這個方法，表示 j 的順序被倒過來了，從最後一個字找回來，這也是為什麼找到第一組 pattern 就要 break 出去，因為 i 從頭數， j 往回數，所以找到的第一組 pattern 自然會是最長的了。

for i in range(leng-1):
    region = range(i+2, leng+1)
    region.reverse()
    for j in region:
        for p in product(*elements[i:j]):
            if (j-i &amp;amp;amp;amp;lt;= 5) and re.sub(r'\s_|_\s|_', &amp;amp;amp;quot;&amp;amp;amp;quot;,' '.join(p)) in allTemplates: # match
                result += [genPattern(p, words[i:j], lemmas[i:j])]
                flag = True
                break
            if flag: break
        if flag: break
    if flag: break

genPattern() 長這樣。若 tag 是大寫或是介系詞，則要換回原本的字 (word)，如果不是則使用 tag 來組成要回傳的 pattern。

def genPattern(template, words, lemmas):
    pat = []
    for tag, word, lemma in zip(template, words, lemmas):
        pat += [lemma + '_' + tag if (tag.isupper() or tag=='prep') else tag]
    return ' '.join(pat)

最後做了一些其他處理，讓 map 生成的格式長這樣以方便 reducer 處理：

ensue \t ensue between
effect \t effect that
difficulty \t difficulty in -ing

Reducer

(1) 對每一個動詞、名詞、形容詞，計算該詞的每一 pattern 的次數
(2) 計算所有 patterns 的總次數、平均次數、standard deviation
(3) 若 pattern 次數大於 mean + 1 * standard deviation –> 輸出(詞, pattern, 次數)

patterns = [line.strip().split('\t') for line in fileinput.input()]
for key, group in groupby(patterns, key = lambda x: x[0]):
    count = collections.defaultdict(int)
    if key.isalpha():    
        print '\n{}: '.format(key)
        for g in group:
            count[g[1]] += 1 # patCount
        mean = numpy.mean(count.values())
        std = numpy.std(count.values())
        for p,v in count.items():
            if float(v) == float(mean) + float(std):
                print ('\n{}: {},{}\t'.format(key,p,v))

以上，是這次實作的部分。