[NLP Lab] Dealing with Spelling Errors

Spring 2015 Natural Language Processing Lab.
Week 1. Dealing with Spelling Errors
Instructor: Jason S. Chang

第一次的 NLP 實作課要來處理拼字錯誤，並且實作一個 spelling corrector。

Dealing with Spelling Errors

以參考資料中的小程式為基礎，增加一些功能，擴展成為能處理以下錯誤的corrector：

• Fusion errors (e.g. “taketo” → “take to”)
• Multi-token errors (e.g. “mor efun” → “more fun”)
• Fusion errors (e.g. “with out” → “without”)

先來解釋一下參考資料中的小程式長什麼樣子。

首先是前置作業，先讀入一份文件，將這份文件的所以單字抓出來，並存入字典，字典的key是單字，value則是那個字在這份文件中出現的次數，這個字典用來作為改錯的標準答案。

建好字典之後，開始進入改錯的流程。

1
2
3

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

每輸入一個字，程式會先去查剛剛建好的字典，若這個字有出現在字典裡，便把它直接加入候選字中 (candidates)。

若沒有，程式會對 word 做一次的編輯 (1 edit)，也就是將這個字中任意位置的字母刪除 (delete)、任兩個字母交換位置 (transpose)、用任意字母來取代某一位置的字母 (replace)、在任意位置插入一個字母 (insert)等等。使得編輯後的字與 word 的 edit distance 為 $1$。程式片段如下：

def edits1(word):
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)&amp;amp;amp;amp;gt;1]
   replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts = [a + c + b for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

如果 edits1 回傳的結果也在字典裡，則也會被加入到候選字中。

如果還是沒有，直覺上當然是繼續做 edit distance 為 2 的處理了。既然已經有製造 edit distance 為 $1$ 的 function，那將它做兩次，edit distance就是2了。

1 2	def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

最後呢，從這些候選字當中，找一個最常出現（也就是 value 最大）的字回傳，就是拼字錯誤的修正結果了。很直覺、土法煉鋼，卻也是最簡單有效的方法。但是這個方法只能找到 edit distance 為 $1$ 或 $2$ 的單字錯誤，沒辦法找到這次作業要求的片語錯誤。所以接下來，就要來講一下怎麼處理 fusion errors 以及 multi-tokens errors。

Dealing with Fusion Errors

處理 fusion errors 以及 multi-tokens errors 也不難，根據原本的程式片段稍微修改一下就可以了。

將空白加入定義好的alphabet中，這樣空白就可以跟著其他字母一起刪除、插入、置換了。
1
alphabet = 'abcdefghijklmnopqrstuvwxyz '

將空白加入還不夠，還必須擴增字典，從單字到 2-gram。

def train():
    f = open('big.txt', 'r')
    lines = f.read().strip().split('\n')
    for l in lines:
        words = l.strip().lower().split()
        for i in range(0, len(words)):
            uniGram[words[i]] += 1
            if i+2 <= len(words):
                biGram[' '.join(words[i:i+2])] += 1

接下來，必須去檢查片與中的每個字是否也是一個拼字正確的單字。所以加入了一個 function 叫做 known_edits1()。先將每個傳入的片語（也就是空白加入 alphabet 後去做一次編輯的結果）用空白分開成兩個字，再把這兩個字丟進字典去查，如果有其中一個不存在字典裡，就捨棄整個片語。如果有其中一個字存在字典裡，就將那個字的分數 +1，增加它的權重。

def known_edits1(words): # known_edits1(edits1(word))
    words_not = []
    for w in words:
        for ws in w.strip().split(' '):
            if ws not in NWORDS:
                words_not.append(w)
                break
            else:
                NWORDS[w] += 1
        if ' ' not in w and w not in NWORDS:
            words_not.append(w)
    for wn in words_not:
        if wn in words:
            words.remove(wn)
    return set(w for w in words)

最後，要修改一下決定 candidates 的方法，讓 know_edits1() 在前面，去抓出片語錯誤。

1
2
3

def correct(word):
    candidates = known([word]) or known_edits1(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key = lambda k: NWORDS[k])

到這裡大致上就能處理範例中的幾個錯誤了。

• taketo → take to
• tak eto → take to
• mor efun → more fun
• with out → without
• som ething → something

Slides and Code

NLP Lab. 2017 Week 01 github repo