[NLP Lab] Spell Checker Using Web Corpus

Spring 2015 Natural Language Processing Lab.
Week 2. Spell Checker Using Web Corpus
Instructor: Jason S. Chang

這週的Lab與上週差不多，只是使用的字典來源不一樣了，由原本local的檔案擴增到網路資料，使用的是NetSpeak API。

首先，來介紹一下 real word errors 和 non-word errors。

Real Word Errors

所謂 real word errors 就是寫作的人拼錯字了，但是這個拼錯的字，本身也是一個字典裡查得到的字。聽起來非常複雜，舉些例子：

• at **brake** time -> at **break** time
• when the **brack** was finished -> when the **break** was finished

第一個例子中，brake 這個字本身有煞車的意思，放在這個句子裡雖然不符合句意，但是並沒有拼字錯誤，對於一般的文字編輯器（比如說 word ）或是 spelling checker 來說，是一個正確的字，所以不會被抓出來糾正。這類型的錯誤通常比較難偵測到，一般都會使用上下文來揪出這種實字錯誤。

Non-word Errors

相較之下，non-word errors 就容易偵測多了。只要字典裡沒有的字通通抓出來就對了！

• I felt very **strang** -> I felt very **strange**
• in the **weanter** when it was snowing -> in the **winter** when it was snowing

而這次的實作主要就是要修正這兩個類型的拼字錯誤。non-word error 就是去撿查字典，所以就不贅述了，以下要來看的是 real word errors 如何偵測與修正。另外這次的 input 是句子，而非上的單字或是片語。

其實概念差不多，只是多了一些檢查機制。這是講義裡給的做法：

Dealing with Real Word Errors

confusable words

這個步驟是要透過 confusable words set 去建立一個字典，而這個 set 就是一堆常常會被搞混的字的組合。這是前五個例子：

an    and
accede    exceed
accept    except
adept    adopt
adverse    averse

def buildDict():
    wordList = dict()
    for l in open('lab2.confusables.txt', 'r').read().strip().split('\n'):
        wordList[l.split('\t')[0]] = l.split('\t')[1]
    return wordList

generating several candidates and compute overlapping ngram
利用 confus 字典來產生candidates。

1
2
3

for test in testPairs.keys():
    freq = Counter()
    freq[test] = freqCount(SE, test)

testPairs.keys() 是要改錯的句子。這段程式碼首先將每個 test 句子依照原字丟進 freqCount() 去算 3-gram 的頻率，而 count 則是每個 3-gram 的相乘：

def freqCount(SE, test):
    count = 1
    words = test.split(' ')
    for i in range(0, len(words)-2):
        res = SE.search(' '.join(words[i:i+3])) # ('when the brake', 7307.0)
        if res:
            for x in res:
                count *= x[1]
    return count

接下來呢，針對 test 裡的每個字，去 confus 裡查是否在易混淆字典裡，若有，就將那個字換成 confus[w] 裡的字，再丟到 freqCount() 裡算 3-gram 的頻率。若沒有，就丟到 non-word spell checker 去檢查，然後用 spell.correct() 的字來算頻率。解釋起來好難，直接看程式碼。

words = test.split(' ')
for w in words:
    if w in confus.keys():
        freq[replace(words, w, confus[w])] = freqCount(SE, replace(words, w, confus[w]))
    freq[replace(words, w, spell.correct(w))] = freqCount(SE, replace(words, w, spell.correct(w)))

output the max. freq value
把這些頻率全部存起來，最後輸出一個最大值的結果。
1
print test + ' -> ' + str(max(freq.items(), key=lambda a: a[1]))
結果輸出長這樣：

稍微算了一下修正率，大概是50-60%，有待加強啦。

Slides and Code

NLP Lab. 2017 Week 02 github repo