[Python] NLTK 工具整理
我發現我越來越魚腦了,每次工具用完就忘,每次都要從頭查一次,那就來記錄一些比較常用的。
Sentence Tokenization
斷句工具。
Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).
參數有兩個:
- text – text to split into sentences
- language – the model name in the Punkt corpus (default: English)
|
|
結果就是將句子斷開來,回傳一個 list :
['Mayday evolved from So Band in 1995 while the members were studying in The Affiliated Senior High School of National Taiwan Normal University.', 'They were later joined by Masa and Stone, who were attending the same school.']
Word Tokenization and POS Tagging
word_tokenize()
除了斷詞還可以順便標註詞性。
|
|
NLTK 這套 word_tokenize()
是採用 Penn Treebank POS Tags,這是標注之後的結果:
[('Mayday', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Golden', 'NNP'), ('Melody', 'NNP'), ('Award', 'NNP'), ('for', 'IN'), ('Best', 'NNP'), ('Band', 'NNP'), ('in', 'IN'), ('2001', 'CD'), (',', ','), ('2004', 'CD'), (',', ','), ('2009', 'CD'), ('and', 'CC'), ('2012', 'CD'), ('.', '.')]
想要速查 tag 的意思,比如我想知道 “JJ” 代表什麼,可以這樣寫:
|
|
會提供一段簡述和簡單的例子:
adjective or numeral, ordinal, third ill-mannered pre-war regrettable oiled calamitous first separable, ectoplasmic battery-powered participatory fourth still-to-be-named, multilingual multi-disciplinary .
Lemmztization
|
|
wordnet_lemmatizer.lemmatize()
可以只傳第一個參數,第二個參數是指定詞性,可以增加準確度。
上面程式碼片段的結果是:
dog
Stanford Parser
當然還有最有趣的 Stanford Parser 系列,之前撞牆過了,再另外寫一篇。
參考資料:http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk