我發現我越來越魚腦了,每次工具用完就忘,每次都要從頭查一次,那就來記錄一些比較常用的。

Sentence Tokenization

斷句工具。

Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).

參數有兩個:

  • text – text to split into sentences
  • language – the model name in the Punkt corpus (default: English)
1
2
3
4
5
6
from nltk.tokenize import sent_tokenize
text = "Mayday evolved from So Band in 1995 while the members were studying in The Affiliated Senior High School of National Taiwan Normal University. They were later joined by Masa and Stone, who were attending the same school."
sent_tokenize_list = sent_tokenize(text)
sent_tokenize_list

結果就是將句子斷開來,回傳一個 list :

['Mayday evolved from So Band in 1995 while the members were studying in The Affiliated Senior High School of National Taiwan Normal University.', 'They were later joined by Masa and Stone, who were attending the same school.']

Word Tokenization and POS Tagging

word_tokenize() 除了斷詞還可以順便標註詞性。

1
2
3
import nltk
text = nltk.word_tokenize("Mayday won the Golden Melody Award for Best Band in 2001, 2004, 2009 and 2012.")

NLTK 這套 word_tokenize() 是採用 Penn Treebank POS Tags,這是標注之後的結果:

[('Mayday', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Golden', 'NNP'), ('Melody', 'NNP'), ('Award', 'NNP'), ('for', 'IN'), ('Best', 'NNP'), ('Band', 'NNP'), ('in', 'IN'), ('2001', 'CD'), (',', ','), ('2004', 'CD'), (',', ','), ('2009', 'CD'), ('and', 'CC'), ('2012', 'CD'), ('.', '.')]

想要速查 tag 的意思,比如我想知道 “JJ” 代表什麼,可以這樣寫:

1
nltk.help.upenn_tagset('JJ')

會提供一段簡述和簡單的例子:

adjective or numeral, ordinal, third ill-mannered pre-war regrettable oiled calamitous first separable, ectoplasmic battery-powered participatory fourth still-to-be-named, multilingual multi-disciplinary .

Lemmztization

1
2
3
4
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('dogs', pos='n')

wordnet_lemmatizer.lemmatize() 可以只傳第一個參數,第二個參數是指定詞性,可以增加準確度。

上面程式碼片段的結果是:

dog

Stanford Parser

當然還有最有趣的 Stanford Parser 系列,之前撞牆過了,再另外寫一篇。

參考資料:http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk