接到任務要處理一份很髒的資料,他真的很髒,除了長得很髒,做的事也很髒。

美國有一間能源公司 Enron,因為幹了太多壞事,導致公司倒閉,財產被扣押,其中包含員工所有業務往來的電子郵件。而這份電子郵件資料就這樣被 Open 了(?於是乎,就成為了學術研究使用的資料集。

在開始進一步的研究和應用之前,要先把這份資料清理一下。

原始資料大概長這樣:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
This note is to inform you that PIRA has commenced its study of the third=
=20
region: California & the Southwest. As in all regions, this study begins=
=20
with a fundamental view of gas flows in the U.S. and Canada. Pipelines in=
=20
this region (covering CA, NV, AZ, NM) will be discussed in greater detail=
=20
within the North American context. Then we turn to the value of =20
transportation at the following three major pricing points with an assessme=
nt=20
of the primary market (firm), secondary market (basis) and asset market:
??????1) Southern California border (Topock)
??????2) San Juan Basin
??????3) Permian Basin (Waha)?
The California/SW region=01,s workshop =01* a key element of the service =
=01* will=20
take place on March 20, 2000,at 8:30 AM, at the Arizona Biltmore Hotel in =
=20
Phoenix.?For those of you joining us, a?discounted block of rooms is bei=
ng=20
held through February 25.
The attached prospectus explains the various options for subscribing.?Plea=
se=20
note two key issues in regards to your subscribing options:?One, there is =
a=20
10% savings for PIRA retainer clients who order before February 25, 2000;=
=20
and?two, there are discounts for purchasing?more than one region.

簡單來說,我要把這些東西的雜訊清理掉,然後把它們變成一句句正常的句子。

我想先找出段落,假設空白行是一個段落的分隔,那麼空白和空白之間的文字就成了一地段落(區塊)。

讀取區塊 (Read paragragh/block)

1
2
3
4
5
6
7
8
block = []
multilines = ''
for line in fileinput.input(files):
if line.strip(): # 如果line不是空白行,與上一句合併
multilines += ' ' + line.strip()
else: # 讀到空白行時,前面合併起來的string就加入list中
block.append(multilines.strip()
multilines = ''

順利找到了段落,接下來當然是斷句了。

用 NLTK 斷句

1
2
3
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(paragraph) # this is a list

要處理句子裡頭的雜訊,例如一堆奇怪的符號,或是過多的底線、空白等等,想要讓他們消失,我想到了 match 和 replace 這兩種方法。寫個簡單的 regular expression,就可以用來配對和取代了:

1
2
3
4
5
6
7
import re
# match
re.match('^[A-Z].*?[.!?]$', s.strip())
# replace
s = re.sub('to be replaced (re)', 'to replace', s)

找出資料夾下所有檔案名稱

最後,是一次要處理一個資料夾下多個檔案的資料。

1
2
3
4
5
6
7
8
9
10
11
12
import os
for files in os.walk('/path/to/enronsent/'):
do something
'''files[0] = path
# '/Users/shanny/Documents/enron.email/enronsent/' #
'''
'''
files[2] = a list of filenames
# ['enronsent00', 'enronsent01', 'enronsent02'] #
'''

List Comprehension

順手來紀錄一下首次嘗試的雙層 list comprehension,也就是兩個 for loop。

1
2
3
4
5
6
7
8
allSents = []
for bl in block:
for s in sentences:
if re.match('^[A-Z].*?[.!?]$', s.strip()):
allSents.append(s)
# 改寫成
allSents = [s for bl in block for s in sentences if re.match('^[A-Z].*?[.!?]', s.strip())]

好吧,大概就是這樣了,前處理做完,下一步就要來解析囉。

參考資料