[Data] Enron 資料前處理

接到任務要處理一份很髒的資料，他真的很髒，除了長得很髒，做的事也很髒。

美國有一間能源公司 Enron，因為幹了太多壞事，導致公司倒閉，財產被扣押，其中包含員工所有業務往來的電子郵件。而這份電子郵件資料就這樣被 Open 了（？於是乎，就成為了學術研究使用的資料集。

在開始進一步的研究和應用之前，要先把這份資料清理一下。

原始資料大概長這樣：

This note  is to inform you that PIRA has commenced its study of the third=
=20
region:  California &amp; the Southwest. As in all regions, this  study begins=
=20
with a fundamental view of gas flows in the U.S. and Canada.  Pipelines in=
=20
this region (covering CA, NV, AZ, NM) will be discussed in greater  detail=
=20
within the North American context. Then we turn to the value of =20
transportation at the following three major pricing points with an assessme=
nt=20
of  the primary market (firm), secondary market (basis) and asset market:
??????1) Southern  California border (Topock)
??????2) San Juan  Basin
??????3) Permian  Basin (Waha)?
The California/SW region=01,s  workshop =01* a key element of the service =
=01* will=20
take place on March 20, 2000,at 8:30 AM, at the Arizona Biltmore Hotel in =
=20
Phoenix.?For those of you joining  us, a?discounted  block  of rooms is bei=
ng=20
held through February 25.
The attached prospectus explains the various  options for subscribing.?Plea=
se=20
note  two key issues in regards to your subscribing options:?One, there is =
a=20
10% savings for  PIRA retainer clients who order before February 25, 2000;=
=20
and?two, there are discounts for  purchasing?more than one  region.

簡單來說，我要把這些東西的雜訊清理掉，然後把它們變成一句句正常的句子。

我想先找出段落，假設空白行是一個段落的分隔，那麼空白和空白之間的文字就成了一地段落（區塊）。

讀取區塊 (Read paragragh/block)

block = []
multilines = ''
for line in fileinput.input(files):
    if line.strip(): # 如果line不是空白行，與上一句合併
        multilines += ' ' + line.strip()
    else: # 讀到空白行時，前面合併起來的string就加入list中
        block.append(multilines.strip()
        multilines = ''

順利找到了段落，接下來當然是斷句了。

用 NLTK 斷句

1
2
3

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(paragraph) # this is a list

要處理句子裡頭的雜訊，例如一堆奇怪的符號，或是過多的底線、空白等等，想要讓他們消失，我想到了 match 和 replace 這兩種方法。寫個簡單的 regular expression，就可以用來配對和取代了：

import re
# match
re.match('^[A-Z].*?[.!?]$', s.strip())
# replace
s = re.sub('to be replaced (re)', 'to replace', s)

找出資料夾下所有檔案名稱

最後，是一次要處理一個資料夾下多個檔案的資料。

import os
for files in os.walk('/path/to/enronsent/'):
    do something
'''files[0] = path
# '/Users/shanny/Documents/enron.email/enronsent/' #
'''
'''
files[2] = a list of filenames
# ['enronsent00', 'enronsent01', 'enronsent02'] #
'''

List Comprehension

順手來紀錄一下首次嘗試的雙層 list comprehension，也就是兩個 for loop。

allSents = []
for bl in block:
    for s in sentences:
        if re.match('^[A-Z].*?[.!?]$', s.strip()):
            allSents.append(s)
# 改寫成
allSents = [s for bl in block for s in sentences if re.match('^[A-Z].*?[.!?]', s.strip())]

好吧，大概就是這樣了，前處理做完，下一步就要來解析囉。

[Data] Enron 資料前處理

[Data] Enron 資料前處理

讀取區塊 (Read paragragh/block)

用 NLTK 斷句

找出資料夾下所有檔案名稱

List Comprehension

參考資料