In this post, I am going to explore a Hotel Review dataset from Kaggle using pandas and visualize data using matplotlib.

The Data

This dataset “515K Hotel Reviews Data in Europe” can be downloaded from Kaggle. It contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe.

The csv file contains 17 fields, including the hotel’s information, positive/negative review content, reviewer score, etc. The description of each field can be found on Kaggle.

1
2
3
4
import pandas as pd
all_reviews = pd.read_csv('Hotel_Reviews.csv')
all_reviews.head(5)

Part of the data

About the Hotels

How many hotels were rated?

Use value_counts() to get the counts of unique values of a Series.

1
all_reviews['Hotel_Name'].value_counts()

1,492 hotels were reviewd in this dataset.

Britannia International Hotel Canary Wharf           4789
Strand Palace Hotel                                  4256
Park Plaza Westminster Bridge London                 4169
Copthorne Tara Hotel London Kensington               3578
DoubleTree by Hilton Hotel London Tower of London    3212
                                                     ... 
Le Lavoisier                                           12
Hotel Eitlj rg                                         12
Hotel Wagner                                           10
Mercure Paris Porte d Orleans                          10
Hotel Gallitzinberg                                     8
Name: Hotel_Name, Length: 1492, dtype: int64

About the Reviewers

Where are the reviewers from?

1
all_reviews['Reviewer_Nationality'].value_counts(normalize=True)

The reviews were written by people from 227 different countries. Nearly half of the reviewers are from the United Kingdom, others are from the USA, Australia, Ireland, UAE, etc.

United Kingdom               0.475524
United States of America     0.068711
Australia                    0.042048
Ireland                      0.028749
United Arab Emirates         0.019845
                            ...   
Tuvalu                       0.000002
Anguilla                     0.000002
Vatican City                 0.000002
Svalbard Jan Mayen           0.000002
Comoros                      0.000002
Name: Reviewer_Nationality, Length: 227, dtype: float64
1
2
3
4
5
6
import matplotlib.pyplot as plt
all_reviews['Reviewer_Nationality'].value_counts().head(10).plot(kind='barh', figsize=(10, 4)).invert_yaxis()
plt.xlabel('Number of reviews')
plt.ylabel('Nationality')
plt.title('The Nationality of Reviewers')

Reviewer Nationality

About the Review

When was the review been published?

1
2
3
review_date = pd.to_datetime(all_reviews['Review_Date'])
counted_review_date = review_date.value_counts().sort_index()
counted_review_date

The reviews were collected in two years from 2015-08-04 to 2017-08-03.

Review Content

In average, the length (word count) of negative reviews (18.5) is longer than that of positive reviews (17.8).

1
2
avg_pos_review_word_cnt = all_reviews['Review_Total_Positive_Word_Counts'].mean() # 17.8
avg_neg_review_word_cnt = all_reviews['Review_Total_Negative_Word_Counts'].mean() # 18.5

Let’s look at how the review scores are distributed.

1
2
3
4
all_reviews[['Reviewer_Score']].plot(kind='hist',bins=list(range(11)), rwidth=0.8)
plt.xlabel('Score')
plt.ylabel('Count')
plt.title('Number of reviews vs Review Score')

Review Score

Reviews with Score >= 5.0

For those review score >= 5.0, we call it an overall positive review; otherwise, an overall negative review.

1
2
3
4
5
overall_pos_review = all_reviews[all_reviews['Reviewer_Score'] >= 5.0]
overall_pos_review_cnt = overall_pos_review['Reviewer_Score'].count() # 493457
all_review_with_score_cnt = all_reviews['Reviewer_Score'].count()
avg_overall_pos_score = overall_pos_review['Reviewer_Score'].mean() # 8.6

The number of reviews with Reviewer_Score >= 5.0 is 493,457, it means that more than 95% of reviews are overall positive, and the average score is 8.6.

1
2
avg_overall_pos_review_pos_word_cnt = overall_pos_review['Review_Total_Positive_Word_Counts'].mean() # 18.23
avg_overall_pos_review_neg_word_cnt = overall_pos_review['Review_Total_Negative_Word_Counts'].mean() # 17.22

For overall positive reviews (Reviewer_Score >= 5.0), the average word count of positive words is 18.23.
For overall positive reviews (Reviewer_Score >= 5.0), the average word count of negative words is 17.22.

Reviews with Score < 5.0

1
2
3
4
overall_neg_review = all_reviews[all_reviews['Reviewer_Score'] < 5.0]
overall_neg_review_cnt = overall_neg_review['Reviewer_Score'].count() # 22281
avg_overall_neg_score = overall_neg_review['Reviewer_Score'].mean() # 3.86

The number of reviews with Reviewer_Score < 5.0 is 22,281, it means that only 4% of reviews are overall negative, and the average score is 3.86.

1
2
avg_overall_neg_review_pos_word_cnt = overall_neg_review['Review_Total_Positive_Word_Counts'].mean()
avg_overall_neg_review_neg_word_cnt = overall_neg_review['Review_Total_Negative_Word_Counts'].mean()

For overall negative reviews (‘Reviewer_Score’ < 5.0), the average word count of positive words is 7.68.
For overall negative reviews (‘Reviewer_Score’ < 5.0), the average word count of negative words is 47.81. Reviewers tend to give more details in negative side of a hotel.

Tags

1
2
3
4
5
6
7
8
from collections import Counter
tags = []
for tag_list in all_reviews.Tags.apply(lambda x: x[1:-1].replace('\'', '').split(',')):
for tag in tag_list:
tags.append(tag.strip())
Counter(tags).most_common(10)

The most popular tag is “Leisure trip”.

[('Leisure trip', 417778),
 ('Submitted from a mobile device', 307640),
 ('Couple', 252294),
 ('Stayed 1 night', 193645),
 ('Stayed 2 nights', 133937),
 ('Solo traveler', 108545),
 ('Stayed 3 nights', 95821),
 ('Business trip', 82939),
 ('Group', 65392),
 ('Family with young children', 61015)]