In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.
In Part 2, I will attempt to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing.
rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together
# Text Mining Packages
library(tidytext)
library(tokenizers)
library("e1071")
library(tm)
library(RTextTools)
# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function
publication_theme <- function() {
theme_economist() +
theme(text=element_text(family="Rockwell"),
plot.title = element_text(family="Rockwell", size=12)
)
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'
Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)
First let’s restore our recently tagged dataset: Here’s a quick refresher:
reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors = FALSE))
names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"
reviews.df <- reviews.df %>%
filter(word != '#NAME?') %>%
select(-entity)
tag_pos <- function(anIndex) {
return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)
kable(reviews.df[1:15,], format = "markdown")
X | doc_id | truthfulness | polarity | word | lemma | pos |
---|---|---|---|---|---|---|
1 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | stay/VERB | stay | VERB |
2 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | schicago/PROPN | schicago | PROPN |
3 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | hilton/PROPN | hilton | PROPN |
4 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | day/NOUN | day | NOUN |
5 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | night/NOUN | night | NOUN |
6 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | conference/NOUN | conference | NOUN |
7 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | easy/ADJ | easy | ADJ |
8 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | amenity/NOUN | amenity | NOUN |
9 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | cleanliness/NOUN | cleanliness | NOUN |
10 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | experience/NOUN | experience | NOUN |
11 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | hilton/PROPN | hilton | PROPN |
12 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | awful/ADJ | awful | ADJ |
13 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | take/VERB | take | VERB |
14 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | time/NOUN | time | NOUN |
15 | negative/MTurk/f1/d_hilton_1 | deceptive | negative | write/VERB | write | VERB |
As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.
The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf)
Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them
(Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.
Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears. \[
Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right)
\]
(Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:
If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.
Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.
Here’s a peek at our datasets Tf°Idf:
reviews.tfidf <- reviews.df %>%
count(doc_id, word, sort = TRUE)
total_words.tfidf <- reviews.tfidf %>%
group_by(doc_id) %>%
summarize(total = sum(n))
reviews.tfidf <- left_join(reviews.tfidf, total_words.tfidf)
reviews.tfidf <- reviews.tfidf %>%
bind_tf_idf(word, doc_id, n)
kable(reviews.tfidf[1:15,], format = "markdown")
doc_id | word | n | total | tf | idf | tf_idf |
---|---|---|---|---|---|---|
negative/MTurk/f4/d_homewood_15 | hotel/NOUN | 11 | 49 | 0.2244898 | 0.2076394 | 0.0466129 |
negative/MTurk/f5/d_intercontinental_17 | hotel/NOUN | 10 | 75 | 0.1333333 | 0.2076394 | 0.0276852 |
negative/Web/f4/t_homewood_19 | credit/NOUN | 10 | 233 | 0.0429185 | 3.8513984 | 0.1652961 |
negative/Web/f4/t_knickerbocker_17 | hotel/NOUN | 10 | 62 | 0.1612903 | 0.2076394 | 0.0334902 |
positive/MTurk/f3/d_omni_5 | omni/PROPN | 9 | 157 | 0.0573248 | 3.2033716 | 0.1836328 |
negative/MTurk/f2/d_affinia_6 | pillow/NOUN | 8 | 112 | 0.0714286 | 2.9234116 | 0.2088151 |
negative/MTurk/f4/d_swissotel_12 | hotel/NOUN | 8 | 73 | 0.1095890 | 0.2076394 | 0.0227550 |
negative/Web/f1/t_james_5 | spoon/NOUN | 8 | 124 | 0.0645161 | 6.6846117 | 0.4312653 |
negative/Web/f1/t_monaco_10 | garage/NOUN | 8 | 130 | 0.0615385 | 5.0751738 | 0.3123184 |
negative/Web/f2/t_affinia_7 | elevator/NOUN | 8 | 179 | 0.0446927 | 2.8559703 | 0.1276411 |
negative/Web/f2/t_ambassador_14 | hotel/NOUN | 8 | 36 | 0.2222222 | 0.2076394 | 0.0461421 |
negative/Web/f2/t_talbott_11 | hotel/NOUN | 8 | 95 | 0.0842105 | 0.2076394 | 0.0174854 |
negative/Web/f2/t_talbott_17 | service/NOUN | 8 | 252 | 0.0317460 | 0.9976364 | 0.0316710 |
negative/Web/f3/t_hyatt_20 | night/NOUN | 8 | 158 | 0.0506329 | 1.1067705 | 0.0560390 |
negative/Web/f4/t_homewood_19 | hotel/NOUN | 8 | 233 | 0.0343348 | 0.2076394 | 0.0071292 |
As you can see words like hotel, that are common, and common to a lot of documents will not be weighted highly for the machine learning algorithm. The tf_idf range for “hotel” is [0.00 - 0.05], whereas, words like “bedbug”, “concierge”, or “stinky”, have a range of [0.40 - 0.50] Let’s look at the top 30 tf°idf’s for deceptive and truthful reviews.
reviews.tfidf.df <- left_join(reviews.df, reviews.tfidf)
deceptive.tfidf <- reviews.tfidf.df %>%
filter(truthfulness=="deceptive") %>%
filter(pos!='PROPN')
deceptive.tfidf <- distinct(deceptive.tfidf, word, .keep_all = TRUE)
deceptive.tfidf <- deceptive.tfidf[order(-deceptive.tfidf$tf_idf),]
kable(deceptive.tfidf[1:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id | truthfulness | polarity | word | lemma | total | tf | idf |
---|---|---|---|---|---|---|---|
positive/MTurk/f4/d_swissotel_9 | deceptive | positive | beat/NOUN | beat | 11 | 0.0909091 | 7.377759 |
positive/MTurk/f2/d_talbott_4 | deceptive | positive | ritzy/ADJ | ritzy | 24 | 0.0833333 | 7.377759 |
positive/MTurk/f3/d_conrad_8 | deceptive | positive | brightly/ADV | brightly | 12 | 0.0833333 | 7.377759 |
positive/MTurk/f3/d_conrad_8 | deceptive | positive | garishness/NOUN | garishness | 12 | 0.0833333 | 7.377759 |
positive/MTurk/f2/d_talbott_7 | deceptive | positive | bedbug/ADV | bedbug | 13 | 0.0769231 | 7.377759 |
positive/MTurk/f1/d_monaco_12 | deceptive | positive | tidy/ADJ | tidy | 10 | 0.1000000 | 5.431849 |
positive/MTurk/f4/d_sheraton_14 | deceptive | positive | -bobby/INTJ | -bobby | 14 | 0.0714286 | 7.377759 |
positive/MTurk/f2/d_talbott_10 | deceptive | positive | talbott/NOUN | talbott | 59 | 0.0677966 | 7.377759 |
positive/MTurk/f5/d_amalfi_18 | deceptive | positive | complimentartry/NOUN | complimentartry | 15 | 0.0666667 | 7.377759 |
positive/MTurk/f2/d_talbott_7 | deceptive | positive | above/ADP | above | 13 | 0.0769231 | 6.279147 |
positive/MTurk/f1/d_james_9 | deceptive | positive | go/VERB | go | 11 | 0.0909091 | 5.298317 |
negative/MTurk/f1/d_james_19 | deceptive | negative | deceive/VERB | deceive | 12 | 0.0833333 | 5.768321 |
negative/MTurk/f1/d_sofitel_11 | deceptive | negative | garish/ADJ | garish | 14 | 0.0714286 | 6.684612 |
negative/MTurk/f1/d_monaco_7 | deceptive | negative | fin/NOUN | fin | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f2/d_affinia_8 | deceptive | positive | nearby/ADP | nearby | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f4/d_sheraton_6 | deceptive | positive | coffeepot/NOUN | coffeepot | 16 | 0.0625000 | 7.377759 |
positive/MTurk/f4/d_swissotel_20 | deceptive | positive | clientel/NOUN | clientel | 16 | 0.0625000 | 7.377759 |
negative/MTurk/f5/d_amalfi_6 | deceptive | negative | planner/NOUN | planner | 33 | 0.0606061 | 7.377759 |
negative/MTurk/f5/d_allegro_17 | deceptive | negative | prolem/NOUN | prolem | 17 | 0.0588235 | 7.377759 |
Deceptive writers use the word deceive in deceptive reviews.
truthful.tfidf <- reviews.tfidf.df %>%
filter(truthfulness=="truthful") %>%
filter(pos!='PROPN')
truthful.tfidf <- distinct(truthful.tfidf, word, .keep_all = TRUE)
truthful.tfidf <- truthful.tfidf[order(-truthful.tfidf$tf_idf),]
kable(truthful.tfidf[2:19,c(2,3,4,5,6,9,10,11)], format = "markdown")
doc_id | truthfulness | polarity | word | lemma | total | tf | idf |
---|---|---|---|---|---|---|---|
negative/Web/f3/t_fairmont_9 | truthful | negative | sully/ADV | sully | 12 | 0.0833333 | 7.377759 |
negative/Web/f1/t_james_14 | truthful | negative | yeah/INTJ | yeah | 29 | 0.1034483 | 5.180534 |
negative/Web/f2/t_ambassador_13 | truthful | negative | carpeting/NOUN | carpeting | 24 | 0.0833333 | 5.991465 |
negative/Web/f3/t_fairmont_9 | truthful | negative | lousy/ADJ | lousy | 12 | 0.0833333 | 5.991465 |
negative/Web/f1/t_sofitel_9 | truthful | negative | princess/NOUN | princess | 15 | 0.0666667 | 7.377759 |
negative/Web/f3/t_omni_4 | truthful | negative | stinky/ADJ | stinky | 15 | 0.0666667 | 7.377759 |
negative/Web/f3/t_hyatt_9 | truthful | negative | strike/NOUN | strike | 26 | 0.0769231 | 6.279147 |
negative/Web/f3/t_fairmont_9 | truthful | negative | boy/NOUN | boy | 12 | 0.0833333 | 5.768321 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | lime/NOUN | lime | 16 | 0.0625000 | 7.377759 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | punch/NOUN | punch | 16 | 0.0625000 | 7.377759 |
positive/TripAdvisor/f2/t_hardrock_5 | truthful | positive | extradionary/ADJ | extradionary | 16 | 0.0625000 | 7.377759 |
negative/Web/f1/t_sofitel_13 | truthful | negative | blanket/NOUN | blanket | 31 | 0.0967742 | 4.605170 |
negative/Web/f3/t_omni_4 | truthful | negative | hurt/VERB | hurt | 15 | 0.0666667 | 6.684612 |
negative/Web/f1/t_sofitel_10 | truthful | negative | lexus/NOUN | lexus | 17 | 0.0588235 | 7.377759 |
negative/Web/f1/t_sofitel_10 | truthful | negative | smash/VERB | smash | 17 | 0.0588235 | 7.377759 |
positive/TripAdvisor/f5/t_intercontinental_2 | truthful | positive | doller/NOUN | doller | 17 | 0.0588235 | 7.377759 |
negative/Web/f1/t_james_5 | truthful | negative | spoon/NOUN | spoon | 124 | 0.0645161 | 6.684612 |
positive/TripAdvisor/f1/t_hilton_5 | truthful | positive | place.amazing/NOUN | place.amazing | 18 | 0.0555556 | 7.377759 |
Truthful people don’t say smelly, they say stinky.
Alright!, Most of the heavy database lifting is done. Let’s start teaching the machine.
In a classic machine learning process we randomly divide the data into a training set and a testing set - here 80% of the 1600 files are randomly selected for training, and the other 20% will be used to test the trained machine’s accuracy. 80% of a small 1600 member training set leaves an even smaller training set of 1280 files. You don’t go to war with the army you wished for, you go to war with the army you have. Poor justification, I know LOL - it’s stinky.
This is my first pass at training. Hopefully we will get better than average human performance (an average accuracy of no better than random at 50%). Let’s start with a naive approach, by using the filtered, stemmed words that we’ve processed so far, adding a label for parts of speech.
First let’s build a document/text database consisting of the existing filtered text. The we build a document term matrix, where each row is a document in the training set, and each term is the Tf°Idf of that term in the training document set.
# input_data_X is the input training data
# output_data_Y is the output prediction (i.e. the classification result)
gather_review_string <- function(aDocID) { paste(reviews.tfidf.df$word[reviews.tfidf.df$doc_id == aDocID], collapse = " ") }
reviews.docs <- distinct(reviews.tfidf.df, doc_id, .keep_all = TRUE)
reviews.docs$review <- sapply(reviews.docs$doc_id, gather_review_string)
reviews.docs <- select(reviews.docs, doc_id, truthfulness, review)
write_csv(reviews.docs, './files/filtered_reviews.csv')
full_range <- 1:length(reviews.docs$doc_id)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)
# Weighting by Tf°Idf drops %accuracy by 10!
doc_term_matrix = create_matrix(reviews.docs[training_range,]$review)#, weighting=weightTfIdf)
input_data_X <- doc_term_matrix
output_data_Y <- reviews.docs[training_range,]$truthfulness
We then create the SVM model and ask it to train itself (time for a drink of coffee)
# We can run prediction on the model to determine execution time
# Set up an SVM model
# Configure the training data
container <- create_container(input_data_X, output_data_Y, trainSize=1:length(training_range), virgin=FALSE)
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)
Once the model is trained, we pull out the test set of data, and use the model to predict the truth. We were going to use the machine to produce King Solomon’s sword. Typically the first part of analyzing and improving performance is to produce a “confusion matrix” which tells us how well (or poorly) we are doing on a 2x2 binary classification problem.
## If you error out:
# trace("create_matrix",edit=T)
# edit it and on line 42 will have a misspelling of the word "acronym".
# Change the "A" to an "a" and hit "Save" - it should work fine after that.
predictionData <- reviews.docs[test_range,]$review
# create a prediction document term matrix
predMatrix <- create_matrix(predictionData, originalMatrix=doc_term_matrix)
# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)
# predict
results <- classify_model(predictionContainer, model)
predicted_results <- as.character(results$SVM_LABEL)
actual_results <- reviews.docs[test_range,]$truthfulness
confusion.df <- as.tibble(data.frame(predicted_results, actual_results, stringsAsFactors = FALSE))
Prediction.Incorrect <- 100* sum(confusion.df$predicted_results != confusion.df$actual_results)/length(test_range)
Prediction.Correct <- 100* sum(confusion.df$predicted_results == confusion.df$actual_results)/length(test_range)
classification.truth_truth <- sum((confusion.df$predicted_results == 'truthful') &
(confusion.df$actual_results == 'truthful'))
classification.truth_false <- sum((confusion.df$predicted_results == 'truthful') &
(confusion.df$actual_results == 'deceptive'))
classification.false_truth <- sum((confusion.df$predicted_results == 'deceptive') &
(confusion.df$actual_results == 'truthful'))
classification.false_false <- sum((confusion.df$predicted_results == 'deceptive') &
(confusion.df$actual_results == 'deceptive'))
Actual.type <- c('Deceptive', 'Truthful')
Prediction.Deceptive <- c(classification.false_false, classification.false_truth)
Prediction.Truthful <- c(classification.truth_false, classification.truth_truth)
conf.mat <- data.frame(Actual.type, Prediction.Deceptive, Prediction.Truthful)
#print(paste("Accuracy is", Prediction.Correct, "%"))
kable(conf.mat, format = "markdown")
Actual.type | Prediction.Deceptive | Prediction.Truthful |
---|---|---|
Deceptive | 128 | 21 |
Truthful | 28 | 143 |
Ok. Well 85% accuracy won’t get me a job on Wall St. But it’s a naive Bag-of-words/Parts-of-speech approach. And with some additional feature engineering I know I can get to 95%
Oh, by the way - even the naive machine did MUCH BETTER than human intelligence.
Let’s improve and tune the model. King Solomon’s sword needs to be sharper. The authors of the original paper noted that bigram’s worked better than unigrams. Also what about a different type of learning model? Although our training model size is small, as is our dictionary size, I’d like to try a neural net to see what happens with overtrained models…
What if we use a neural net instead?
Let’s try that: