by Ashok Khosla, Mendocino County, California

Project background:

In the Internet age, unethical competitors can hire humans from various parts of the globe to write product reviews that can decimate a local business. As an example of the kind of behaviour that goes on, examine this Wall St. Journal article

This study project, evaluating deceptive reviews in the hotel industry, is a repeat of work performed at Cornell University by Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock. Their paper Finding Deceptive Opinion Spam by Any Stretch of the Imagination details their methodology.

As part of their work, they collected a database of 1600 reviews, of which 800 were deceptive, and 800 were truthful (i.e. written by an actual hotel guest). The deceptive reviews were written, under contract, with human workers, who were given a minute to write the review, had to live in the US, etc, and were told to write a positive or negative(harmful) review. The combination of negative/positive reviews with truthful/deceptive writers produces 4 sets of 400 reviews each:

This project will use a Support Vector Machine (SVM) which is a fancy way of saying this project will calculate a place in the data where, like King Solomon’s sword, it can cleave truth from deception. In mathematical terms, the SVM will determine a cleaving hyperplane through a hyperspace defined by dimensions such as word counts, and parts-of-speech counts (nouns, adjective, etc.)

The project is written in one of my favorite programming languages, the statistical software language known as R. If you want to follow along, click on the code button to see the software in action.

Exploratory Data Analysis:

The first part of every data science project is to read in the data and “explore” it, preferably using graphics to gain insight.

Initialize and load libraries

################################################################################
### Initialize environment
################################################################################
rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together

 # Text Mining Packages
library(tidytext)
library(tokenizers)
library(wordcloud2) #creative visualizations
library(spacyr)
spacy_initialize()

# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function

publication_theme <- function() {
    theme_economist() +
    theme(text=element_text(family="Rockwell"),
          plot.title = element_text(family="Rockwell", size=12)
    )
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'

Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)

Read and label the hotel reviews

Once all the programming libraries and environment have been loaded, we need to read in the reviews (one per file) and label them as truthful or deceptive. Here’s a look at some randomly sampled reviews from the dataset of 1600 reviews. I’ve hidden whether or not they are truthful or deceptive. You guess.

training_path = '../input/op_spam_train'
doc_id <- list.files(path=training_path, pattern="*.txt", full.names=TRUE, recursive=TRUE)
if (length(doc_id) != 1600) { 
   stop (paste("Couldn't locate input training files at:",  
                print(getwd()), print(training_path)))
}
training.df <- as.tibble(as.data.frame(doc_id, 
                                       col.names = c("Training_File"),
                                       stringsAsFactors = FALSE))
Polarity Hotel_Reviews
positive We’ve just returned from a two night stay at the Affinia Chicago and our visit was delightful. Thanks to tripadvisor reviews we were prepared for the major construction work on the facade and entrance..it was a mess. And the hotel restaurant and bar remain incomplete and unavailable. Even parts of the lobby had “wet paint” stickers. All these things could have made for a disappointing stay. Not so!! Although we arrived very early in the day our suite was ready along with a most warm and gracious greeting from the front desk staff. After just a few minutes we felt like we were “at home.” The suite on the 25th floor, with a fab view along a wall of windows, was sparkling fresh and clean. The bathroom lighting was outstanding and led to my bride of nearly 50 years saying that it made her look younger…she was right! Everything in and about the room was “like-new” and very stylish. As a visitor we appreciated the very convenient location just a few steps from the Magnificient Mile and all of its ambiance. Thanks to a dedicated and gracious staff, a fantastic location, and chic decorating elements, this hotel can and will be our future hotel of choice in Chicago. The stay was short..but sweet.
negative My family has dubbed the Omni in Chicago “Fawlty Towers” because there always seems to be something wrong. This time our room was right above the kitchen and the odor of cooking grease wafted through our bedroom all night long. It smelled like a greasy spoon restaurant and it permeated all of our clothing. The bedroom window was bolted shut, so we couldn’t get any fresh air into the room. With the amount that the Omni is charging for a room, they should have a much better ventilation system.
negative This hotel was very overpriced for what you get from staying here. The Amalfi Hotel Chicago advertises itself as a “luxury hotel” located in downtown Chicago. Since it is in downtown Chicago, that means that I had street noises keeping me up most of the night. My view also suffered. They had nicer rooms available with better views, but the markups to stay in them was outrageous! The room itself was decent, with clean linens and nice air conditioning, but I can get that in just about any hotel these days. There wasn’t anything particularly luxurious about the hotel to make it a stand-out for me, but for the price I paid for the room, I certainly had that expectation. It was a let down.
negative The heating in our room didn’t work and wasn’t repaired depite numerous requests, the rooms were very small, the litter in the bin remained in the bin for the whole stay, the cleaner made the bed but nothing else, when we returned home my credit card statement showed that the hotel had charged me for the 5 night stay despite my having pre-paid through a travel agent, it took the hotel 2 months to re-credit my card without an apology or offer to pay for my numerous trans-Atlantic phone calls to them.
negative The Amalfi Hotel in Chicago advertises itself as family friendly, yet when I arrived with my children it was anything BUT family friendly. The staff seemed miffed and stuck up, and I had to wait for the personal check in. I was too worried that my kids would ruin the obviously really expensive furniture while I waited, and other guests (all single or couples without children) seemed annoyed every time I passed with my kids while I was trying to keep them entertained before we checked in. This might not be the best place to stay if you have a kid or two, even though it is advertised as such. I give you a great big Thumbs Down for false advertising, Amalfi.
negative My wife and I stayed at the Hyatt Regency in Chicago a couple of weeks ago and will most defiantly not go back. The customer service is horrible and it is well apparent that most of the workers did not want to be there either. When we arrived at the hotel we were told us that the room was still being cleaned and had to wait almost a hour before they gave us the key. When we got into the room the bed was not even properly made. The cafe is always crowded with people and it take two hours to get food. We ended up going home two days before we planned.



Check your guesses below. How did you do? Better than 50% accuracy?

kable(training.df[random_records,c(2,3,4)], format = "markdown")
Truthfulness Polarity Hotel_Reviews
truthful positive We’ve just returned from a two night stay at the Affinia Chicago and our visit was delightful. Thanks to tripadvisor reviews we were prepared for the major construction work on the facade and entrance..it was a mess. And the hotel restaurant and bar remain incomplete and unavailable. Even parts of the lobby had “wet paint” stickers. All these things could have made for a disappointing stay. Not so!! Although we arrived very early in the day our suite was ready along with a most warm and gracious greeting from the front desk staff. After just a few minutes we felt like we were “at home.” The suite on the 25th floor, with a fab view along a wall of windows, was sparkling fresh and clean. The bathroom lighting was outstanding and led to my bride of nearly 50 years saying that it made her look younger…she was right! Everything in and about the room was “like-new” and very stylish. As a visitor we appreciated the very convenient location just a few steps from the Magnificient Mile and all of its ambiance. Thanks to a dedicated and gracious staff, a fantastic location, and chic decorating elements, this hotel can and will be our future hotel of choice in Chicago. The stay was short..but sweet.
truthful negative My family has dubbed the Omni in Chicago “Fawlty Towers” because there always seems to be something wrong. This time our room was right above the kitchen and the odor of cooking grease wafted through our bedroom all night long. It smelled like a greasy spoon restaurant and it permeated all of our clothing. The bedroom window was bolted shut, so we couldn’t get any fresh air into the room. With the amount that the Omni is charging for a room, they should have a much better ventilation system.
deceptive negative This hotel was very overpriced for what you get from staying here. The Amalfi Hotel Chicago advertises itself as a “luxury hotel” located in downtown Chicago. Since it is in downtown Chicago, that means that I had street noises keeping me up most of the night. My view also suffered. They had nicer rooms available with better views, but the markups to stay in them was outrageous! The room itself was decent, with clean linens and nice air conditioning, but I can get that in just about any hotel these days. There wasn’t anything particularly luxurious about the hotel to make it a stand-out for me, but for the price I paid for the room, I certainly had that expectation. It was a let down.
truthful negative The heating in our room didn’t work and wasn’t repaired depite numerous requests, the rooms were very small, the litter in the bin remained in the bin for the whole stay, the cleaner made the bed but nothing else, when we returned home my credit card statement showed that the hotel had charged me for the 5 night stay despite my having pre-paid through a travel agent, it took the hotel 2 months to re-credit my card without an apology or offer to pay for my numerous trans-Atlantic phone calls to them.
deceptive negative The Amalfi Hotel in Chicago advertises itself as family friendly, yet when I arrived with my children it was anything BUT family friendly. The staff seemed miffed and stuck up, and I had to wait for the personal check in. I was too worried that my kids would ruin the obviously really expensive furniture while I waited, and other guests (all single or couples without children) seemed annoyed every time I passed with my kids while I was trying to keep them entertained before we checked in. This might not be the best place to stay if you have a kid or two, even though it is advertised as such. I give you a great big Thumbs Down for false advertising, Amalfi.
deceptive negative My wife and I stayed at the Hyatt Regency in Chicago a couple of weeks ago and will most defiantly not go back. The customer service is horrible and it is well apparent that most of the workers did not want to be there either. When we arrived at the hotel we were told us that the room was still being cleaned and had to wait almost a hour before they gave us the key. When we got into the room the bed was not even properly made. The cafe is always crowded with people and it take two hours to get food. We ended up going home two days before we planned.



Now comes the first analysis of the text. The first thing to examine in any textual analysis is the so called bag-of-words model. We break up the reviews into a set of words, and then we analyze those single words.

cleanup_review <- function(aReviewStr) {
    the_cleansed_string <- aReviewStr
    the_cleansed_string <- gsub("[^a-zA-Z0-9 ]", " ", the_cleansed_string)
    theTokens <- tokenize_words(the_cleansed_string, stopwords = stopwords::stopwords("en"))[[1]]
    theLongerTokens  <- theTokens[sapply(theTokens, function(aToken) (nchar(aToken) > 3))]
    the_cleansed_string <- paste(theLongerTokens, collapse = " ")
    return(the_cleansed_string)
}
training.df$Filtered_Reviews <- sapply(training.df$Hotel_Reviews, cleanup_review)

tokenized_unigram.df <- training.df %>% 
    unnest_tokens(word, Filtered_Reviews) %>%
    distinct() 

tokenized_bigram.df <- training.df %>% 
    unnest_tokens(ngram, Filtered_Reviews, token = "ngrams", n = 2, collapse=FALSE) # %>%

Preliminary distributions

Zipf’s law says that the more common a word is the shorter it is. It also says that the frequency of that word has a “Zipfian” distribution - common words (typically the top 2000-3000 words in any language) have a very high occurence/probability, and then the rest rapidly drop off. As a quick sanity check of the data, let’s see if the reviews have a “Zipfian” distribution.

word_frequency <- tokenized_unigram.df %>%
    dplyr::count(word, sort = TRUE) 
freq_range = 1:1000
barplot(word_frequency$n[freq_range])

Looks Zipfian!

Sentiment analysis

To train equally, half of the reviews are “negative polarity”, meaning they dislike (or want to harm) the hotel, and half are “positive polarity” meaning they like (or want to promote) the hotel. We can perform a “sentiment” analysis (determining whether a writer’s attitude is positive, negative, or neutral) on the reviews to see if the data seems reasonably labeled for thumbs-up/down polarity. Here we use the “afinn” sentiment dictionary to categorize words as harsh or kind.

sentiments.df <- tokenized_unigram.df %>%
    inner_join(get_sentiments("afinn"), by = "word") %>%
    group_by(doc_id) %>%
    mutate(sentiment=sum(score)) %>%
    select(-Hotel_Reviews, -word, -score) %>%
    filter(sentiment > -20)
sentiments.df <- sentiments.df[!duplicated(sentiments.df),]

deceptive.sentiments <- filter(sentiments.df, Truthfulness=="deceptive")
truthful.sentiments <- filter(sentiments.df, Truthfulness=="truthful")
deceptive.negative.df <- deceptive.sentiments %>% filter(Polarity=="negative")
deceptive.positive.df <- deceptive.sentiments %>% filter(Polarity=="positive")
truthful.negative.df <- truthful.sentiments %>% filter(Polarity=="negative")
truthful.positive.df <- truthful.sentiments %>% filter(Polarity=="positive")

deceptive.negative.mean <- mean(deceptive.negative.df$sentiment)
deceptive.negative.sd <- sd(deceptive.negative.df$sentiment)
truthful.negative.mean <- mean(truthful.negative.df$sentiment)
truthful.negative.sd <- sd(truthful.negative.df$sentiment)

deceptive.positive.mean <- mean(deceptive.positive.df$sentiment)
deceptive.positive.sd <- sd(deceptive.positive.df$sentiment)
truthful.positive.mean <- mean(truthful.positive.df$sentiment)
truthful.positive.sd <- sd(truthful.positive.df$sentiment)

truthfulness <- c("deceptive", "truthful")
negative.mean <- c(deceptive.negative.mean, truthful.negative.mean)
negative.stdev <- c(deceptive.negative.sd, truthful.negative.sd)
positive.mean <- c(deceptive.positive.mean, truthful.positive.mean)
positive.stdev <- c(deceptive.positive.sd, truthful.positive.sd)
summary.table = data.frame(truthfulness, negative.mean, negative.stdev, positive.mean, positive.stdev)
kable(summary.table, format = "markdown")
truthfulness negative.mean negative.stdev positive.mean positive.stdev
deceptive 1.637500 6.476183 14.755 7.270635
truthful 2.598485 6.651384 13.550 7.318908
ggplot(data=deceptive.sentiments, aes(sentiment, fill=Polarity)) + 
       geom_histogram(binwidth = 1,position="dodge") +
       xlab("Sentiment Level") +
       labs(title = "Sentiment Levels for Deceitful Negative and Positive Reviews") +
       publication_theme()

ggplot(data=truthful.sentiments, aes(sentiment, fill=Polarity)) + 
       geom_histogram(binwidth = 1,position="dodge") +
       xlab("Sentiment Level") +
       labs(title = "Sentiment Levels for Truthful Negative and Positive Reviews") +
       publication_theme()

Deceptive writers are harsher than truthful writers when writing negative reviews and more positive than truthful writers when writing positive reviews. No surprise there LOL 😂 From the table and graphics you can also see that deceptive reviews have a wider standard deviation than truthful reviews. It might be useful to add sentiment analysis as a feature in our machine learning hyperspace.

Deceptive words vs truthful words

Let’s plot the probability that words show up in reviews, by whether they show up in truthful, or deceptive reviews, or both.

Words that show up ONLY in deceptive reviews or ONLY in truthful reviews

word_distribution <- as.tibble(count(tokenized_unigram.df, word, Truthfulness, sort = TRUE))
tidy_word_distribution <- spread(word_distribution, Truthfulness, n) 
tidy_word_distribution <- tidy_word_distribution %>% 
    replace_na(list(deceptive = 0, truthful = 0)) %>% 
    mutate(deceptive_proportion = deceptive / sum(deceptive))  %>%
    mutate(truth_proportion = truthful / sum(truthful))

deceptive_words_only <- tidy_word_distribution %>%
    filter(truth_proportion == 0) %>%
    filter(deceptive > 2) %>%
    mutate(word = reorder(word, deceptive))   #Reorders word into a factor, based on n....
deceptive.barplot <- ggplot(deceptive_words_only, aes(word, deceptive)) +
        geom_col() +
        xlab(NULL) +
        coord_flip() +
        labs(title = "Words that appear ONLY in deceptive reviews") +
        publication_theme()
deceptive.barplot

true_words_only <- tidy_word_distribution %>%
    filter(deceptive_proportion == 0) %>%
    filter(truthful > 2) %>%
    mutate(word = reorder(word, truthful)) #Reorders word into a factor, based on n....
true.barplot <- ggplot(true_words_only, aes(word, truthful)) +
        geom_col() +
        xlab(NULL) +
        coord_flip() +
        labs(title = "Words that appear ONLY in  truthful reviews") +
        publication_theme()
true.barplot

There are words that only show up in deceptive reviews, and similarly there are words that only show up in truthful reviews: As you can observe, the deceptive reviews are slightly richer on superlative words, and the truthful reviews seem richer on nouns. This implies that parts-of-speech tagging might be a useful part of our approach.

Words that show up in BOTH truthful and deceptive reviews

How about words that show up in both truthful and deceptive reviews?. Looks like the same thing, nouns show up in the top left truthful section, adjectives in the bottom right deceptive section. Looks like people who write deceptive reviews like the words luxury, accomodations, amazing and smell.

Deceptive reviews smell bad! And they are “amazing”. LOL 😂

# Remove words that occur only on axes from the plot
plot_word_distribution <- tidy_word_distribution %>%
    filter((truth_proportion > 0) & (deceptive_proportion > 0))

word_distribution.plot <- ggplot(plot_word_distribution, 
       aes(x = deceptive_proportion, y = truth_proportion )) +
    geom_abline(color = "gray40", lty = 2) +
    geom_jitter(color = publication.color.orange, alpha = 0.3, size = 2., width = 0.2, height = 0.2) +
    geom_text(aes(label = word), check_overlap = TRUE, size = 3, fontface = "bold", vjust = 1.5) +
    scale_x_log10(labels = percent_format()) +
    scale_y_log10(labels = percent_format()) +
    theme(legend.position="none") +
    labs(x = "% Deceptive (log scale)", y = "% Truthful (log scale)") +
    publication_theme()
word_distribution.plot

Now would be a good time to quote the authors of the original study:

However, that deceptive opinions contain more superlatives is not unexpected, since deceptive writing (but not necessarily imaginative writing in general) often contains exaggerated language.

With some insight into the distribution and type of words used in spam review, let’s build a machine learning algorithm. The original authors got a 90% accuracy match with their SVM. Again a quote is in order:

Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task. In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges, who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).

So… astonishingly, human intelligence has a roughly 50% accuracy at detecting deception (no better than random). This insight explains much of our current world to me. 😉

In the case of detecting deceptive hotel reviews, I suspect that artificial intelligence can exceed human intelligence.

Feature Engineering

It will be helpful to extract more “2nd order” information from the words we have at our disposal. We’ve already discussed word frequencies by truthful/deceptive and sentiment analysis. The authors of the study used two other sets of parameters for their machine-learning “hyperspace” - parts-of-speech tagging(adjectives, nouns, etc.) and bigrams(two words located next to each other).

You might be surprised to learn that identifying a word’s part of speech is a fairly classic AI technology. Even very simple approaches routinely get 90% accuracy. The other thing the author’s discovered was helpful was using bigrams/n-grams instead of single words. This means they looked at paired words. For example the previous sentence, instead of being broken down into {this, means, they, looked, at, paired, words} breaks down into two word combinations. {this means, means they, they looked, looked at, at paired, paired words}

Let’s perform these bits of “feature engineering” - creating additional features from the text to improve our machine learning success. First, the bigger challenge, tagging the words with their parts-of-speech and seeing what that tells us.

Testing spacyr - an open-source Parts-of_Speech (POS) tagger

Several open-source parts-of-speech (POS) taggers are available. I’m using spacyr. Trained using a neural net, this POS tagger has a 92% accuracy, and is a (relatively) fast classifier - important considering the amount of data being mined. As an example of its use, let’s parse a joke:

A Texan, a Russian and a New Yorker go to a restaurant in London.
The waiter tells them, “Excuse me – if you were going to order the steak, I’m afraid there’s a shortage due to the mad cow disease.”

The Texan says, “What’s a shortage?”
The Russian says, “What’s a steak?”
The New Yorker says, “What’s ‘excuse me’?”

txt <- c(line1 = "A Texan, a Russian and a New Yorker go to a restaurant in London.",
         line2 = "The waiter tells them, Excuse me -- if you were going to order the steak, I'm afraid there's a shortage due to the mad cow disease.",
         line3 = "The Texan says, What's a shortage?",
         line4 = "The Russian says, What's a steak?",
         line4 = "The New Yorker says, What's 'excuse me?")

# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt$sentence_id <- NULL
# Show label and review
kable(parsedtxt[1:20,], format = "markdown")
doc_id token_id token lemma pos entity
line1 1 A a DET
line1 2 Texan texan PROPN NORP_B
line1 3 , , PUNCT
line1 4 a a DET
line1 5 Russian russian PROPN NORP_B
line1 6 and and CCONJ
line1 7 a a DET ORG_B
line1 8 New new PROPN ORG_I
line1 9 Yorker yorker PROPN ORG_I
line1 10 go go VERB
line1 11 to to ADP
line1 12 a a DET
line1 13 restaurant restaurant NOUN
line1 14 in in ADP
line1 15 London london PROPN GPE_B
line1 16 . . PUNCT
line2 1 The the DET
line2 2 waiter waiter NOUN
line2 3 tells tell VERB
line2 4 them -PRON- PRON

Tagging the hotel reviews

We’ll first tag the reviews And then clean them up by removing information-free words like I, me, the, etc.

# Create a list of docId, review (spacyr's input format for text)
text_data <- c()
text_data[training.df$doc_id] <- training.df$Hotel_Reviews
reviews.pos.raw <- spacy_parse(text_data)

# Standardize format
names(reviews.pos.raw)[names(reviews.pos.raw)=="token"] <- "word"
names(reviews.pos.raw)[names(reviews.pos.raw)=="doc_id"] <- "doc_id"
reviews.pos.raw$token_id <- NULL

# Remove all tokens less than 3 characters and remove stop words
reviews.pos <- reviews.pos.raw %>%
    filter(nchar(word) > 3) %>%
    filter(pos!="PART") %>%
    anti_join(stop_words)

reviews.df <- inner_join(training.df, reviews.pos, by="doc_id") %>%
    select(-Filtered_Reviews, -Hotel_Reviews, -sentence_id)

#Cache files for later...
write.csv(reviews.df, file = "./files/reviews_pos.csv")

Here’s a bit of the table after all the tagging is done

kable(head(reviews.df[,2:7]), format = "markdown")
Truthfulness Polarity word lemma pos entity
deceptive negative stayed stay VERB
deceptive negative Schicago schicago PROPN ORG_I
deceptive negative Hilton hilton PROPN ORG_I
deceptive negative days day NOUN DATE_I
deceptive negative nights night NOUN TIME_I
deceptive negative conference conference NOUN

Confirming our parts of speech hypothesis

Let’s see if some of my suspicions about parts of speech are true (hint: they’re not…, but others are)

Start by examining the parts of speech distribution for both truthful and deceptive reviews

reviews.pos.counted <- reviews.df %>%
                        group_by(Truthfulness) %>%
                        dplyr::count(pos,sort=TRUE)
# Cleanup table for presentation                        
names(reviews.pos.counted)[names(reviews.pos.counted)=="pos"] <- "Parts_of_Speech"
names(reviews.pos.counted)[names(reviews.pos.counted)=="n"] <- "Number_of_Occurences"
# I'd like to reorder POS but I can't get it to work :-(
# reorder_pos <- function(aTag) {
#     newOrder <- list(pos=c('NOUN', 'VERB', 'PROPN', 'ADJ', 'ADV', 'ADP', 'DET', 
#                          'NUM', 'PRON', 'PUNCT', 'INTJ', 'CCONJ', 'X'), value=1:13)
#     return(newOrder[aTag])
# }
reviews.pos.counted <- reviews.pos.counted[,c(2,1,3)]
kable(reviews.pos.counted[order(reviews.pos.counted$Parts_of_Speech, reviews.pos.counted$Truthfulness),], 
      format = "markdown")
Parts_of_Speech Truthfulness Number_of_Occurences
ADJ deceptive 6507
ADJ truthful 6704
ADP deceptive 426
ADP truthful 267
ADV deceptive 2278
ADV truthful 2155
CCONJ deceptive 7
CCONJ truthful 12
DET deceptive 265
DET truthful 303
INTJ deceptive 48
INTJ truthful 48
NOUN deceptive 17616
NOUN truthful 19959
NUM deceptive 83
NUM truthful 266
PRON deceptive 164
PRON truthful 181
PROPN deceptive 3996
PROPN truthful 3775
PUNCT deceptive 8
PUNCT truthful 88
VERB deceptive 8478
VERB truthful 8398
X deceptive 2
X truthful 14

Hmmn… Compared to deceptive reviews, truthful reviews have:

  • Significantly more nouns (NOUN)
  • Significantly less adverbial particles (ADP) such as nearby, while, after, before, etc.
  • Significantly more numbers (NUM)
  • Somewhat fewer proper names (PROPN)
pos.barplot <- ggplot(reviews.pos.counted, aes(Parts_of_Speech,Number_of_Occurences, fill=Truthfulness)) +
        geom_col(position="dodge") +
        coord_flip() +
        ylab("Number of times POS appears") +
        labs(title = "Truthful and Deceptive Reviews - Parts-of-speech Distribution") +
        theme_economist() +
        theme(text=element_text(family="Rockwell"), 
              plot.title = element_text(family="Rockwell", size=12))
pos.barplot

Next Steps:

This ends the Exploratory Analysis part of this project. Now that there is some insight into the data, we’re going to see if we can create the sword of truth… Next up is Part 2- Using a Support Vector Machine Learning Model