In the Internet age, unethical competitors can hire humans from various parts of the globe to write product reviews that can decimate a local business. As an example of the kind of behaviour that goes on, examine this Wall St. Journal article
This study project, evaluating deceptive reviews in the hotel industry, is a repeat of work performed at Cornell University by Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock. Their paper Finding Deceptive Opinion Spam by Any Stretch of the Imagination details their methodology.
As part of their work, they collected a database of 1600 reviews, of which 800 were deceptive, and 800 were truthful (i.e. written by an actual hotel guest). The deceptive reviews were written, under contract, with human workers, who were given a minute to write the review, had to live in the US, etc, and were told to write a positive or negative(harmful) review. The combination of negative/positive reviews with truthful/deceptive writers produces 4 sets of 400 reviews each:
This project will use a Support Vector Machine (SVM) which is a fancy way of saying this project will calculate a place in the data where, like King Solomon’s sword, it can cleave truth from deception. In mathematical terms, the SVM will determine a cleaving hyperplane through a hyperspace defined by dimensions such as word counts, and parts-of-speech counts (nouns, adjective, etc.)
The project is written in one of my favorite programming languages, the statistical software language known as R. If you want to follow along, click on the code button to see the software in action.
The first part of every data science project is to read in the data and “explore” it, preferably using graphics to gain insight.
################################################################################
### Initialize environment
################################################################################
rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together
# Text Mining Packages
library(tidytext)
library(tokenizers)
library(wordcloud2) #creative visualizations
library(spacyr)
spacy_initialize()
# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function
publication_theme <- function() {
theme_economist() +
theme(text=element_text(family="Rockwell"),
plot.title = element_text(family="Rockwell", size=12)
)
}
publication.color.background <- '#d6e4ea'
publication.color.orange <- '#f0716f'
publication.color.cyan <- '#3cbfc2'
Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)
Once all the programming libraries and environment have been loaded, we need to read in the reviews (one per file) and label them as truthful or deceptive. Here’s a look at some randomly sampled reviews from the dataset of 1600 reviews. I’ve hidden whether or not they are truthful or deceptive. You guess.
training_path = '../input/op_spam_train'
doc_id <- list.files(path=training_path, pattern="*.txt", full.names=TRUE, recursive=TRUE)
if (length(doc_id) != 1600) {
stop (paste("Couldn't locate input training files at:",
print(getwd()), print(training_path)))
}
training.df <- as.tibble(as.data.frame(doc_id,
col.names = c("Training_File"),
stringsAsFactors = FALSE))
Polarity | Hotel_Reviews |
---|---|
positive | We’ve just returned from a two night stay at the Affinia Chicago and our visit was delightful. Thanks to tripadvisor reviews we were prepared for the major construction work on the facade and entrance..it was a mess. And the hotel restaurant and bar remain incomplete and unavailable. Even parts of the lobby had “wet paint” stickers. All these things could have made for a disappointing stay. Not so!! Although we arrived very early in the day our suite was ready along with a most warm and gracious greeting from the front desk staff. After just a few minutes we felt like we were “at home.” The suite on the 25th floor, with a fab view along a wall of windows, was sparkling fresh and clean. The bathroom lighting was outstanding and led to my bride of nearly 50 years saying that it made her look younger…she was right! Everything in and about the room was “like-new” and very stylish. As a visitor we appreciated the very convenient location just a few steps from the Magnificient Mile and all of its ambiance. Thanks to a dedicated and gracious staff, a fantastic location, and chic decorating elements, this hotel can and will be our future hotel of choice in Chicago. The stay was short..but sweet. |
negative | My family has dubbed the Omni in Chicago “Fawlty Towers” because there always seems to be something wrong. This time our room was right above the kitchen and the odor of cooking grease wafted through our bedroom all night long. It smelled like a greasy spoon restaurant and it permeated all of our clothing. The bedroom window was bolted shut, so we couldn’t get any fresh air into the room. With the amount that the Omni is charging for a room, they should have a much better ventilation system. |
negative | This hotel was very overpriced for what you get from staying here. The Amalfi Hotel Chicago advertises itself as a “luxury hotel” located in downtown Chicago. Since it is in downtown Chicago, that means that I had street noises keeping me up most of the night. My view also suffered. They had nicer rooms available with better views, but the markups to stay in them was outrageous! The room itself was decent, with clean linens and nice air conditioning, but I can get that in just about any hotel these days. There wasn’t anything particularly luxurious about the hotel to make it a stand-out for me, but for the price I paid for the room, I certainly had that expectation. It was a let down. |
negative | The heating in our room didn’t work and wasn’t repaired depite numerous requests, the rooms were very small, the litter in the bin remained in the bin for the whole stay, the cleaner made the bed but nothing else, when we returned home my credit card statement showed that the hotel had charged me for the 5 night stay despite my having pre-paid through a travel agent, it took the hotel 2 months to re-credit my card without an apology or offer to pay for my numerous trans-Atlantic phone calls to them. |
negative | The Amalfi Hotel in Chicago advertises itself as family friendly, yet when I arrived with my children it was anything BUT family friendly. The staff seemed miffed and stuck up, and I had to wait for the personal check in. I was too worried that my kids would ruin the obviously really expensive furniture while I waited, and other guests (all single or couples without children) seemed annoyed every time I passed with my kids while I was trying to keep them entertained before we checked in. This might not be the best place to stay if you have a kid or two, even though it is advertised as such. I give you a great big Thumbs Down for false advertising, Amalfi. |
negative | My wife and I stayed at the Hyatt Regency in Chicago a couple of weeks ago and will most defiantly not go back. The customer service is horrible and it is well apparent that most of the workers did not want to be there either. When we arrived at the hotel we were told us that the room was still being cleaned and had to wait almost a hour before they gave us the key. When we got into the room the bed was not even properly made. The cafe is always crowded with people and it take two hours to get food. We ended up going home two days before we planned. |
Check your guesses below. How did you do? Better than 50% accuracy?
kable(training.df[random_records,c(2,3,4)], format = "markdown")
Truthfulness | Polarity | Hotel_Reviews |
---|---|---|
truthful | positive | We’ve just returned from a two night stay at the Affinia Chicago and our visit was delightful. Thanks to tripadvisor reviews we were prepared for the major construction work on the facade and entrance..it was a mess. And the hotel restaurant and bar remain incomplete and unavailable. Even parts of the lobby had “wet paint” stickers. All these things could have made for a disappointing stay. Not so!! Although we arrived very early in the day our suite was ready along with a most warm and gracious greeting from the front desk staff. After just a few minutes we felt like we were “at home.” The suite on the 25th floor, with a fab view along a wall of windows, was sparkling fresh and clean. The bathroom lighting was outstanding and led to my bride of nearly 50 years saying that it made her look younger…she was right! Everything in and about the room was “like-new” and very stylish. As a visitor we appreciated the very convenient location just a few steps from the Magnificient Mile and all of its ambiance. Thanks to a dedicated and gracious staff, a fantastic location, and chic decorating elements, this hotel can and will be our future hotel of choice in Chicago. The stay was short..but sweet. |
truthful | negative | My family has dubbed the Omni in Chicago “Fawlty Towers” because there always seems to be something wrong. This time our room was right above the kitchen and the odor of cooking grease wafted through our bedroom all night long. It smelled like a greasy spoon restaurant and it permeated all of our clothing. The bedroom window was bolted shut, so we couldn’t get any fresh air into the room. With the amount that the Omni is charging for a room, they should have a much better ventilation system. |
deceptive | negative | This hotel was very overpriced for what you get from staying here. The Amalfi Hotel Chicago advertises itself as a “luxury hotel” located in downtown Chicago. Since it is in downtown Chicago, that means that I had street noises keeping me up most of the night. My view also suffered. They had nicer rooms available with better views, but the markups to stay in them was outrageous! The room itself was decent, with clean linens and nice air conditioning, but I can get that in just about any hotel these days. There wasn’t anything particularly luxurious about the hotel to make it a stand-out for me, but for the price I paid for the room, I certainly had that expectation. It was a let down. |
truthful | negative | The heating in our room didn’t work and wasn’t repaired depite numerous requests, the rooms were very small, the litter in the bin remained in the bin for the whole stay, the cleaner made the bed but nothing else, when we returned home my credit card statement showed that the hotel had charged me for the 5 night stay despite my having pre-paid through a travel agent, it took the hotel 2 months to re-credit my card without an apology or offer to pay for my numerous trans-Atlantic phone calls to them. |
deceptive | negative | The Amalfi Hotel in Chicago advertises itself as family friendly, yet when I arrived with my children it was anything BUT family friendly. The staff seemed miffed and stuck up, and I had to wait for the personal check in. I was too worried that my kids would ruin the obviously really expensive furniture while I waited, and other guests (all single or couples without children) seemed annoyed every time I passed with my kids while I was trying to keep them entertained before we checked in. This might not be the best place to stay if you have a kid or two, even though it is advertised as such. I give you a great big Thumbs Down for false advertising, Amalfi. |
deceptive | negative | My wife and I stayed at the Hyatt Regency in Chicago a couple of weeks ago and will most defiantly not go back. The customer service is horrible and it is well apparent that most of the workers did not want to be there either. When we arrived at the hotel we were told us that the room was still being cleaned and had to wait almost a hour before they gave us the key. When we got into the room the bed was not even properly made. The cafe is always crowded with people and it take two hours to get food. We ended up going home two days before we planned. |
Now comes the first analysis of the text. The first thing to examine in any textual analysis is the so called bag-of-words model. We break up the reviews into a set of words, and then we analyze those single words.
cleanup_review <- function(aReviewStr) {
the_cleansed_string <- aReviewStr
the_cleansed_string <- gsub("[^a-zA-Z0-9 ]", " ", the_cleansed_string)
theTokens <- tokenize_words(the_cleansed_string, stopwords = stopwords::stopwords("en"))[[1]]
theLongerTokens <- theTokens[sapply(theTokens, function(aToken) (nchar(aToken) > 3))]
the_cleansed_string <- paste(theLongerTokens, collapse = " ")
return(the_cleansed_string)
}
training.df$Filtered_Reviews <- sapply(training.df$Hotel_Reviews, cleanup_review)
tokenized_unigram.df <- training.df %>%
unnest_tokens(word, Filtered_Reviews) %>%
distinct()
tokenized_bigram.df <- training.df %>%
unnest_tokens(ngram, Filtered_Reviews, token = "ngrams", n = 2, collapse=FALSE) # %>%
Zipf’s law says that the more common a word is the shorter it is. It also says that the frequency of that word has a “Zipfian” distribution - common words (typically the top 2000-3000 words in any language) have a very high occurence/probability, and then the rest rapidly drop off. As a quick sanity check of the data, let’s see if the reviews have a “Zipfian” distribution.
word_frequency <- tokenized_unigram.df %>%
dplyr::count(word, sort = TRUE)
freq_range = 1:1000
barplot(word_frequency$n[freq_range])
Looks Zipfian!
To train equally, half of the reviews are “negative polarity”, meaning they dislike (or want to harm) the hotel, and half are “positive polarity” meaning they like (or want to promote) the hotel. We can perform a “sentiment” analysis (determining whether a writer’s attitude is positive, negative, or neutral) on the reviews to see if the data seems reasonably labeled for thumbs-up/down polarity. Here we use the “afinn” sentiment dictionary to categorize words as harsh or kind.
sentiments.df <- tokenized_unigram.df %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(doc_id) %>%
mutate(sentiment=sum(score)) %>%
select(-Hotel_Reviews, -word, -score) %>%
filter(sentiment > -20)
sentiments.df <- sentiments.df[!duplicated(sentiments.df),]
deceptive.sentiments <- filter(sentiments.df, Truthfulness=="deceptive")
truthful.sentiments <- filter(sentiments.df, Truthfulness=="truthful")
deceptive.negative.df <- deceptive.sentiments %>% filter(Polarity=="negative")
deceptive.positive.df <- deceptive.sentiments %>% filter(Polarity=="positive")
truthful.negative.df <- truthful.sentiments %>% filter(Polarity=="negative")
truthful.positive.df <- truthful.sentiments %>% filter(Polarity=="positive")
deceptive.negative.mean <- mean(deceptive.negative.df$sentiment)
deceptive.negative.sd <- sd(deceptive.negative.df$sentiment)
truthful.negative.mean <- mean(truthful.negative.df$sentiment)
truthful.negative.sd <- sd(truthful.negative.df$sentiment)
deceptive.positive.mean <- mean(deceptive.positive.df$sentiment)
deceptive.positive.sd <- sd(deceptive.positive.df$sentiment)
truthful.positive.mean <- mean(truthful.positive.df$sentiment)
truthful.positive.sd <- sd(truthful.positive.df$sentiment)
truthfulness <- c("deceptive", "truthful")
negative.mean <- c(deceptive.negative.mean, truthful.negative.mean)
negative.stdev <- c(deceptive.negative.sd, truthful.negative.sd)
positive.mean <- c(deceptive.positive.mean, truthful.positive.mean)
positive.stdev <- c(deceptive.positive.sd, truthful.positive.sd)
summary.table = data.frame(truthfulness, negative.mean, negative.stdev, positive.mean, positive.stdev)
kable(summary.table, format = "markdown")
truthfulness | negative.mean | negative.stdev | positive.mean | positive.stdev |
---|---|---|---|---|
deceptive | 1.637500 | 6.476183 | 14.755 | 7.270635 |
truthful | 2.598485 | 6.651384 | 13.550 | 7.318908 |
ggplot(data=deceptive.sentiments, aes(sentiment, fill=Polarity)) +
geom_histogram(binwidth = 1,position="dodge") +
xlab("Sentiment Level") +
labs(title = "Sentiment Levels for Deceitful Negative and Positive Reviews") +
publication_theme()
ggplot(data=truthful.sentiments, aes(sentiment, fill=Polarity)) +
geom_histogram(binwidth = 1,position="dodge") +
xlab("Sentiment Level") +
labs(title = "Sentiment Levels for Truthful Negative and Positive Reviews") +
publication_theme()
Deceptive writers are harsher than truthful writers when writing negative reviews and more positive than truthful writers when writing positive reviews. No surprise there LOL 😂 From the table and graphics you can also see that deceptive reviews have a wider standard deviation than truthful reviews. It might be useful to add sentiment analysis as a feature in our machine learning hyperspace.
Let’s plot the probability that words show up in reviews, by whether they show up in truthful, or deceptive reviews, or both.
word_distribution <- as.tibble(count(tokenized_unigram.df, word, Truthfulness, sort = TRUE))
tidy_word_distribution <- spread(word_distribution, Truthfulness, n)
tidy_word_distribution <- tidy_word_distribution %>%
replace_na(list(deceptive = 0, truthful = 0)) %>%
mutate(deceptive_proportion = deceptive / sum(deceptive)) %>%
mutate(truth_proportion = truthful / sum(truthful))
deceptive_words_only <- tidy_word_distribution %>%
filter(truth_proportion == 0) %>%
filter(deceptive > 2) %>%
mutate(word = reorder(word, deceptive)) #Reorders word into a factor, based on n....
deceptive.barplot <- ggplot(deceptive_words_only, aes(word, deceptive)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(title = "Words that appear ONLY in deceptive reviews") +
publication_theme()
deceptive.barplot
true_words_only <- tidy_word_distribution %>%
filter(deceptive_proportion == 0) %>%
filter(truthful > 2) %>%
mutate(word = reorder(word, truthful)) #Reorders word into a factor, based on n....
true.barplot <- ggplot(true_words_only, aes(word, truthful)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(title = "Words that appear ONLY in truthful reviews") +
publication_theme()
true.barplot
There are words that only show up in deceptive reviews, and similarly there are words that only show up in truthful reviews: As you can observe, the deceptive reviews are slightly richer on superlative words, and the truthful reviews seem richer on nouns. This implies that parts-of-speech tagging might be a useful part of our approach.
How about words that show up in both truthful and deceptive reviews?. Looks like the same thing, nouns show up in the top left truthful section, adjectives in the bottom right deceptive section. Looks like people who write deceptive reviews like the words luxury, accomodations, amazing and smell.
Deceptive reviews smell bad! And they are “amazing”. LOL 😂
# Remove words that occur only on axes from the plot
plot_word_distribution <- tidy_word_distribution %>%
filter((truth_proportion > 0) & (deceptive_proportion > 0))
word_distribution.plot <- ggplot(plot_word_distribution,
aes(x = deceptive_proportion, y = truth_proportion )) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(color = publication.color.orange, alpha = 0.3, size = 2., width = 0.2, height = 0.2) +
geom_text(aes(label = word), check_overlap = TRUE, size = 3, fontface = "bold", vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
theme(legend.position="none") +
labs(x = "% Deceptive (log scale)", y = "% Truthful (log scale)") +
publication_theme()
word_distribution.plot
Now would be a good time to quote the authors of the original study:
However, that deceptive opinions contain more superlatives is not unexpected, since deceptive writing (but not necessarily imaginative writing in general) often contains exaggerated language.
With some insight into the distribution and type of words used in spam review, let’s build a machine learning algorithm. The original authors got a 90% accuracy match with their SVM. Again a quote is in order:
Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task. In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges, who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).
So… astonishingly, human intelligence has a roughly 50% accuracy at detecting deception (no better than random). This insight explains much of our current world to me. 😉
In the case of detecting deceptive hotel reviews, I suspect that artificial intelligence can exceed human intelligence.
It will be helpful to extract more “2nd order” information from the words we have at our disposal. We’ve already discussed word frequencies by truthful/deceptive and sentiment analysis. The authors of the study used two other sets of parameters for their machine-learning “hyperspace” - parts-of-speech tagging(adjectives, nouns, etc.) and bigrams(two words located next to each other).
You might be surprised to learn that identifying a word’s part of speech is a fairly classic AI technology. Even very simple approaches routinely get 90% accuracy. The other thing the author’s discovered was helpful was using bigrams/n-grams instead of single words. This means they looked at paired words. For example the previous sentence, instead of being broken down into {this, means, they, looked, at, paired, words} breaks down into two word combinations. {this means, means they, they looked, looked at, at paired, paired words}
Let’s perform these bits of “feature engineering” - creating additional features from the text to improve our machine learning success. First, the bigger challenge, tagging the words with their parts-of-speech and seeing what that tells us.
Several open-source parts-of-speech (POS) taggers are available. I’m using spacyr. Trained using a neural net, this POS tagger has a 92% accuracy, and is a (relatively) fast classifier - important considering the amount of data being mined. As an example of its use, let’s parse a joke:
A Texan, a Russian and a New Yorker go to a restaurant in London.
The waiter tells them, “Excuse me – if you were going to order the steak, I’m afraid there’s a shortage due to the mad cow disease.”
The Texan says, “What’s a shortage?”
The Russian says, “What’s a steak?”
The New Yorker says, “What’s ‘excuse me’?”
txt <- c(line1 = "A Texan, a Russian and a New Yorker go to a restaurant in London.",
line2 = "The waiter tells them, Excuse me -- if you were going to order the steak, I'm afraid there's a shortage due to the mad cow disease.",
line3 = "The Texan says, What's a shortage?",
line4 = "The Russian says, What's a steak?",
line4 = "The New Yorker says, What's 'excuse me?")
# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt$sentence_id <- NULL
# Show label and review
kable(parsedtxt[1:20,], format = "markdown")
doc_id | token_id | token | lemma | pos | entity |
---|---|---|---|---|---|
line1 | 1 | A | a | DET | |
line1 | 2 | Texan | texan | PROPN | NORP_B |
line1 | 3 | , | , | PUNCT | |
line1 | 4 | a | a | DET | |
line1 | 5 | Russian | russian | PROPN | NORP_B |
line1 | 6 | and | and | CCONJ | |
line1 | 7 | a | a | DET | ORG_B |
line1 | 8 | New | new | PROPN | ORG_I |
line1 | 9 | Yorker | yorker | PROPN | ORG_I |
line1 | 10 | go | go | VERB | |
line1 | 11 | to | to | ADP | |
line1 | 12 | a | a | DET | |
line1 | 13 | restaurant | restaurant | NOUN | |
line1 | 14 | in | in | ADP | |
line1 | 15 | London | london | PROPN | GPE_B |
line1 | 16 | . | . | PUNCT | |
line2 | 1 | The | the | DET | |
line2 | 2 | waiter | waiter | NOUN | |
line2 | 3 | tells | tell | VERB | |
line2 | 4 | them | -PRON- | PRON |
We’ll first tag the reviews And then clean them up by removing information-free words like I, me, the, etc.
# Create a list of docId, review (spacyr's input format for text)
text_data <- c()
text_data[training.df$doc_id] <- training.df$Hotel_Reviews
reviews.pos.raw <- spacy_parse(text_data)
# Standardize format
names(reviews.pos.raw)[names(reviews.pos.raw)=="token"] <- "word"
names(reviews.pos.raw)[names(reviews.pos.raw)=="doc_id"] <- "doc_id"
reviews.pos.raw$token_id <- NULL
# Remove all tokens less than 3 characters and remove stop words
reviews.pos <- reviews.pos.raw %>%
filter(nchar(word) > 3) %>%
filter(pos!="PART") %>%
anti_join(stop_words)
reviews.df <- inner_join(training.df, reviews.pos, by="doc_id") %>%
select(-Filtered_Reviews, -Hotel_Reviews, -sentence_id)
#Cache files for later...
write.csv(reviews.df, file = "./files/reviews_pos.csv")
Here’s a bit of the table after all the tagging is done
kable(head(reviews.df[,2:7]), format = "markdown")
Truthfulness | Polarity | word | lemma | pos | entity |
---|---|---|---|---|---|
deceptive | negative | stayed | stay | VERB | |
deceptive | negative | Schicago | schicago | PROPN | ORG_I |
deceptive | negative | Hilton | hilton | PROPN | ORG_I |
deceptive | negative | days | day | NOUN | DATE_I |
deceptive | negative | nights | night | NOUN | TIME_I |
deceptive | negative | conference | conference | NOUN |
Let’s see if some of my suspicions about parts of speech are true (hint: they’re not…, but others are)
Start by examining the parts of speech distribution for both truthful and deceptive reviews
reviews.pos.counted <- reviews.df %>%
group_by(Truthfulness) %>%
dplyr::count(pos,sort=TRUE)
# Cleanup table for presentation
names(reviews.pos.counted)[names(reviews.pos.counted)=="pos"] <- "Parts_of_Speech"
names(reviews.pos.counted)[names(reviews.pos.counted)=="n"] <- "Number_of_Occurences"
# I'd like to reorder POS but I can't get it to work :-(
# reorder_pos <- function(aTag) {
# newOrder <- list(pos=c('NOUN', 'VERB', 'PROPN', 'ADJ', 'ADV', 'ADP', 'DET',
# 'NUM', 'PRON', 'PUNCT', 'INTJ', 'CCONJ', 'X'), value=1:13)
# return(newOrder[aTag])
# }
reviews.pos.counted <- reviews.pos.counted[,c(2,1,3)]
kable(reviews.pos.counted[order(reviews.pos.counted$Parts_of_Speech, reviews.pos.counted$Truthfulness),],
format = "markdown")
Parts_of_Speech | Truthfulness | Number_of_Occurences |
---|---|---|
ADJ | deceptive | 6507 |
ADJ | truthful | 6704 |
ADP | deceptive | 426 |
ADP | truthful | 267 |
ADV | deceptive | 2278 |
ADV | truthful | 2155 |
CCONJ | deceptive | 7 |
CCONJ | truthful | 12 |
DET | deceptive | 265 |
DET | truthful | 303 |
INTJ | deceptive | 48 |
INTJ | truthful | 48 |
NOUN | deceptive | 17616 |
NOUN | truthful | 19959 |
NUM | deceptive | 83 |
NUM | truthful | 266 |
PRON | deceptive | 164 |
PRON | truthful | 181 |
PROPN | deceptive | 3996 |
PROPN | truthful | 3775 |
PUNCT | deceptive | 8 |
PUNCT | truthful | 88 |
VERB | deceptive | 8478 |
VERB | truthful | 8398 |
X | deceptive | 2 |
X | truthful | 14 |
Hmmn… Compared to deceptive reviews, truthful reviews have:
pos.barplot <- ggplot(reviews.pos.counted, aes(Parts_of_Speech,Number_of_Occurences, fill=Truthfulness)) +
geom_col(position="dodge") +
coord_flip() +
ylab("Number of times POS appears") +
labs(title = "Truthful and Deceptive Reviews - Parts-of-speech Distribution") +
theme_economist() +
theme(text=element_text(family="Rockwell"),
plot.title = element_text(family="Rockwell", size=12))
pos.barplot
This ends the Exploratory Analysis part of this project. Now that there is some insight into the data, we’re going to see if we can create the sword of truth… Next up is Part 2- Using a Support Vector Machine Learning Model