In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.
In Part 2, I attempted to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing. The machine learning algorithm we used was the Support Vector Model. Running the training algorithm several times generated accuracies ranging from 82% to 87%. Since the sample size was small, the random selection of training data influenced the training accuracy.
Will a change in technology help? Unlikely, but just for fun (mostly mine LOL) let’s see how a neural net behaves.
I’ll use the very popular neural net package called Keras, written by Francois Chollet, part of the Google DeepMind team. The following is adapted from adapted from A tutorial on text classification using Keras
First we’ll reload our tagged database of docs, words, and parts of speech.
reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors=FALSE))
names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"
reviews.df <- reviews.df %>%
filter(word != '#NAME?') %>%
select(-entity)
tag_pos <- function(anIndex) {
return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)
# Make a dictionary
wordIndex.dict <- count(reviews.df, word, sort = TRUE)
wordIndex.dict$word_index <- 1:length(wordIndex.dict$word)
write.csv(wordIndex.dict, file="./files/word_index_dict.csv")
# And update reviews.df to have a word_index :-) with one line!
reviews.df <- left_join(reviews.df,wordIndex.dict,by="word")
document.term.df <- reviews.df %>%
group_by(doc_id) %>%
mutate(terms_list = c(word_index))
document_ids <- as.vector(distinct(reviews.df, doc_id)$doc_id)
getTermList <- function(aDocId) {
return(reviews.df[reviews.df$doc_id==aDocId,]$word_index) }
getOutputCode <- function(aDocID) {
if (reviews.df[reviews.df$doc_id==aDocID,]$truthfulness[1] == 'deceptive')
{ return(0) }
else { return(1) }
}
get_document_set <- function(someDocs) {
document_x_data <- unname(sapply(someDocs, getTermList))
document_y_data <- unname(sapply(someDocs, getOutputCode))
return(list( x =document_x_data, y=document_y_data))
}
full_range <- 1:length(document_ids)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)
nn_container <- list(train = get_document_set(document_ids[training_range]),
test = get_document_set(document_ids[test_range]))
Like any data science project much of the ingenuity is in trying to munge the data into the right form in the fewest lines.
c(train_data, train_labels) %<-% nn_container$train
c(test_data, test_labels) %<-% nn_container$test
### Dictionary of index to words
word_index <- list()
for (i in 1:length(wordIndex.dict$word)) {
word_index[wordIndex.dict[i,]$word] = wordIndex.dict[i,]$word_index
}
word_index_df <- data.frame(
word = names(word_index),
idx = unlist(word_index, use.names = FALSE),
stringsAsFactors = FALSE
)
# The first indices are reserved
word_index_df <- word_index_df %>% mutate(idx = idx + 3)
word_index_df <- word_index_df %>%
add_row(word = "<PAD>", idx = 0)%>%
add_row(word = "<START>", idx = 1)%>%
add_row(word = "<UNK>", idx = 2)%>%
add_row(word = "<UNUSED>", idx = 3)
word_index_df <- word_index_df %>% arrange(idx)
gNumWords_in_Dict <- length(word_index_df$word)
## Convert from numbers to words
decode_review <- function(text){
paste(map(text, function(number) word_index_df %>%
filter(idx == number) %>%
select(word) %>%
pull()),
collapse = " ")
}
# decode_review(train_data[[1]])
### Prepare the data
# The reviews — the arrays of integers — must be converted to tensors before fed into the neural network. This conversion can be done a couple of ways:
# One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network — a dense layer — that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words * num_reviews size matrix.
# Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape num_examples * max_length. We can use an embedding layer capable of handling this shape as the first layer in our network.
# In this tutorial, we will use the second approach.
# Since the reviews must be the same length, we will use the pad_sequences function to standardize the lengths:
train_data <- pad_sequences(
train_data,
value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
padding = "post",
maxlen = 256
)
test_data <- pad_sequences(
test_data,
value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
padding = "post",
maxlen = 256
)
We build a neural net with an input layer of about 9000 words, coded by integer index, and an output layer that says the result is 0 or 1. In between we have a few hidden layers. Let’s run and test it.
### Build the model
# input shape is the vocabulary count used for the hotel reviews ( approximately 10,000 words)
model <- keras_model_sequential()
model %>%
layer_embedding(input_dim = gNumWords_in_Dict, output_dim = 96) %>%
layer_global_average_pooling_1d() %>%
layer_dense(units = 96, activation = "relu") %>%
layer_dense(units = 96, activation = "relu") %>%
layer_dense(units = 96, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% summary()
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## embedding_1 (Embedding) (None, None, 96) 839424
## ___________________________________________________________________________
## global_average_pooling1d_1 (Glob (None, 96) 0
## ___________________________________________________________________________
## dense_1 (Dense) (None, 96) 9312
## ___________________________________________________________________________
## dense_2 (Dense) (None, 96) 9312
## ___________________________________________________________________________
## dense_3 (Dense) (None, 96) 9312
## ___________________________________________________________________________
## dense_4 (Dense) (None, 1) 97
## ===========================================================================
## Total params: 867,457
## Trainable params: 867,457
## Non-trainable params: 0
## ___________________________________________________________________________
# Loss function and optimizer
model %>% compile(
optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = list('accuracy')
)
# Loss function and optimizer
x_val <- train_data[1:150, ]
partial_x_train <- train_data[150:nrow(train_data), ]
y_val <- train_labels[1:150]
partial_y_train <- train_labels[150:length(train_labels)]
# Train the model
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 50,
validation_data = list(x_val, y_val),
verbose=1
)
# EVALUATE the model
results <- model %>% evaluate(test_data, test_labels)
Percent_Accuracy <- format(100*results$acc,digits=2)
results
## $loss
## [1] 0.5000462
##
## $acc
## [1] 0.8875
plot(history)
Sigh. A different suit of clothes didn’t help. The neural net has an accuracy of 89% - on average a few percent better (but with a much narrower standard deviation), than the Support Vector Machine. This implies… that the neural net may be less sensitive to the sampling error introduced by the small sample size.