Using A Neural Net

In Part 1 of this study project I explored the motivation and data of deceptive hotel reviews provided by researchers at Cornell University.

In Part 2, I attempted to replicate their approach using statistical and machine learning packages available from The R Project for Statistical Computing. The machine learning algorithm we used was the Support Vector Model. Running the training algorithm several times generated accuracies ranging from 82% to 87%. Since the sample size was small, the random selection of training data influenced the training accuracy.

Will a change in technology help? Unlikely, but just for fun (mostly mine LOL) let’s see how a neural net behaves.

I’ll use the very popular neural net package called Keras, written by Francois Chollet, part of the Google DeepMind team. The following is adapted from adapted from A tutorial on text classification using Keras

First we’ll reload our tagged database of docs, words, and parts of speech.

reviews.df <- as.tibble(read.csv('./files/reviews_pos.csv', stringsAsFactors=FALSE))

names(reviews.df)[names(reviews.df)=="X1"] <- "word_id"
names(reviews.df)[names(reviews.df)=="Truthfulness"] <- "truthfulness"
names(reviews.df)[names(reviews.df)=="Polarity"] <- "polarity"
reviews.df <- reviews.df %>%
    filter(word != '#NAME?') %>%
    select(-entity)
tag_pos <- function(anIndex) { 
    return(paste(reviews.df[anIndex,]$lemma, '/', reviews.df[anIndex,]$pos, sep =''))}
reviews.df$word <- sapply(1:length(reviews.df$word),tag_pos)

# Make a dictionary
wordIndex.dict <- count(reviews.df, word, sort = TRUE)
wordIndex.dict$word_index <- 1:length(wordIndex.dict$word)
write.csv(wordIndex.dict, file="./files/word_index_dict.csv")

# And update reviews.df to have a word_index :-) with one line!
reviews.df <- left_join(reviews.df,wordIndex.dict,by="word")

document.term.df <- reviews.df %>%
    group_by(doc_id) %>%
    mutate(terms_list = c(word_index))
document_ids <- as.vector(distinct(reviews.df, doc_id)$doc_id)

getTermList <- function(aDocId) { 
    return(reviews.df[reviews.df$doc_id==aDocId,]$word_index) }
getOutputCode <- function(aDocID) {
    if (reviews.df[reviews.df$doc_id==aDocID,]$truthfulness[1] == 'deceptive')
        { return(0) }
    else { return(1) }
}

get_document_set <- function(someDocs) {
    document_x_data <- unname(sapply(someDocs, getTermList))
    document_y_data <- unname(sapply(someDocs, getOutputCode))
    return(list( x =document_x_data, y=document_y_data))
}

full_range <- 1:length(document_ids)
training_range <- sample(full_range, round(.8*length(full_range)))
test_range <- setdiff(full_range,training_range)

nn_container <- list(train = get_document_set(document_ids[training_range]),
                     test  = get_document_set(document_ids[test_range]))

Like any data science project much of the ingenuity is in trying to munge the data into the right form in the fewest lines.

c(train_data, train_labels) %<-% nn_container$train
c(test_data, test_labels) %<-% nn_container$test

### Dictionary of index to words
word_index <- list()
for (i in 1:length(wordIndex.dict$word)) {
    word_index[wordIndex.dict[i,]$word] = wordIndex.dict[i,]$word_index 
}
word_index_df <- data.frame(
    word = names(word_index),
    idx = unlist(word_index, use.names = FALSE),
    stringsAsFactors = FALSE
)

# The first indices are reserved  
word_index_df <- word_index_df %>% mutate(idx = idx + 3)
word_index_df <- word_index_df %>%
    add_row(word = "<PAD>", idx = 0)%>%
    add_row(word = "<START>", idx = 1)%>%
    add_row(word = "<UNK>", idx = 2)%>%
    add_row(word = "<UNUSED>", idx = 3)

word_index_df <- word_index_df %>% arrange(idx)
gNumWords_in_Dict <- length(word_index_df$word)

## Convert from numbers to words
decode_review <- function(text){
    paste(map(text, function(number) word_index_df %>%
                  filter(idx == number) %>%
                  select(word) %>% 
                  pull()),
          collapse = " ")
}
# decode_review(train_data[[1]])

### Prepare the data

# The reviews — the arrays of integers — must be converted to tensors before fed into the neural network. This conversion can be done a couple of ways:
# One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network — a dense layer — that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words * num_reviews size matrix.
# Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape num_examples * max_length. We can use an embedding layer capable of handling this shape as the first layer in our network.
# In this tutorial, we will use the second approach.
# Since the reviews must be the same length, we will use the pad_sequences function to standardize the lengths:
train_data <- pad_sequences(
    train_data,
    value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
    padding = "post",
    maxlen = 256
)
test_data <- pad_sequences(
    test_data,
    value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
    padding = "post",
    maxlen = 256
)

We build a neural net with an input layer of about 9000 words, coded by integer index, and an output layer that says the result is 0 or 1. In between we have a few hidden layers. Let’s run and test it.

### Build the model

# input shape is the vocabulary count used for the hotel reviews ( approximately 10,000 words)
model <- keras_model_sequential() 
model %>% 
    layer_embedding(input_dim = gNumWords_in_Dict, output_dim = 96) %>%
    layer_global_average_pooling_1d() %>%
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 96, activation = "relu") %>% 
    layer_dense(units = 1, activation = "sigmoid")

model %>% summary()

## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_1 (Embedding)          (None, None, 96)              839424      
## ___________________________________________________________________________
## global_average_pooling1d_1 (Glob (None, 96)                    0           
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 96)                    9312        
## ___________________________________________________________________________
## dense_4 (Dense)                  (None, 1)                     97          
## ===========================================================================
## Total params: 867,457
## Trainable params: 867,457
## Non-trainable params: 0
## ___________________________________________________________________________

# Loss function and optimizer
model %>% compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = list('accuracy')
)

# Loss function and optimizer
x_val <- train_data[1:150, ]
partial_x_train <- train_data[150:nrow(train_data), ]

y_val <- train_labels[1:150]
partial_y_train <- train_labels[150:length(train_labels)]

# Train the model
history <- model %>% fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 50,
    validation_data = list(x_val, y_val),
    verbose=1
)


# EVALUATE the model
results <- model %>% evaluate(test_data, test_labels)
Percent_Accuracy <- format(100*results$acc,digits=2)
results

## $loss
## [1] 0.5000462
## 
## $acc
## [1] 0.8875

plot(history)

RESULTS

Sigh. A different suit of clothes didn’t help. The neural net has an accuracy of 89% - on average a few percent better (but with a much narrower standard deviation), than the Support Vector Machine. This implies… that the neural net may be less sensitive to the sampling error introduced by the small sample size.

Detecting Deceptive Hotel Reviews using Machine Learning - Part 2

7/28/2018

by Ashok Khosla, Mendocino County, California

Using A Neural Net

RESULTS