This Rmarkdown file is made to give some insight into how we processed the data aquired with Python here in R.
We make use of the ggplot2 package for visualisation:
library(ggplot2)
data_raw = read.csv('/Users/danielschene/Desktop/DataFinal.csv')
attach(data_raw)
Here is a summary of our dataframe:
summary(data_raw)
## Work NoOfSen NoOfWords
## 1984 : 1 Min. : 797 Min. : 22385
## AliceInWonderland : 1 1st Qu.: 3530 1st Qu.: 75587
## AnnaKarenina : 1 Median : 5568 Median :110676
## AroundTheWorldIn80Days: 1 Mean : 6489 Mean :141003
## BraveNewWorld : 1 3rd Qu.: 8107 3rd Qu.:167696
## BrothersKaramazov : 1 Max. :26431 Max. :595904
## (Other) :48
## AvgWperS UniqueW UniqRatio LongstSen
## Min. :10.83 Min. : 2724 Min. : 3.739 Min. : 76.0
## 1st Qu.:17.48 1st Qu.: 7078 1st Qu.: 6.004 1st Qu.: 135.5
## Median :21.64 Median : 9297 Median : 8.769 Median : 182.0
## Mean :22.82 Mean : 9891 Mean : 8.726 Mean : 229.2
## 3rd Qu.:26.68 3rd Qu.:12001 3rd Qu.:10.618 3rd Qu.: 272.5
## Max. :55.26 Max. :22282 Max. :16.075 Max. :1360.0
##
## AvgWLen FirstPublished NoOfRatings GRrating
## Min. :3.940 Min. :1605 Min. : 16157 Min. :3.420
## 1st Qu.:4.368 1st Qu.:1851 1st Qu.: 172961 1st Qu.:3.772
## Median :4.608 Median :1876 Median : 280888 Median :3.850
## Mean :4.584 Mean :1870 Mean : 615040 Mean :3.867
## 3rd Qu.:4.781 3rd Qu.:1906 3rd Qu.: 709025 3rd Qu.:4.000
## Max. :5.155 Max. :1960 Max. :3755275 Max. :4.320
##
## LTrating LTpopularity
## Min. :3.450 Min. : 9.0
## 1st Qu.:3.803 1st Qu.: 59.5
## Median :3.890 Median : 143.0
## Mean :3.916 Mean : 242.9
## 3rd Qu.:4.048 3rd Qu.: 306.2
## Max. :4.420 Max. :1345.0
##
str(data_raw)
## 'data.frame': 54 obs. of 13 variables:
## $ Work : Factor w/ 54 levels "1984","AliceInWonderland",..: 23 18 28 5 31 27 44 48 24 20 ...
## $ NoOfSen : int 954 2417 14801 5233 2736 797 6335 1709 8101 9317 ...
## $ NoOfWords : int 26964 40120 323894 66229 80341 22385 115180 45495 140568 112907 ...
## $ AvgWperS : num 28.3 16.6 21.9 12.7 29.4 ...
## $ UniqueW : int 4254 5974 17332 9861 6792 2724 9325 5439 12253 10196 ...
## $ UniqRatio : num 15.78 14.89 5.35 14.89 8.45 ...
## $ LongstSen : int 129 98 174 258 235 144 178 125 175 118 ...
## $ AvgWLen : num 4.49 4.61 4.84 4.97 4.74 ...
## $ FirstPublished: int 1886 1899 1871 1932 1880 1915 1960 1898 1900 1910 ...
## $ NoOfRatings : int 302590 348090 120371 1209004 259185 491629 3755275 76725 24647 66445 ...
## $ GRrating : num 3.8 3.42 3.95 3.98 3.81 3.8 4.26 3.44 3.62 3.96 ...
## $ LTrating : num 3.73 3.57 4.2 3.96 3.82 3.91 4.39 3.45 3.68 3.98 ...
## $ LTpopularity : int 314 147 283 20 185 745 12 1190 835 822 ...
Plot_1 = ggplot(data_raw,
aes(x = UniqRatio, y = LTrating, colour = GRrating)) +
geom_point(size = 2) + geom_smooth(method = "lm", se=F) + labs(colour = "Goodreads rating", x = "Relative amount of unique words in text", y = "LibraryThing rating") + theme_minimal()
As a first step in looking for interesting patterns in our dataframe, we plot some variables against one another. Here is the rating on LibraryThing as a function of the unique words ratio (which is not a good measure of vocabulary, we have to note).
Plot_1
t.test(UniqRatio, LTrating)
##
## Welch Two Sample t-test
##
## data: UniqRatio and LTrating
## t = 11.454, df = 53.554, p-value = 5.088e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.968595 5.653094
## sample estimates:
## mean of x mean of y
## 8.726400 3.915556
summary(glm(LTrating~UniqRatio))
##
## Call:
## glm(formula = LTrating ~ UniqRatio)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.44302 -0.12575 0.00217 0.10412 0.45506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.183852 0.083892 49.872 < 2e-16 ***
## UniqRatio -0.030745 0.009075 -3.388 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0413681)
##
## Null deviance: 2.6259 on 53 degrees of freedom
## Residual deviance: 2.1511 on 52 degrees of freedom
## AIC: -14.796
##
## Number of Fisher Scoring iterations: 2
Let’s see the interaction between average sentence length and score on Goodreads:
##
## Call:
## glm(formula = GRrating ~ AvgWperS)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.47718 -0.09898 -0.01846 0.14621 0.43664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.976654 0.085806 46.345 <2e-16 ***
## AvgWperS -0.004788 0.003545 -1.351 0.183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.04430041)
##
## Null deviance: 2.3844 on 53 degrees of freedom
## Residual deviance: 2.3036 on 52 degrees of freedom
## AIC: -11.098
##
## Number of Fisher Scoring iterations: 2
Perhaps it would be better to first have a look at how the number of ratings may affect the ratings themselves, and also to see if the length of a text is indicative of the amount of ratings, i.e. on the amount of people who actually finish the book:
It seems that the more people give a rating, the higher the rating gets. This might be explained in that people perhaps are sensitive to the rating the book already has when they are about to rate it, and are more likely to give a rating, especially a good one, if the book is already rated highly.
Plot_3
There also seems to be some effect of length on the amount of ratings, in that the longer a work is, the less ratings it receives. The scales are different, I think because of the varying nature of the values between works.
Plot_4
Yet, here we see a trend of longer works generally receiving a higher rating.
Plot_5
So long works are rated less, but higher. Short works are rated more, but lower. However, more ratings also indicates a higher score, and this would mean that short works, because they receive more ratings, should have a higher rating than long works, but the opposite is the case. Perhaps this small dataset is not useful for making such claims, but it nevertheless is apparent, be it that the sample is small.
These visualisations may be interesting, but we wanted to take it a step further and see whether we could predict user ratings based on the text-internal values we’ve extracted with python. R’s neuralnet package provides a means of training a simple neural network for making predictions. This can then be compared with the predictions of a generalized linear model. Let me stress that I am not at all an expert when it comes to neural networks; I’ve come across them in another course, and was eager to apply the little knowledge I have of them to this experimental project.
We make use of the dplyr package, and the afore mentioned neuralnet package.
A pre-analysis of the variables gave us an indication of which ones tend to have the strongest predictive power, these were:
Average words per sentence Relative amount (%) of unique words in the text Total amount of unique words
Using dplyr, we make a new dataframe:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
neural_data = select(data_raw, AvgWperS, UniqRatio, UniqueW, GRrating)
Now we set a sample size, which will be 0.75. So 75% of the data we will use to train the network, the rest will be used for testing:
samplesize = 0.75*nrow(neural_data)
set.seed(1)
neural_index = sample(seq_len(nrow(neural_data)), size = samplesize)
training_data = neural_data[neural_index,]
testing_data = neural_data[-neural_index,]
Next we scale the data in order to be compatible between variables, and then we select the training and test set:
max = apply(neural_data, 2, max)
min = apply(neural_data, 2, min)
neural_df_scaled = as.data.frame(scale(neural_data, center = min, scale = max - min))
train_nn = neural_df_scaled[neural_index,]
test_nn = neural_df_scaled[-neural_index,]
A quick look at the sets:
head(train_nn)
## AvgWperS UniqRatio UniqueW GRrating
## 15 0.03302681 0.45953658 0.24573065 0.7222222
## 20 0.17526625 0.09905814 0.42294713 0.8666667
## 30 0.00000000 0.16725318 0.09142039 0.4222222
## 47 0.28463428 0.18278338 0.81005215 0.4000000
## 11 0.19489556 0.02676351 0.65093568 1.0000000
## 45 0.32488979 0.34875836 0.45152879 0.4555556
head(test_nn)
## AvgWperS UniqRatio UniqueW GRrating
## 5 0.4172195 0.3821886 0.2079967 0.4333333
## 6 0.3884595 0.6833255 0.0000000 0.4222222
## 18 0.2296493 0.5859178 0.2893956 0.5333333
## 19 0.3814434 0.2527365 0.5838532 0.4777778
## 22 0.1583433 0.3899677 0.3928827 0.4666667
## 25 0.3567060 0.2465112 0.2851007 0.4333333
Now we can make our neural network:
set.seed(2)
attach(train_nn)
## The following objects are masked from data_raw:
##
## AvgWperS, GRrating, UniqRatio, UniqueW
library(neuralnet)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
my_NN = neuralnet(GRrating ~ AvgWperS + UniqRatio + UniqueW, train_nn, hidden = 3, linear.output = T, rep = 800)
Next we let the network predict the ratings of the test set, and we scale it back up for the numbers to make sense again:
predict_test_nn = neuralnet::compute(my_NN, test_nn[,c("AvgWperS", "UniqRatio", "UniqueW")])
predict_test_nn = (predict_test_nn$net.result*(max(neural_data$GRrating) - min(neural_data$GRrating))) + min(neural_data$GRrating)
predict_test_nn
## [,1]
## 5 3.805103417
## 6 3.807429676
## 18 3.784023567
## 19 3.789162713
## 22 3.866330297
## 25 3.913455651
## 26 4.086885915
## 33 3.776953034
## 34 4.009441904
## 35 3.791983640
## 39 3.834674537
## 41 4.055562209
## 44 3.811807843
## 49 3.521556794
Is what we get. Now we can check back with our testing set we created earlier:
testing_data
## AvgWperS UniqRatio UniqueW GRrating
## 5 29.36440058 8.453964974 6792 3.81
## 6 28.08657465 12.168863078 2724 3.80
## 18 21.03053645 10.967218690 8384 3.90
## 19 27.77484514 6.857012644 14143 3.85
## 22 17.86236244 8.549929353 10408 3.84
## 25 26.67574635 6.780214843 8300 3.81
## 26 22.54564716 3.739192890 22282 4.10
## 33 24.73040650 10.087315572 7671 3.77
## 34 26.67696391 5.015181240 12603 4.07
## 35 17.44778859 10.748582668 10219 3.59
## 39 17.56517036 11.711004372 7607 3.79
## 41 22.60707733 6.033903022 5204 3.99
## 44 23.69158291 9.738314288 7346 3.83
## 49 30.43614931 11.072996201 12008 3.80
We can look at this and try to compare, but it is better to get some kind of measurement of how accurate it was. A good and easy way to operationalize accuracy is the mean squared error (MSE). This basically takes the error margin between the predicted value and the actual value, squares it (to prevent that positives and negatives cancel each other out), and then gives us the mean:
MSE_neuralnet <- sum((predict_test_nn - testing_data$GRrating)^2)/nrow(testing_data)
MSE_neuralnet
## [1] 0.01124771644
This is relatively low, so our model has done a good job, we may say. In order to asess whether it is really better than a linear model, we will compare the two:
lm_fit <- glm(GRrating ~ AvgWperS + UniqRatio + UniqueW, data=training_data)
summary(lm_fit)
##
## Call:
## glm(formula = GRrating ~ AvgWperS + UniqRatio + UniqueW, data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3948860 -0.1155299 0.0107101 0.1581974 0.3248531
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.610251180827 0.219695499282 20.98473
## AvgWperS -0.009692233041 0.003964092112 -2.44501
## UniqRatio -0.040305846054 0.012729482936 -3.16634
## UniqueW -0.000016966595 0.000009123007 -1.85976
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## AvgWperS 0.0195088 *
## UniqRatio 0.0031387 **
## UniqueW 0.0711065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.04475949933)
##
## Null deviance: 2.1616975 on 39 degrees of freedom
## Residual deviance: 1.6113420 on 36 degrees of freedom
## AIC: -4.9574012
##
## Number of Fisher Scoring iterations: 2
pr_lm <- predict(lm_fit, testing_data)
MSE_lm <- sum((pr_lm - testing_data$GRrating)^2)/nrow(testing_data)
MSE_lm
## [1] 0.014066016
MSE_neuralnet
## [1] 0.01124771644
It turns out that the linear model is very accurate too, up to the point where the two hardly differ in predictive “power”. The NN is still a bit better, but for a neural network the difference with a GLM is very low. This probably has to do with the fact that our dataset is very small, and for it to be accurate a NN needs a very large set of data to train on, which is not the case here. Still, the MSEs of the two models are low in general, which means that reader scores can actually be predicted with some accuracy, which is a cool finding for this project.