Introduction

It is common to conflate protagonists with heroes, and based on the genre or plot, we could be more prone to making such a conflation. It is usually tales that involve some kind of conquest, adventure or transformation that might hint at the protagonist being akin to the “hero” of the story. By hero, we mean that they’re viewed as powerful people and are seen as generally on the side of “the good”, and thus might be associated with positive sentiments throughout the novel, even if they partake in behaviour that might suggest otherwise. It is this phenomenon that made us curious about how characters are perceived in the larger world, and how their perception might differ in the world of the novel they belong to. Considering this, we created a formalized research question that we could test using sentiment analysis: Is the protagonist always the quintessential hero? Further, does someone always need to be viewed as good for them to be on the side of “the good”?

Corpus Description

Before stating the research hypothesis, we found that it was important to locate the question in a particular corpus because of the specific nature of characters we’d be analyzing. We wished to analyze someone who fits the bill of a “hero” for the world at large, but perhaps not within the world of the novel. Further, to make the analysis stronger, we also wished to analyze a strong, recurring character so that we’d have a wealth of data to work with. Based on these stipulations, we narrowed our corpus down to the Sherlock Holmes series, written by Sir Arthur Conan Doyle. Sherlock Holmes is the protagonist of this series and he works as a detective, which invariably involves going on a series of adventures and experiencing several mishaps, thus fitting into the common trope we described earlier. He’s also often described as a genius, but also as someone with a cold demeanour. With this, we have a character who is arguably on the “good side” because he helps to solve crime; a character who is generally well-liked by readers; but also a character who classifies better as an anti-hero rather than a hero. Thus, the Sherlock Holmes series served as an ideal corpus to formulate a research hypothesis based on our initial question.

Research Hypothesis

Our hypothesis is that “Sherlock Holmes will not be viewed positively by the characters of the series’ universe”.

Summary

The complete corpus of the Sherlock Holmes series comprises 4 novels and 56 short stories. Of these, we have all 4 novels: A Study in Secret (1887), The Sign of Four (1890), The Hound of the Baskervilles (1901-1902), and The Valley of Fear (1914-1915). Sir Arthur Conan Doyle wrote the 56 short stories in 5 collections of Sherlock Holmes books, of which we have 4, which amount to 44 short stories. These collections are The Adventures of Sherlock Holmes (1892), The Memoirs of Sherlock Holmes (1894), The Return of Sherlock Holmes (1905), and His Last Bow (1917). The only collection of short stories that is missing is The Case-Book of Sherlock Holmes (1927), which contains 12 short stories, and isn’t available on Project Gutenberg.

The word count of the corpus amounts to a total of 600,012 words, giving us a little over half a million words to work with. The words with the highest frequencies are: Holmes (mentioned 2,317 times), time (mentioned 849 times), sir (mentioned 755 times), Watson (mentioned 754 times), house (mentioned 741 times), night (mentioned 664 times), door (mentioned 659 times), hand (mentioned 637 times), found (mentioned 559 times), and eyes (mentioned 549 times).

The most frequently mentioned words are related to the main characters (Holmes and Watson), as well as words that suggest that spatial imagery plays a big role in this series (door, house, found, eyes, etc.). This isn’t surprising because the series largely revolves around Holmes’ and Watson’s adventures in solving crime, which would definitely involve hunting for clues quite often.

freq_long <- sherlock_import %>%
  unnest_tokens(word, text) %>%
  count(word) %>%
  arrange(desc(n))

freq_short <- freq_long %>%
  anti_join(stop_words)
## Joining, by = "word"
sum(freq_long$n) # Number of total words
## [1] 600012

Data Visualisation 1: Sentiment Analysis

For the first data visualization, we used the Bing and Afinn sentiment lexicons to analyze sentiment surrounding the words ‘Holmes’, ‘Sherlock’, and ‘detective’. These were the words we used to identify Sherlock as a character in the texts. We tokenized the words as bigrams and dropped all rows in which none of the words were the 3 identifying words. For both Afinn and Bing, we did an inner_join() of the lexicons, leaving us with the sentiment words before or after the identifying word.

# Plots the bing histogram

bing_sentiments %>%
  ggplot(aes(sentiment)) + 
  geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

The Bing histogram shows that a greater number of negative words were used around the subject of our study (44) as compared to positive words (35), showing that the character was mostly described with negative words. This could include adverbs (showing an action being seen as negative, so it is negative but limited to a moment in time) or just adjectives (describing the actual character, not limited to a moment in time).

# Plots the graph for Afinn analysis

afinn_sentiments %>%
  ggplot(aes(total)) + 
  geom_histogram(binwidth = 1)

The Afinn histogram shows that a cluster forms around 0. Negative values seem more concentrated here, with a high frequency of words with the value -2. Words with positive values are more spread out, whereas there is only one outlier on the negative side of the x-axis.

# Displays tibble of Afinn sentiment values
arrange(afinn_sentiments, desc(total))
## # A tibble: 62 x 2
##    sentiment_word total
##    <chr>          <dbl>
##  1 smiled            36
##  2 dear              30
##  3 smiling           26
##  4 laughing          12
##  5 laughed           11
##  6 desired            4
##  7 hailed             4
##  8 save               4
##  9 exultantly         3
## 10 gracious           3
## # ... with 52 more rows

“Smiled” was the highest positive valued word in Afinn, but a character smiling isn’t an indication of agreeability because someone can smile sarcastically, sadly, cruelly, etc. For example, in His Last Bow “Holmes smiled with an expression of weary patience”, indicating a negative and sarcastic tone.

“Dear” was the second-highest positive value. But this might be a case where Afinn might fail to accurately detect sentiment, as it might just be the way people addressed others in the 20th century. In the series, most instances of “dear“ were when Watson addressed Holmes, which is indicative of Watson’s manner of speaking. It is interesting that the only mention of “dear” with Sherlock is by Watson, his closest friend, and not by other characters. Other mentions of the word, while not for Sherlock, are very common and seem to be just a common manner of speaking.

Data Visualisation 2: Correlations

For this tool, we looked at the most significant correlations for the words ‘detective’, ‘Sherlock’, and ‘Holmes’. We filtered the data to only show us results that are less than or equal to 0.05 in the correlation value, giving us only significant results.

# Shows graph of correlations surrounding the identifying words and performing an inner_join of the Bing lexicon, letting us look at correlations with sentiment words

sherlock_cors %>%
  filter(item1 == "sherlock" | item1 == "holmes" | item1 == "detective")%>%
  inner_join(bing, by = c(item2 = "word")) %>%
  group_by(item1) %>%
  slice_max(correlation, n = 6) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

From the correlation graphs we found that the word ‘gravely’ was associated with all 3 character identifiers we chose, with a significance value of around 0.04 for all. ‘Smart’ was mentioned for ‘detective’ and ‘Sherlock’, but not for ‘Holmes’. Most of the negative words have been correlated to the word ‘Holmes’. From this, it is possible to infer that for people who know Sherlock Holmes closely, referring to him by positive terms is not an unlikely event, as well as for people who purely know him in his capacity as a detective. However, our results might have been affected by the fact that Sherlock has a brother with the same surname as him, and so some of the words might have correlated with mentions of Mycroft as well.

Conclusion

Based on the results of our 2 data visualizations, we find that we have more evidence in support of our hypothesis than not, with the exception being Afinn’s sentiment analysis of our corpus. However, even with Afinn, we showed that many of the words it assigned positive values to were not used in a strictly positive context. Of course, this can be said about the negative words too, but we believe that the correlations we found for our hypothesis helped to corroborate our hypothesis.

However, the data we presented is still riddled with confounders, such as the presence of Mycroft Holmes and the lack of access to seeing sentiment around Sherlock’s pronouns.

As for answering our research question(s), we conclude that no, Sherlock is not the quintessential hero, nor does he need to be viewed as one in order to be on the side of ‘the good’. We also believe that this is an assessment that is transferable to other bodies of literature.

Reflection

We found that the tools we had available for conducting a sentiment analysis on R were much more powerful than what we had at our disposal with Voyant. Data could be manipulated with a lot more ease, and since we were the ones writing the code, there was no confusion about what a certain data point might mean; everything was very clear to us regarding what the data represents. In Voyant, there were several tools that were either confusing to use or confusing to interpret since it didn’t offer justifications for what each tool can do, and in which situations.

However, there were quite a few limitations as well. One of the main limitations we found was that the lexicon used for the tools might not fully capture the common lexicon of the 19th and 20th Centuries (which is when Sherlock Holmes was written). We also found that Afinn’s system of grading words with a certain value (ranging from -5 to 5) in order to imply the extent of their positive or negative meaning was rather arbitrary, and might not always align with the context of the text. For example, assigning “cried” as a negative word could very well be false in case the subject was just shouting loudly rather than crying out of sadness. Bing, on the other hand, classifies words as only negative or positive, which makes the sentiment analysis over-simplified (this is partly resolved by Afinn, but it still stands that its gradation of words can be seen as arbitrary). Finally, we did not end up using NRC because it classified words in a peculiar way that we didn’t agree with. For example, the word ‘dying’ was associated with the sentiment of ‘anger’, but dying is not an act that is necessarily angry.

Apart from all this, there is the general limitation found throughout Voyant and R, and probably in other tools too, which is that we’re limited in our analysis by the lack of information we have about how people feel about Sherlock Holmes while only referring to him by his pronouns.

Overall, R proved to be a lot more useful and nuanced than what Voyant was offering, but it was still not perfect. However, given the limitations of our own skill as well as the limitations of technology in general, we weren’t expecting it to not leave us with some loose ends.