I’ve created an online tool for applying sentiment analysis to articles published in the Guardian newspaper based on the reader comments. You can jump in and try using it here, or find a short explanation here. Below I’ll discuss what it is trying to do and how it does it.
While studying the Data Manipulation at Scale MOOC I was introduced to a simple method of sentiment analysis. The aim of sentiment analysis to extract indicators of opinion from natural language sources. In the course this was performed on a Twitter feed, but I thought it would be interesting to apply the same methodology to the comments section in the Guardian newspaper.
I wanted to see if an algorithm could match up with my perceptions of how an article is received by the readers. It would have a couple of advantages over applying the same algorithm on Twitter:
- The comments are specifically about the same news article, where as Twitter content generally has less context.
- Guardian readers, and commentators, are a self-selecting group so might form a more coherent population than Twitter.
I’ll go into a detailed explanation of the methodology below, but here’s a summary:
- The sentiment of a comment is derived by parsing it for words that might indicate either positive or negative feelings. For example the words “superb” or “thrilled” would suggest positive sentiment, but “catastrophic” or “fraud” would imply the reverse.
- Comments can be recommended by other readers; these are used to weight the sentiment.
- An average is taken across the comments to get an overall score.
Before getting into detail I’ll give some illustrative results. Overall I think it actually does a pretty reasonable job of assessing user reaction to an article. The key is that although individual comments are often classified inaccurately, the score is aggregated across enough comments to make the overall score fairly representative.
An overall score gives the total sentiment measure. If this is greater than +10 it is considered strongly positive, less than -10 strongly negative, with most articles falling somewhere in between.
Here’s more detail about how the scores are calculated.
First the comments are selected. Like many forums or message boards the comments are organised in threads. So after one reader comments on an article others can reply to it. In the analysis only the head comments are used, in other words all the replies are ignored. This is because I’m looking to assess sentiment on the article, not on the comments. Negative replies could be negative towards the article or the comment they are responding to. You could try to be clever and say a negative reply to a negative comment probably represents a positive sentiment towards the article, but by doing that you’re likely to just compound any mis-classifications.
Only the first 100 comments are taken. This is partly for efficiency, but also because these are the most read and hence attract the majority of the recommends.
With the comments selected the follow procedure is followed to find a sentiment score.
- A small amount of cleaning is done on the comments, in particular any quoted text is removed. This is because the quoted text is often something the commentator is disagreeing with.
- The sentiment value is calculated by searching for sentiment indicating words from this list. This is the same list I used on the course previously, and I’m very grateful to Finn Arup Nielsen for making it available.
- This is normalised by dividing by the word count to get a sentiment per word.
- The score is then weighted by the number of recommends it received, rescaling so that the average recommend-weighting is equal to 1. An extra recommend is added to each comment to represent the poster themselves.
- The score is multiplied by an arbitrary factor of 100 to make it more presentable.
All that gives each comment a score, and the overall article score is just a straight average.
This is a very simple algorithm to attempt to extract some meaning from natural text. As such it’s not hard to find problems with it. As mentioned above, it does appear to me to give reasonable results despite this since errors are on the whole averaged away. Here are just some of the problems, there are plenty more.
The first issue is exactly what a positive or negative sentiment means. With simple news items this isn’t really a problem, commentators are just reacting to what has happened. However for opinion pieces it can be harder since negative comments might be directed against the author or the subject of the piece. For instance, negative comments on a comedy item about the prime minister might be directed at the politician or the author’s weak humour. For this reason straight news items tend to give better results. One example of positive comments directed towards the author being balanced by negative comments concerning the content is here. Even in straight news articles it may not be clear exactly what the sentiment reflects. The positive comments here are for the train, not the fact it was stopped by trespassers.
It’s not hard to find examples of comments that are mis-classified. A simple example is the phrase “oh dear”. This actually gets a positive sentiment score due to the word “dear”. Equally “perfect lies” would be positive as “lies” isn’t classified and perfect is viewed favourably. A relatively common mis-classification for the same reasons is “Jesus wept”. It should be pointed out the sentiment valuations are intended for Twitter which probably has quite a different vocabulary usage to Guardian commentary.
Another effect, even if it isn’t always a problem, is that some comments can have a disproportionate effect on the outcome. In particular single word insults. These will have a very highly negative sentiment per word score, so with a significant number of recommends they will have a large effect on the overall score. The off-the-scale negative rating of this is a perfect example.
Of course one of the biggest problems is the British fondness for sarcasm. It’s enough to give this article a positive reaction when clearly that’s not the readers’ view, with the comment “another tory triumph!!!” seemingly expressing joy at the rise in homelessness. This would appear to be an active area of research, see here and here.
While it is fun to look at the ratings of individual articles I thought it would be instructive to run the algorithm across multiple articles and see how they compare. However the relatively slow performance meant I needed to restrict how many articles are valued at once. You can see and use the result here, though it’s worth pointing out that it isn’t all that robust (see below).
This allows you to browse for articles, sometimes finding ones which provoke a surprisingly strong reaction, for example.
I’d like to put the code up on GitHub but I need to check a couple of things first.
The valuation code is in Python for two reasons. First it’s my rapid development language of choice, and second this allows me to use the word list author’s own valuation implementation.
However this did present me with a problem. This website is hosted on a LAMP system typical of WordPress blogs such as this one, with no Python hosting capability. The solution was to host a Python back-end component elsewhere and call it from a PHP front-end on this server. This was probably actually a good thing because it forced a clean separation of the back-end logic in the form of an API, and a front-end presentation layer that talks to it over a JSON interface. The API was hosted on Flask, something I’ve not used before but found very straightforward for this purpose.
However, this set-up has led to the system being a little brittle, you may occasionally see results fail to return. This is at least in part because the Python hosting is on a free plan, and of course you get what you pay for.
After implementing the list view functionality, see above, I found it be very slow with queries taking a few minutes. There’s still work to be done here but a couple of things have helped.
- Results of external API calls are now cached in a MongoDB instance. I originally just used a local find cache but using Mongo is not only more scalable but actually required even fewer lines of code. This could be extended by caching sentiment valuations too.
- Some parallel processing was added as per my previous blog entry.
There are still a number of optimisations that could be made. The biggest performance gains would probably come from any reduction in network traffic.
I think the system does a reasonable job and aggregating reader sentiment, however that is no more than my subjective assessment. To validate it I guess I should do something like this:
- Go through sets of comments and apply a score to each, indicating whether they should be considered positive or negative. Or better, ask someone else to do it who isn’t going to be swayed by knowledge of the algorithm.
- Use these values, and possibly the recommends, to get a sentiment value.
- Analyse the correlation between the automatically generated sentiment values and those from the manual method.
- Should probably do the same for the reply comments too, to assess whether ignoring them is a valid approach.
As long as you separated training and test data, you could also use the manually applied sentiment values to help train the system and improve it. All this would take a lot of time.
One simple improvement could be to use stemming. This involves taking only the root of a word, for example “hateful” might be reduced to “hate”, as would “hating” or “hated”. This would mean better coverage, “hateful” is an example of a word that isn’t in the sentiment list but its root “hate” is.
Another simple improvement might be to remove short words from the word count such as “a”, “or”, “the” and so on, meaning that single word insults don’t carry quite as much relative weight.
There would be some interesting avenues to explore off the back of the basic system. For example:
- At the minute it’s only possible to list a small number of articles at once. It would be interesting to see, say, the top 5 positive and negative articles across a week or month.
- It might be interesting to try to link words or names that appear in an article headline with sentiment, and more interesting to see if you could then track that over time. So for example, could you detect if articles about a political figure or a project such as CrossRail were evolving?
- Another British newspaper, the Daily Mail, also attracts a significant number of comments on the website. It would be interesting to apply the same algorithm, and then compare reactions to similar news items, especially given the two publications are close to be being political polar opposites.
This was a fun project to see if sentiment analysis could be applied in this context, and by and large it looks as if it can.
However, I’m not sure it could really be said to provide any insight at the minute. That would take more work to validate and improve the approach, and then further effort to derive some meaning from the results.