While print circulation of newspapers in the UK continues to fall they retain a physical presence which can be hard to avoid. You can be exposed to their front pages every time you go to a supermarket or pay for petrol, so that even if you wouldn’t ever consider buying a particular paper you might still be familiar with their front pages. They act as mini-billboards, with the main headline being all that most people actually read.
I wanted to see if I could quantify the biases in their choice of lead headline. In particular I wanted to check whether the Express was really as obsessed with Europe as it appeared, and the Daily Mail has preoccupied with migrants. So to that end it was a success.
All the data was extracted from The Paperboy website. I didn’t have to do anything fancy like text recognition, the front-page headlines themselves are available on the site. However the data isn’t perfect. Particularly with broadsheets, which carry multiple headlines on their front page, the headline recorded is actually a minor one and not the main story. Furthermore, some recorded headlines just seem to be inaccurate, or at the very least refer to earlier editions. In aggregate the data isn’t too bad though, particularly for the Mail and the Express.
I was constrained by the data available. That’s why the Mirror, Sun and Times don’t feature, which is a pity as it would have made it rather more interesting.
The code was written in Python, and I made use of the natural language processing library nltk.
After extracting the headlines I did the following: