Blog Analysis Part Two: Lexicon

Published in Blogging - 4 mins to read

Having looked at the feelings behind the words in this blog yesterday, today I want to look at the words themselves. As of the time of writing (but not including this post) there are 176,219 words across 685 posts - which makes it about twice the length of the average novel. The mean length of a post is 256.319708 words, the median 165, and the mode is 207 words, with there being 8 posts of that length. Assuming a reasonably generous reading speed of 250 words per minute, the average post takes a hair over 60 seconds to consume, but the entire corpus (is describing this blog as a “corpus” disgustingly narcissistic?) would take roughly eleven hours and 45 minutes to get through. The longest post I’ve written to date is An open letter to all my friends, ever at a hefty 4,861 words, making it longer than any essay I ever wrote at any point during my academic career. Twenty one of my posts have been over 1,000 words long, and 207 posts are above the mean length. The shortest post I ever wrote is this little cutie, Duerme, which isn’t going to be winning any Spanish literary awards any time soon and only barely qualifies as a blog post at all under the stipulations of the original bet (namely that each post must be a minimum of two sentences in length).

The longest word I’ve used to date is “disproportionately”, clocking in at a girthy 18 letters long, and I’ve actually used it thrice. There were several joint runners up, each at 17 characters apiece: “cognodegenerative”, “commercialisation”, “straightforwardly”, “perchloroethylene”, “counterproductive”, “catastrophisizing” and finally “indistinguishable”. Each of those has seen only a single use.

I have used 12,446 unique words and 8,695 unique stems, which means that my vocabulary stacks up pretty favourably against some very famous novels. As much as I would love to pretend that’s because I’m in any way as eloquent as Charles Dickens, I write about a huge range of topics rather than being bound by the confines of telling a coherent story, meaning I get to pull words from all corners of the English language in order to to bump up my numbers. I also definitely make typos, whereas presumably over the years someone has fixed all of those that might have been found in A Tale of Two Cities. Even with those advantages, I still can’t hold a candle to David Foster Wallace.

Wholly unsurprisingly, the most common word I have used is “I”, which has appeared 7,717 times, comfortably beating “the” into second with a mere 6,392 instances. My most used verb in a singular tense is “have”, but across all tenses “to be” pips “to have” to the post. I used the stem “love” 274 times, while “hate” only appeared 34 times in comparison - who knew I was such a ray of sunshine? “Feel” clocks 610 uses but is outdone by “think” with 692, perhaps something that’s reflective on my tendency to overthink things. “You” took the silver medal in the pronouns category with an impressive score of 1197, ahead of “they” in bronze, with “he” not too far behind but “her” having been lapped, with roughly a third as many uses as “he”. I’ve used my own name six times, while Ted has got 18 mentions, clearly making him my favourite person. He’s not quite my most used proper noun though, that honour goes to “Guernsey” which I have apparently bemoaned a whole 52 times.

“One” is the most common numberword, with 717 appearances, “two” is second with 143 and “three” third with 52. Here the pattern stops though as “four” and its 17 uses has to concede to “five” and its superior 26. “Shit”, along with its comrades “bullshit”, “shitty”, “shitshow”, “shitpost”, “shitfest” and “shittier” have a combined total of 61, while “fuck” and the gang of “fucking”, “fucked”, “motherfucker”, “fuckerberg” (amazingly I’ve used that one twice) and “fuckers” have notched up a dominating 92 in comparison. I’ve never used the word “cunt” in the blog, until just now. Sorry Mum.

As established yesterday, I don’t really know the best way to extract meaningful data from this, but it’s cool to know that I now have a dataset of a decent size from which maybe I could derive some kind of deep insight into my psyche. It seems to me like a nice demonstration of compound interest as well; if I do something small every day, eventually it will grow into something big. If I ever wanted to write anything serious, I know that if I write 250 words a day, eventually I’ll be left with a novel or a memoir or what 50 Shades of Grey should’ve been, providing I stick with it.

This was a lot of fun for me - maybe I’ll do another one when I reach 300,000 words, and I’ll learn some data science and proper NLP in the meantime.