Is word count really accurate?

November 10, 2021

Recently, I’ve asked myself this question: Is there a better way to calculate essay length than word count? Just wonderin’. While word count might work fine for a single-page documents, word count becomes less accurate for hundred-page documents, like, um, a PhD thesis.

The problemLong vs short

As someone who’s a big fan of languags, here’s the thing. In linguistics, agglutinative languages exist. In other words:

(of a language, e.g. Hungarian, Turkish, Korean, and Swahili) tending to express concepts in complex words consisting of many elements, rather than by inflection or by using isolated elements.

—Oxford English Dictionary

What this basically means is that some languages have longer words, while others have shorter words. A good example would be Turkish. Like so:

Comparison of an agglutinative language with a non-agglutinative language

Note: This is an upscaled version generated by Waifu2x

A single word in Turkish could make for an entire sentence in English, because each word has, well, more “words” in it. Thus, I feel that it’s rather inaccurate to judge document length just like that. Yes, in technical terms, word count does equal to the byte size of a digital document, but from the aspect of “how much is this person really saying”, word count scores poorly.

Existing solutionsIt’s complicated

An interesting point of research is linguist Joseph Greenberg’s work in measuring the “agglutinativeness” of a language. In his academic paper, written in 1960 (it’s still relevant), the degree of agglutinativeness is based on a ratio: The number of agglutinative junctures to the number of morph junctures. It’s a bit over my head, but the evidence is in the table below, extracted from Luschützky (2003):

	Agglutination	Synthesis
Swahili	0.67	2.56
Turkish	0.60	2.33
Yakut	0.51	2.17
English	0.30	1.67

Solely from the agglutination index, there is a difference 0.37 between the agglutinative Swahili and the non-agglutinative English. So, what does that mean? I suppose it works like this: A text written in Swahili can fit in almost two times more content than a text written in English. So is word count really accurate? Is it really a universal marker of text length? Let’s not even get into Mandarin, which itself is perhaps one of the most agglutinative languages in the world, which was, for whatever reason, seemingly omitted from the aforementioned study.

Personal thoughts

Based on the above, how then, do we measure document length more accurately? I have a theory. Based on the assumption that clause-binding words/phrases such as “therefore”, “so”, “hence”, “as a result” indicate a certain degree of logical advancement in content, we could instead count these words/phrases in a text such that a greater number would result in a higher “content length” value.

But wait—we can go even further! What if we could leverage an unsupervised machine learning model to understand the “content length”? This is a super interesting idea, and could actually be useful. The bottom line: Word count varies from language to language, and while it has its conveniences and relevance, more robust systems for calculating document length need to be formed.

The problemLong vs short

Comparison of an agglutinative language with a non-agglutinative language

Existing solutionsIt’s complicated

Personal thoughts