Zipf's law:

The linguists at Language Log have been poking fun at a BBC story suggesting that British teens have poor vocabularies and that Britain is becoming a nation of "Vicky Pollards." The main posts on the subject are here and here; an extra post is here, and for a (very partial) retraction of their original mockery (which was substantially fair, but here they go into greater theoretical detail) here.

By the way, who is Vicky Pollard? The Language Loggers suggest looking here, here, here, and here. I've only looked at the fourth of those links, but it's pretty funny.

In any event, the basic moral is that the BBC doesn't know what it's talking about. For one thing:

The Vicky character — a broad satire of the accent, dress and manners of British lumpen-teen females — is portrayed as hyper-verbal. One of the basic Vicky bits is her jabbering rapidly on automatic pilot, saying far more than she should. Yet the BBC sees her as someone who is unable to communicate due to an inadequate word stock, not someone who over-communicates with socially inappropriate content, accent, word choice and sentence structure. This is another piece of evidence that journalists these days are incapable of elementary observation and common-sense description, at least when it comes to speech and language.

For another thing, the story generated the assertion that "the top 20 words used [by British teens] . . . account for around a third of all words." Now, you're supposed to read that and imagine "um," "like," "y'know" . . . but it turns out that everyone does the same thing. Having the top 20 words account for a third of all your words is a normal distribution. (That's "normal" in the "ordinary" sense, not the "Gaussian" sense.) Take a look at Zipf's Law, and then read this lovely article about the Oxford English Corpus, where you can find the 100 commonest English "words" (where "words" basically means "lemmas," if you find that helpful).

Especially funnily, the Language Log folks analyzed a text by the professor responsible for the statistic, and found that he, too, followed the same 20/one-third law! Not that the professor is really to blame; of course, his research was badly mangled by the media.

UPDATE: A commenter quibbles with my use of the word "commonest." In the comments, I quote the Oxford English Corpus guys using the word, and also uses of the word by Byron and Jonathan Swift.

Related Posts (on one page):

  1. More about language:
  2. Zipf's law:
Respondent (mail):
"The 100 commonest English "words"
Sasha, is "commonest" a valid superlative? I think you would need to say "most common".
1.4.2007 11:52am
andy (mail) (www):
I think the Brits should be concerned with the yellow teeth in their mouths, not the words coming out of them.
1.4.2007 11:55am
Glenn W. Bowen (mail):
Zipfs Law seems to follow the form of successively moving half the distance to an object, never reaching the object.

Something to do with Fibonacci Numbers, the Golden Mean, etc., lurking here...
1.4.2007 12:05pm
Sasha Volokh (mail) (www):
Respondent: Take a look at the link to the Oxford English Corpus above, where you'll find:

"What is the commonest word? Based on the evidence of the billion-word Oxford English Corpus, the 100 commonest English words found in writing around the world are as follows . . ."

This is the OED guys talking.

Jonathan Swift says, in Tale of a Tub (1704): "It was necessary that corruption should have some allegory as well as the rest; and the author invented the properest he could, without inquiring what other people had writ; and the commonest reader will find, there is not the least resemblance between the two stories."

Byron wrote, in The Irish Avatar (1821):

Is it madness or meanness which clings to thee now?

Were he God, as he is but the commonest Clay

With scarce fewer wrinkles than sins on his brow

Such servile devotion might shame him away.

"Commonest" is listed in The Free Dictionary. And a Google search yields 2,790,000 hits for the word "commonest."

And even without any of the above, I maintain that "commonest" would still be O.K., as it's unambiguous and not difficult to say. Awkwardness or clunkiness is in the eye of the beholder, but I say it's neither awkward nor clunky.

Now I realize there are "rules" floating around to the effect that you don't add "-est" after long adjectives, which includes 2-syllable adjectives not ending in "-y." Even on that web site I just linked to, they recognize exceptions like "quiet," "clever," "narrow," and "simple." ("Simple" might be a special case because an "-est" superlative still only takes two syllables.) So that rule arguably doesn't exclude "commonest." To the extent it does, I reject the rule.
1.4.2007 12:18pm
Paul Zrimsek (mail):
Andy, if we keep quiet about the teeth it's possible that they'll keep quiet about the fat. I'm not holding my breath or anything, but it's worth a try.
1.4.2007 12:33pm
Glenn W. Bowen (mail):
sasha...

given your response above, the "quibble" note is... superfluous.
1.4.2007 12:37pm
StevenK:
Here's a great website, WORDCOUNT: http://www.wordcount.org/index2.html

It ranks more than 80,000 English words by frequency of use.
1.4.2007 1:43pm
Bill Poser (mail) (www):
Regarding "commonest", there seems to be a rule even to the variation in acceptability of such forms. Unequivocal monosyllables take the suffixes -er and -est without any question or variation: big -> biggest, fair -> fairest, small -> smallest, etc. Unequivocal polysllables do not take these suffixes and require the periphrastic forms "more X" and "most X", e.g. innate -> most innate, *innatest, casual -> most casual, *casualest. The words where the choice is not so clear and on which people seem to disagree are words that are superficially disyllabic but in which one syllable is reduced and arguably is not syllabic at a more abstract level of representation. "common" falls into this category, along with examples like "loyal" and "purple".
1.4.2007 1:52pm
Just an Observer:
Zipf's Law has serious practical applications far from theoretical linguistics. This distribution, or some variation of it, is employed by the ranking algorithms of modern text-search engines such as Google and Yahoo.
1.4.2007 1:53pm
Sarah (mail) (www):
The BBC has a serious fixation on Vicky Pollard and how the British nation is becoming one of "chavs." I've been a reader of the BBC education news since I had to do a study on British exchange programs on an internship (ages ago) and this is one of a handful of themes that they are currently focused on. I think it's a bit of a thing with the British in general at the moment, though -- there was a recent BBC story about how all the young adults shown on popular TV are early school-leavers who work in shops and don't seem to be moving on with their lives, but that was because of some kind of high-level press release from an outside group. Vicky Pollard was mentioned in that article, too (she also finds her way into the occasional ASBO story.) Anyway, if the BBC wants to say that British youth are lazy good-for-nothings and we're all doomed, chances are she'll be mentioned even when it isn't really appropriate. She's a code word more than a fictional character.

As far as word frequencies are concerned... I love them to pieces. I'm getting a new Sunday School class (turning 8 years old this year) and we're going to learn to read half the KJV in a month (the hardest words out of the 47 which are necessary are "thou," "Israel," and "against.") I also use the Russian frequency lists for vocabulary study, though I have to say that a lot of this stuff is less useful than it appears to be -- if I tell you that "к" or "в" are two of the most commonly used words in Russian, you still have a long way to go to reach actual understanding (since these are "function" words used in many different contexts, sometimes with very different meanings from an English-speaking point of view.) The top 20 words used in any given text probably get used in 30 or 40 different senses within that text: that's a big part of why they're used so much.
1.4.2007 2:01pm
Mary Katherine Day-Petrano (mail):
Well I know the "commonest" words used by the Florida State Courts system about the Americans With Disabilities Act -- 'it is al just surplus verbiage.' Senior Supervising Attorney for Judicial Education.
1.4.2007 2:07pm
Mary Katherine Day-Petrano (mail):
"al"=all
1.4.2007 2:08pm
David Chesler (mail) (www):
in which one syllable is reduced and arguably is not syllabic at a more abstract level of representation. "common" falls into this category, along with examples like "loyal" and "purple".

Aren't you referring to syllables with a schwa?

This rule allows "biggerer" or similar constructions comparing comparisons. ("In gorillas the male is biggerer than the female than in humans." Isn't that more clear [why don't I like clearer here?] than "The male gorilla is bigger than the female gorilla to a greater degree than the male human is bigger than the female human"?)
1.4.2007 3:16pm
Shut up! I ain't even dun nuffin or nuffin!:
Can I just say that Little Britain, the show that Ms. Pollard comes from, is the best sketch comedy show since Kids in the Hall?
1.4.2007 8:18pm
Lev:

The Vicky character — a broad satire


?


Here's a great website, WORDCOUNT: http://www.wordcount.org/index2.html

It ranks more than 80,000 English words by frequency of use.


Shouldn't, like, "like", be number one, you know?
1.5.2007 12:04am
Spartacus (www):
Commonest: no problem. But, although I've found it in the dictionary, I wonder about "funnily." While grammatically correct, how would you say the word? "fun-ee-lee"? "funn-ill-ee"? Or like "funnelly" (as in, "like a funnel")? It just sounds awkward, though perhaps not much better than, "in a funny manner" or some such. But I concede, it is a word.
1.5.2007 9:39am
Bill Poser (mail) (www):
Re David Chesler: One might describe the syllables of uncertain status as those with a schwa. I avoid that approach for two reasons. One is that even if that is a surface-true description, it arguably doesn't reflect what is really going on. On this view, what is really going on is that at the point at which the comparative and superlative forms are generated, those syllables are not syllables at all. The other reason is that it is disputed whether in, e.g., the second syllable of "loyal" one has a syllabic /l/ or a schwa followed by an /l/.

I don't see how "the rule" (which one?) allows double forms like "biggerer". Those are impossible as far as I know, and they would only be generated if the morphology provided the opportunity to add the suffix twice, which it doesn't.
1.5.2007 5:11pm
ys:

Here's a great website, WORDCOUNT: http://www.wordcount.org/index2.html


It's a great site, but it has a severe flaw: it lists plurals of words separately and in addition to singulars. This is simply not right for straight frequency counting.
1.5.2007 10:22pm
Lev:

syllables of uncertain status as those with a schwa



Like, that sounds, like, you know, "bush schwa", you know?
1.5.2007 11:02pm