Tuesday, February 28, 2006

(Spelled) segment distribution: lexicon vs. corpus

Recently reading a historical para about Scrabble, I was surprised to (re)realize that the letter distributions in Scrabble are based on an (informal) corpus count (New York Times front page), not a dictionary-headword (i.e. lexicon) count. So, e.g., 12 per cent (12 tiles of 100) of letters in the Scrabble bag are 'e'; that's pretty exactly the percentage of letter occurrences that are 'e' in a corpus of written English.

The reason this surprised me is that in a corpus there will be many repetitions of function words, which probably inflates the percentages of certain letters, e.g. 't' (the, to, it), 'e' (the, he, she), etc., when compared to their percentage occurrence in the lexicon, which contains only one token of each function word.

But of course Scrabble is all about producing nice individual words, really, not about producing a corpus-like set of word tokens. Indeed, someone who repeatedly put down words like 'to', 'the', 'it', and so on would not get very far in a game of Scrabble. So it seemed to me that it would have been more appropriate to use a letter distribution based on the percentage of each letter occurring in a list of dictionary headwords.

I thought I would try to find out how different the lexicon-based letter distribution in English is from the corpus-based letter distribution, but I can't find any numbers online for the former. (Numbers for the latter are all over the place, of course, and match the Scrabble distributions pretty exactly; the reason letter distribution is so interesting to many people is because it's a good way to solve simple encryption problems.)

I know it would be a supersimple programming problem to produce a list of letters and their respective percentage distributions in the headwords of any online dictionary database (e.g. the fourth column of Mike Hammond's 'newdic' file), but it'd be a biggish time investment for me to figure it out right this second. Anyone who can see a quick and easy way to do it want to send me the numbers for comparison to the corpus numbers? It would be interesting...

(Seems to me that it might even be useful, theoretically -- if you're into exemplar models of mental lexicon representations, e.g., the frequency/markedness value of a given English segment in your mental inventory might be expected to correlate with the corpus distribution, while if you're not into exemplar models but rather a more traditional lexicon-based model, with a single abstract phonological representation of each given word, you might expect the segment markedness values to correlate with their lexicon distribution in English.)

3 Comments:

Anonymous Anonymous said...

Using that newdic file & my own half-assed Python-fu, I get:

'a': 8.9
'b': 2.1
'c': 4.7
'd': 2.9
'e': 11.0
'f': 1.4
'g': 2.2
'h': 2.3
'i': 8.8
'j': 0.2
'k': 0.8
'l': 5.5
'm': 3.2
'n': 6.8
'o': 6.9
'p': 3.2
'q': 0.2
'r': 7.5
's': 5.3
't': 7.7
'u': 3.8
'v': 1.2
'w': 0.9
'x': 0.3
'y': 1.8
'z': 0.4

6:59 PM  
Blogger Nayeli said...

It would be interesting to use the TWL or SOWPODs dictionaries themselves to compute the lexicon-based frequency.

3:08 PM  
Anonymous Anonymous said...

waroneMen's Lacoste Polo Shirts Men's RL Striped Polo Shirts Women's Lacoste Polo Shirts Men's polo shirts Men's polo shirts Men's polo shirts 4 polo shirts Women's polo shirts 21 polo shirts Men's polo shirts Women's LACOSTE 5 PCS of Ralph

6:20 PM  

Post a Comment

<< Home