Mining juicy words
This weekend, I counted all the words on Project Gutenberg. This has been done before, notably, here. My script crawled most of the English language books on Project Gutenberg (about 20,000 titles), and counted how often each word appears, and how many books each word appears in. The script ran for about 20 hours.
You can download the resulting list, which contains over a million words, here. Each line shows how many books each word appears in. A second list, which shows how many times each word occurs in total, can be downloaded here.
I prefer the list that shows the number of books each word appears in. It has the effect of pushing down words which appear a lot in only a small number of books, such as the names of fictional characters.
I compiled these lists because I wanted to make some word puzzles. There are lot of free lexicons, or word lists out there, such as the ENABLE lexicon which is commonly used for scrabble-like games. However, for the purposes of making crosswords, word searches, and other puzzles, it’s very helpful to restrict the words to more commonly used ones, and to know how common each word is.
The popularity number of the word correlates well with how ‘juicy’ the word is, or appropriate for a word puzzle. For example, using my book count list, words at the very top of the list are quite boring structure words.
18374 by
18054 and
18023 the
17994 of
17963 a
17955 to
17946 in
17916 from
17912 with
17909 for
As we head towards 10,000, we encounter most of the common bread and butter words. These are also kind of boring.
15095 case
15094 none
15091 taking
15070 seem
15060 able
10776 buried
10771 report
10767 asking
10767 clean
10764 occurred
As we head from 10,000 to 200, the words get increasingly more interesting.
9781 plainly
9781 flat
9779 proofreading
9777 passion
9775 approaching
.
.
.
5999 commanding
5998 channel
5997 translated
5996 metal
5996 sixth
.
.
.
1999 conflicts
1999 spider
1999 bleed
1999 discrimination
1998 lends
.
.
.
599 studs
599 niggardly
599 symbolized
599 engraven
599 palliate
There is a sweet spot with a lot of very juicy, but still familiar words in the 300s. If I were selecting words for puzzle construction, this is the area I would favor. After the 300s, the words start to get increasingly more obscure.
359 pajamas
359 dressings
359 thievish
359 anatomist
359 ticks
.
.
.
200 darkish
200 acclimated
200 unfriendliness
200 moveth
200 undiscoverable
At the 200 mark, we’ve only covered about 38,000 words. There are 1,236,759 words in the list total, so we are are still at the top of a very long tail! Below 200, words get increasingly obscure, archaic, misspelled and foreign. We also hit a lot of proper nouns. Still there are a few legit, but rarely used words mixed in.
99 tingeing
99 marshmallows
99 somethings
99 feelest
99 petrify
.
.
.
50 anim
50 makeweight
50 godard
50 seraglios
50 vun
.
.
.
25 admiralties
25 vanni
25 senescent
25 futrelle
25 erechtheum
.
.
.
10 foretime
10 chargee
10 cabinetmaking
10 pneumonias
10 olivo
.
.
.
5 guisers
5 hairing
5 hipless
5 turms
5 arpasia
.
.
.
1 raskolink
1 baetan
1 succories
1 denudement
1 trotudas
UPDATE, March 23rd:
I measured the average book-count of the words in all the New York Times crossword puzzles since 1997 (their online archive goes back to about 1996). For each puzzle, I averaged the book-counts of the words that appear in my list (typically, about 80%-90% of the words in each puzzle). For most years, the average book-count falls between 2,008 and 2,180, and from year to year, the results can be surprisingly consistent.
Here are my averages:
1997 2070.93
1998 2154.20
1999 2113.24
2000 2180.20
2001 2131.94
2002 2141.65
2003 2115.29
2004 2114.60
2005 2034.01
2006 2026.20
2007 2035.31
2008 2033.90
2009 2008.76
There appears to be a marked shift towards more obscure words from 2004 – 2005.
Interestingly, there are only about a thousand words that fall in that NYT-Crossword sweet spot. Here they are:
pondering squarely pregnant paws scold cordiality cooler venturing variance hypothesis forefinger economic untimely dubious shepherds secular minimum pallor degrading fastidious desertion foretold heath discourage wintry wrenched peas raiment pensive reproof ankle flattened moore fisherman peninsula beholding identification wheeling maine unhappiness richmond frantically enhanced gorge extremities joyously stronghold hissed nut bowels repressed lending feasts cavern unfold memoirs onto invade ark structures forbids liver correctness abashed stumble clerical orchestra terrifying enchantment incomparable collapsed paler ballad recalls slack restraining motley rippling circled ardor lambs flapping shrug prettily avarice aforesaid educate glorified acquiescence acquitted dungeon blasted objective persuading fray forts statistics gathers levelled moderately splashed mirrors infected vacancy furs mates grating precipitate confiding ton grazing dispositions partnership momentarily framework attorney regulating fathom nimble ravages surpassing quieted hitting sustaining practiced darkening walled withdrawal unawares exceptionally howard fiend queens horseman dictates quarry waged coral pleasanter badge assurances subsistence italians manned alphabet bower reposed preachers variously anticipating arabian melodious slate hourly bled dejected dreamt discordant stormed purchasing sap unreal parlour dam couples humblest postpone butterflies chaps yells paw freeze forfeit eclipse advertisements dozens quitting romances uphold drunkenness agonies guinea forge tearful twig dispatched windy tidy bitterest dogged wastes disconcerted irritable tunnel contentedly backing uniforms gunpowder mineral pigeons repel pail territories ransom stab draped redemption individually medicines azure bony scissors ma invariable supplement repulsed entreaty capitals forbearance adviser unavoidable raining enlighten holiness countenances untold coil mutilated dancers thankfulness buzzing armor spoiling narrower adhere ardently undergoing indomitable devoting friction thrive ravine diverse floats hazy twain aspire visage quarrelling womanly shields initiative disappointments elaborately civility disobedience splashing festivities disasters bustling vicissitudes monopoly helen raid marshes fitful consigned illustrates apprehensive conscientiously fabulous colleagues profited wharf grievances countryman laurels diversity monastery target pounded conspicuously myriads hostilities atrocious vase overturned redoubled mountainous swallowing layer adherents sparing parchment trampling imaginations laughingly fictitious jet widows picnic prospective valour absorb yankee chocolate courtiers canoe chasm biscuit stairway jars adjustment ancestral roving catholics psychological milder adapt woollen loathsome rowing barracks signing banker grunted slumbers garret midsummer ignoble savings substantially resuming fostered mane prophesied forfeited swan loosed fortifications gloriously vouchsafed oratory jovial crescent stinging stamps commissions lanterns caresses merest universities insurance draughts surmise rebuked valid barbarians revolted humbled emerald contradicted halfway marvels excel nervousness pier stall illustrating grades surly utensils chagrin colouring murders northwest widening pitiable keener kent devilish conventions carving studded mat dwarf weights youngster compels resounded dispatch fried completeness dismissal undecided aiding dimmed plied illumined extensively needing graphic embroidery glimmering sash sauntered sniffed grasses pitcher rapt unerring offences exiled sucked raced fig streaks halo religions rhetoric advising fraught canadian hampered riders profile incur excellency benediction gregory particles diminutive chemistry infants lounging knocks elated mien propped reverent antagonism wade exhaust unduly needy girlish hoarsely mortified hercules initials scar flowery reproduction absorption excelled stains facilitate modify slap grounded wig lavished magnified agility hugh sponge irishman cultivating stalked fumes metals arena augmented enjoined fibre flushing biscuits attends nick soaring follower boom surest rhine proclaiming snatching paramount alluring clambered loom poultry intoxication slaughtered perplexing impaired sleek patting conceited squirrel inventor notably swells ripened click ethics fairies adventurer summoning vocal jove scolded dwellers uniformity sarah prairie capacities unfriendly uttermost hens gear penance unbearable sewed legion disposing mistook prestige organic unparalleled invaders laboriously trench steeped distraction dipping groped slackened beak salutary summits intrusted inanimate flowering reiterated receding jagged adversity safeguard unacquainted stalks axes alps hip mortality perverse apathy weighs julius witnessing epithet childlike lunatic pretends convict oblivious restlessly yarn offense chests runaway dilapidated unfailing verdure cloudless ferry vista toll prettier unearthly enlist feudal penitent scarf encamped dedication mahogany relinquish residents salmon payments meditations tragedies sufferers concludes arnold smoky altars squadron pursuers sagacious abnormal bernard reeled strangled cherry planets combatants bunches feathered fearlessly therefrom canst precipitated likelihood potato conquests intensified columbus hairy slapped scrupulously immemorial buoyant graver warranted senator excesses invading complimentary turks highness factors vindictive shovel tenderest uncanny augustus propositions detection efficacy artful iniquity emancipation listless indolence lease purified grease unoccupied encounters treasurer hereby narrated revel impetus legislative wailed mexican disappoint impertinence abstraction pulls submissive surged falsely sheriff wilder underwent submitting prisons implicitly treasured sculpture spheres trailed impassioned exacted converts pepper coloring noiseless conflagration relatively maddened precincts versed quartered culprit tunes torments birch fairness unsteady terminate offender citadel ado compiled
March 23rd, 2010 at 2:34 pm
At lunch today, I was wondering not about words, but typed character frequency. I’m in the physically (carpal tunnel) and mentaly painful adjustment period getting used to the new Microsoft Office redesign. If they wanted to be really useful and force everyone to learn a new way of doing things, wouldn’t it be better to invent the most efficent/ergonomic letter/punctuation layout for a keyboard? That’s an update I could get behind. So I thought about looking up character frequencies (combinations would probably be important too…) in a database like you just did your word database, but then I ran across this when I hopped over to get some more Penrose slitherlinks, and thought you might know if such a database already existed.
March 23rd, 2010 at 5:35 pm
I’ve seen such tables (and computed them, in the past). I suspect, however, that there are other factors besides letter placement
that have a greater effect on ergonomics… It’s not necessarily qwerty that is the cause of your troubles. I’d happy to compute a letter
frequency table from Project Gutenberg, if you’re interested..
March 30th, 2010 at 9:49 am
I find your work, puzzles, music, and thoughts fascinating. Thanks for sharing.