How many words do you need?
- A note on corpus and lexicon size
- 1. Lemmatized words
- 2. Running words of text
- 3. Wordforms and running words
- 4. Real-time value of corpus sizes
- 5. Summary: Desiderata for documentation
- 5.1. Recommended corpus sizes in running words.
- 5.2. Recommended lexicon sizes in lemmatized words.
Johanna Nichols, UC Berkeley
Date of this version: Jan. 18, 2005
Comments and other input welcome
How much material is needed for minimal, normal, or optimal documentation of a language? How many entries should be in a dictionary? How large should a text corpus be? This squib is an attempt to estimate adequate and optimal sizes of corpora, measured in lemmatized words and running words, for linguistic research and for multi-purpose lexicography.
A dictionary is constructed from a wordlist, i.e. the list of words it defines or translates, and the wordlist is then organized into headwords, subwords, examples, and other information. When a publisher lists the number of words in a dictionary (as is commonly done for European dictionaries but rarely done for ones published in the U.S.), that usually includes everything that figures as lexicalized material - headwords, subwords, idioms, examples, illustrative phrases, even principal parts, etc. Field and descriptive dictionaries rarely have the elaborate structure of subwords, etc. shown by large published dictionaries of major world languages, and in any case there is no particular consistency from dictionary to dictionary as to whether items are treated as headwords or subwords. To make it easiest to compare dictionaries I will try to focus on what is known as the lemmatized word -- the head of an inflectional paradigm and the source of very regular derivational forms such as (for English) adverb in -ly, comparative and superlative adjectives.
Cheng 2002, 2002 (and earlier works) shows that, for both English and Chinese, any given author uses a maximum of about 4000-8000 lemmatized words. A book of about 100,000 running words of text reaches this maximum in some cases. Cheng quotes Francis & Kucera 1982 to the effect that the Brown corpus (1,000,000 running words) contains over 30,000 lemmatized words of which just under 6000 occur more than 5 times.
Cheng's interpretation is that 8000 words is about the maximum for an individual's actively known vocabulary. Of the authors he has surveyed only Shakespeare reaches an exceptional 10,000 words. Other sources cite vocabulary sizes ranging up to 80,000 words for the individual; but this is passive vocabulary, some of it known only in the sense of being understood in context. Cheng 2000 shows that, over time from 93BC to the present, the size of Chinese dictionaries increases regularly but the size of the individual author's vocabulary remains at a constant 4000-8000 characters.
Cheng's results suggest a measure of adequacy for lexical documentation: it should reach the range of an individual's active vocabulary, and it should be compiled from extensive enough materials to include the entire active vocabulary for at least one good speaker and preferably several good speakers.
I note that a fair amount of Cheng's English corpus appears to be literature for young adults. The sources with the higher lemma figures are writers writing for a full-fledged audience, e.g. Mark Twain; nonfiction writers appear to mostly fall in here. So Cheng's figure of 8000 may be a minimum, in that inclusion of more varied genres would almost certainly expand it. Also, the figure probably includes distinctly fewer technical terms than the average user knows actively. Finally, what Cheng surveys in this paper is not the given author's whole oeuvre but just one large work or (e.g. for Mark Twain) a collection of short works. That said, evidently one needs close to 100,000 words per individual to have any chance of capturing that individual's entire active lexical range. That would be about 17 real-time hours: see section 4 below. (Of course, to be strictly comparable to Cheng's sources that would need to be 17 hours of high-quality text, comparable in lexical, stylistic, and thematic range to the best written and edited literature.)
Recent field and descriptive dictionaries do well by these criteria. Table 1 below shows the results of a survey of a quick convenience sample of glossaries and dictionaries that I have used and found excellent. There are two major types: the glossary or wordlist included at the end of a descriptive grammar, and the self-standing monographic dictionary. I have occasionally given counts of both headwords only and headwords plus first-order subwords, where there was an appreciable difference; in those cases the count of headwords plus first-order subwords is closest to a count of lemmatized words. (In these dictionaries, subwords beyond the first order tended to be principal parts, illustrative inflectional forms, dialect variants, etc. rather than separate lexicalized words.) All figures are rounded. Wordlist figures do not reveal additional richness such as dialect coverage, examples, drawings, multilingual glossing, etymology, etc.
Table 1. Wordlist sizes for a selection of field and descriptive lexica.
|* Author's own word count count; no mark = my estimate|
hw + sw1
Swadesh 100-word list or similar
Usually just one English gloss per word, no subwords
True dictionary with definitions, subsenses, etc.
Result of author's long work on the language
Uses previous published lexicography on the language
Count of headwords only
Count of headwords plus first-order subwords
|Vocabularies in field grammars:|
|Mayali||Evans 2003||Basic wordlist||150 (all dialects)|
|Eastern Kayah Li||Solnit 1997||Glossary||1000|
|Lezgi||Haspelmath 1993||Glossary||1450 hw,
1600 hw + sw1
|Hunzib||van den Berg 1995||Dictionary||2000 *|
|Monographic field and similar dictionaries:|
|Tümpisa Shoshone||Dayley 1989||Single project||3500 *|
|Dongolese Nubian||Armbruster 1965||Lifework||6200|
|Chickasaw||Munro & Willmond 1994||Non-initial||12,000 *|
|Nez Perce||Aoki 1994||Lifework||4400 headwords,
16,000 hw + sw1
|Hopi||Hopi Dictionary Project 1998||Lifework||30,000 hw + sw1*|
|Lahu||Matisoff 1988||Lifework|| 7300 headwords,
30,000 hw + sw1
|Tzotzil||Laughlin 1975||Lifework||30,000 *|
Another essential for lexical description is inclusion of relevant grammatical information (part of speech, declension or conjugation class, valence, gender, any irregularities, etc.). Recent field grammars (and a good many older ones) meet this criterion well. For dictionaries (but usually not for glossaries and basic wordlists), it is important to convey the meaning of the word, while glossaries and wordlists often just give a single best English equivalent as gloss. Again, recent field dictionaries usually meet well the criterion of good descriptive lexical semantics.
A desideratum for good lexical documentation might then be 3000-5000 lemmatized words. 5000 words is the size of a typical student's or learner's dictionary of commonly studied languages like English, the size of the Brown Corpus high-frequency vocabulary, and (in Cheng's findings) a bit over half of the typical writer's total active vocabulary.
On the other hand, a glossary as small as a few hundred words, promptly produced and distributed widely, can be a significant early gift from the researcher to the language community. It should always be made clear that this is a first quick test compilation and not a full dictionary: see the cautionary words of England 1992:34-35 on the negative effects of publishing only small dictionaries.
Incidentally, the introductions to some of the dictionaries listed in Table 1 provide much insight on field lexicography. Those of Matisoff and Aoki give perspective on what counts as a word in isolating and polysynthetic languages respectively (this bears on the word count). Other recent sources on dictionary making include Landau 1984/1989 and Frawley, Hill, and Munro 2002 (which includes histories of several dictionary projects; see especially the chapters by Aoki and Hill).
Kudos to these and other colleagues who have produced large and richly defined lifework dictionaries.
The Uppsala corpus of Russian (1,000,000 words) yields zero or near-zero frequencies of some morphological forms. Timberlake 2004:6 searched for the two attested instrumental singular forms of Russian tysjacha 'thousand' and got zero returns for the less common one, while searching the entire Internet returned thousands of hits for each. Many of them may be repeats, but nonetheless the body of examples on the entire Internet is several orders of magnitude larger than the Uppsala corpus.
The British National Corpus, at 100,000,00 words, is too small to test some common aspects of alternate valence patterns (e.g. dative shift with high-frequency verbs). Ruppenhofer 2004:Ch. 3 needed examples with give questioning Recipient in the two dative shift constructions (who did he give X to vs. who did he give X). He searched the BNC for examples of who or whom followed within 5 words by a form of give and found very few tokens. Searching the entire Internet via Google yielded many examples.
Thus the entire Internet (for languages like English and Russian that are well represented there) is several orders of magnitude larger than very large corpora.
My own experience (unpublished surveys of medieval Slavic texts) is that 1000 clauses exclusive of the most common verb ('be' in Slavic and probably most languages) will reflect basic inflectional and derivational categories in their main functions, basic lexical classes, basic word order, basic alignment, and the main valence types and valence-changing processes. 1500-2000 clauses is enough to exclude the most frequent verb and still have over 1000 clauses left for analysis. Though I did not make a count, 2000 clauses is probably in the range of 10,000 running words of text.
My experience from typological work is that even a few sentences of authentic text vastly enrich a description.
Therefore, a desideratum for corpora to be used for close syntactic work would be at least a million words, preferably at least ten million. Ten thousand or even less will suffice to attest the basic patterns. However, anything at all - even just a few sentences - is enormously valuable.
Monson et al. 2004 measure the rate at which new wordforms show up in running text and find that, for the polysynthetic language Mapudungun, there is no fall-off in the steep rate of increase of wordforms even after 1,000,000 running words. In the Spanish translation of this corpus, by contrast, the rate is much flatter.
Rates of new wordforms to running words (approximate; calculated by eyeballing the graph of Monson et al.) (these figures apply more or less anywhere after 100,000 running words):
|Mapudungun||1 : 4|
|Spanish||1 : 50|
Both curves are slightly steeper up to about 100,000 running words, suggesting that this is a useful target for lexical documentation. (It corresponds to the text size needed in Cheng's study to capture the whole active vocabulary of one author.)
Thus it appears that the number of words of running text needed for work on inflection and other morphology varies with the inflectional complexity of a language.
Based on the Berkeley Ingush corpus (http://ingush.berkeley.edu:7012/) and on the corpus size reported by Monson et al. 2004, I calculate that an hour of transcribed recorded speech contains about 6000 words. (The figure might be somewhat higher for languages with less inflected and therefore shorter words than the inflectionally complex Ingush or polysynthetic Mapudungun.)
Transcribed recorded hours needed at this rate for various corpus sizes:
|1 million words (Brown size):||170 hours|
|10 million words:||1,700 hours|
|100 million words (BNC size):||17,000 hours|
I have actual figures for only the Ingush corpus, whose total recorded time so far is about 150 hours. Only a small part is transcribed: 279 minutes or 4.65 hours (under 5%). Others working in documentation mention goals of transcribing and annotating some 10-15% of one's total archived corpus.
A target for good documentation might then be 150-200 recorded hours, some 15-30 hours of it transcribed and annotated.
Figures recommended here are for quality recordings, transcribed, glossed, and adequately commented -- that is, provided with fluent speaker judgments on the meaning of the material and the identity of the lexical items, and additional judgments on the kind of question that is likely to arise as a linguist works on the material.
Minimal documentation: Something like 1000 clauses excluding those with the most common verb (if any verb is substantially more common than others, as 'be' is in medieval Slavic texts). To be safe, 2000 clauses (this more than provides for excluding the most common verb).
This would be several thousand to ten thousand running words. This appears to be minimally adequate for capturing major inflectional categories and major clause types, in moderately synthetic languages; for a highly synthetic or polysynthetic language more material is needed.
Basic documentation: About 100,000 running words, which appears to be the threshold figure adequate for capturing the typical good speaker's overall active vocabulary.
Good documentation: A million-word corpus. 150-200 hours of good-quality recorded text, up to about 20 hours per speaker, from a variety of speakers on a variety of topics in a variety of genres.
At 20 hours/speaker this is 10 speakers. Also, by Cheng's criteria, 100,000 words/speaker is 10 speakers for a million-word corpus. In reality, though, it is highly desirable to get more than 10 speakers (and also highly desirable to get the full 20 hours or 100,000 words from each of several speakers).
Excellent documentation: At least an order of magnitude larger than good; i.e. at least 10,000,000 words (1500-2000 recorded hours).
Full documentation: The sobering examples of the research experiences of Timberlake and Ruppenhofer (mentiolned above) show that even 100,000,000 words is at least an order of magnitude too small to capture phenomena that, though of low frequency, are in the competence of ordinary native speakers. That would represent at least 20,000 recorded hours, and it is too low by an order of magnitude.
Assuming that a typical speaker hears speech for about 8 hours per day, the typical exposure is around 3000 hours per year. Assuming that full ordinary linguistic competence (i.e. not highly educated competence but ordinary adult lexical competence) is reached by one's mid-twenties, that would represent 75,000 hours. For written languages, add to that some unknown amount representing reading. Extraordinary linguistic competence -- that of a genius like Shakespeare or a highly educated modern reader -- requires wide reading, attentive listening to a wide range of selected good speakers, and a good memory.
On these various criteria it would take well over a billion (a thousand million) running words, and over 100,000 carefully chosen recorded hours, to just begin to approach the lifetime exposure of a good young adult speaker. Unfortunately, field documentation cannot hope to reach these levels. However, there is one piece of good news here: For humans, exposure requires repeats to refresh one's memory; computers, however, do not need this, so a low-frequency item, once documented, has a better chance of survival in documentation than in the speech community.
Basic lexical documentation: 1000-2000 words with basic grammatical information. Field dictionaries and glossaries of this size are useful.
Good lexical documentation: A few thousand to several thousand words including all members of closed classes, over 1000 high-frequency words, all items on standard lists such as the Swadesh lists, and some coverage of technical vocabulary and archaisms that are likely to be important for ethnography and historical linguistics. Full coverage of kinship terms and main body parts. Full grammatical information (inflection, derivation, classification, valence) for each word. Basic gazetteer including at least common personal names, important mythic, historical, and literary names, and important toponyms.
Excellent lexical documentation: 5000-6000 high-frequency words plus as full coverage as possible of archaisms, regionalisms or dialect words, and technical vocabulary. Native and traditional vocabulary as well as vocabulary for modern concepts (whether coined or borrowed or code-shifted). (Total of perhaps 10,000 words.) Large gazetteer including all known toponyms. Full grammatical and (insofar as possible) etymological information for each word.
- Aoki, Haruo. 1994. Nez Perce dictionary: UCPL 122. Berkeley-Los Angeles: University of California Press.
- Armbruster, Charles Hubert. 1965. Dongolese Nubian: A Lexicon. Cambridge: Cambridge University Press.
- Cheng, Chin-Chuan. 2000. Frequently-used characters and language cognition. Studies in the Linguistic Sciences 30:107-118.
- Cheng, Chin-Chuan. 2002. Language cognition and vocabulary learning. In Selected Papers from the Eleventh International Symposium on Language Teaching/Fourth Pan Asia Conference, 54-62. Taipei: English Teachers Association.
- Dayley, Jon P. 1989. Tümpisa (Panamint) Shoshone Dictionary: UCPL 116. Berkeley-Los Angeles: University of California Press.
- Dixon, R. M. W. 1977. A Grammar of Yidiny. Cambridge: Cambridge University Press.
- Evans, Nicholas D. 1995. A Grammar of Kayardild. Berlin: Mouton de Gruyter.
- Evans, Nicholas. 2003. Bininj Gun-Wok: A Pan-Dialectal Grammar of Mayali, Kunwinjku and Kune: Pacific Linguistics 541. Canberra: Research School of Pacific and Asian Studies, Australian National University.
- Francis, W. Nelson, and Henry Kucera 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin Company.
- Frawley, William, Hill, Kenneth C., and Munro, Pamela. 2002. Making Dictionaries: Preserving Indigenous Languages of the americas. Berkeley: University of California Press.
- Harvey, Mark. 2001. A Grammar of Limilngan : a Language of the Mary River Region, Northern Territory, Australia: Pacific Linguistics 516. Canberra: Pacific Linguistics, Research School of Pacific and Asian Studies, Australian National University.
- Hopi_Dictionary_Project (1998). Hopi Dictionary / Hopiikwa lavaytutuveni: A Hopi-English Dictionary of the Third Mesa Dialect. Tucson: University of Arizona Press.
- Karttunen, Frances. 1992. An Analytical Dictionary of Nahuatl. Norman and London: University of Oklahoma Press.
- Landau, Sidney I. 1984/1989. Dictionaries: The Art and Craft of Lexicography. Cambridge: Cambridge University Press.
- Laughlin, Robert M. 1975. The Great Tzotzil Dictionary of San Lorenzo Zinacantn: Smithsonian Contributions to Anthropology, 19. Washington, DC: Smithsonian Institution Press.
- Levin, Lori; Lavie, Alon; Vega, Rodolfo; Carbonell, Jaime; Brown, Ralf; Cañulef, Eliseo; and Huenchullan, C. 2002. Data collection and language technologies for Mapudungun. Paper presented at Proceedings of the International Workshop on Resources and Tools in Field Linguistics.
- Matisoff, James A. 1988. The Dictionary of Lahu: UCPL 111. Berkeley-Los Angeles: University of California Press.
- Merlan, Francesca. 1994. A grammar of Wardaman: a language of the Northern Territory of Australia: Mouton Grammar Library 11. Berlin: Mouton de Gruyter.
- Monson, Christian; Levin, Lori; Vega, Rodolfo; Brown, Ralf; Font Llitjos, Ariadna; Lavie, Alon; Carbonell, Jaime; Cañulef, Eliseo; and Huisca, Rosendo. 2004. Data collection and analysis of Mapudungun morphology for spelling correction. Language Resources and Evaluation Conference (LREC) Proceedings.
- Munro, Pamela, and Willmond, Catherine. 1994. Chickasaw: An Analytical Dictionary / Chikashshanompaat Holisso Toba'chi. Norman-London: University of Oklahoma Press.
- Ruppenhofer, Josef. 2004. Unpublished Ph.D. dissertation, University of California, Berkeley.
- Solnit, David. 1997. Eastern Kaya Li: Gramamr, Texts, Glossary. Honolulu: University of Hawaii Press.
- Spears, Richard A., and Linda Schinke-Llano, eds. 1984. (1996 printing.) Everyday American English Dictionary. Lincolnwood, IL: NTC Publishing Group.
- Timberlake, Alan. 2004. A Reference Grammar of Russian. Cambridge: Cambridge University Press.
- van den Berg, Helma. 1995. A Grammar of Hunzib (with Texts and Lexicon). Munich-Newcastle: Lincom Europa.