Word Corpus




Our quantitative descriptions of the orthographic and phonological characteristics of French words encountered by children in elementary school were based on the Manulex lexical entries (non-lemmatized version) and their corresponding frequency norms (Lété et al., 2004). Our main reason for choosing Manulex was the fact that Manulex provides separate lexicon and frequency norms from Grade 1 to Grade 5. Word frequency in MANULEX is computed separately as a function of school grade: grade 1, grade 2 and grades 3 to 5. A fourth computation index provides word frequency for all grades considered together. More details about the MANULEX database can be found either on the new web manulex site or on http://unpc.univ-lyon2.fr/~lete/manulex_eng/INDEX.HTM

All entries in the MANULEX-wordform lexicon (Lété et al., 2004) were used for computations except abbreviations, euphonic strings, interjections, and compound entries (entries that contain a space, an apostrophe, or a dash). Because many proper names listed in MANULEX have ambiguous or unknown pronunciations, only those with a frequency value of at least .10 in G1-5 levels were considered in the computations. The total number of entries in G1-5 is 45080. Among these, 10861 occurred in G1, 18131 in G2, and 42422 in G3-5.

Details of the computations are provided in the following pages. We also report statistical descriptions on the different corpora. We also describe the results of recent -and preliminary- analyses comparing grapheme-phoneme mappings in the manulex corpus and in a corpus of adult-directed written material (Lexique ver. 3, New et al, 2005). Note that these last analyses are not described in the Manulex_Infra manuscript (Peereman, Lété, & Sprenger-Charolles, in press).