Manulex_Infra consists in a database set derived from MANULEX, a web-accessible database listing word-frequency values for 48,886 lexical entries encountered in 54 French elementary-school readers. The development of Manulex-infra was motivated by the desire to overcome two important limitations of the databases currently available for the study of literacy acquisition. Firstly, most current linguistic databases for the French language are either based on adult-directed written material making them unsuitable for studying literacy acquisition, or they provide child word-frequency norms that do not reflect current usage. Second, the databases which consider the present-day child's written vocabulary provide lexical statistics only. In spite of the undeniable relevance of the recent word-frequency norms for studying literacy acquisition, word frequency is obviously not the sole factor influencing reading and writing performance, so more specialized databases are needed. In particular, it is well known that the statistical characteristics of a language, such as the consistency of the mappings between orthography and phonology, influence literacy acquisition. Hence, linguistic databases designed for the study of literacy acquisition should provide objective statistics on infra-lexical variables, especially regarding grapheme-to-phoneme and phoneme-to-grapheme mappings.

Linguistic databases play a central role by compiling quantitative and objective estimates about the principal variables that affect reading and writing acquisition. We describe a new set of web-accessible databases of French orthography whose main characteristic is that they are based on frequency analyses of words occurring in reading books used in the elementary school grades. Our aim was thus to develop a companion set of databases to the Manulex frequency norms that will allow experimental studies to manipulate and control several critical infra-lexical and lexical variables when investigating literacy acquisition in French children. Quantitative estimates were made for several infra-lexical variables (syllable, grapheme-to-phoneme mappings, bigrams) and lexical variables (lexical neighborhood, homophony and homography). These analyses should permit quantitative descriptions of the written language in beginning readers, the manipulation and control of variables based on objective data in empirical studies, and the development of instructional methods in keeping with the distributional characteristics of the orthography.


