Statistical description

 

 

 

Statistical descriptions of the four corpora are provided in the Tables. They allow to grasp the evolution of the lexical corpora as the school grade gets higher. Our primary aim in providing statistical descriptions of Manulex-infra variables was to allow users to achieve a finer selection of experimental items that takes into account the statistical distributions of the various variables within the Manulex corpora. Knowledge of the distribution of variables in a corpus can facilitate selection, so it is useful for the researcher to situate the chosen experimental items relative to the whole corpus for the purposes of estimating their representativeness. An advantage of this approach is that it avoids the use of atypical items that are not representative of the word set in the database. This type of control also appears to be advantageous for cross-linguistic studies, because differences in the methodologies used to elaborate lexical databases (e.g., the choice of written texts) inevitably lead to variations in the lexical statistics obtained, making cross-linguistic comparisons difficult. Comparisons should be less problematic when they are based on the most representative items of the corpora, which are less likely to be contaminated by initial methodological choices. For this reason, dispersion indexes were calculated for variables considered in Manulex-infra.

 

  
(Tables open in a new window)

Word Frequency and Word Length

Bigram Frequency
Biphone Frequency

Homographs and Homophones
Lexical Neighborhood

Syllable Frequency


Grapheme-Phoneme and
Phoneme-Grapheme Frequency

Grapheme-Phoneme Consistency
Phoneme-Grapheme Consistency

 

 


 

In addition to satisfying these methodological considerations, a secondary aim of the statistical descriptions was to make it possible to determine whether infra-lexical variables, particularly grapheme-phoneme mappings, exhibit distributional disparities across corpora. Obviously, changes in the lexical characteristics of a word set are likely to occur as the size of the set increases. This should be the case for lexical similarity variables such as the number of lexical neighbors or the number of homophonic words. Also, mean word frequency and word length are expected to change as the school grade gets higher, assuming that children are exposed first to short and frequent words. An interesting question, however, is whether the infra-lexical characteristics of different word sets vary. One possibility is that the mappings between orthography and phonology of words appearing in readers used in the lower grades are less complex. This could be true, for example, if most of the words found in the first-grade readers were purposely selected to minimize grapheme-phoneme inconsistencies. Alternatively, the mean consistency of grapheme-phoneme associations may be similar across grades.

Analyses of the by-type statistics for grapheme-phoneme associations indicate that frequency naturally increases as a function of the orthographic vocabulary size. Regarding consistency, the directionality of the orthography-phonology mapping matters. Mean consistency is higher for grapheme-phoneme correspondences than for phoneme-grapheme correspondences, an observation that is similar to previous findings based on monosyllabic French words segmented into units larger than the grapheme (onset, nucleus, coda, and rhyme; Peereman & Content, 1998, 1999; Ziegler et al., 1997). Consistency varies with the grapheme position in the word. In particular, phoneme-grapheme consistency is much lower for final graphemes than for initial or middle ones, this being mainly due to the frequent presence of silent derivational marks in word endings. Interestingly, on average, the consistency of both grapheme-phoneme and phoneme-grapheme associations does not vary across grades. It suggests that French children are exposed to highly inconsistent phoneme-grapheme mappings from the very beginning of reading acquisition. Connectionist models suggest that exposure to the full complexity of orthography-phonology mappings from the start is beneficial to learning and processing new words later (Zevin & Seidenberg, 2002, 2004). Connectionist networks trained on words exhibiting only consistent orthography-phonology associations have trouble in modifying computational weights for learning new associations. Our distributional analyses of grapheme-phoneme mappings indicate that children are exposed to similar degrees of complexity across school grades, which seems to be an ideal solution according to connectionist simulations. Thus, encountering a variety of grapheme-phoneme associations when starting to read and write may help young children process new words later.

Comparing the Manulex_infra statistics with an adult lexical database . Recent analyses were performed to compare grapheme-phoneme statistics obtained from the Manulex corpus and from the Lexique database (ver. 3; New et al., 2005). Because the manulex_infra statistics are derived from a particular set of written texts (elementary-school readers), it is interesting to examine whether the computed statistics approximate what can be computed from a word set derived from adult reading materials. The analyses reported below are based on the grapheme-phoneme associations (n = 157) that occurred at least 50 times (in different words or not) in the Lexique corpus. This criterion was adopted to avoid very rare or uncertain orthography-phonology mappings. The statistics were computed from 138,057 entries (i.e., removing 118 entries from the Lexique corpus) using graphemic segmentation principles similar to those used for Manulex_Infra.

Analyses were carried out to examine the correlations between grapheme-phoneme Frequency and Consistency (by-type values) for the 157 different mappings computed from the different corpus. As can be seen in the Table below, very high correlations were obtained. It thus appears that, on average, the phoneme-grapheme statistics computed from the school readers considered in the Manulex database provide a correct picture of the grapho-phonological complexities occurring in an adult lexical corpus.

Correlations between the Frequency and Consistency values computed on the different corpus
G-P Frequency
Graph-Phon Consistency
Phon-Graph Consistency
 
1  
2  
3-5
1-5
Ad.
1  
.99
.99
.99
.98
2  
.99
.99
.98
3-5  
.99
.98
1-5  
.98
Ad  
     
 
1  
2  
3-5
1-5
Ad.
1  
.99
.99
.99
.94
2  
.99
.99
.95
3-5  
.99
.96
1-5  
.96
Ad          
 
1  
2  
3-5
1-5
Ad.
1  
.99
.99
.99
.98
2  
.99
.99
.98
3-5  
.99
.99
1-5  
.99
Ad          
Note. G-P frequency is identical to P-G frequency. 1: Grade Level 1 (CP); 2: Grade Level 2 (CE1); 3-5: Grade levels 3 to 5 (CE2 to CM2); 1-5: Grade Levels 1 to 5 (CP to CM2); Ad: Lexique corpus (adult).
     

Finally, figures 1 and 2 below display the frequency and the phoneme-to-grapheme consistency values (by-type) computed on the different corpus. The mappings in Figure 1 are ordered as a function of their frequency in the Lexique corpus. The mappings in Figure 2 are ordered as a function of their consistency in the Lexique corpus.

Figure 1. Frequency of the 157 Phoneme-Grapheme mappings in the different corpus

It is clear from Figure 1 that what distinguishes between the different corpus is the frequency with which each of the different mappings occurs. Not surprisingly, frequencies increase as a function of the corpus size. So, frequencies are higher for the Lexique corpus, then for the Level 3 corpus, followed by Level 2 and Level 1. Interestingly, the frequency ratios between the different corpus appear similar for the different mappings

Figure 2. Consistency for the 157 Phoneme-Grapheme mappings in the different corpus

Figure 2 shows that phonology-orthography consistency is mostly similar in the different corpus. Hence, for most of the different mappings, the consistency values computed from the different corpus are similar. The few points that diverge from the adult statistics are essentially due to the Levels 1 and 2 corpus. This follows from the fact that a few of the mappings are still consistent for lower grades, and are becoming less consistent as new words are introduced in the reading lexicon. For example, the mapping between /tS/ and CH (as in the word “sandwich”) is perfectly consistent in the Level 1 corpus (100) but it drops to 63 in Level 3, and 84 in the Adult lexicon due to the addition of new words (e. g., “carpaccio”, “dolce”). However, although a few similar cases exist, it appears that consistency of the mappings does not greatly differ between the Manulex and the Lexique corpus. This observation is in agreement with our previous analyses showing that children are exposed to similar degrees of complexity across school grades.