Computations

 

 

 

The computations fall into two categories: word-length characteristics and grade-level characteristics. The word-length characteristics are the numbers of letters, phonemes, graphemes, and syllables in the word. Contrary to word-length characteristics,   grade-level characteristics are function of the word corpus analyzed. They were computed on the four Manulex-wordform lexicons corresponding to the four levels, G1, G2, G3-5, and G1-5, that is, words found in first-grade readers, second-grade readers, third-to-fifth-grade readers, and all readers, which were put into four statistical datafiles, Manu1, Manu2, Manu35, and ManuAll, respectively. There are type-based and token-based computations. Type-based computations are computations made on each word occurring in a lexicon, whatever its lexical frequency. Thus, a common word like "dans" (in) has the same weight as a word rare like "rang" (rank), despite their large difference in frequency. Token-based computations are computations on each word occurring in a lexicon (word type) weighted by its lexical frequency (taken from the Manulex U index). The computations carried out on each of the four lexicons are detailed below.

Association frequencies, GP-consistency and PG-consistency values. The ambiguity of phonological encoding from orthographic input, and the ambiguity of orthographic encoding from phonological input, are generally estimated by consistency index. In Manulex-Infra, the GP-consistency index is equal to the frequency at which a particular grapheme-phoneme mapping occurs divided by the total frequency of the grapheme, no matter how it is pronounced. For example, the GP-consistency index of the association "ch"->/S/ (as in the word "chat" /Sa/) is obtained by dividing the frequency of occurrence of the "ch"->/S/ association by the frequency of the grapheme "ch", irrespective of its pronunciation (including /S/, but also /k/ for example, as in "choral" /koRal/). The GP-consistency index was then multiplied by 100. Its maximal value (total consistency) is 100. Similarly, the PG-consistency index is equal to the frequency at which a particular phoneme-grapheme mapping occurs, divided by the total frequency of the phoneme multiplied by 100, no matter how the phoneme is spelled.

Consistency can differ greatly as a function of the serial position of the units in the word. In particular, due to the derivational morphology of French, word endings are often silent, so spelling is less transparent. To better characterize the orthography-phonology mappings of French, frequency and consistency were computed as a function of the relative serial position (initial, middle, final) of the units in the words. The results are included in each word entry, together with the average value and the value of the word's least frequent and the least consistent associations.

Finally, separate tables provide summary statistics on the frequency and consistency values of all associations found in the word corpora.

Infra-lexical unit frequencies. Bigrams, biphones, and syllable frequencies were computed for each entry at the four levels. Bigram frequency is the frequency of occurrence of each two-letter sequence in the word list. Transposed to phonology, biphone frequency is the frequency of occurrence of each two-phoneme sequence in the word list. Finally, syllable frequency was computed from the syllabic segmentations of phonological wordforms. Computations were type-based and token-based, and were performed separately for the different units (bigram, biphone, syllable) as a function of their relative serial position in the word (initial, middle, final). Supplementary databases provide summary statistics on bigrams, biphones, and syllable frequencies. Letter, phoneme, and trigram frequency tables are also available.

Orthographic and phonographic neighborhood. Lexical neighborhood density was computed to assess lexical similarities between words. Orthographic neighbors are operationally defined as words that can be generated from the base letter string by a single letter substitution. For example, FACE, RICE, RATE, and RACK are orthographic neighbors of the word RACE. Because orthographic neighborhood density depends on the specific orthographic wordforms known by the children, values at the four levels were computed separately (first grade, second grade, third to fifth grades, all grades). Neighborhood density is type-based and token-based. The type-based measure corresponds to the raw number of orthographic neighbors. The token-based measure takes neighbor frequency into account by summing the frequency of all neighbors. Note that if two (or more) neighbors are homographs, they are counted only once, and their frequencies are summed.

Adult reading aloud (Peereman & Content, 1997) has been shown to be facilitated by a particular subtype of orthographic neighbor, namely ones that are also phonologically similar to the target word. These lexical neighbors, referred to as phonographic are both orthographically and phonologically similar to the word to be pronounced.Phonographic neighborhood density was therefore computed in addition to orthographic neighborhood. Phonological similarity between words was estimated by applying the orthographic-neighbor operationalization to phonological forms. Hence, words were considered to be phonologically similar when they differed by a single phoneme. The computation results are incorporated in the main word databases (Manu1, Manu2, Manu35, ManuAll), and the neighbors are listed in separate files (NO1, NO2, NO35, NOall for Orthographic neighbors; NOP1, NOP2, NOP35, NOPall for Phonographic neighbors), along with their frequency.

Homophones and homographs. The number of homophones and the number of homographs for each entry were also computed at the four levels. Again, type-based and token-based computations were performed.

Type-based and token-based values were added to the databases. The words entering into the computations are listed in separate files, along with their frequency (HG1, HG2, HG35, and HGall for heterophonic HomoGraphs; HP1, HP2, HP35, and HPall for heterographic HomoPhones).

Orthographic uniqueness point. In studies on auditory word recognition, the phonological unicity point is traditionally defined as the serial position of the phoneme (counting from the first phoneme in the word) at which the target word diverges from other lexical candidates. Transposed to orthographic forms, uniqueness point refers to the serial position of the letter (counting from left to right) at which the target word diverges from any other lexical candidates. Orthographic uniqueness point is given for each word in each grade level.