Graphemic Parsing




MANULEX-wordform lexical entries were also segmented into graphemic units in order to compute the frequency and consistency of grapho-phonological mappings. As far as possible, the main principle was to segment the orthographic chains so that each segmented substring corresponds to a single phoneme. The term "grapheme" is thus used here to refer to letter or letter groups that match a phoneme. Note that French includes several multi-letter graphemes such as "ou", "an", "un", "in", "eu", "ch", and "gn" .

Graphemic segmentation of French words is generally not problematic, although segmentation choices had to be made in some cases. Our choices were governed by a second important principle according to which segmentation must highlight inconsistencies in the pronunciation of orthographic strings. In what follows, the details of these cases and the specific solutions that were adopted are described.

Two broad categories of problems were encountered. On the one hand, the exact limits of the orthographic substring that matched a single phoneme was sometimes ambiguous. On the other hand, it was sometimes impossible to make each phoneme in a phoneme string correspond to a particular grapheme.

The letter "g" is generally pronounced in one of two ways, depending on the vowel that immediately follows it. The pronunciation /g/ occurs in front of the letters "a","o", and "u" (e.g., "gare", "golf", "guide") whereas the pronunciation /Z/ occurs in front of the letters "i" and "e" ("givre", "gel"). Most of the time, the letter "u" is not pronounced when it appears after the letter "g" but its presence causes the "g" to be pronounced /g/ ("guise" is pronounced / giz / whereas " gise" is pronounced / Ziz /). Thus, one considers that the grapheme "gu" usually has the phoneme /g/ as its counterpart. A similar analysis applies to the grapheme "qu" corresponding to the phoneme /k/, in which the letter "u" is generally not pronounced (e.g., "piquant" / pik@ /). Some exceptions exist. For example, the letter "u" corresponds to the glide /w/ in the words "aquarelle" / akwaREl / and "iguane" / igwan /, or to the glide /8/ in the words "linguistique" / l5g8istik / and "aiguille" / eg8ij /. Segmentation into the graphemes "gu" and "qu" is thus preserved in order to reveal the pronunciation inconsistencies ("gu" pronounced /g/, / gw / or /g8/, "qu" pronounced /k/ or / kw /).

Generally, the usefulness of allowing graphemic groups to be mapped to more than one phoneme is particularly obvious in cases where the same graphemic group is sometimes associated to a single phoneme and sometimes to a larger phonemic sequence. If doubled consonants were treated as a single grapheme mapping a single phoneme (e. g.;   "gramme", "rappel", "joggeurs", "accord"), then this would imply encoding them in this way even when each of the consonants represents a distinct phoneme (e. g., /g/ and /Z/ in "suggère", /k/ and /s/ in "vaccin", /t/ and /S/ in "carpaccio"). A similar case is the "er" ending of words. In many words (mainly verbs in the infinitive), "er" corresponds to the phoneme /e/ ("aimer" , " parler" , " viser"), but the pronunciation /ER/ is found in some words ("amer", "fer", "mer", "enfer"). Again, pronunciation inconsistencies emerge only if the parsing method maintains the graphemic group "er", whatever its pronunciation. Other similar cases are the groups "in" (/5/ in "singe", /in/ in "bingo"), "im" (/5/ in "timbre", /im/ in "interim"), "en" (/@/ in "enseigne", /@n/ in "enivre"), "enn" (/En/ in "ennemi", /@n/ in "ennui"), "emm" (/em/ in "Emmanuel", /@m/ in "emmener", /am/ in "femme" and "récemment", /Em/ in "dilemme"), "ill" (/j/ in "ailleurs", /ij/ in "Antilles"), "illi" (/ji/ in "taillis", /ij/ in "serpillière"), "ay" (/E/ in "Raymond", /aj/ in "mayonnaise", /Ej/ in "crayon"), and "ey" (/E/ in "poney", /Ej/ in "asseyait").

In other cases, graphemic groups associated with more than one phoneme are required because no correspondence can be found between letters and individual phonemes. The grapheme "oi" is frequent in French words, and it is generally pronounced / wa / as in "oiseau" and "noisette". However, unlike "ui" which that can be broken down into "u" (generally /8/) and "i" (generally /i/), keeping "oi" together is an acceptable solution. A similar problem occurs for the graphemes "oï", "oî", "oin" and "oy", as well as for graphemes included in foreign words such as "ue" / ju / in "fuel" or "le" /9l/ in "scrabble".

Finally, French is a language that stands apart when it comes to the transcription of word final morphological marks. A large number of morphological marks that are not pronounced are used in the written language. This is true of derivational marks. For example, the "d" at the end of the French word "lourd" (heavy), from which is derived the word "lourdeur" (heaviness) is silent, whereas the "d" in the English word "kind" (from which "kindness" is derived) is pronounced. In addition, the "s" that signals the plural at the end of a French word ("tables"), is silent, as is the s that indicates the second person of verbs ("tu manges" (you eat)), whereas these written letters are pronounced in the English words "tables" or "he/she eats" . The existence of silent letters was taken into account by mapping them to a silent phoneme (represented by a hash mark).

The total number of graphemic groups obtained is 125. There are listed below together with examples:

Grapheme Example   Grapheme Example   Grapheme Example
a affiche   er amer   on ronde
à çà   eu deux   o bol
â âge   jeûner   ô hôpital
aen caen   ey poneys   ooin shampooing
ai aigle   ez parlez   oo alcool
aîné   f fond   ou joue
aim daim   ff affable  
ain train   ge bougeoir   août
am tambour   g gare   ow clown
an ange   gg agglomération   oy troyes
aon faon   gn agneau   ph dauphin
saône   gu aiguiser   p pont
au épaule   h hiver   pp grippe
aw crawl   i rire   q cinq
ay crayon   î île   qu quille
b balcon   ï héroïque   r roue
bb abbé   il ail   rr nourrice
cch pinocchio   illi serpillière   sch schéma
cc accord   ill accastillage   sc crescendo
ch chaise   im timbre   acquiesça
c car   in vin   sh washington
ck hockey   în devînt   ss mousse
cqu acquéreur   ïn coïncidence   s rose
ç hameçon   j jour   th panthère
d date   k kilo   t vite
dd bouddha   kk drakkar   tt roulotte
ea leader   le scrabble   ue bruegel
ean jean   l loup   um humble
eau agneau   ll actuelle   un aucun
e adverbe   m meuble   ü capharnaüm
é abbé   mm pomme   u rue
è mystère   ng boeing   û bûche
ê tête   n nom   v vent
ë noël   nn abonné   wh whisky
ee jeep   oa goal   w kiwi
ei veine   oeu bœuf   x boxe
eim reims   oe moelle   y yoga
ein plein   poêle   ym cymbale
em assembler   oin loin   yn larynx
emm dilemme   oi oiseau   z lézard
en dent   benoît   zz pizza
enn solennel   om ombre