Technical principles of the morphological analysis |
The morphological analyzer
Let's create a src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt file in which we will type a single word, كتاب.
Let's then execute the following code :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt ¶ CP1256 results.txt UTF-8 -v
Let's have a look at the output results.txt file, which is encoded in UTF-8 :
Processing token : كتاب Transliteration : ktAb Token not yet processed. Token has direct solutions. SOLUTION #3 Lemma : kAtib Vocalized as : كُتّاب Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category : stem : كُتّاب NOUN Glossed as : stem : authors/writers SOLUTION #1 Lemma : kitAb Vocalized as : كِتاب Morphology : prefix : Pref-0 stem : Ndu suffix : Suff-0 Grammatical category : stem : كِتاب NOUN Glossed as : stem : book SOLUTION #2 Lemma : kut~Ab Vocalized as : كُتّاب Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category : stem : كُتّاب NOUN Glossed as : stem : kuttab (village school)/Quran school
The way the morphological analyzer works becomes then more obvious :
Message | Meaning |
---|---|
Processing token | the word being processed |
Transliteration | the transliteration of the word in the Buckwalter's transliteration system ; only with the -v parameter and if no output encoding is specified |
Token not yet processed. | indicates that the word hasn't been processed yet and that it isn't in AraMorph's cache ; only with the -v parameter |
Token has direct solutions. |
indicates that the word can be analyzed as it is written ; only with the -v parameter. Indeed, AraMorph is able to take alternative writings into consideration like a final ـه in place of a ـة or a final ـى in place of a ـي... |
SOLUTION | indicates each solution for the word. The display order is not significant. |
Lemma | indicates the lemma's ID in the stems dictionary. |
Vocalized as : | indicates the vocalization of the solution. |
Morphology : | indicates the morphological category of the prefix, the stem and the suffix of the solution. |
Grammatical category : | indicates the grammatical category of the prefix, the stem and the suffix of the solution. |
Glossed as : | indicates one or more english glosses for the prefix, the stem and the suffix of the solution. |
How does AraMorph manage to propose acceptable solutions ?
First, you have to know that AraMorph, like its predecessor in Perl, works with a transliteration of the arabic word. This transliteration obviously uses Buckwalter's transliteration system. Thus, كتاب is transliterated in ktAb before its morphological analysis.
Then, AraMorph uses a brute force algorithm to decompose the word in a sequence of possible prefix, stem and suffix :
prefix | stem | suffix |
---|---|---|
ktAb | Ø | Ø |
ktA | b | Ø |
ktA | Ø | b |
kt | Ab | Ø |
kt | A | b |
kt | Ø | Ab |
k | tAb | Ø |
k | tA | b |
k | t | Ab |
k | Ø | tAb |
Ø | ktAb | Ø |
Ø | ktA | b |
Ø | kt | Ab |
Ø | k | tAb |
Ø | Ø | ktAb |
Then, AraMorph checks the presence of each element in three dictionaries :
- the prefix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictPrefixes
- the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/dictStems
- the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictSuffixes
If successful, AraMorph grabs the morphological information for each element.
If applicable, AraMorph then checks if the morphologies of each element are compatible between each other by looking-up three tables containing valid combinations :
- between the prefix and the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAB
- between the prefix and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAC
- between the stem and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableBC
A word decomposition whose :
- prefix, stem and suffix have a dictionary entry,
- prefix, stem and suffix are morphologically compatible between each other,
... is a solution. For كتاب, there are three ones as we can see above.
If some interpretation problem occurs, rarely enough however, messages are displayed on the console.