Technical principles of the morphological analysis

The morphological analyzer
How does AraMorph manage to propose acceptable solutions ?

The morphological analyzer

Let's create a src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt file in which we will type a single word, كتاب.

Let's then execute the following code :

java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶
gpl.pierrick.brihaye.aramorph.AraMorph ¶
src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt ¶
CP1256 results.txt UTF-8 -v

Warning

Of course, the file's encoding should fit your text editor's one.

Warning

The dictionaries have a fair memory footprint. You may have to increase the memory allocated to Java by using options such as -Xms128M -Xmx192M.

Let's have a look at the output results.txt file, which is encoded in UTF-8 :

Processing token : 	كتاب
Transliteration : 	ktAb
Token not yet processed.
Token has direct solutions.

SOLUTION #3
Lemma  : 	kAtib
Vocalized as : 	كُتّاب
Morphology : 
	prefix : Pref-0
	stem : N
	suffix : Suff-0
Grammatical category : 
	stem : كُتّاب	NOUN
Glossed as : 
	stem : authors/writers


SOLUTION #1
Lemma  : 	kitAb
Vocalized as : 	كِتاب
Morphology : 
	prefix : Pref-0
	stem : Ndu
	suffix : Suff-0
Grammatical category : 
	stem : كِتاب	NOUN
Glossed as : 
	stem : book


SOLUTION #2
Lemma  : 	kut~Ab
Vocalized as : 	كُتّاب
Morphology : 
	prefix : Pref-0
	stem : N
	suffix : Suff-0
Grammatical category : 
	stem : كُتّاب	NOUN
Glossed as : 
	stem : kuttab (village school)/Quran school

The way the morphological analyzer works becomes then more obvious :

Message	Meaning
Processing token	the word being processed
Transliteration	the transliteration of the word in the Buckwalter's transliteration system ; only with the -v parameter and if no output encoding is specified
Token not yet processed.	indicates that the word hasn't been processed yet and that it isn't in AraMorph's cache ; only with the -v parameter
Token has direct solutions.	indicates that the word can be analyzed as it is written ; only with the -v parameter. Indeed, AraMorph is able to take alternative writings into consideration like a final ـه in place of a ـة or a final ـى in place of a ـي...
SOLUTION	indicates each solution for the word. The display order is not significant.
Lemma	indicates the lemma's ID in the stems dictionary.
Vocalized as :	indicates the vocalization of the solution.
Morphology :	indicates the morphological category of the prefix, the stem and the suffix of the solution.
Grammatical category :	indicates the grammatical category of the prefix, the stem and the suffix of the solution.
Glossed as :	indicates one or more english glosses for the prefix, the stem and the suffix of the solution.

Note

The explanations about the morphological categories are available in this section section those about grammatical categories in this section.

How does AraMorph manage to propose acceptable solutions ?

First, you have to know that AraMorph, like its predecessor in Perl, works with a transliteration of the arabic word. This transliteration obviously uses Buckwalter's transliteration system. Thus, كتاب is transliterated in ktAb before its morphological analysis.

Fixme (PB)

This operation should not be necessary since Java works natively with Unicode. A code optimization, that would allow to bypass the transliteration step and thus increase performance is to be done.

Then, AraMorph uses a brute force algorithm to decompose the word in a sequence of possible prefix, stem and suffix :

prefix	stem	suffix
ktAb	Ø	Ø
ktA	b	Ø
ktA	Ø	b
kt	Ab	Ø
kt	A	b
kt	Ø	Ab
k	tAb	Ø
k	tA	b
k	t	Ab
k	Ø	tAb
Ø	ktAb	Ø
Ø	ktA	b
Ø	kt	Ab
Ø	k	tAb
Ø	Ø	ktAb

Then, AraMorph checks the presence of each element in three dictionaries :

the prefix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictPrefixes
the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/dictStems
the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/dictSuffixes

If successful, AraMorph grabs the morphological information for each element.

Warning

The Ø prefixes and suffixes are morphologically significant.

If applicable, AraMorph then checks if the morphologies of each element are compatible between each other by looking-up three tables containing valid combinations :

between the prefix and the stem, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAB
between the prefix and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableAC
between the stem and the suffix, in gpl/pierrick/brihaye/aramorph/dictionaries/tableBC

A word decomposition whose :

prefix, stem and suffix have a dictionary entry,
prefix, stem and suffix are morphologically compatible between each other,

... is a solution. For كتاب, there are three ones as we can see above.

Warning

Some informations in the stems dictionary are in fact relevant for prefixes or suffixes. AraMorph, when it returns a solution, tries to shift these informations towards the prefixes or the suffixes. It is thus possible to have several prefixes and/or suffixes for a single word.
If some interpretation problem occurs, rarely enough however, messages are displayed on the console.


Copyright © 2003-2005 Pierrick Brihaye All rights reserved.