Using with Lucene |
- The arabic analyzer for Lucene
- The english glosser for Lucene
- What are the solutions retained by the analyzers and the glossers ?
The arabic analyzer for Lucene
Let's execute the following code :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt CP1256 results.txt UTF-8
Let's have a look at the results.txt output file :
كِتاب NOUN [0-4] 1 كُتّاب NOUN [0-4] 0
The principle is thus the following : every matching stem returns a Lucene token. We can see the token's text (termText), its grammatical category, its position in the input stream (startOffset and endOffset) and its relative position (positionIncrement) in regard to the previous token.
It should indeed be pointed out that a same arabic word, because it is generally reducted to its consonantical skeleton, may often return several solutions.
Let's try another example with a tktbn.txt file containing the word تكتبن :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/tktbn.txt CP1256 results.txt UTF-8
Let's have a look at the results.txt output file :
Processing token : ????? SOLUTION #3 Lemma : >akotab Vocalized as : tukotibna Morphology : prefix : IVPref-Antn-tu stem : IV_yu suffix : IVSuff-n Grammatical category : prefix : tu IV2FP stem : kotib VERB_IMPERFECT suffix : na IVSUFF_SUBJ:FP Glossed as : prefix : you [fem.pl.] stem : dictate/make write suffix : [fem.pl.] SOLUTION #4 Lemma : >akotab Vocalized as : tukotabna Morphology : prefix : IVPref-Antn-tu stem : IV_Pass_yu suffix : IVSuff-n Grammatical category : prefix : tu IV2FP stem : kotab VERB_IMPERFECT suffix : na IVSUFF_SUBJ:FP Glossed as : prefix : you [fem.pl.] stem : be dictated suffix : [fem.pl.] SOLUTION #2 Lemma : katab Vocalized as : tukotabna Morphology : prefix : IVPref-Antn-tu stem : IV_Pass_yu suffix : IVSuff-n Grammatical category : prefix : tu IV2FP stem : kotab VERB_IMPERFECT suffix : na IVSUFF_SUBJ:FP Glossed as : prefix : you [fem.pl.] stem : be written/be fated/be destined suffix : [fem.pl.] SOLUTION #1 Lemma : katab Vocalized as : takotubna Morphology : prefix : IVPref-Antn-ta stem : IV suffix : IVSuff-n Grammatical category : prefix : ta IV2FP stem : kotub VERB_IMPERFECT suffix : na IVSUFF_SUBJ:FP Glossed as : prefix : you [fem.pl.] stem : write suffix : [fem.pl.]
Here, the decomposition in prefix(es), stem and suffix(es) becomes obvious.
In order to understand that the analysis only deals with the stem :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/tktbn.txt results.txt
Let's have a look at the results.txt output file :
kotub VERB_IMPERFECT [0-5] 1 kotab VERB_IMPERFECT [0-5] 0 kotib VERB_IMPERFECT [0-5] 0
We can effectively see that we get the stems of the different forms of the كتب root when it is used as an imperfective verb. Thus, using this system, the analysis of a second person feminine plural imperfective verb is the same for every imperfective form, whatever the person.
The english glosser for Lucene
Let's execute the following code :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.test.TestArabicGlosser ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/ktAb.txt CP1256 results.txt UTF-8
Let's have a look at the results.txt output file :
kuttab NOUN [0-4] 1 village NOUN [0-4] 0 school NOUN [0-4] 0 quran NOUN [0-4] 0 school NOUN [0-4] 0 authors NOUN [0-4] 0 writers NOUN [0-4] 0 book NOUN [0-4] 0
Here, the principle is stricly the same except that the glosses are themselves tokenized by the WhitespaceFilter before being sent to Lucene's standard processing queue (StandardFilter, LowerCaseFilter and StopFilter).
What are the solutions retained by the analyzers and the glossers ?
As we have just seen, the analyzers return some tokens whose type is the stem's grammatical category.
However, an analyzer will consider that some types should be regarded as related to stop words and will not return any token when it encounters them ; it will filter them. The list of the grammatical categories considered as non significant is as follow :
- DEM_PRON_F
- DEM_PRON_FS
- DEM_PRON_FD
- DEM_PRON_MD
- DEM_PRON_MP
- DEM_PRON_MS
- DET
- INTERROG
- NO_STEM
- NUMERIC_COMMA
- PART
- PRON_1P
- PRON_1S
- PRON_2D
- PRON_2FP
- PRON_2FS
- PRON_2MP
- PRON_2MS
- PRON_3D
- PRON_3FP
- PRON_3FS
- PRON_3MP
- PRON_3MS
- REL_PRON
Conversely, this is the list of the grammatical categories that should be regarded as significant :
- ABBREV
- ADJ
- ADV
- NOUN
- NOUN_PROP
- VERB_IMPERATIVE
- VERB_IMPERFECT
- VERB_PERFECT
-
NO_RESULT
WarningThis result is kept since the experiments tend to show that there is a significant chance that this word is a foreign word missing in the dictionary. It is obviously possible to write a specific Lucene filter to refine the analysis of this type of token.