Installation and tests |
- Building from source code
- Testing the morphological analyzer
- Testing the arabic analyzer for Lucene
- Testing the english analyzer for Lucene
Building from source code
The source code includes the necessary Ant librairies together with a build.xml file.
Warning
For now, the code makes a heavy use of regular expressions so that AraMorph needs to be build and used on a JDK 1.4 or above. The code modification that would allow the use of external regular expressions libraries that could operate with an older JDK is still to be done.
The build is simply done when build.bat is invoked together with a target's name.
Fixme (???)
Many thanks to whom will provide me a functionnal Unix script for the build.
The available targets are :
Target | Action |
---|---|
compile | Compiles the source code. By default, the result will be generated into ${dist}/src. |
jar |
Builds the ArabicAnalyzer.jar file. By default, the result will be generated into ${dist}. According to the value of the with.sources property (true by default), the source files are included in the file. |
zip | Builds a ArabicAnalyzer-src.zip file including the source files. By default, the result will be generated into ${dist}. |
dist | Builds a ArabicAnalyzer-dist.zip file including all the distribution files. By default, the result will be generated into ${dist}. |
javadoc | Builds the javadocs of the source files. By default, the result will be generated into ${dist}/javadoc. |
site |
Builds the HTML documentation using Apache Forrest. By default, the result will be generated into ${dist}/html. You may define Forrest's path from the FORREST_HOME environment variable or define the forrest.home property in a ./forrest.properties file. Warning
The target is designed for Forrest version 0.5.1. Many thanks to whom will provide me a build.xml file for newer versions.
|
clean | Deletes the build directory, by default ${dist} |
help | Display some help about the different available targets (default target). |
Note
By default, ${dist} is set to ./build.
Testing the morphological analyzer
The morphological analyzer is used as such :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar and commons-collections.jar files.
As such, i.e. without arguments, the program displays the following help message :
Arabic Morphological Analyzer for Java(tm) Ported to Java(tm) by Pierrick Brihaye, 2003-2004. Based on : BUCKWALTER ARABIC MORPHOLOGICAL ANALYZER Portions (c) 2002 QAMUS LLC (www.qamus.org), (c) 2002 Trustees of the University of Pennsylvania. This program is governed by : The GNU General Public License Usage : araMorph inFile [inEncoding] [outFile] [outEncoding] [-v] inFile : file to be analyzed inEncoding : encoding for inFile, default CP1256 outFile : result file, default console outEncoding : encoding for outFile, if not specified use Buckwalter transliterat ion with system's file.encoding -v : verbose mode
The parameters should not raise any particular problem :
Parameter | Usage |
---|---|
inFile | The path of a text file to be analyzed (mandatory) |
inEncoding | This text file's encoding (CP1256 by default) |
outFile | The path of the file where the results of the morphological analysis should be output (the console by default) |
outEncoding | The encoding of the results file (by default, the JVM's file.encoding system property, using Buckwalter's transliteration) |
-v | A flag to be set for more verbosity |
Here are a few usage examples :
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/UTF-8.txt UTF-8 -v java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar ¶ gpl.pierrick.brihaye.aramorph.AraMorph ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/iso-8859-6.txt ¶ iso-8859-6 results.txt CP1256
Testing the arabic analyzer for Lucene
The parameters are the same except -v which is unuseful. The Lucene jar file must however be included in the classpath.
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.test.TestArabicAnalyzer ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt results.txt
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar, commons-collections.jar and lucene*.jar files.
Testing the english analyzer for Lucene
Note
Yes, it is possible to analyze an arabic text and return english tokens !
Don't expect too much however, AraMorph isn't designed to be an automatic translation tool.
Don't expect too much however, AraMorph isn't designed to be an automatic translation tool.
The parameters are the same except -v which is unuseful. The Lucene jar file must however be included in the classpath.
java -cp build/ArabicAnalyzer.jar;lib/commons-collections.jar;lib/lucene-1.4.3.jar ¶ gpl.pierrick.brihaye.aramorph.test.TestArabicGlosser ¶ src/java/gpl/pierrick/brihaye/aramorph/test/testdocs/cp1256.txt results.txt
Warning
The classpath must obviously point to the actual location of the ArabicAnalyzer.jar, commons-collections.jar and lucene*.jar files.