Open Source Java Suggester.
1. What is the Suggester software?
The Suggestor is an Open Source Java program, providing suggestions for
unknown (misspelt) words based on custom dictionary.
As a basic implementation Suggester can serve as a spellchecker. In this
case all words have the same weight.
It includes a high speed suggestion engine, based on fast edit-distance
calculation algorithm enhanced with Lawrence Philips Metaphone algorithm
and private fuzzy-matching algorithm.
2. How can I use the Suggester software?
- Spellchecker.
- Search engine suggestions, based on your custom word list.
- Human resources department, just create index of your employee names to suggest proper spelling.
- Medical field, for example the drug name suggestion.
- Misspelt word suggestions in any other fields, which require custom dictionaries.
3. Advantages
High dictionary compression:
The word dictionary is compressed not only on a hard drive, but also in
virtual memory.
Basic UK English dictionary contains about 57000 words and has a size
about 90K.
Full English dictionary contains about 200,000 words (including names,
abbreviations, geographic places, etc.) and it takes 236Kb file on a
hard drive and about 2Mb space in memory.
Other languages are compressed even better.
For example, full Russian dictionary contains more than 1,300,000 words
(including variants) and it takes 315Kb file on a hard drive and again
about 2Mb space in memory.
Comparing original word list file in UTF-8 format with size more than
30Mb, the compressed size is close to 1% of original size.
High dictionary search and suggestion selection speed:
Dictionary case dependent / independent look-up takes about 0.002 /
0.005 ms per word, which comes to speed about 500,000 / 200,000 words
per second. Suggestions search speed averages about 40 ms per set of
suggestions for each unknown word on Pentium M 1.4Gz (with high quality
of suggestions).
Portability:
The Suggester software entirely written in Java 1.2.
Runs on any Java platform: Windows, Mac OS, Unix, Linux. Tested on JRE 1.2,
1.3, 1.4, 1.5.
Dictionary retains original word list:
The dictionary internal structure supports UTF-8 encoding and keeps all original words in a case sensitive format.
Did we mention that the Basic Suggester is free? Yes it is.
4. Where to get it?
The home page for the Suggester project can be found on the SoftCorporation LLC.
web site http://www.softcorporation.com/products/suggester.
There you also can find the information how to download the latest release as
well as all other information you might need regarding this project.
Click here to Download Free Basic Suggester.
5. Requirements
o A Java 1.2 or later compatible virtual machine for your operating system.
o To run Index Builder you may need up to 512 Mb (or more) of virtual memory.
6. Basic, Advanced and Enterprise versions of Suggester software
There are 3 different versions of Suggester software:
o Basic Suggester - (free) uses one dictionary, where all words have the same weight.
The Suggester Spell Check uses Basic Suggester.
o Advanced Suggester - (commercial, currently not for sale) can use
multiple dictionaries with different weights assigned to each
dictionary. Also supports multiple languages.
o Enterprise Suggester - (not ready for distribution) uses all features
from Advanced Suggester plus provides content dependent suggestions.
7. What is the Index Builder?
The Index Builder creates compressed index from your word list.
In the past the Index Builder was excluded from the Basic Suggester
package. Not any more! You can build your own index from your word or
phrases list.
Note, the Index Builder uses significantly more memory comparing with
the classes providing suggestions,
however it is not significant limitation considering the amount or RAM
computers have these days.
For example, to compile Polish dictionary, containing more than 3
million words, the Index Builder uses about 300 MB memory.
If the word list is sorted, this requirement significantly goes down.
The speed of Index compilation itself is pretty high. For example, on
the laptop (Pentium 1.5 Mhz) to compile Polish dictionary it takes less
than 5 sec.
However the process to read the words file, convert it to UTF-8 encoding
and sort all words takes more than 20 sec:
Polish dictionary compilation
8. What are the Suggester Configuration files?
The Suggester can be configured to fit your requirements.
a) BasicSuggester Configuration file:
By default the file is located at the classpath: com/softcorporation/suggester/basicSuggester.config
Parameters:
LENGTH_MIN_ED_1 - minimum word length to apply edit distance = 1.
LENGTH_MIN_ED_2 - minimum word length to apply edit distance = 2.
LENGTH_MIN_ED_3 - minimum word length to apply edit distance = 3.
LENGTH_MIN_ED_4 - minimum word length to apply edit distance = 4.
WEIGHT_EDIT_DISTANCE - edit distance weight for results sorting.
WEIGHT_SOUNDEX - soundex or metaphone weight for results sorting.
WEIGHT_LENGTH - different word length. The weight for results sorting.
WEIGHT_LAST_CHAR - last character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR - first character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR_UPPER - first character is not in upper case. The weight for results sorting.
WEIGHT_FIRST_CHAR_LOWER - first character is not in lower case. The weight for results sorting.
WEIGHT_ADD_REM_CHAR - characters are added or removed. The weight for results sorting.
WEIGHT_FUZZY_PHON - Fazzy matching. The weight for results sorting.
WEIGHT_JOINED_WORD - Joined word. The weight for results sorting.
SEARCH_JOINED - search for joined words.
REMOVE_JOINED_VARIATIONS - remove joined variations.
JOINED_WORD_LENGTH_MIN - minimum joined word length.
JOINED_WORD_LENGTH_EDT - minimul joined word length to consider edit distance = 1.
CLOSE_WORDS_CUT - remove unrelated suggestions.
DELIMITERS - word delimiters.
DELIMITERS_JOINED - joined words delimiters.
b) Language Configuration files:
The Fuzzy matching algorithm uses these files to select the best suggestion for the language.
The file name should follow format: LANGUAGE.config.
Creating your own language files you can add more languages to the Suggester.
The files are located at the classpath: com/softcorporation/suggester/language/
Parameters:
LANGUAGE - the language identifier.
S1=S2:80[,Sn:##] - the relation (here it is 80) between strings S1 and S2, usually representing letters.
The strongest relation = 100 (default).
All language letters should be listed in the file, even if one letter has no relations to others.
9. Open source.
The Suggester source code is published here.
10. Documentation
The documentation is available for Advanced and Enterprise versions and is included in the "doc" directory of download package.
Here is the Suggester Manual compatible with Basic Suggester version.
11. Java Code Samples
Java code samples are included in the download package. Click on a link for more information on How to use the Suggester.
If you would like to see the example of usage of Suggester Enterprise version, you can try a Virtual Keyboard for Smart TV,
or Wikipedia People Instant Fuzzy Search.
12. HTML Examples
Suggester software includes free spell-checker, which you can test here:
Click here to run English Spell Check test.
Click here to run Russian Spell Check test.
13. Dictionaries
Dictionaries are included with free spell-checker, which you download from here:
Suggester Spellcheck.
14. Release Notes
-
10 Jun, 2006. Initial release 1.0.0.
-
21 Oct, 2007. Release 1.1.2.
-
01 Feb, 2008. Release 1.1.3. Language configuration files update.
-
17 Aug, 2013. Open Source 1.1.2 Release.
15. Licensing and Legal Issues
For legal and licensing issues, please read the LICENSE.TXT
file.
Basically there are no limitations to use or redistribute the code
besides providing reference to original developer: SoftCorporation LLC.
Java (TM) is trademark of Oracle Corporation.