Class %Text.Japanese

datatype class %Text.Japanese extends %Text.Text

ODBC Type: VARCHAR

The %Text.Japanese class implements (or calls) the Japanese language-specific stemming algorithm and initializes the language-specific list of noise words.

Inventory

Parameters	Properties	Methods	Queries	Indices	ForeignKeys	Triggers
9		3

Summary

Methods
AddDocToDictionary	AddToDictionary	AddToThesaurus	BuildValueArray
ChooseSearchKey	Classify	CreateQList	DecompressOffsets
DisplayToLogical	DropDictionary	EndOfWord	ExcludeCommonTerms
IsValid	LoadThesaurus	LogicalToDisplay	LogicalToOdbc
LogicalToXSD	MakeSearchTerms	Normalize	RemoveDocFromDictionary
RemoveFromThesaurus	SeparateWords	Similarity	SimilarityIdx
Standardize	Translate	XSDToLogical

Parameters

• parameter CASEINSENSITIVE = 0;

CASEINSENSITIVE=1 causes comparisons to be performed by %CONTAINS in a case-insensitive manner when the collation of the underlying property is case insensitive. Setting CASEINSENSITIVE=1 improves matching and typically reduces both the size of the index and index update time. Note that CASEINSENSITIVE is not applicable to the %CONTAINSTERM operator, since %CONTAINSTERM always compares terms using the collation of the specified property.

• parameter DICTIONARY = 6;

The default dictionary for properties of this class. By overriding the DICTIONARY you can create separate dictionaries for different kinds of properties in the same language. For example, email documents, legal briefs, and medical records might each have a separate dictionary so that term frequency and document similarity can be appropriately estimated in each separate domain.

• parameter FILTERNOISEWORDS = 0;

FILTERNOISEWORDS controls whether common-word filtering is enabled. Specifying a list of noise words can greatly reduce the size of a text index and the associated index update time; however, to perform text search it is necessary to also remove noise words from the search pattern, and this can produce some counter-intuitive results. See example below.
Setting up noise word filtering is a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second, populate the noise word dictionary by calling the ExcludeCommonTerms with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms purges the previous set of noise words, so it may be called any number of times, but it is necessary to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.
Note: The SQL predicate:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')
will not find any qualifying rows if 'to, be, or, not' are all noise words; however, if any of these terms are not noise words, then only the non-noise words will participate in the matching process.

• parameter MINWORDLEN = 1;

MINWORDLEN specifies the minimum length word that will be retained excluding ngram words and post-stemmed words. MINWORDLEN provides a simple means of excluding terms based on their length, since it is usually the case that short words such as 'a', 'to', 'an', etc., are connectives that contain little information content. The length refers to the number of characters in the original document. Note that if stemming or thesaurus translation is enabled, then the length of the term in a text index may have fewer than MINWORDLEN characters.
Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1, since otherwise a word stem could be classified as a noise word even though alternate forms of the word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded as a noise word, whereas "jumps" would not.

• parameter NGRAMLEN = 2;

NGRAMLEN is the maximum number of words that will be regarded as a single search term. When NGRAMLEN=2, two-word combinations will be added to any index, in addition to single words. Consecutive words exclude noise words.

• parameter NUMCHARS;

NUMCHARS specifies the characters other than digits that may appear in a number. Note that if "," is included in NUMCHARS, then "1,000" will be considered a single number, but the comma will be removed so that "1,000" will match "1000" using the %CONTAINS SQL predicate. The characters "." and "-" are also special and mark the beginning of a numeric term when the next character is numeric, regardless of how NUMCHARS is defined.

• parameter SEPARATEWORDS = 1;

Languages such as Japanese require the raw document text to be parsed and separated into words before being processed by the class methods. If SEPARATEWORDS=1 then call the SeparateWords() class method.

• parameter SOURCELANGUAGE = "ja";

SOURCELANGUAGEUAGE specifies the default source language to translate documents or queries from. This enables documents written and stored in multiple langauges to be queried in a single common language.

• parameter STEMMING = 0;

STEMMING replaces each word by its language-specific stem to improve the matching quality. Note that stemmed words are modified, and may or may not correspond to real words in the language. If stemming is enabled, then search patterns must also be stemmed prior to searching.
Note: Stemming of search strings is performed automatically by the %CONTAINS Cache SQL predicate if stemming is enabled on the corresponding property; however, stemming is not automatically performed by the more primitive FOR SOME %ELEMENT SQL predicate.

Methods

• classmethod ExcludeCommonTerms(nWords) as %Status

Classifies the most common nTerms words in the current language as noise words. The words specified in NOISEWORDS100, NOISEWORDS200, and NOISEWORDS300, list the most common 300 words of the current language, in order of their frequency. Similarly, NOISEBIGRAMSn00 lists the most common 300 bigrams of the current language that would not typically be considered useful for searching.

• classmethod SeparateWords(rawText As %String) as %String

Separates individual terms with whitespace, for languages such as Japanese.


[BASEXML] > [%Text] > [Japanese]		Private Storage