Defining Word Sets

Word sets are created as part of the dimensions XML document. Word sets can be used to define stop words, stemming exclusion words and analysis exclusion words.

New in version 2.8.6.

Word sets are defined as elements within the dimensions.xml file and then referred to in dimension definitions using a reference. The list of words in the word set definition are analyzed using the analyzer rules for the dimension to which they apply, so punctuation, whitespace and other extra characters will be automatically removed or words broken up at punctuation marks.

Word sets and dimensions can appear in any order in the dimensions.xml file.

This example creates a simple word set and includes extra punctuation that will not be included in the final word set:

<wordset id="stopwords">a an, the: that; whose</wordset>

The word set from the above XML snippet would be: a an the that whose

To use a word set in a text type dimension, refer to the word set using the appropriate reference attribute.

To create a stop word list, use stopWords-ref. To create a stemming exclusion word set, use stemmingExclusion-ref and to create an analysis exclusion word set, use noAnalysis-ref when defining your dimension.

Creating Merged Word Sets

Word sets can be created in manageable blocks and merged with other word sets before they are used on a dimension. There are several variants that you can use to refer to an individual word set or merged word sets.

New in version 2.8.7.

Variant 1: Inline Word Set with Dimension

This variant is the simplest in which the word set is defined inline with a dimension declaration.

<dimension id="example" stopwords="then am are in on it"/>

Variant 2: Word Set Reference with Dimension

This variant simplifies the dimension declaration and isolates the definition of the word set. This variant also is the easiest way to reuse a word set.

<dimension id="example" stopWords-ref="myStopWords"/>
<wordset id="myStopWords">
  then am are in on it
</wordset>

Variant 3: Multiple Word Set References with Dimension

This variant allows multiple pre-defined word sets to be merged by listing the necessary word sets on the dimension declaration.

<dimension id="example" stopWords-ref="myStopWords,myStopWords2"/>
<wordset id="myStopWords">
  then am are in on it
</wordset>
<wordset id="myStopWords2">
  into a through between under over
</wordset>

Variant 4: Word Sets Referring to Wordsets with Dimension

This variant allows you to create a new wordset by merging the words from one or more word sets. It also allows you to add additional words to the merged word sets.

<dimension id="example" stopWords-ref="combinedStopWords"/>
<wordset id="combinedStopWords" wordset-ref="myStopWords,myStopWords2">
  super great new words added to the merged word sets
</wordset>
<wordset id="myStopWords">
  then am are in on it
</wordset>
<wordset id="myStopWords2">
  into a through between under over
</wordset>