About Text Dimensions

When a text is indexed, it is subject to the analyzing and tokenizing filters that can transform and normalize the data to make it more searchable. For example — removing blank spaces, removing HTML tags, stop words, stemming or phonetic analysis. At indexing time as well as at query the engine will apply the correct analyzers that were specified in the dimension definition.

For more information on creating a text dimension and its options, refer to Text Type Dimensions.

Analysis

When the engine indexes an item, its individual text fields are subject to the analyzing and tokenizing filters that will transform and normalize the data in the original text. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. When indexing and querying the engine automatically processes the input using the same logic, make relevant searches easily.

The analysis process consists of tokenization and transformation steps. The tokenization process splits text into tokens or words. The transformation takes those tokens and transforms them into a normalized form, replaces the token or adds to it. Examples of tokenization and transformation options follow in the sections below.

Tokenization

The engine uses a word delimiter, standard, whitespace or smartcn tokenizer. The default tokenizer is the word delimiter tokenizer.

Word Delimiter Tokenizer (wordDelimiter)

  • Split on intra-word delimiters (by default, all non alpha-numeric characters) and combine word parts to create a single word, for example wi-fi becomes wi, fi, wifi and wi-fi.
  • Split on intra-word delimiters also works with numbers, converting 555-322-1212 into 555, 322 1212 and 5553221212.
  • Split on case transitions, for example,(with ignoreCase enabled), PowerShot becomes power, shot and powershot. McDonald becomes mc and donald.
  • Correctly stem English possessives. For example, O'Neil's becomes O'Neil.
  • Split on letter-number transitions, for example, MD80 becomes MD, 80 and MD80. This feature is not enabled by default.
  • Maintains leading and trailing currency symbols, plus (+) and minus (-) signs. Customers can search for $10 or 10, +20, 10- or -10.
  • Maintains trailing % symbols. Customers can search for 10% or 10.

Standard Tokenizer (standard)

  • Splits words at punctuation characters, removing punctuation. A dot that’s not followed by whitespace is considered part of a token, though.
  • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and Internet host names as one token.

Whitespace Tokenizer (whitespace)

  • Splits words at whitespace characters (spaces, new lines, tabs, etc).

For more information on how to determine which tokenizer to use, refer to Text Type.

Smart Chinese Tokenizer (smartcn)

  • Tries to split words based on a Hidden Markov Model.

This is an experimental option.

Query Parsers

A query parser is used to convert a string of text into a query and save the user from having to understand complex syntax to do simple requests.

The query parser treats each word of the query as if it is part of a phrase. It attempts to find any documents with any of the words in the query, then determines if the query words form phrases in the items. Items with all of the query words rank highest and items in which the query words are closest together, rank even higher.

As of version 3.8, double quotes (”) around the words add influence to the query. When quotes are applied around one or more words the resulting documents must match all and any words in quotes. For phrases in quotes, the following analysis steps are not applied:

  • stemming
  • phonetic analysis
  • synonym reduction and expansion

When using double quotes around more than one word, the words must appear in the same order as entered and immediately proximate to each other.

Starts With Queries

The Discovery Search Engine supports starts with queries for individual words or queries consisting of more than one word. For example, searching for “harva bus” can return results for “Harvard Business”.

New in version 2.8.1.

Accented Characters

Some languages use accented characters (diacritics), but you probably cannot expect users to be consistent about typing them in queries or the data for your changesets.

If you are using a stemmer, most stemmers are a somewhat forgiving about accented characters, these being handled on a language-specific basis.

For Latin-script and Greek-script writing systems, you can remove all accents from characters using the dimension field accentFolding. For more information refer to accentFolding in the Text Type dimension documentation.

Stop Words

Stopwords affect an index and searches in three ways: relevance, performance, and resource utilization.

From a relevance perspective, these extremely high-frequency terms tend to throw off the scoring algorithm, and you won’t get very good results if you leave them in your text. At the same time, if you remove them, you can return bad results when the stop word is important.

From a performance perspective, if you keep stop words, some queries (especially phrase queries) can be slower.

From a resource utilization perspective, if you keep stop words, the index is much larger than if you remove them.

Default Stop Word Lists for Supported Languages

The default list of English stop words is

a and are as at be but by for if in into is it no not of on or s
such t that the their then there these they this to was will with

The default list of Spanish stop words are

New in version 2.8.3.

a acuerdo adelante ademas además adrede ahi ahí ahora al alli allí
alrededor antano antaño ante antes apenas aproximadamente aquel
aquél aquella aquélla aquellas aquéllas aquello aquellos aquéllos
aqui aquí arribaabajo asi así aun aún aunque b bajo bastante bien
breve c casi cerca claro como cómo con conmigo contigo contra cual
cuál cuales cuáles cuando cuándo cuanta cuánta cuantas cuántas
cuanto cuánto cuantos cuántos d de debajo del delante demasiado
dentro deprisa desde despacio despues después detras detrás dia
día dias días donde dónde dos durante e el él ella ellas ellos en
encima enfrente enseguida entre es esa ésa esas ésas ese ése eso
esos ésos esta está ésta estado estados estan están estar estas
éstas este éste esto estos éstos ex excepto f final fue fuera
fueron g general gran h ha habia había habla hablan hace hacia han
hasta hay horas hoy i incluso informo informó j junto k l la lado
las le lejos lo los luego m mal mas más mayor me medio mejor menos
menudo mi mí mia mía mias mías mientras mio mío mios míos mis
mismo mucho muy n nada nadie ninguna no nos nosotras nosotros
nuestra nuestras nuestro nuestros nueva nuevo nunca o os otra
otros p pais paìs para parte pasado peor pero poco por porque
pronto proximo próximo puede q qeu que qué quien quién quienes
quiénes quiza quizá quizas quizás r raras repente s salvo se sé
segun según ser sera será si sí sido siempre sin sobre solamente
solo sólo son soyos su supuesto sus suya suyas suyo t tal tambien
también tampoco tarde te temprano ti tiene todavia todavía todo
todos tras tu tú tus tuya tuyas tuyo tuyos u un una unas uno unos
usted ustedes v veces vez vosotras vosotros vuestra vuestras
vuestro vuestros w x y ya yo z

The default list of French stop words is

à ai aie aient aies ait as au aura aurai auraient aurais aurait
auras aurez auriez aurions aurons auront aux avaient avais avait
avec avez aviez avions avons ayant ayez ayons c ce ceci cela celà ces
cet cette d dans de des du elle en es est et étaient étais était
étant été étée étées êtes étés étiez étions eu eue eues eûmes
eurent eus eusse eussent eusses eussiez eussions eut eût eûtes eux
fûmes furent fus fusse fussent fusses fussiez fussions fut fût
fûtes ici il ils j je l la le les leur leurs lui m ma mais me même
mes moi mon n ne nos notre nous on ont ou par pas pour qu que quel
quelle quelles quels qui s sa sans se sera serai seraient serais
serait seras serez seriez serions serons seront ses soi soient
sois soit sommes son sont soyez soyons suis sur t ta te tes toi
ton tu un une vos votre vous y

The default list of Italian stop words is

a ai al alla allo altre altri altro anche ancora avere aveva ben
che chi con cosa cui da del della dello dentro deve devo di e ecco
fare fra giu ha hai hanno ho il io la le lei lo loro lui ma me nei
nella no noi nome nostro o oltre ora pero piu poco qua quasi
quello questo qui quindi sara sei sembra sembrava senza sia siamo
solo sono sopra sotto stati stato stesso su subito sul sulla tanto
te tra un una uno va vai voi

The default list of German stop words is

aber alle allem allen aller alles als also am an ander andere
anderem anderen anderer anderes anderm andern anderr anders auch
auf aus bei bin bis bist da damit dann das daß dasselbe dazu dein
deine deinem deinen deiner deines dem demselben den denn denselben
der derer derselbe derselben des desselben dessen dich die dies
diese dieselbe dieselben diesem diesen dieser dieses dir doch dort
du durch ein eine einem einen einer eines einig einige einigem
einigen einiger einiges einmal er es etwas euch euer eure eurem
euren eurer eures für gegen gewesen hab habe haben hat hatte
hatten hier hin hinter ich ihm ihn ihnen ihr ihre ihrem ihren
ihrer ihres im in indem ins ist jede jedem jeden jeder jedes jene
jenem jenen jener jenes jetzt kann kein keine keinem keinen keiner
keines können könnte machen man manche manchem manchen mancher
manches mein meine meinem meinen meiner meines mich mir mit muss
musste nach nicht nichts noch nun nur ob oder ohne sehr sein seine
seinem seinen seiner seines selbst sich sie sind so solche solchem
solchen solcher solches soll sollte sondern sonst über um und uns
unse unsem unsen unser unses unter viel vom von vor während war
waren warst was weg weil weiter welche welchem welchen welcher
welches wenn werde werden wie wieder will wir wird wirst wo wollen
wollte würde würden zu zum zur zwar zwischen

The default list of Dutch stop words is

aan al alles als altijd andere ben bij daar dan dat de der deze
die dit doch doen door dus een eens en er ge geen geweest haar had
heb hebben heeft hem het hier hij hoe hun iemand iets ik in is ja
je kan kon kunnen maar me meer men met mij mijn moet na naar niet
niets nog nu of om omdat onder ons ook op over reeds te tegen toch
toen tot u uit uw van veel voor want waren was wat werd wezen wie
wil worden wordt zal ze zelf zich zij zijn zo zonder zou

The default list of Brazilian Portuguese stop words is

a ainda alem ambas ambos antes ao aonde aos apos aquele aqueles as
assim com como contra contudo cuja cujas cujo cujos da das de dela
dele deles demais depois desde desta deste dispoe dispoem diversa
diversas diversos do dos durante e ela elas ele eles em entao
entre essa essas esse esses esta estas este estes ha isso isto
logo mais mas mediante menos mesma mesmas mesmo mesmos na nao nas
nem nesse neste nos o os ou outra outras outro outros pelas pelo
pelos perante pois por porque portanto propios proprio quais qual
qualquer quando quanto que quem quer se seja sem sendo seu seus
sob sobre sua suas tal tambem teu teus toda todas todo todos tua
tuas tudo um uma umas uns

Stemming

Stemming can help improve relevance, but it can hurt as well.

There is no general rule for whether or not to stem: It depends not only on the language, but also on your documents and queries.

The engine uses the Porter Setmming Algorithm that normalizes words by removing common endings.

For example, Example: “riding”, “rides”, “horses” ==> “ride”, “ride”, “hors”.

New in version 2.8.3.

The engine currently support stemmers for English, Spanish, French, German, Italian, Brazilian Portuguese and Dutch.

Since stemmers are language-specific, the choice of stemmers depends on the locale in effect for the text dimension. If the language is one of the supported languages the supported language is determined from the locale language.

Phonetic Analysis

The engine currently supports four phonetic algorithms for use in text analysis: Soundex, Refined Soundex, Metaphone and Double Metaphone. These phonetic algorithms only support English.

Using another language will cause unpredictable results.

New in version 2.8.3.

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is probably the most widely-known of the phonetic encoders.

Soundex method originally developed by Margaret Odell and Robert Russell.

For more information on the Soundex Rules, refer to the Wikipedia article:

New in version 2.8.3.

http://en.wikipedia.org/wiki/Soundex

Refined Soundex

Encodes a string into a Refined Soundex value. A refined soundex code is optimized for spell checking words.

New in version 2.8.3.

Metaphone

Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. The algorithm produces variable length keys as its output, as opposed to Soundex’s fixed-length keys. Similar sounding words share the same keys.

Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation.

For more information on the Metaphone Rules, refer to the Wikipedia article:

New in version 2.8.3.

http://en.wikipedia.org/wiki/Metaphone

Double Metaphone

The Double Metaphone is the second generation of his Metaphone algorithm. It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name “Smith” yields a primary code of SM0 and a secondary code of XMT, while the name “Schmidt” yields a primary code of XMT and a secondary code of SMT–both have XMT in common.

Double Metaphone tries to account for many irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. Thus it uses a much more complex ruleset than Metaphone; for example, it tests for approximately 100 different contexts of the use of the letter C alone.

For more information on the Double Metaphone Rules, refer to the Wikipedia article:

New in version 2.8.3.

http://en.wikipedia.org/wiki/Double_Metaphone

Word Sets

As of version 2.8.6, the engine supports custom word set creation to further refine the text analysis process. Word sets can be created once and reused many times. They are used to create stop words, stemming exclusion lists and lists of words that should not be analyzed at all.

To ensure consistency and guaranteee that queries always return relevant results, the word sets you define are always processed using exactly the same analysis as the dimension that uses the word set.

As of version 2.8.7, word sets can be defined separately and merged for use on more than one dimension.

For more information on word sets and how to create them, refer to Defining Word Sets.

Synonyms & Thesauruses

As of version 2.9, the engine supports custom synonym defintions by creating a thesaurus which can be used to map a word or phrase to another set of words or phrases. Like word sets, a thesaurus can be created once and reused many times.

To ensure consistency and guaranteee that queries always return relevant results, the thesauruses you define are always processed using exactly the same analysis as the dimension that uses the thesauruses.

Thesauruses can be defined separately and merged for use on more than one dimension.

The engine supports both synonym reduction (at index time) and synonym reduction and expansion at query time.

Whenever possible, the engine will try to reduce synonyms to their simplest, common forms. When the engine creates a query it tries to apply synonyms wherever possible, in a “greedy” manner. A good example would involve using US state names. For example, Massachusetts is a synonym for “MA”. Using the engine’s defaults, all occurrences of Massachusetts will be replaced with “MA”. Any other synonyms, including synonym phrases will also be reduced. For example, “Massachusetts Institute of Technology” will be redcued to “MA Institute of Technology”.

Synonym expansion happens at query time, including expansion of phrases. If “New York” is a synonym for “NY”, then the phrase query “Brookly, NY” will also be expanded to “Brooklyn, New York”. This feature allows using the powerful word parser with synonyms.

For more information on thesaruses and how to create them, refer to Defining Synonyms/Thesauruses.

Did You Mean? and Spelling Correction

As of version 2.9, the engine supports a “Did You Mean” query suggestion feature that also performs spelling corrections using custom dictionaries. Did you mean will take an existing user’s text query and suggest alernate queries that may be what the user was looking for.

The engine allows configuration of the algorithms to be used to determine misspelled words, allows customers to create their own spelling dictionaries and also uses the available indexed words to determine query relevance.