Dimension Types

Dimensions fall into one of the following types of dimensions, discussed in detail below:

  • Ordered
  • Tree
  • Mutex
  • Keyword
  • Integer
  • Double
  • Geoloc
  • Time
  • Text (fieldedText)
  • GroupBy

Ordered Type

A dimension whose values are ordered in some fashion. For example, level of education could be defined as an ordered list: “None”, “High School”, “Some College”, “Associate’s Degree”, “Bachelor’s Degree”, “Master’s Degree”, “PhD”.

Ordered Element XML Element

<element id="344" name="Comedy"/>

The element element specifies a facet of the dimension and/or provides mapping information from an encoded id to a self-documenting description.

The following dimension types may not specify elements: Geoloc, Keyword and Text.

id="<Element ID>"

An id is required and must be unique within the set of all element elements of the enclosing dimension element. The id refers to a data value found in the changeset.

For searching, if the value attribute is not specified, then the value of the element is the value of id.

The value of id is case-sensitive, regardless of the ignoreCase attribute of the dimension element or search criteria.

value="<Element Value>"

A value is optional. For searching, if the value attribute is not used, then the value of the element is the value of id.

The value of value follows the case comparison rules specified by the ignoreCase attribute of the dimension for searching.

<dimension id="user.gender" type="mutex">
    <element id="0" value="M" name="Male"/>
    <element id="1" value="F" name="Female"/>
</dimension>
name="<Descriptive Name>"

A name is optional. It is used to map an encoded id into a readable label that is used in the Admin tool Indices tab and, as of version 2.9, when requested, is returned in the indexValues query response. This allows a code to be used to index and then to translate that code to a human-readable value.

Note that the name attribute of each element is optional.

<dimension id="user.education" type="ordered">
    <element id="0" name="None"/>
    <element id="1" name="High School"/>
    <element id="2" name="Some College"/>
    <element id="3" name="Associate's Degree"/>
    <element id="4" name="Bachelor's Degree"/>
    <element id="5" name="Master's Degree"/>
    <element id="6" name="PhD"/>
</dimension>

The order of the elements is significant - in this example, a Bachelor’s Degree is closer to a Master’s Degree than it is to High School.

Tree Type

A dimension whose values can be structured in a hierarchy.

<dimension id="ethnicity" type="tree">
    <element id="99" name="Asian">
        <element id="1" name="Japanese"/>
        <element id="2" name="Vietnamese"/>
        <element id="17" name="Chinese"/>
    </element>
    <element id="3" name="African American"/>
    <element id="5" name="Caribbean"/>
    <element id="6" name="Caucasian / White">
        <element id="7" name="Eastern European"/>
        <element id="8" name="Western European"/>
        <element id="9" name="North American"/>
        <element id="10" name="Other Caucasian"/>
    </element>
    <element id="11" name="East Indian"/>
    <element id="12" name="Hispanic / Latino"/>
    <element id="13" name="Middle Eastern"/>
    <element id="14" name="Native American"/>
    <element id="15" name="Pacific Islands"/>
    <element id="16" name="Other"/>
</dimension>

Note that parent elements don’t have to exist in your dataset; they can be used purely to create structure. In the above example, let’s assume that Asian exists solely as a grouping element, in other words the Asian value doesn’t exist in the dataset but we want to use it to create structure. If someone performs a search for Vietnamese, Vietnamese users are returned first, followed by Japanese and Chinese (in no particular order) and then all other ethnicities. The id of elements that exist solely as parents is irrelevant as long as it doesn’t conflict with any other element’s id.

As of version 2.9, tree dimension searches can benefit from the engine’s highlighting feature. You can request that the engine highlight all matches against your tree dimension text and present those matches in a clear fashion. For more information on how to specify highlighting in the query request, refer to Overview and Highlighting Criterion.

If you have a dimension whose elements have little or no discernible structure, you can use the tree type as well. For example:

<dimension id='user.favorite_fastfood' type="tree">
    <element id="0" name="Fried Chicken"/>
    <element id="1" name="Burgers"/>
    <element id="2" name="Hot Dogs"/>
    <element id="3" name="Pizza"/>
    <element id="4" name="Okonomiyaki"/>
</dimension>

Mutex Type

A dimension whose values are mutually exclusive of one another. For example, gender could be considered mutually exclusive (a search for “Male” would explicitly exclude items associated with “Female”).

<dimension id="user.gender" type="mutex">
    <element id="0" name="Male"/>
    <element id="1" name="Female"/>
</dimension>

Keyword Type

String values can be represented by keyword type dimensions. Keyword dimensions are best used for standardized single word values or tags. When they are indexed, the whole string is used and no further textual analysis is done.

<dimension id="us_state" type="keyword"/>

Unless explicitly disabled, each distinct value in a keyword type dimension is available for faceting unless maxBuckets is hit, otherwise there is no limit to the number of facet values.

For more information on faceting with keyword type dimensions refer to Faceting.

maxBuckets = "0"

This attribute will limit the number of facets that can be created for the dimension. If there are more buckets than that number, all facet counts will be removed. To disable any facet counts, set maxBuckets to 0.

Integer Type (Scalar)

Scalar values represented as integers such as age. The integer type is similar to a keyword type except that its values are treated as whole numbers.

<dimension id="user.age" type="integer"/>
<dimension id='user.join_date' type="integer"/>

Integer type dimensions are scalar dimensions with special features that support ranges of values. For more information on these features refer to Scalar Features.

For more information on faceting with scalar type dimensions refer to Faceting.

Double Type (Scalar)

Scalar values represented as doubles or floats, such as height. The double type is similar to a keyword type except that its values are treated as numbers.

<dimension id="user.height" type="double"/>

Double type dimensions are scalar dimensions with special features that support ranges of values. For more information on these features refer to Scalar Features.

For more information on faceting with scalar type dimensions refer to Faceting.

Time Type (Scalar)

Values representing dates and times. The value is represented internally as a number, so any operations that are valid for numeric dimensions are also valid for time type dimensions.

The time type is similar to a keyword type except that its values are treated as dates.

Time type dimensions are scalar dimensions with special features that support ranges of values. For more information on these features refer to Scalar Features.

For more information on faceting with scalar type dimensions refer to Faceting.

<dimension id="last_visited" type="time" format="yyyy-MM-dd'T'HH:mm:ss" timeZone="UTC"/>

The following attribute will impact how the values in your time dimension are indexed.

format= "unix" | "<format specification>"

Time dimension values are converted from a date/time format specification. To specify the format of the data, use the format attribute. If your date format does not include a time zone, then you are strongly encouraged to explicitly specify a timeZone on your dimension element. The only limitation to the type of format to use is that the format should not include a space. This restriction may be lifted in a later release.

Specify “unix” if your time values are in Unix date/time format, otherwise compose a date/time format string by referencing the following table.

To include a string in the format, enclose the string with single quotes (‘).

Default: Default US local formatting.

Letter Date or Time Component Presentation Examples
G Era designator Text AD
y Year Year 2010; 78
M Month in year Month July; Jul; 07
d Day in month Number 10
a Am/pm marker Text PM
H Hour in day (0-23) Number 0
k Hour in day (1-24) Number 24
K Hour in am/pm (0-11) Number 0
h Hour in am/pm (1-12) Number 12
m Minute in hour Number 30
s Second in minute Number 55
S Millisecond Number 978
z Time zone General time zone Eastern Standard Time; EST; GMT-05:00
Z Time zone RFC 822 time zone -0800
timeZone="<timezone specification>"

In general, we recommend always specifying a timeZone.

If your format is not unix and does not include a time zone (“Z” or “z”), then you should provide a timeZone to use when indexing or querying dates and times. Note that when you index a date such as “2013-01-01”, the time of 00:00 is implicit. Because of this, timeZone is relevant event if you are only indexing dates.

The time zone UTC is used if you do not specify a timeZone.

Some example time zones are UTC, GMT+1, America/New_York, US/Eastern, US/Pacific.

Documentation about time zone parsing is available online from oracle at http://docs.oracle.com/javase/6/docs/api/java/util/TimeZone.html. The exact list of valid time zones depends upon the time zone data version installed with your Java environment. More information is available online from oracle at http://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html.

Geoloc Type

A dimension whose value consists of a geographical location specified using longitude and latitude, zipcode, or both.

latitude = "<property name>"

This attribute will determine which property contains the latitude double component.

Default: latitude

longitude = "<property name>"

This attribute will determine which property contains the longitude double component.

Default: longitude

Note: To index multiple latitude/longitude coordinates, prepare a parallel array for each of the latitude and longitude values. The first latitude will be married to the first longitude and so forth.

zipcode = "<property name>"

This attribute will determine which property contains the US 5-digit zipcode used to geocode the item.

If latitude and longitude are provided, then zipcode will be ignored.

Default: zipcode

<dimension id="user.geoloc" type="geoloc" longitude="user.longitude"
    latitude="user.latitude" zipcode="user.zipcode" />

In the above example changesets would contain, for each item, properties named “user.longitude”, “user.latitude”, and “user.zipcode”. The property names default to “latitude”, and “longitude”, and “zipcode” if left unspecified. The longitude and latitude values in the changeset must be specified in degrees.

Text Type

The engine allows for several kinds of text searches, keyword and text. A text dimension can be searched using full free text searching capabilities. On the other hand, a keyword dimension can only be used for case-sensitive exact string matches, but avoids the overhead incurred in a full text search.

For example, a changeset property which has a known set of constant values such as US state abbreviations (‘NY’, ‘MA’, ‘CA’, ‘AK’, etc.) would fit nicely into a keyword dimension, whereas a paragraph containing text written by a user about herself would fit into a text dimension.

Text dimension searches can benefit from the engine’s highlighting feature. You can request that the engine highlight all matches against your search text and present those matches in a clear fashion. For more information on how to specify highlighting in the query request, refer to Overview and Highlighting Criterion.

<dimension id="user.state" type="keyword" />
<dimension id="user.aboutme" type="text" />

A text dimension can optionally index multiple changeset properties using the key attribute. This dimension, “freetext,” indexes both the user.aboutme and user.lookingfor changeset properties.

<dimension id="freetext" type="text" key="user.aboutme,user.lookingfor"/>

In addition to using keyword dimensions for single constant values, it can also be applied to data with multiple values. In this case, an optional delimiters attribute can be used to determine how the underlying data gets tokenized. In the following example, the data from the changeset is tokenized by splitting the text on ‘,’.

<dimension id="tags" type="keyword" delimiters=","/>

For more information about text dimensions in general, including features related to internationalization, refer to About Text Dimensions.

The following attributes will impact how the text in your dimension is indexed.

noAnalysis-ref= "<id of word set>"

If certain words should be excluded from the analysis process, then a customer can create list of words to be excluded. For more information on creating a word set for stemming exclusion, refer to Defining Word Sets. For more information on the text analysis process, refer to Analysis.

New in version 2.8.6.

stemming= "true" |  "false"

Stemming is enabled by default for all Text dimensions. Stemming is the process of finding the root of all familiar words such that full text searches return more matches. For example, “dog”, “dogs”, “doggy” might all be converted to “dog” when stemming is enabled. For more exact matching, set stemming to “false”. For more information, refer to Stemming.

stemmingExclusion-ref= "<id of word set>"

If stemming is enabled, then a customer can define a list of words that should be excluded from stemming. For more information on creating a word set for stemming exclusion, refer to Defining Word Sets.

New in version 2.8.6.

accentFolding = "true" | "false"

Accent folding in enabled by default for all text dimensions but can also be set for keyword dimensions as well. Accent folding only applies to certain types of Unicode encoding systems with specially accented characters. Accent folding can help improve matches.

For example, when accent folding is enabled, a property with value “Montréal” can be found using a criteria value of “Montreal”. Conversely, a criteria value of “Töt” could find a property value of “Tot”.

For most English language users, accent folding may not be required. For more exact matching, set accentFolding to false.

The accent folding features support composed Unicode characters.

normalizeFullWidthChars= "true" |  "false"

If the text to be indexed includes full-width Chinese equivalents for ASCII symbols, letters and numbers, these characters can be automatically translated into ASCII (single byte UTF-8) characters.

Default: false.

stopWords = "<Comma-delimited Stop Words>"
stopWords-ref = "<id of word set>"

Stop words are commonly occurring words such as pronouns, articles and conjunctions that may serve to muddy or bloat a full-text search index. When indexing large blocks of text, including stop words may help improve text matching, particularly for phrase matching.

To not use any stop words, specify stopWords= "".

For more information on stop words, refer to Stop Words and for more information on creating a word set for stop words, refer to Defining Word Sets.

fieldPositionIncrementGap = "<position increment gap>"

When the indexer combines changeset properties listed in the dimension’s key attribute, it separates the text in each property in an attempt to avoid proximity phrase matches on words at the end of one changeset property with words at the beginning of the next property.

There may be cases in which you do want proximity phrase matching to cross changeset properties. For example, if you include both a city and state property in a text dimension, if fieldPositionIncrementGap is set to 0, then you could do a proximity search on “Salem, Oregon”.

Default: 100.

New in version 2.8.6.

synonyms-ref = "<id of thesaurus to use for synonym dictionary>"

This attribute will determine the optional synonym dictionary to use when indexing text.

For more information on creating a synonym dictionary, refer to Defining Synonyms/Thesauruses.

New in version 2.9.

didYouMean = "true" | "false"

Set this attribute to “true” to enable the Did You Mean? query suggestion feature. When enabled, the query response will include a didYouMean field that may include suggested queries in response to what the user searched for.

Default: false.

New in version 2.9.

didYouMeanDictionary = "<Comma-delimited Stop Words>"
didYouMeanDictionary-ref = "<id of word set to use for dictionary>"

If didYouMean has been enabled, this attribute will determine the optional spelling correction dictionary to use as part of the Did You Mean? suggestion process.

For more information on creating a word set for a Did You Mean? dictionary, refer to Defining Word Sets.

New in version 2.9.

makeParts = "words" | "numbers" | "both" | "none"

This attribute applies when the word delimiter tokenizer is used. It determines which parts of words are generated from compound words. To generate word parts from compound or delimited words, use words. To generate number parts out of compound words, use numbers. To generate both word and number parts, use both.

Default: words.

concatParts = "words" | "numbers" | "both" | "none"

This attribute applies when the word delimiter tokenizer is used. It determines which parts of words are concatenated to form new words. For example, the product number XMY-MD-89-9001 could have three parts: 2 words and 2 numbers. Using value words will create a XMYMD and using value numbers will generate 899001.

Use attribute concatenateAll to concatenate both number and word parts into a single word.

Default: words.

New in version 2.9.

concatenateAll = "true" | "false"

This attribute applies when the word delimiter tokenizer is used. It determines which parts of words are concatenated to form new words. For example, the product number XMY-MD-89-9001 could have 2 words and 2 numbers. With concatenateAll enabled the indexer will create``XMYMD899001`.

Default: true.

New in version 2.9.

splitParts = "case" | "numbers" | "both" | "none"

This attribute applies when the word delimiter tokenizer is used. To enable or disable splitting words at case changes or letter-number transitions use this attribute. With case transitions, McDonald becomes Mc and Donald.

Default: case.

New in version 2.9.

stemEnglishPossessive = "true" | "false"

This attribute applies when the word delimiter tokenizer is used. When it is enabled, then O'Neil's becomes O'Neil.

Default: true.

New in version 2.9.

stripHtml = "true" | "false"

If the text for the dimension includes HTML tags, this attribute will remove HTML tags before indexing. Character entities are also converted to UTF-8.

Default: false

New in version 2.7.2.

phoneticAlgorithm = "soundex" | "refinedSoundex" | "metaphone" | "doubleMetaphone"

This attribute is used to apply a phonetic algorithm during the analysis phase of indexing. For more information about phonetic algorithms, refer to Phonetic Analysis.

New in version 2.8.3.

ignoreFieldLength = "true" | "false"

In order to adjust internal relevance adjustments, if the length of the field is to be ignored at query time, this attribute must be set to true. The default is false.

New in version 2.8.2.

ignoreInverseDocumentFrequency = "true" | "false"

This attribute is used to define the default value for query relevance adjustments. Its presence does not change index behavior. If the frequency of terms (in the universe of values for the dimension) is low, then the engine will automatically increase relevance in proportion to the infrequency. This value can be overridden as part of a query criterion.

Default false for releases < 3.7

Default true for releases >= 3.7

New in version 2.8.2.

tokenizer = "wordDelimiter" | "standard" | "whitespace"

Determines which tokenizer to use when analyzing the text. For more information, refer to Tokenization.

Default: wordDelimiter

New in version 2.8.6.

A text dimension also allows changeset properties to be assigned to specific fields. At query time, the fields can be individually searched and their contributing relevancies adjusted on a field-by-field level.

<dimension id="document" type="text">
    <field id="title" key="customer_name"/>
    <field id="body" key="customer_bio"/>
    <field id="keywords" key="customer_preferences,customer_dislikes"/>
</dimension>

The field element is functionally similar to a text dimension element. All of the attributes that apply to a text dimension element can be used. Any attributes appearing on the parent dimension element automatically apply to all field elements, unless overridden on the individual field element itself.

Group By Type

A dimension whose sole purpose is to be used in a grouped query. The id or values of the indexed data is used to group the results of a query. In most aspects, a groupBy dimension is close to a keyword type dimension.

To use a groupBy dimension, create a dimension on the values you want to group by. When the engine executes the query, it will group the results by the values in the associated groupBy dimension.

The query response can include property data for the grouped by value as well as property data for the best matches items within each groupBy group.

A groupBy type dimension can index one and only one value.

NOTE: groupBy dimensions and queries are only supported in a single-server configuration. There is currently no support for multi-server support.

GroupBy is particularly useful for real estate solutions for New Homr or Rental Communities in which a customer searches for a particular floor plan or model but the results are shown by grouping the results into a particular Community, Complex or other Development.

<dimension id="floorplans" type="groupBy" key="floorplan_id"/>
indexesItemId = "true"

Indicates that the groupBy changeset key is an item id. When this attribute is enabled, the query result will include changeset properties for the groupBy value since it has been identified as an item id.

Default: false

legacyGroupBy = "true"

Indicates the query response for this group by should use the format prior to version 3.0. This attribute exists for use by customers who were using groupBy prior to version 3.0 and wish to upgrade their engine without being required to upgrade their query processing logic.

Default: false