Web Services¶

The Discovery Search Engine supports a variety of web service entry points.

Changesets¶

The Changesets web service supports listing stored, posting new or exporting existing changesets.

Listing Stored Changesets¶

List stored changesets. URL Parameters: full - returns extended history past the last snapshot.

http://example.com:8090/ws/changeset [GET or HEAD]

Example Request¶

GET /ws/changeset HTTP/1.1

Example Response¶

HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8

<changesets total="1">
    <changeset id="a873e97062faebe9a4eb4402e6952188" snapshot="true"
        date="Wed, 13 Oct 2010 20:36:31 GMT" md5="ed37baa943969c6fadabed9806f58fb9"
        href="http://localhost:8090/ws/changeset/a873e97062faebe9a4eb4402e6952188"
        length="67291575" rawLength="14785239" />
</changesets>

Exporting a Stored Changeset¶

Retrieve the contents of a changeset by identifier id.

http://example.com:8090/ws/changeset/[id] [GET or HEAD]

Parameters¶

id - (Required.) Identifier of changeset to retrieve

Example Request¶

GET /ws/changeset/123 HTTP/1.1

Example Response¶

HTTP/1.1 200 OK
Content-Type: text/xml

[Changeset follows in response body]

Apply/Store Changeset¶

Apply and store a new changeset. The input is a changeset. In the response, the Location header is set to the URL of the changeset, and the body lists the changeset identifier.

http://example.com:8090/ws/changeset/[id] [POST]

Query Parameters¶

type Specifies the type of changeset to post, one of either delta (default), snapshot, full, bulk or reset.

Example Request¶

POST /ws/changeset HTTP/1.1
Content-Type: text/xml

[Changeset follows in response body]

Example Response¶

HTTP/1.1 201 Created
Content-Type: text/plain

3b9897903f544040ba40e48b82d01310

Status Codes¶

201 Changeset was created
204 Changeset was empty (as of version 2.8.5)
400 Changeset was invalid

Dimensions¶

The Dimensions web service supports importing a new or exporting an existing dimensions document. The input is a dimensions document when exporting.

http://example.com:8090/ws/dimensions [GET or POST]

Export the Dimensions Document¶

Retrieve the current defined dimensions. The output is a dimension definition document.

http://example.com:8090/ws/dimensions [POST]

Example Usage¶

curl http://example.com:8090/ws/dimensions

Post a new Dimensions Document¶

Set the current defined dimensions. The input is a dimensions specification document. The content type is text/xml.

http://example.com:8090/ws/dimensions [POST]

Example Usage¶

curl --header 'Content-Type: text/xml' --data-binary @dimensions.xml \
    http://example.com:8090/ws/dimensions

Item¶

Fetches changeset property data for the requested item.

http://example.com:8090/ws/item [GET]

Parameters¶

id - (Required.) Identifier of item to retrieve

Example Request¶

GET /ws/item/123 HTTP/1.1

Example Response¶

HTTP/1.1 200 OK
Content-Type: application/json

{
    "_id": "123",
    "genres": "Science Fiction,Action,Adventure",
    "release_year": "2009",
    "MPAA_rating": "PG-13",
    "duration": "127",
    "title": "Star Trek",
    "synopsis": "Boldly going where no one has gone before."
}

Items¶

Fetches changeset property data for one or more items, returning the results in a JSON array in the order in which the items were specified in the request.

If the any of the requested items are not found, the response will indicate which items are missing.

New in version 2.8.3.

http://example.com:8090/ws/items [GET or POST]

Parameters (GET)¶

items - (Required.) Delimited list of identifiers of items to retrieve.
delimiter - The delimited used to separate the ids in the request. Default is a comma (,).
properties - Changeset property ids (names) for which data should be returned. Default is all properties.

Example Request (GET)¶

GET /ws/items?items=123,missing_id&delimiter=, HTTP/1.1

Example Request (POST)¶

The POST method takes a JSON object as its input. The input defines the identifiers and (optionally) properties to return.

curl --header 'Content-Type: application/json' \
    http://example.com:8090/ws/items

The format of the JSON request object is

{
    "ids": [
            "123",
            "missing_id"
           ],
    "properties":
           [
             "_id",
             "genres",
             "release_year",
             "MPAA_rating",
             "duration",
             "title",
             "synopsis"
           ]
}

Example Response (POST and GET)¶

If any requested id is not found, then the response includes a key “_exists” that has the value of “false”.

HTTP/1.1 200 OK
Content-Type: application/json

[
    {
        "_id": "123",
        "genres": "Science Fiction,Action,Adventure",
        "release_year": "2009",
        "MPAA_rating": "PG-13",
        "duration": "127",
        "title": "Star Trek",
        "synopsis": "Boldly going where no one has gone before."
    },
    {
        "_id": "missing_id",
        "_exists": false
    }
]

Export¶

Fetches a combination of item metadata, properties, and indexed values for either all items or those that match the search critera.

This differs from the query API because it is designed to stream out large amounts of data. Because of this it does not support pagination, sortBy, or groupBy and the items are streamed in an undefined order which will vary between calls.

http://example.com:8090/ws/export [GET or POST]

Request¶

The POST endpoint expects a payload of type application/json.

A GET is equivalent to a POST with a request of:

{
  "sources": true
}

The payload can have the following fields:

scores

Whether to include scores in the metadata. Boolean that defaults to false.

sources

Whether to include sources (item properties) in the response. Boolean that defaults to false.

criteria

Optional search criteria, follows the same rules as the criteria option of a query request. Defaults to no value and thus returns all items.

exactMatchesOnly

Whether to include just exact matches from the query. Boolean that defaults to false.

When true you will get just exact matches and their metadata will not include an exact field.
When false you will get exact and fuzzy matches and their metadata will inclue an exact field.

values

Optional list of dimension ids whose indexed values should be included in the response. An array of string, defaults to no value.

Response¶

The response is a stream of JSON documents terminated with a newline and a final newline at the end of the stream. The declared content type is application/x-ndjson.

{header}\n
{a}\n
{b}\n
{c}\n
{d}\n
\n

The first object in the stream is a header object that describes the layout of the rest of the stream. Subsequent objects come in groups of one or more that represent a single item.

The format attribute of the header object declares the contents of all the groups in the stream. It is an array of strings which describe the contents and order of each object in a group. Valid values are any of meta, values, and sources.

Valid object formats are:

meta: This is always present and contains metadata about the item. Valid attributes of the metadata object are id, score, and exact.
values: This is only present when the request contains values. The dimension object contains the indexed values of the item keyed by the dimension id.
sources: This is only present when the request has sources set to true. The source object is the original JSON document that was indexed.

More simply put, each group starts with a metadata object and then optionally follows with dimension and/or source objects.

Example Request for sources¶

{
    "criteria": [
        {"dimension": "example", "value": "example"}
    ],
    "exactMatchesOnly": true,
    "source": true
}

Example Response for sources¶

{"format":["meta","sources"]}\n
{"id":"1001"}\n
{"color":"red"}\n
{"id":"1002"}\n
{"color":"green"}\n
\n

Example Request for indexed values¶

{
    "criteria": [
        {"dimension": "example", "value": "example"}
    ],
    "exactMatchesOnly": true,
    "values": ["location"]
}

Example Response for indexed values¶

{"format":["meta","values"]}\n
{"id":"1001"}\n
{"location":[{"latitude":40.1,"longitude":-72.3}]}\n
{"id":"1002"}\n
{"location":[{"latitude":41.0,"longitude":-73.0}]}\n
\n

DidYouMean¶

Fetches suggested query values for for the requested query string. Did You Mean? suggestions can also be requested via a search criterion.

See: DidYouMean Criterion for a description of the Did You Mean? request API for additional information, data types, default and valid values.

http://example.com:8090/ws/didyoumean/[dimension] [GET or POST]

Parameters¶

query - (Required.) Query string to process for suggestions
maxSuggestions - Maximum number of suggestions to return/ integer
distanceAlgorithm - Word distance algorithm to use. string
morePopular - Uses indexed term relevance to identify best suggestions. boolean
highlighting - Enabled or disables highlighting of the suggested queries
highlighting.preTemplate - If highlighting is enabled, specifies the string to place before of the replaced query term.
highlighting.postTemplate - If highlighting is enabled, specifies the string to place after of the replaced query term.
escapeHtml - Sanitizes any HTML in the original query and suggestions. boolean

Example Request¶

GET /ws/didyoumean/freetext?query=st+marys&distanceAlgorithm=levenshtein
    &highlighting.preTemplate=%3Ci%3E&highlighting.postTemplate=%3C/i%3E HTTP/1.1

Example Response¶

See: DidYouMean for a description of the Did You Mean? response JSON format.

HTTP/1.1 200 OK
Content-Type: application/json

{
  "tokenCount": 2,
  "uncertainCount": 1,
  "query": {
    "value": "st marys",
    "label": "<i>st</i> <i>marys</i>"
  },
  "suggestions": [
    {
      "value": "st mary",
      "label": "st <i>mary</i>"
    },
    {
      "value": "st maria",
      "label": "st <i>maria</i>"
    },
    {
      "value": "st mark",
      "label": "st <i>mark</i>"
    },
    {
      "value": "st mar",
      "label": "st <i>mar</i>"
    }
  ]
}

Metrics¶

The Discovery Search Engine exposes Prometheus metrics on /metrics.

http://example.com:8090/metrics [GET]

There are various types of exposed metrics, some standard ones provided by the Prometheus client library and others custom to the engine.

Standard Metrics¶

Standard¶

name	type	labels
process_cpu_seconds_total	counter
process_start_time_seconds	gauge
process_open_fds	gauge
process_max_fds	gauge
process_virtual_memory_bytes*	gauge
process_resident_memory_bytes*	gauge

* only available when running on linux

Memory Pools¶

name	type	labels
jvm_memory_bytes_used	gauge	area
jvm_memory_bytes_committed	gauge	area
jvm_memory_bytes_max	gauge	area
jvm_memory_pool_bytes_used	gauge	pool
jvm_memory_pool_bytes_committed	gauge	pool
jvm_memory_pool_bytes_max	gauge	pool

Labels:

area: one of heap or nonheap
pool: always default

Garbage Collector¶

name	type	labels
jvm_gc_collection_seconds	summary	gc

Labels:

gc: name of the garbage collector, e.g. PS1

Thread¶

name	type	labels
jvm_threads_current	gauge
jvm_threads_daemon	gauge
jvm_threads_peak	gauge
jvm_threads_started_total	counter
jvm_threads_deadlocked	gauge
jvm_threads_deadlocked_monitor	gauge

Class Loading¶

name	type	labels
jvm_classes_loaded	gauge
jvm_classes_loaded_total	counter
jvm_classes_unloaded_total	counter

Version Info¶

name	type	labels
jvm_info	gauge*	version, vendor

* although specified as a gauge, the value is always 1.

Labels:

version: The JVM version from the system property java.runtime.version
vendor: The JVM vendor from the system property java.vm.vendor

Jetty Statistics¶

name	type	labels
jetty_requests_total	counter
jetty_requests_active	gauge
jetty_requests_active_max	gauge
jetty_request_time_max_seconds	gauge
jetty_request_time_seconds_total	counter
jetty_dispatched_total	counter
jetty_dispatched_active	gauge
jetty_dispatched_active_max	gauge
jetty_dispatched_time_max	gauge
jetty_dispatched_time_seconds_total	counter
jetty_async_requests_total	counter
jetty_async_requests_waiting	gauge
jetty_async_requests_waiting_max	gauge
jetty_async_dispatches_total	counter
jetty_expires_total	counter
jetty_responses_total	counter	code
jetty_stats_seconds	gauge
jetty_responses_bytes_total	counter
jetty_queued_thread_pool_threads	gauge
jetty_queued_thread_pool_threads_idle	gauge
jetty_queued_thread_pool_jobs	gauge

Labels:

code: The HTTP response code, one of 1xx, 2xx, 3xx, 4xx, or 5xx

Custom Metrics¶

Web Application¶

These custom metrics expose traffic broken down by path.

name	type	labels
webapp_requests_active	gauge	method, path
webapp_requests_total	counter	method, path
webapp_request_bytes_total	counter	method, path
webapp_response_bytes_total	counter	method, path, code
webapp_response_seconds_total	histogram	method, path, code

Labels:

method: HTTP method, one of GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS, or TRACE
path: First two components of the path, e.g. /ws/query
code: HTTP response code, one of 1xx, 2xx, 3xx, 4xx, or 5xx

Periodically Updated¶

These metrics are refreshed every 30 seconds.

name	type	labels
discovery_dimensions_count	gauge
discovery_items_count	gauge
discovery_index_bytes	gauge

Instantly Updated¶

These metrics are always up to date.

name	type	labels
discovery_applied_changeset_action_total	counter	action
discovery_changeset_bytes_total	counter	type
discovery_checkpoint_seconds_total	counter

Labels:

action: How the data was changed during changeset application, one of create, update, or delete
type: The type of changeset ingested, one of reset, delta, snapshot, checkpoint, full, or bulk

Statistics¶

DEPRECATED Replaced by Metrics.

The Discovery Search Engine exposes a number of data points that are of interest on a web service at /ws/statistics. The statistics exposed are transient, all counters will reset when the engine stops.

http://example.com:8090/ws/statistics [GET]

We provide a munin plugin that can be used to generate graphs based on the exposed data.

The project page is hosted at github at:

http://github.com/t11e/discovery_munin

You can download the latest source as a zip file from:

http://github.com/t11e/discovery_munin/releases

Example Usage¶

To see the available data points:

$ curl http://localhost:8090/ws/statistics
http
changeset
item
checkpoint
changeset.apply
index
query
json

To get the data itself GET from the URL with /fetch/${datapoint}. Where datapoint is a space delimited list of options as output from /ws/statistics.

Get the HTTP statistics:

$ curl http://localhost:8090/ws/statistics/fetch/http
http.time.count: 2663
http.time.mean: 88.51032669921146
http.time.min: 0
http.time.max: 34629
http.time.variance: 480464.4911592846
http.time.stddev: 693.1554595898994
http.time.sum: 235703
http.uncaught.io: 0
http.uncaught.runtime: 0
http.uncaught.error: 0

Get the item statistics:

$ curl http://localhost:8090/ws/statistics/fetch/item
item.count: 2381
item.disk: 8518018

Get both the HTTP and item statistics:

$ curl http://localhost:8090/ws/statistics/fetch/item+http
item.count: 2381
item.disk: 8518018
http.time.count: 2668
http.time.mean: 88.34632683658175
http.time.min: 0
http.time.max: 34629
http.time.variance: 479578.0629898773
http.time.stddev: 692.515749272085
http.time.sum: 235708
http.uncaught.io: 0
http.uncaught.runtime: 0
http.uncaught.error: 0

The returned data expands as needed, so if an engine has only serviced one query it will look like this:

$ curl http://localhost:8090/ws/statistics/fetch/query
query.regular.count: 1
query.regular.size.mean: 22555
query.regular.size.sum: 22555
query.regular.time.mean: 91
query.regular.time.sum: 91

If it has serviced two queries then you’ll get more information:

$ curl http://localhost:8090/ws/statistics/fetch/query
query.regular.count: 2
query.regular.size.mean: 22556.0
query.regular.size.min: 22555
query.regular.size.max: 22557
query.regular.size.variance: 2.0
query.regular.size.stddev: 1.4142135623730951
query.regular.size.sum: 45112
query.regular.time.mean: 70.0
query.regular.time.min: 49
query.regular.time.max: 91
query.regular.time.variance: 882.0
query.regular.time.stddev: 29.698484809834994
query.regular.time.sum: 140

Field details¶

The most interesting fields are described here.

Total time taken for all queries returning non empty results

query.regular.time.sum

Total time taken for queries returning empty results

query.empty.time.sum

Number of items in the current partition

index.items

Number of indexes

index.count

Number of items in the dataset

item.count

Size of the dataset on disk (db/items directory)

item.disk

Number of created changesets by type:

changeset.reset.size.count

changeset.delta.size.count

changeset.snapshot.size.count

changeset.bulk.size.count

changeset.checkpoint.size.count

Total uncompressed size:

changeset.reset.size.sum

changeset.delta.size.sum

changeset.snapshot.size.sum

changeset.bulk.size.sum

changeset.checkpoint.size.sum

Total compressed size:

changeset.reset.compressed.sum

changeset.delta.compressed.sum

changeset.snapshot.compressed.sum

changeset.bulk.compressed.sum

changeset.checkpoint.compressed.sum

Total number of applied changesets (those that are written to the DB file)

changeset.apply.count

Break down of item actions across the changeset applications

changeset.apply.created.sum

changeset.apply.modified.sum

changeset.apply.deleted.sum

Total time taken generating checkpoints:

checkpoint.time.sum

Total number of checkpoints generated;

checkpoint.time.count

Total number of HTTP requests served;

http.time.count

Total time to server them:

http.time.sum

Checkpoint¶

The checkpoint web service will create a checkpoint on demand. This web service performs an action and takes no parameters.

New in version 2.8.3.

http://example.com:8090/ws/checkpoint [POST]

Example Usage¶

To create a checkpoint:

$ curl -X POST http://example.com:8090/ws/checkpoint

Queryable¶

The Queryable web service can be used to determine if the engine is in a state that it can respond to queries, e.g. “queryable.”

The service is meant to be used by smart load-balancers (such as Amazon’s elastic load balancer), though it could be used by other applications interacting with the engine.

When the system is able to respond to queries, it will return status code 204 (No content). When it is not able to respond with query results, it will return status code 503 (Service Unavailable).

This web service is also available on the path /queryable for compatility with older releases. Release 3.8 introduced the current path of /ws/queryable.

Optional URL Parameters¶

The following options can be specified as query parameters on the URL

success - An integer status code value that determines the status code returned if the engine is “queryable.” For example, if using an Amazon elastic load balancer that does not support status code 204, success=200 might be an appropriate option.

Default: 204 (No Content)
error - An integer status code value that determines the status code returned if the engine is not “queryable”.

Default: 503 (Service Unavailable)

New in version 3.8.
retryAfter - Optional value for the response header Retry-After for when the engine is not “queryable”.

Default: 30

New in version 3.8.

http://example.com:8090/ws/queryable [GET]

Example Usage¶

To check if the engine can respond to queries.

$ curl --head http://example.com:8090/ws/queryable

System State¶

http://example.com:8090/ws/info/system-state [GET]

Example Response¶

<system-state>
    <running>
        <true />
    </running>
    <queryable>
        <true />
    </queryable>
</system-state>