Data Tool

The Discovery Data Tool provides an implementation of a Changeset Publisher as stand-alone cross-platform Java application.

It uses JDBC, and is thus compatible with most major databases.

This open source component is hosted on github at:

You can download the latest source as well as the built distribution from:


  1. Configure the Data Tool by editing discovery_datatool.xml

  2. Create a file.

    1. Optionally specify HTTP port, bind address, JVM memory, or other overrides.
    2. Optionally configure the Data Tool to use an SSL certificate.
  3. Run the Data Tool with discovery_datatool unix shell script.

  4. Pull changesets or configure the engine to pull from the publisher at


    If you are using access control, then the URL would be


    Of course, if you are using SSL, then the URL would use the https scheme.

Optional URL Parameters

The following options can be specified as query parameters on the URL

  • forceSnapshot - If set to “true” then a snapshot will be generated regardless of the profile’s last_run timestamp. This option is passed from the Discovery Search Engine if a user has checked the Restart feed from fresh snapshot now checkbox on the Feeds tab of the administrative web interface.


Note that the XML schema for the configuration is versioned. Earlier versions of the configuration XML continue to be supported, however we recommend using the latest version of the XML schema in order to take full advantage of the newest features.

The configuration is defined by the XML Schema located in

The examples/v7 directory of the distribution contains sample XML configuration files for various popular databases. You can copy any of these to discovery_datatool.xml (in the same directory as to get started. Examples for the earlier schema versions are in the examples/v* directories.

As the Discovery Datatool is changed, we strive to maintain backwards compatibility. One way that we do this is by introducing new XML schema versions as new features are added. The Data Tool continues to support earlier versions of the configuration XML by using the schema namespace to recognize how to parse the file.

In general, moving to the newer schema is trivial. There was, however, a major structural change in the XML schema between version 3 and verson 4. To ease migration utilities/configuration_converter_v3_to_v4.xslt can be used to migrate from XML schema version 3 to version 4. You should do this if you want to use the any of the features added since version 3 such as subqueries, bulk/full feeds, or vertical table support.

Here is an example of how you would use this stylesheet with xsltproc and xmllint to convert IntegrationTest-3.xml. This assumes that you have the command line tools xsltproc and xmllint installed:

$ xsltproc utilities/configuration_converter_v3_to_v4.xslt discovery_datatool.xml | xmllint --format - > discovery_datatool-4.xml

The configuration XML file contains four main sections: accessControl, dataSources, profiles and publishers. It contains all the SQL statements that are used, giving you complete ability to optimize them as needed.

    <user name="engine1" password="engine1secret"/>




The accessControl element is optional. If it is present, the HTTP basic authentication will be used to restrict access the Data Tool URLs.

It accepts any number of child user elements. Each must have name and password attributes.


A DataSource describes how the Data Tool should connect to the database. You can use a javax.sql.DataSource or java.sql.Driver.

MySQL DataSource

<dataSource name="dataSource"

MySQL Driver

<driver name="dataSource"

SQL Server DataSource

<dataSource name="dataSource"
  <!-- 1: sqlserver, 2: sybase -->

SQL Server Driver

<driver name="dataSource"


Profiles describe how the state about changeset pulls should be stored.

Profiles are expected to exist before use. The Discovery Data Tool does not automatically create them unless you provide the SQL for this in a createSql element.


This profile stores state in the database.

You would create the following table to support it.

create table changeset_profile (
  name varchar(20) not null unique primary key,
  last_run timestamp

And configure it like this.

<sqlProfile name="profile" dataSource="dataSource">
    createSql is optional. By including it we enable automatic profile creation.
    insert into changeset_profile (name) values (:name)
  <retrieveSql startColumn="startTime" endColumn="endTime">
    select last_run as startTime, CURRENT_TIMESTAMP as endTime
    from changeset_profile
    where name = :name
    update changeset_profile
    set last_run = :lastRun
    where name = :name

The interaction with the SqlProfile is as follows.

Whenever a request for a /ws/publisher/PUBLISHER_NAME?profile=PROFILE_NAME is made, the retrieveSql is executed. It must return start and end timestamps in the columns corresponding to retrieveSql/@startColumn and retrieveSql/@endColumn. The query must either return one row or no rows.

If no rows are returned and there is no createSql, then the changeset pull will fail with a NoSuchProfileException. If there is a createSql, then it is executed with the profile name bound as the “name” parameter. As you might expect, this query is responsible for creating the named changeset profile. After that query executes, then the retrieveSql is executed to obtain the start and end timestamps.

After the successful generation of the changeset, the updateSql is executed in order to update the named profile’s lastRun to the end timestamp that was used in generating the changeset. The updateSql is passed the profile name as the “name” parameter and the new last run timestamp as the “lastRun” parameter. The updateSql is expected to modify exactly one row. If it does not, then a RuntimeException is thrown.


A publisher is a named entry point for pulling changesets. Each publisher is accessible via a different URL based on its name.



A sqlPublisher element requires a dataSource. A profile is optional. If no profile is specified, then the publisher will never have a start or end time available for its queries. An optional propertyCase attribute specifies how the case of item properties should be handled.

The sqlPublisher element contains optionally an element specifying the queries for snapshot, full or bulk changeset, as well any number of elements defining the queries for incremental (delta) changesets. These elements ultimately contain the SQL queries used to create set-item, add-to-item, and remove-item elements in the resulting changeset.

propertyCase for item property name handling

Starting with schema version 7, you can specify if and how you want item property names to be normalized. With schema version 7, the default is to preserve case. Older schema versions continue to have the legacy behavior of lowercasing property names from column names but from pivoted values from merge-columns. The valid values for propertyCase are:

Use whatever case we get back from the database
All property names are converted to locale lower case
All property names are converted to locale upper case
Property names from sql column labels are converted to lower case, but property names from vertical table values are preserved.
<sqlPublisher name="production"
  dataSource="dataSource" profile="profile">
Changeset Categories

A SqlPublisher can be configured to produce two categories of changesets.

  • Complete changesets: one of snapshot, bulk, or full
  • Incremental changesets: delta

A SqlPublisher may optionally contain one complete changeset defintion. A complete changeset definition is provided by the snapshot, bulk, and full elements. As the name suggests, a complete changeset should include all the items in your dataset.

A SqlPublisher may contain any number of incremental changeset definitions, each one defined in a delta element. A delta changeset includes only the items that have been modified or deleted since the previous incremental or complete changeset was created for a particular profile.

The changeset category elements contain one or more set-item elements, which in turn contain a query and optional merge-columns and subquery elements. The snapshot and delta elements may also contain add-to-item and remove-item elements, which in turn contain a query and, in the case of add-to-item, optional merge-columns and subquery elements. The bulk and full elements cannot contain any remove-item or add-to-item elements, but can contain a query and optional merge-columns and subquery elements.

When the Data Tool processes a changeset for a SqlPublisher, the queries inside of the set-item elements are used to create set-item elements in the resulting changeset to represent items that were created or modified. The queries inside of the add-to-item elements are used to create add-to-item elements in the resulting changeset to represent properties that are set on preexisting items. The queries inside of the remove-item elements are used to create remove-item elements in the resulting changeset to record when items are deleted.

In all queries and subqueries, the column names are used for property names unless they are overridden by specifying a column alias.

To include modified or created items in a delta changeset:

  <set-item idColumn="id">
    select * from items where last_updated >= :start and last_updated < :end

  <remove-item idColumn="id">
    select id from deleted_items where last_updated >= :start and last_updated < :end

To follow deletions in a delta changeset:

  <remove-item idColumn="id">
    select id from deleted_items where last_updated >= :start and last_updated < :end

To include all items in a snapshot changeset:

  <set-item idColumn="id">
    select * from items

Note that a snapshot may contain a remove-item query. Bulk and full must not contain a remove-item.

  <set-item idColumn="id">
    select * from items
  <remove-item idColumn="id">
    select id from deleted_items

Note that, unlike snapshots, bulk and full changesets must not include any duplicate items.

Note that calling stored procedures is supported. Calling a stored procedure looks like this.

  <set-item idColumn="id">
    call get_updated_items(:start, :end)

The SQL is processed by a org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate which supports named parameters.

The following bound parameters are available for your queries. Bound parameter names are not case sensitive. Note that start and end are null for snapshot, bulk and full queries, so you probably do not want to include them in your sql:

  • start The inclusive start of the date range for this changeset as a java.util.Date.
  • end The exclusive end of the date range for this changeset as a java.util.Date.
  • kind ‘snapshot’ or ‘delta’

Each set-item can contain one query element, and zero or more subquery elements. Each set-item is executed in sequence. This supports creating changesets that union across various different views or tables.

Each add-to-item can contain one query element, and zero or more subquery elements. Each add-to-item is executed in sequence. This supports creating changesets that union across various different views or tables.

Vertical Table Support

If your database uses a vertical table design, then you will want to pivot multiple rows from your vertical table into one conceptual set-item in the generated changeset. The Data Tool supports this kind of pivot with one or more merge-columns elements.

With merge-columns, your query is expected to return multiple rows for each item. For each row returned, the merge-columns element tells the Data Tool to use the value from keyColumn for the item property key and and the value from valueColumn for the item property value. Multiple values returned across multiple rows for the same item and property are combined into an array structure as shown for entry “ratings” in the this example:

  <set-item id="007">
        <entry name="ratings">

The data tool expects the result set to be ordered by item id, which is implemented through the idColumn value, so the SQL query to generate the result set should order the results appropriately.

If the results are not ordered by id, then the resulting set-item entries will not be reliable. No error will be thrown.

Note that any non-merged columns from the query resultset will be included in the resulting set-item. Unlike merge-columns, only the the first row will be used to populate the set-item.

You may include subquery elements after your merge-columns. Any subqueries that you include will be run only after the merges are complete.

Here is an example vertical table schema

create table vertical_master_items (
  id varchar(255) primary key not null,
  lastUpdated timestamp not null default current_timestamp);

create table vertical_data (
  id varchar(255) not null references vertical_master_items(id),
  name varchar(255),
  value varchar(255));

And here is an example of the use of the merge-columns configuration element.

<sqlPublisher name="test-simple" dataSource="dataSource">
    <set-item idColumn="id">
        SELECT,, data.value
        FROM vertical_master_items master
        JOIN vertical_data data on =
        WHERE lastUpdated >= :start and lastUpdated < :end
        ORDER BY
      <merge-columns keyColumn="name" valueColumn="value"/>

With this data


id last_updated
1 2010-01-01
2 2010-01-01
3 2010-01-01


id name value
1 color red
1 color orange
2 color yellow

The following changeset will be created:

  <set-item id="1">
        <entry name="id">
        <entry name="color">
  <set-item id="2">
        <entry name="id">
        <entry name="color">

The engine requires that data available through one-to-many relationships be “flattened” or denormalized as part of the main item property query. A subquery is typically used to pull in values from child tables. Subqueries provide a very flexible means for managing these relationships.

Subqueries are optional. Each set-item element can contain zero or more subquery elements. Subqueries are executed after the single main required query.

Subqueries are quite powerful and flexible. Here are some ways that you might use a subquery.

  1. Map a list of values from a single column in a child table to an item property as an array of values.
  2. Map a list of values from a single column in a child table to an item property as a single string of delimited values.
  3. Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing an array of values.
  4. Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing a single string of delimited values.
  5. Map a list of subquery rows to a single item property containing a map with each column as a map key for a string of delimited values.
  6. Map a list of subquery rows to a single item property containing a list of maps (one per subquery row), with each column as a map key for that columns value.
  7. Use a discriminator column returned from the subquery to map a list of subquery rows to a single item property containing a map with each discriminator value used as a map key with its value being a map of column name to value for that discriminator’s row.

The results from each subquery are merged into the set-item element. The column names or aliases are used as property names, overwriting any entries previously populated by the main query. The subquery results can optionally be scoped by specifying either a property or propertyPrefix attribute on the subquery. Specifying a property means use that as the set-item property name. Specifying a propertyPrefix means prepend the prefix to the column name(s) or alias(es) when creating set-item property names.

Subquery results can be represented as arrays of values, a delimited string, arrays of maps, maps of arrays, maps of delimited strings, or mapped to multiple entries on the main set-item with the values as arrays or delimited strings. The choice of how multiple values are mapped is controlled by the type attribute.

If a subquery returns more than one row, then the row values are represented as an array. If a delimiter is specified, then the row values returned by the subquery are joined by the delimiter into a single value.

If a subquery returns more that one column, and it is mapped using a property, then the property value will be represented by a dictionary (map) or an array of dictionaries, with subquery each row consisting of a single dictionary of column name to column value. Each column is mapped to an entry in the struct. Column aliases are respected, so you can re-map the database column names to arbitrary strings in the returned dictionaries.

By specifying a propertyPrefix, each column from the subquery is merged into the main set-item using the propertyPrefix combined with the column name or column alias to create the set-item property name. The values are either a single value (when the subquery returns a single row), an array of values, or a string of delimited values depending on whether they type specified was array or delimited.

Subqueries that return multiple columns may use the discriminator feature. When a discriminator is used, the values of each row are mapped to a dictionary with the discriminator value as the key. If multiple subquery rows return the same discriminator value, then the values from the last row returned overwrite any previous values for that discriminator.

The default subquery type is array - which builds an array element in the set-item structure for the multiple values for each column returned. The other valid choice for type is delimited - which builds a single delimited string containing the multiple values. For type of delimited, a delimiter can specified. If unspecified, the default delimiter is a comma.

All the columns or aliases returned by the main query are available to the subquery as named bound parameters. The bound parameters from the main query (:start, :end, and :kind) are not available to the subquery. Bound parameter names are not case sensitive. If the main query returns null or an empty string for a column, then the corresponding bound parameter will have a null value. Bound parameters in a subquery that do not have a corresponding column in the main query will have a null value.

The subquery examples below use the id returned from the main query as a bound parameter.

  select name as color from SubqueryColors where parent_id=:id order by name
<subquery type="delimited" delimiter="|"><![CDATA[
  select name as color_delim from SubqueryColors where parent_id=:id order by name
<subquery property="color-default"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
<subquery property="color-array" type="array"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
<subquery property="color-delimited" type="delimited"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
<subquery property="color-delimited" type="delimited" delimiter="|"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
<subquery property="address" discriminator="addr_type"><![CDATA[
  select addr_type, street, city, state, zipcode
  from SubqueryAddress
  where parent_id=:id
  order by id
<subquery propertyPrefix="product"><![CDATA[
  select id, price, description from order_lines where order_id=:id order by id
Supported Column Types

The following SQL data types are supported by the Data Tool and will be mapped from your query and subquery results to changeset set-item or add-to-item property values.

  • BIT
  • CHAR
  • CLOB
  • DATE
  • REAL
  • TIME

Only char, varchar, longvarchar and clob columns can be treated as embedded JSON columns.

Unsupported Column Types

The following SQL data types are not supported by the Data Tool. They will be ignored if they are returned from your query or subquery.

  • BLOB
  • NULL
  • REF


To run the Data Tool with SSL support

  1. Create or obtain an RSA key and server certificate (potentially signed by a certificate authority).
    • Note that the Common Name on the certificate must match the hostname in the feed URL to be used by the discovery engine.
  2. Import the key and certificate into a Java Key Store for the Data Tool.
  3. If the server certificate is not signed by a root certificate authority, then you will need to add it to you server’s shared trusted key store, place a copy of the certificate (public key) with a file extension of .pem or .crt into engines/xxxx/certs.
  4. Configure the Data Tool start script to pass in the command line options for the https-port keystore-file, keystore-pass and key-pass. Exactly how to do this is detailed below.

Create the RSA key, Certificate and Java Key Store

The full process for creating and RSA key, X.509 certificate and Java Key Store is well documented on the Internet. These minimal steps will get you started with a self-signed certificate. You can substitute the name datatool with anything you want. Note that you need to have openssl and Java 1.5 or newer installed.

# Generate RSA key.
# Note that the Common Name must match the host name that will be in the public
# feed URL for the Data Tool
openssl genrsa -des3 -out datatool.key

# Create x509 certificate from the key
openssl req -new -x509 -key datatool.key -out datatool.crt

# Combine the key(s) and the certificate into the a single pkcs12 file
openssl pkcs12 -inkey datatool.key -in datatool.crt -export -out datatool.pkcs12

# create a Java key store and import the key and certificate
keytool -importkeystore -srckeystore datatool.pkcs12 -srcstoretype PKCS12 \
    -destkeystore datatool.jks

You will need to place the keystore file datatool.jks in a location where the Data Tool will be able to read it when it starts up.

Configuration options in

The Discovery Data Tool is configured with a file which is read by the discovery_datatool shell script and used to launch the application.

The following properties may be specified (none are required).

Port on which to listen for HTTP. Defaults to 8089.
TCP address to bind to. Listens on all interfaces if not specified.
Port on which to listen for HTTPS. If not specified, then Discovery Data Tool will not serve HTTPS.
Path the a Java Key Store file containing the server certificate and private key. Required for HTTPS.
Password for the Java Key Store file containing the server certificate and private key. Required for HTTPS.
Password for the server private key. Required if your key has a password.
Path to a Java Key Store file containing trusted client certificates. Required if you want to verify SSL client certificates.
Password for the trust store. Required if using a truststore_file.
Maximum heap to allocate in megabytes. Uses JVM default if not specified.
Override for specifying logging configuration. Example -Djava.util.logging.config.file=/special_location/ If unspecified and a file exists in the same directory as the, then that file is used, otherwise the from the Discovery Data Tool release directory is used.
Whether the Discovery Data Tool should validate its sqlPublishers’ SQL statements against the database. While no results are retrieved, this option can slow application start-up. Defaults to true. Set to false to disable validation.

How-To: Bringing it All Together

Given all these details, what would a typical data integration process look like?

Design the Data Integration

Start by determining what your key business entities are. How are they related? What do you plan to search and display. Remember that the Discovery Engine cannot perform joins, so each item that you include in your changeset must include all the data that you want to use for search or display - including any data from related entities. Based on your proposed searches, go ahead and create a dimensions.xml that you’ll use to validate your initial changeset. It’s an iterative process, so this is just a starting point.

If you want to support processing incremental changes in the Discovery Engine, then your entities will need a last-modified timestamp and you will need some mechanism for tracking when items are deleted. Database triggers can be useful to ensure that last-modified timestamps are updated appropriately - especially if you choose to cascade updates on dependent objects to your main items (e.g. updating a broker updates the last-modified timestamp on all of the broker’s listings).

Tracking deletes

For tracking deleted items, two strategies are soft deletion and a deleted items table. To implement soft deletion, rows are not actually deleted, and a status flag is updated instead. All queries include the status flag so that you don’t inadvertandly include deleted items. If you want to implement a hard delete, where your items are actually deleted, a database trigger can add a row to a deleted items table each time an item is deleted. The deleted items table contains, at a minimum, the item identifier and the timestamp of when the item was deleted. With either of these mechanism, it is trivial to create SQL queries that return the list of item identifiers that were deleted during a specific timeframe.

A note about identifiers

Item identifiers in the Discovery Engine must be unique. In an SQL databae, it is common for each table to have its own primary key sequence. In the engine, all items share a global identifier namespace (as if they were all in one “table”), so you cannot simply use the database primary key as your item id. You can, however, trivially modify your database identifier by appending or prepending a type discriminator (e.g. “broker_1” instead of just “1”).

Let’s write some SQL

Once you’ve considered all the above points, you’re ready to build and test the SQL that you will use to export your entities in a denormalized form for your changesets, and the SQL that you will use to export the current deleted items list. Pay particular attention to the scalability of your queries as the size of your data may increase over time. You can test these initial SQL queries out in the tool of your choice.

Remeber that there are two kinds of changesets that you can choose to support: complete changesets and incremental changesets.

Complete changesets (snapshot, bulk, full), contain all the data for all the items that should be live. As such, there is no need to include a remove-item query - unless you need to remove items that you just created (and this is not permitted in bulk changesets). Complete changesets have an :end timestamp, but do not have a :start timestamp.

Incremental changests (detla), only contain information about items that were modified or deleted since the last incremental changeset. Use the :start and :end bound parameters in your SQL. Note that the general approach is WHERE lastUpdated >= :start and lastUpdated < :end.

Testing with the Discovery Data Tool and the Discovery Engine

After you write your basic SQL, create a discovery_dataool.xml with the building blocks described in this document. If you want to create incremental changesets, then you must define a sqlProfile. Create one or more sqlPublisher s and wire everything together so that you can test your SQL in the Data Tool and test the changesets it generates with your engine. Afterall, what good is a changeset if you cannot search or retrieve the data like you planned.

For testing, you can use curl on the command line, redirecting the output to a file. You can test the data tool without parameters to get a complete set of data, and test using a profile. I often use the dryRun=true parameter when testing a profile so that the profile will not be updated in the database. This way you can repeatedly produce delta changesets with the same :start and :end to test your SQL changes against the same subset of items.

If you have access to modify the database, then you can also delete or update the test profiles that you used to test incremental feeds. If you only have read-only db access, then you can still test the delta queries even without a profile by using the startDate and endDate parameters. It should now be clear that many of these parameters exist solely to enable testing.

To test a complete changeset (e.g. full, snapshot, or a delta with all data);

$ curl  'http://localhost:8089/ws/publisher/{publisherName}' > cs.xml

To test a delta with a profile:

$ curl  'http://localhost:8089/ws/publisher/{publisherName}?profile=testProfile&dryRun=true' > cs.xml

When you use curl -qi, the HTTP response headers are included in the output. This is useful for confirming that the changeset type is as expected (delta, full, etc) by looking at the X-t11e-type header. If you use curl -qi, you must need to edit the output to remove the headers in order to use the changeset (or it will not be valid XML).

Query string parameters useful for testing are:

Test delta with :start SQL bound parameter in lieu of using a profile
Test delta with :end SQL bound parameter in lieu of using a profile
Refers to the profile stored in the database. Depending on the datatool configuration, profiles may be auto-created
Do not update the profile database row. This way you can repeatedly test the same delta timespan window
Generate a complete feed. Corresponds to the “Restart with complete feed now” engine feed UI checkbox.

Note that startDate and endDate use the date time format yyyy-MM-dd'T'HH:mm:ss

Some things that we recommend testing

  1. That the sql doesn’t have any runtime errors.

    The datatool will validate the sql at startup using the database/driver to parse it, and fail to start if invalid. To fully validate the sql you also need to pull data through it. If you have subqueries, then you’ll want to test scenarios where your subqueries return 0, 1, and >1 rows. This is just basic branch coverage of your queries. If your sql has conditions (e.g. case, if) you’ll want to test all the permutations.

  2. The changeset contains all the expected data in the structure that you expect (e.g. if you’re using subqueries, did you want parallel arrays of primitives but got an array of dictionaries)

  3. Deltas using a profile only contain the modified and deleted items that you expect

  4. All the publishers and sql are used. If you have different SQL for a <full> and <delta> in a publisher, then you need to request and generate both kinds of changesets in order to test the SQL for both.

  5. The XML is valid by running it though xmllint

  6. The changeset can be processed by the engine as expected. You can manually upload the XML via the admin UI. This sometimes uncovers errors that you missed. For example, if you didn’t notice that the XML was invalid due to an exception.

    Upload a dimensions file, and verify that the data in the changeset can be indexed and queried as expected. The data might be in the changeset, but not in a manner than can be indexed, or maybe there’s a mismatch in property names between the dimension and the changeset.

  7. Incremental feeds to the engine work as expected. Configure a recurring feed for the engine via the admin UI using your data tool, make data changes, and verify that the changes reach your engine.

If you request a changeset and the data tool encounters a runtime error in the middle (e.g. sql that failed in a subquery part of the way through), then the resulting changeset will be invalid with the exception trace included in the generated output.


You will see most Data Tool errors in the log/datatool-0.log file, especially errors validating the SQL at startup. The datatool will not start if the SQL fails to validate. When posting a changeset to your Discovery Engine, you may see some errors processing the changeset immediately in the Discovery Engine HTTP response, but some errors will only appear in the Discovery Engine log file log/discovery-0.log.