Data Tool

The Discovery Data Tool provides an implementation of a Changeset Publisher as stand-alone cross-platform Java application.

It uses JDBC, and is thus compatible with most major databases.

This open source component is hosted on github at:

http://github.com/t11e/discovery_datatool

You can download the latest source as well as the built distribution from:

http://github.com/t11e/discovery_datatool/releases

Summary

  1. Configure the Data Tool by editing discovery_datatool.xml

  2. Create a datatool.properties file.

    1. Optionally specify HTTP port, bind address, JVM memory, or other overrides.
    2. Optionally configure the Data Tool to use an SSL certificate.
  3. Run the Data Tool with discovery_datatool unix shell script.

  4. Pull changesets or configure the engine to pull from the publisher at

    http://hostname:port/ws/publisher/PUBLISHER_NAME?profile=PROFILE_NAME.

    If you are using access control, then the URL would be

    http://user:pass@hostname:port/ws/publisher/PUBLISHER_NAME?profile=PROFILE_NAME.

    Of course, if you are using SSL, then the URL would use the https scheme.

Optional URL Parameters

The following options can be specified as query parameters on the URL

  • forceSnapshot - If set to “true” then a snapshot will be generated regardless of the profile’s last_run timestamp. This option is passed from the Discovery Search Engine if a user has checked the Restart feed from fresh snapshot now checkbox on the Feeds tab of the administrative web interface.

Configuration

Note that the XML schema for the configuration is versioned. Earlier versions of the configuration XML continue to be supported, however we recommend using the latest version of the XML schema in order to take full advantage of the newest features.

The configuration is defined by the XML Schema located in

http://transparensee.com/schema/datatool-config-7.xsd

The examples/v7 directory of the distribution contains sample XML configuration files for various popular databases. You can copy any of these to discovery_datatool.xml (in the same directory as datatool.properties) to get started. Examples for the earlier schema versions are in the examples/v* directories.

As the Discovery Datatool is changed, we strive to maintain backwards compatibility. One way that we do this is by introducing new XML schema versions as new features are added. The Data Tool continues to support earlier versions of the configuration XML by using the schema namespace to recognize how to parse the file.

In general, moving to the newer schema is trivial. There was, however, a major structural change in the XML schema between version 3 and verson 4. To ease migration utilities/configuration_converter_v3_to_v4.xslt can be used to migrate from XML schema version 3 to version 4. You should do this if you want to use the any of the features added since version 3 such as subqueries, bulk/full feeds, provider, kind or vertical table support.

Here is an example of how you would use this stylesheet with xsltproc and xmllint to convert IntegrationTest-3.xml. This assumes that you have the command line tools xsltproc and xmllint installed:

$ xsltproc utilities/configuration_converter_v3_to_v4.xslt discovery_datatool.xml | xmllint --format - > discovery_datatool-4.xml

The configuration XML file contains four main sections: accessControl, dataSources, profiles and publishers. It contains all the SQL statements that are used, giving you complete ability to optimize them as needed.

<config
  xmlns="http://transparensee.com/schema/datatool-config-7"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation=
    "http://transparensee.com/schema/datatool-config-7
     http://transparensee.com/schema/datatool-config-7.xsd">
  <!--
  Optional:
  <accessControl>
    <user name="engine1" password="engine1secret"/>
  </accessControl>
  -->
  <dataSources>
    ...
  </dataSources>

  <profiles>
    ...
  </profiles>

  <publishers>
    ...
  </publishers>
</config>

accessControl

The accessControl element is optional. If it is present, the HTTP basic authentication will be used to restrict access the Data Tool URLs.

It accepts any number of child user elements. Each must have name and password attributes.

DataSources

A DataSource describes how the Data Tool should connect to the database. You can use a javax.sql.DataSource or java.sql.Driver.

MySQL DataSource

<dataSource name="dataSource"
  jar="drivers/mysql-connector-java-5.1.10-bin.jar"
  class="com.mysql.jdbc.jdbc2.optional.MysqlDataSource">
  <url>jdbc:mysql://localhost/test</url>
  <user>test</user>
  <password>test</password>
</dataSource>

MySQL Driver

<driver name="dataSource"
  jar="drivers/mysql-connector-java-5.1.10-bin.jar"
  class="com.mysql.jdbc.Driver">
  <url>jdbc:mysql://localhost/test</url>
  <username>test</username>
  <password>test</password>
</driver>

SQL Server DataSource

<dataSource name="dataSource"
  jar="drivers/jtds-1.2.5.jar"
  class="net.sourceforge.jtds.jdbcx.JtdsDataSource">
  <!-- 1: sqlserver, 2: sybase -->
  <serverType>1</serverType>
  <serverName>localhost</serverName>
  <databaseName>test</databaseName>
  <user>test</user>
  <password>test</password>
</dataSource>

SQL Server Driver

<driver name="dataSource"
  jar="drivers/jtds-1.2.5.jar"
  class="net.sourceforge.jtds.jdbc.Driver">
  <url>jdbc:jtds:sqlserver://localhost/test</url>
  <username>test</username>
  <password>test</password>
</driver>

Profiles

Profiles describe how the state about changeset pulls should be stored.

Profiles are expected to exist before use. The Discovery Data Tool does not automatically create them unless you provide the SQL for this in a createSql element.

SqlProfile

This profile stores state in the database.

You would create the following table to support it.

create table changeset_profile (
  name varchar(20) not null unique primary key,
  last_run timestamp
)

And configure it like this.

<sqlProfile name="profile" dataSource="dataSource">
  <!--
    createSql is optional. By including it we enable automatic profile creation.
  -->
  <createSql>
    insert into changeset_profile (name) values (:name)
  </createSql>
  <retrieveSql startColumn="startTime" endColumn="endTime">
    select last_run as startTime, CURRENT_TIMESTAMP as endTime
    from changeset_profile
    where name = :name
  </retrieveSql>
  <updateSql>
    update changeset_profile
    set last_run = :lastRun
    where name = :name
  </updateSql>
</sqlProfile>

The interaction with the SqlProfile is as follows.

Whenever a request for a /ws/publisher/PUBLISHER_NAME?profile=PROFILE_NAME is made, the retrieveSql is executed. It must return start and end timestamps in the columns corresponding to retrieveSql/@startColumn and retrieveSql/@endColumn. The query must either return one row or no rows.

If no rows are returned and there is no createSql, then the changeset pull will fail with a NoSuchProfileException. If there is a createSql, then it is executed with the profile name bound as the “name” parameter. As you might expect, this query is responsible for creating the named changeset profile. After that query executes, then the retrieveSql is executed to obtain the start and end timestamps.

After the successful generation of the changeset, the updateSql is executed in order to update the named profile’s lastRun to the end timestamp that was used in generating the changeset. The updateSql is passed the profile name as the “name” parameter and the new last run timestamp as the “lastRun” parameter. The updateSql is expected to modify exactly one row. If it does not, then a RuntimeException is thrown.

Publishers

A publisher is a named entry point for pulling changesets. Each publisher is accessible via a different URL based on its name.

http://localhost:8089/ws/publisher/production
http://localhost:8089/ws/publisher/stage
http://localhost:8089/ws/publisher/test

SqlPublisher

A sqlPublisher element requires a dataSource. A profile is optional. If no profile is specified, then the publisher will never have a start or end time available for its queries. An optional propertyCase attribute specifies how the case of item properties should be handled.

The sqlPublisher element contains optionally an element specifying the queries for snapshot, full or bulk changeset, as well any number of elements defining the queries for incremental (delta) changesets. These elements ultimately contain the SQL queries used to create set-item, add-to-item, and remove-item elements in the resulting changeset.

propertyCase for item property name handling

Starting with schema version 7, you can specify if and how you want item property names to be normalized. With schema version 7, the default is to preserve case. Older schema versions continue to have the legacy behavior of lowercasing property names from column names but from pivoted values from merge-columns. The valid values for propertyCase are:

preserve
Use whatever case we get back from the database
lower
All property names are converted to locale lower case
upper
All property names are converted to locale upper case
legacy
Property names from sql column labels are converted to lower case, but property names from vertical table values are preserved.
<sqlPublisher name="production"
  dataSource="dataSource" profile="profile">
  ...
</sqlPublisher>
Changeset Categories

A SqlPublisher can be configured to produce two categories of changesets.

  • Complete changesets: one of snapshot, bulk, or full
  • Incremental changesets: delta

A SqlPublisher may optionally contain one complete changeset defintion. A complete changeset definition is provided by the snapshot, bulk, and full elements. As the name suggests, a complete changeset should include all the items in your dataset.

A SqlPublisher may contain any number of incremental changeset definitions, each one defined in a delta element. A delta changeset includes only the items that have been modified or deleted since the previous incremental or complete changeset was created for a particular profile.

The changeset category elements contain one or more set-item elements, which in turn contain a query and optional merge-columns and subquery elements. The snapshot and delta elements may also contain add-to-item and remove-item elements, which in turn contain a query and, in the case of add-to-item, optional merge-columns and subquery elements. The bulk and full elements cannot contain any remove-item or add-to-item elements, but can contain a query and optional merge-columns and subquery elements.

When the Data Tool processes a changeset for a SqlPublisher, the queries inside of the set-item elements are used to create set-item elements in the resulting changeset to represent items that were created or modified. The queries inside of the add-to-item elements are used to create add-to-item elements in the resulting changeset to represent properties that are set on preexisting items. The queries inside of the remove-item elements are used to create remove-item elements in the resulting changeset to record when items are deleted.

In all queries and subqueries, the column names are used for property names unless they are overridden by specifying a column alias.

To include modified or created items in a delta changeset:

<delta>
  <set-item idColumn="id">
    <query><![CDATA[
    select * from items where last_updated >= :start and last_updated < :end
  ]]></query>
  </set-item>
</delta>

<delta>
  <remove-item idColumn="id">
    <query><![CDATA[
    select id from deleted_items where last_updated >= :start and last_updated < :end
  ]]></query>
  </remove-item>
</delta>

To follow deletions in a delta changeset:

<delta>
  <remove-item idColumn="id">
    <query><![CDATA[
    select id from deleted_items where last_updated >= :start and last_updated < :end
  ]]></query>
  </remove-item>
</delta>

To include all items in a snapshot changeset:

<snapshot>
  <set-item idColumn="id">
    <query><![CDATA[
    select * from items
  ]]></query>
  </set-item>
</snapshot>

Note that a snapshot may contain a remove-item query. Bulk and full must not contain a remove-item.

<snapshot>
  <set-item idColumn="id">
    <query><![CDATA[
    select * from items
  ]]></query>
  </set-item>
  <remove-item idColumn="id">
    <query><![CDATA[
    select id from deleted_items
  ]]></query>
  </remove-item>
</snapshot>

Note that, unlike snapshots, bulk and full changesets must not include any duplicate items.

Note that calling stored procedures is supported. Calling a stored procedure looks like this.

<delta>
  <set-item idColumn="id">
    <query><![CDATA[
    call get_updated_items(:start, :end)
  ]]></query>
  </set-item>
</delta>

The SQL is processed by a org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate which supports named parameters.

The following bound parameters are available for your queries. Bound parameter names are not case sensitive. Note that start and end are null for snapshot, bulk and full queries, so you probably do not want to include them in your sql:

  • start The inclusive start of the date range for this changeset as a java.util.Date.
  • end The exclusive end of the date range for this changeset as a java.util.Date.
  • kind ‘snapshot’ or ‘delta’

Each set-item can contain one query element, and zero or more subquery elements. Each set-item is executed in sequence. This supports creating changesets that union across various different views or tables.

Each add-to-item can contain one query element, and zero or more subquery elements. Each add-to-item is executed in sequence. This supports creating changesets that union across various different views or tables.

Provider and Kind Support

If the identifiers in your data are only unique within the context of a specific content provider and/or data type, as of Discovery Engine 3.3.0 and Data Tool 1.6 you no longer need to manipulate the value returned for the idColumn to make it unique within your engine. Instead you can use the providerColumn and kindColumn attributes on the set-item, add-to-item, and remove-item configuration elements. The presence of these attributes will result in the generation of changeset set-item, add-to-item, and remove-item elements with locator, provider and kind attributes, and without id attributes. The Discovery Engine combines locator, provider and kind to create a unique identifier attribute for each item. This feature is compatible with merge-columns and subquery elements - just think of xxxxColumn as the item property key, whether it comes from the main query, merge-columns or a subquery. The properties for the idColumn, providerColumn and kindColumn are not included in the set-item or add-to-item properties structure, as this would create redundant data entries in the item properties. The special item properties _id, _locator, _provider, and _kind are included when you request item properties from the Discovery Engine.

Note that providerColumn and kindColumn are each optional. You may choose to include none, one, or both. If either of the attributes are present, even if they are empty, then the resulting set-item or add-to-item will use locator instead of id.

Here is an example snapshot configuration that contains two kinds of items (with possibly overlapping ids), and one provider. If you did not want to track the provider, then in this case the providerColumn could be omitted.

<snapshot>
  <set-item idColumn="id" providerColumn="provider" kindColumn="kind">
    <query><![CDATA[
    select m.*, 'imdb' as provider, 'movie' as kind from movies m
  ]]></query>
  </set-item>
  <set-item idColumn="id" providerColumn="provider" kindColumn="kind">
    <query><![CDATA[
    select p.*, 'imdb' as provider, 'person' as kind from person p
  ]]></query>
  </set-item>
</snapshot>
Vertical Table Support

If your database uses a vertical table design, then you will want to pivot multiple rows from your vertical table into one conceptual set-item in the generated changeset. The Data Tool supports this kind of pivot with one or more merge-columns elements.

With merge-columns, your query is expected to return multiple rows for each item. For each row returned, the merge-columns element tells the Data Tool to use the value from keyColumn for the item property key and and the value from valueColumn for the item property value. Multiple values returned across multiple rows for the same item and property are combined into an array structure as shown for entry “ratings” in the this example:

<changeset>
  <set-item id="007">
    <properties>
      <struct>
        <entry name="ratings">
          <array>
            <element><int>1</int></element>
            <element><int>5</int></element>
          </array>
        </entry>
      </struct>
    </properties>
  </set-item>
</changeset>

The data tool expects the result set to be ordered by item id, which is implemented through the idColumn value, so the SQL query to generate the result set should order the results appropriately.

If the results are not ordered by id, then the resulting set-item entries will not be reliable. No error will be thrown.

Note that any non-merged columns from the query resultset will be included in the resulting set-item. Unlike merge-columns, only the the first row will be used to populate the set-item.

You may include subquery elements after your merge-columns. Any subqueries that you include will be run only after the merges are complete.

Here is an example vertical table schema

create table vertical_master_items (
  id varchar(255) primary key not null,
  lastUpdated timestamp not null default current_timestamp);

create table vertical_data (
  id varchar(255) not null references vertical_master_items(id),
  name varchar(255),
  value varchar(255));

And here is an example of the use of the merge-columns configuration element.

<sqlPublisher name="test-simple" dataSource="dataSource">
  <delta>
    <set-item idColumn="id">
      <query><![CDATA[
        SELECT data.id, data.name, data.value
        FROM vertical_master_items master
        JOIN vertical_data data on data.id = master.id
        WHERE lastUpdated >= :start and lastUpdated < :end
        ORDER BY data.id
    ]]></query>
      <merge-columns keyColumn="name" valueColumn="value"/>
    </set-item>
  </delta>
</sqlPublisher>

With this data

vertical_master_items

id last_updated
1 2010-01-01
2 2010-01-01
3 2010-01-01

vertical_data

id name value
1 color red
1 color orange
2 color yellow

The following changeset will be created:

<changeset>
  <set-item id="1">
    <properties>
      <struct>
        <entry name="id">
          <string>1</string>
        </entry>
        <entry name="color">
          <string>red</string>
        </entry>
      </struct>
    </properties>
  </set-item>
  <set-item id="2">
    <properties>
      <struct>
        <entry name="id">
          <string>2</string>
        </entry>
        <entry name="color">
          <array>
            <element>
              <string>orange</string>
            </element>
            <element>
              <string>yellow</string>
            </element>
          </array>
        </entry>
      </struct>
    </properties>
  </set-item>
</changeset>
Subqueries

The engine requires that data available through one-to-many relationships be “flattened” or denormalized as part of the main item property query. A subquery is typically used to pull in values from child tables. Subqueries provide a very flexible means for managing these relationships.

Subqueries are optional. Each set-item element can contain zero or more subquery elements. Subqueries are executed after the single main required query.

Subqueries are quite powerful and flexible. Here are some ways that you might use a subquery.

  1. Map a list of values from a single column in a child table to an item property as an array of values.
  2. Map a list of values from a single column in a child table to an item property as a single string of delimited values.
  3. Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing an array of values.
  4. Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing a single string of delimited values.
  5. Map a list of subquery rows to a single item property containing a map with each column as a map key for a string of delimited values.
  6. Map a list of subquery rows to a single item property containing a list of maps (one per subquery row), with each column as a map key for that columns value.
  7. Use a discriminator column returned from the subquery to map a list of subquery rows to a single item property containing a map with each discriminator value used as a map key with its value being a map of column name to value for that discriminator’s row.

The results from each subquery are merged into the set-item element. The column names or aliases are used as property names, overwriting any entries previously populated by the main query. The subquery results can optionally be scoped by specifying either a property or propertyPrefix attribute on the subquery. Specifying a property means use that as the set-item property name. Specifying a propertyPrefix means prepend the prefix to the column name(s) or alias(es) when creating set-item property names.

Subquery results can be represented as arrays of values, a delimited string, arrays of maps, maps of arrays, maps of delimited strings, or mapped to multiple entries on the main set-item with the values as arrays or delimited strings. The choice of how multiple values are mapped is controlled by the type attribute.

If a subquery returns more than one row, then the row values are represented as an array. If a delimiter is specified, then the row values returned by the subquery are joined by the delimiter into a single value.

If a subquery returns more that one column, and it is mapped using a property, then the property value will be represented by a dictionary (map) or an array of dictionaries, with subquery each row consisting of a single dictionary of column name to column value. Each column is mapped to an entry in the struct. Column aliases are respected, so you can re-map the database column names to arbitrary strings in the returned dictionaries.

By specifying a propertyPrefix, each column from the subquery is merged into the main set-item using the propertyPrefix combined with the column name or column alias to create the set-item property name. The values are either a single value (when the subquery returns a single row), an array of values, or a string of delimited values depending on whether they type specified was array or delimited.

Subqueries that return multiple columns may use the discriminator feature. When a discriminator is used, the values of each row are mapped to a dictionary with the discriminator value as the key. If multiple subquery rows return the same discriminator value, then the values from the last row returned overwrite any previous values for that discriminator.

The default subquery type is array - which builds an array element in the set-item structure for the multiple values for each column returned. The other valid choice for type is delimited - which builds a single delimited string containing the multiple values. For type of delimited, a delimiter can specified. If unspecified, the default delimiter is a comma.

All the columns or aliases returned by the main query are available to the subquery as named bound parameters. The bound parameters from the main query (:start, :end, and :kind) are not available to the subquery. Bound parameter names are not case sensitive. If the main query returns null or an empty string for a column, then the corresponding bound parameter will have a null value. Bound parameters in a subquery that do not have a corresponding column in the main query will have a null value.

The subquery examples below use the id returned from the main query as a bound parameter.

<subquery><![CDATA[
  select name as color from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery type="delimited" delimiter="|"><![CDATA[
  select name as color_delim from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery property="color-default"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery property="color-array" type="array"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery property="color-delimited" type="delimited"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery property="color-delimited" type="delimited" delimiter="|"><![CDATA[
  select name from SubqueryColors where parent_id=:id order by name
]]></subquery>
<subquery property="address" discriminator="addr_type"><![CDATA[
  select addr_type, street, city, state, zipcode
  from SubqueryAddress
  where parent_id=:id
  order by id
]]></subquery>
<subquery propertyPrefix="product"><![CDATA[
  select id, price, description from order_lines where order_id=:id order by id
]]></subquery>
Supported Column Types

The following SQL data types are supported by the Data Tool and will be mapped from your query and subquery results to changeset set-item or add-to-item property values.

  • BIGINT
  • BIT
  • BOOLEAN
  • CHAR
  • CLOB
  • DATE
  • DECIMAL
  • DOUBLE
  • FLOAT
  • INTEGER
  • LONGVARCHAR
  • NUMERIC
  • REAL
  • SMALLINT
  • TIME
  • TIMESTAMP
  • TINYINT
  • VARCHAR

Only char, varchar, longvarchar and clob columns can be treated as embedded JSON columns.

Unsupported Column Types

The following SQL data types are not supported by the Data Tool. They will be ignored if they are returned from your query or subquery.

  • ARRAY
  • BINARY
  • BLOB
  • DATALINK
  • DISTINCT
  • JAVA_OBJECT
  • LONGVARBINARY
  • NULL
  • OTHER
  • REF
  • STRUCT
  • VARBINARY

SSL

To run the Data Tool with SSL support

  1. Create or obtain an RSA key and server certificate (potentially signed by a certificate authority).
    • Note that the Common Name on the certificate must match the hostname in the feed URL to be used by the discovery engine.
  2. Import the key and certificate into a Java Key Store for the Data Tool.
  3. If the server certificate is not signed by a root certificate authority, then you will need to add it to you server’s shared trusted key store, place a copy of the certificate (public key) with a file extension of .pem or .crt into engines/xxxx/certs.
  4. Configure the Data Tool start script to pass in the command line options for the https-port keystore-file, keystore-pass and key-pass. Exactly how to do this is detailed below.

Create the RSA key, Certificate and Java Key Store

The full process for creating and RSA key, X.509 certificate and Java Key Store is well documented on the Internet. These minimal steps will get you started with a self-signed certificate. You can substitute the name datatool with anything you want. Note that you need to have openssl and Java 1.5 or newer installed.

# Generate RSA key.
# Note that the Common Name must match the host name that will be in the public
# feed URL for the Data Tool
openssl genrsa -des3 -out datatool.key

# Create x509 certificate from the key
openssl req -new -x509 -key datatool.key -out datatool.crt

# Combine the key(s) and the certificate into the a single pkcs12 file
openssl pkcs12 -inkey datatool.key -in datatool.crt -export -out datatool.pkcs12

# create a Java key store and import the key and certificate
keytool -importkeystore -srckeystore datatool.pkcs12 -srcstoretype PKCS12 \
    -destkeystore datatool.jks

You will need to place the keystore file datatool.jks in a location where the Data Tool will be able to read it when it starts up.

Configuration options in datatool.properties

The Discovery Data Tool is configured with a datatool.properties file which is read by the discovery_datatool shell script and used to launch the application.

The following properties may be specified (none are required).

http_port
Port on which to listen for HTTP. Defaults to 8089.
bind_address
TCP address to bind to. Listens on all interfaces if not specified.
https_port
Port on which to listen for HTTPS. If not specified, then Discovery Data Tool will not serve HTTPS.
keystore_file
Path the a Java Key Store file containing the server certificate and private key. Required for HTTPS.
keystore_pass
Password for the Java Key Store file containing the server certificate and private key. Required for HTTPS.
key_pass
Password for the server private key. Required if your key has a password.
truststore_file
Path to a Java Key Store file containing trusted client certificates. Required if you want to verify SSL client certificates.
truststore_pass
Password for the trust store. Required if using a truststore_file.
jvm.memory
Maximum heap to allocate in megabytes. Uses JVM default if not specified.
jvm.logging
Override for specifying logging configuration. Example -Djava.util.logging.config.file=/special_location/logging.properties. If unspecified and a logging.properties file exists in the same directory as the datatool.properties, then that file is used, otherwise the logging.properties from the Discovery Data Tool release directory is used.
validate_sql
Whether the Discovery Data Tool should validate its sqlPublishers’ SQL statements against the database. While no results are retrieved, this option can slow application start-up. Defaults to true. Set to false to disable validation.