Kuha Document Store

Kuha Document Store is a HTTP backend API written in Python for serving Document Store records to multiple repo handlers. The Document Store uses MongoDB as a persistent storage and provides multiple endpoints for managing the database documents.

Kuha Document Store is a part of Open Source software bundle Kuha2.

Features

Import records from DDI XML

Kuha Document Store provides an easy way to import multiple records all at once by simply submitting a DDI file to an import-endpoint. The Document Store imports all records found from the file and handles inserts and updates correctly.

REST API for full control

Kuha Document Store has a REST API that gives end users full control of the records stored in the Document Store. The REST API may be used to build functionality for spesific needs, for example, to automatically update a record when a record is changed in a 3rd party storage system.

With the REST API, end users are not tied to using DDI, but may use arbitrary metadata formats and submit their records to Document Store using HTTP with JSON payload.

Flexible query support

Kuha Document Store provides an endpoint for selectively querying stored records. The Query API is used by client applications which are a part of the Kuha2 software bundle.

Dependencies & requirements

  • Python 3.5 or newer
  • MongoDB 3.4 or newer (License: GNU AGPL v3.0)
  • Recommended: python3-venv 3.5.1 or newer

The software is continuously tested against Python versions 3.5, 3.6, 3.7, 3.8 and 3.9.

MongoDB 3.4 is the first supported version, but the software is also known to work with 3.6 and 4.2. Intermediate versions are most likely suitable.

Python packages

The following can be obtained from Python package index.

  • motor (License: Apache License 2.0)
  • pymongo (License: Apache License 2.0)
  • tornado (License: Apache License 2.0)
  • Cerberus (License: ICS)
  • python-dateutil (License: Simplified BSD)

Kuha Common is a library used with Kuha2 software. It can be obtained from https://bitbucket.org/tietoarkisto/kuha_common

  • kuha_common (License: EUPL)

License

Kuha Document Store is available under the EUPL. See LICENSE.txt for the full license.

Configuration

The application can be configured with a configuration file, via command line arguments or by environment variables. All configuration options have default values. If a configuration option is specified in more than one place, then command line values override environment variables which override configuration file values which override defaults.

The following configuration options are available:

-h, --help

Show help message and exit.

--print-configuration

Print active configuration and exit.

--document-store-port <port>

Port of Kuha document store database. Defaults to ´`6001``. May also be controlled by setting environment variable: KUHA_DS_PORT.

--document-store-api-version <api_version>

Api version for document store. This gets prepended to the URL path. Defaults to v0. May also be controlled by setting environment variable: KUHA_DS_API_VERSION.

--database-host <database_host>

Host/IP of the Document Store database. Defaults to localhost. May also be controlled by setting environment variable: KUHA_DS_DBHOST

--database-port <port>

Port of the Document Store database. Defaults to 27017. May also be controlled by setting environment variable: KUHA_DS_DBPORT

--database-name <name>

Name of Document Store database. Defaults to kuha_document_store. May also be controlled by setting environment variable: KUHA_DS_DBMAME

--database-user-reader <user>

Username for database user having read-only rights. Defaults to reader. May also be controlled by setting environment variable: KUHA_DS_DBUSER_READER

--database-pass-reader <password>

Password for database user having read-only rights. Defaults to reader. May also be controlled by setting environment variable: KUHA_DS_DBPASS_READER

--database-user-editor <user>

Username for database user having editing rights. Defaults to editor. May also be controlled by setting environment variable: KUHA_DS_DBUSER_EDITOR

--database-pass-editor <password>

Password for database user having editing rights. Defaults to editor. May also be controlled by setting environment variable: KUHA_DS_DBPASS_EDITOR

--loglevel <loglevel>

Lowest logging level of log messages that get output. Valid values are logging levels supported by Python’s logging [CRITICAL,ERROR,WARNING,INFO,DEBUG]. Defaults to INFO. May also be controlled by setting environment variable: KUHA_LOGLEVEL

--logformat <logformat>

Logging format supported by logging. Defaults to %(asctime)s %(levelname)s(%(name)s): %(message)s) May also be controlled by setting environment variable: KUHA_LOGFORMAT

Configuration file

Args that start with ‘–’ (eg. –document-store-port) can also be set in a config file. The configuration file lookup searches the file from current working directory and from the package directory. The name of the configuration file is kuha_document_store.ini.

Note

Invoke with --help to print out config file lookup paths.

Environment variables

If the program will be run by using the scripts provided in scripts subdirectory, the runtime environment can be controlled via scripts/runtime_env, which will be created by copying from scripts/runtime_env.dist at installation time by scripts/install_kuha_document_store_virtualenv.sh.

Running the program

This guide will use convenience scripts from scripts subdirectory. It is assumed that the program was installed by using scripts/install_kuha_document_store_virtualenv.sh.

Run Document Store server:

./scripts/run_kuha_document_store.sh

The script will source scripts/runtime_env and activate the installed virtualenv. Finally it calls kuha_ds_serve, with given command line arguments.

HTTP API endpoints

Root for the requests is configurable and defaults to localhost:6001/v0. Every endpoint will return HTTP status code 500 on internal errors.

Note

Responses with multiple objects will be streamed one object at a time.

REST API

Document Store REST API provides CRUD support to the underlying documents.

GET /(collection)/(document_id)

Get object from collection with optional document_id. If document_id is not given, endpoint will return all objects in collection.

Parameters:
  • collection (str) – Document collection. One of studies, variables, questions or study_groups.
  • document_id (str) – Optional document ID. 24-character hex string.
Status Codes:
POST /studies

Create a new object to studies-collection from JSON request body.

Example request:

POST /studies HTTP/1.1
Content-Type: application/json

{"study_number": "study_1"}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82e76e6fb71d06fef00e69"
}
Request Headers:
 
Request JSON Object:
 
  • study_number (string) – Required study number. Used as an identifier. Must be unique within collection.
Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object.
Status Codes:
POST /variables

Create a new object to variables-collection from JSON request body.

Example request:

POST /variables HTTP/1.1
Content-Type: application/json

{
    "study_number": "study_1",
    "variable_name": "variable_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ecf16fb71d06fef00e6a"
}
Request Headers:
 
Request JSON Object:
 
  • study_number (string) – Required study number. Used as an identifier combined with variable_name. Their combination must be unique within collection.
  • variable_name (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.
Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object.
Status Codes:
POST /questions

Create a new object to questions-collection from JSON request body.

Example request:

POST /questions HTTP/1.1
Content-Type: application/json

{
    "study_number": "study_1",
    "question_identifier": "question_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ee1a6fb71d06fef00e6b"
}
Request Headers:
 
Request JSON Object:
 
  • study_number (string) – Required study number. Used as an identifier combined with question_identifier. Their combination must be unique within collection.
  • question_identifier (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.
Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object.
Status Codes:
POST /study_groups

Create a new object to study_groups-collection from JSON request body.

Example request:

POST /study_groups HTTP/1.1
Content-Type: application/json

{
    "study_group_identifier": "study_group_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ee876fb71d06fef00e6c"
}
Request Headers:
 
Request JSON Object:
 
  • study_group_identifier (string) – Required. Used as an identifier and must be unique within collection.
Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object.
Status Codes:
DELETE /(collection)/(document_id)

Delete document or all documents within collection. If optional document_id is left out, will delete all documents within collection.

Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.
Status Codes:
PUT /(collection)/(document_id)

Replace document within collection. :see: Documents for information on payload.

Note:

Leave _metadata field out of the request, to let document store handle updated-timestamp automatically.

Request Headers:
 
Response JSON Object:
 
  • result (string) – Operation outcome.
  • error (string) – Errors during operation.
  • affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.
Status Codes:

Import API

Documents may be imported to Document Store by using the import API. See Importers for more information on how documents are parsed.

POST /import/(importer_id)/(collection)

Import document using importer specified with importer_id. Optional collection may be given to limit the import to a specific collection. If collection is not given the importer will import to collections that are applicable to the posted file.

Note:

Importer API may only be used to initially import documents to the database. After the initial import the documents may be updated via the REST API.

Parameters:
  • importer_id (str) –

    Mandatory parameter to select the importer.

    • ddi_31 imports DDI 3.1 file.
    • ddi_c imports DDI 2.5 file.
    • ddi_122_nesstar imports DDI 1.2.2. Nesstar file.
  • collection (str) – Optional parameter to limit the import to specific collection. One of studies, variables, questions or study_groups.
Request Headers:
 
Request body:

DDI file contents

Response JSON Object:
 
  • result (string) – Operation result
  • imported_docs (array) – Imported document IDs
  • error (string) – Errors found during import
Status Codes:

Query API

Query documents or information on documents from collection.

POST /query/(collection)

Execute query against collection and return results in JSON.

Request Headers:
 
Query Parameters:
 
  • query_type (string) –

    Optional query parameter to select the query type.

    • select is the default query type. It returns all documents found by filter.
    • count returns the number of documents which match the filter.
    • distinct return distinct results for certain field which match the filter.
Request JSON Object:
 
  • _filter (object) – Query filter. Used for all query types. Request may specify multiple filter conditions.

Example request with multiple filter conditions:

POST /query/variables HTTP/1.1
Content-Type: application/json

{
    "_filter": {
        "study_number": "study_1",
        "variable_name": "variable_1"
    }
}
Request JSON Object:
 
  • fields (array) – Optional. Select returned fields. Used in select query type. _id will always be returned. If not set, full document will be returned.
  • skip (int) – Optinal. Skip documents from the beginning. Used in select query type.
  • limit (int) – Optional. Limit the number of returned documents. Used in select query type.
  • sort_by (string) – Optional. Sort the queried documents by field. Used in select query type.
  • sort_order (int) –

    Optional. Sort order of the queried documents. Used in select query type.

    • 1: Ascending sort order.
    • -1: Descending sort order.
  • fieldname (string) – Mandatory for distinct query type. Return distinct values for this field.

Result depends on the requested query_type.

JSON response for select query-type

Results will be streamed one object at a time. The object is a document with requested fields.

JSON response for count query-type

Response JSON Object:
 
  • count (int) – Number of documents found with _filter.

JSON response for distinct query-type

If fieldname points to document’s leaf node the response is in the following format.

HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "<fieldname>": ["<list-of-distinct-values>"]
}

If fieldname points to document’s branch node the response is in the following format.

HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "<fieldname>": ["<list-of-distinct-objects>"]
}

Example requests and responses for distinct query-type

POST /query/studies?query_type=distinct HTTP/1.1
Content-Type: application/json

{
    "fieldname": "_metadata.updated"
}
HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "_metadata.updated": [
        "2018-02-13T13:49:37Z",
        "2018-02-08T10:55:41Z"
    ]
}
POST /query/studies?query_type=distinct HTTP/1.1
Content-Type: application/json

{
    "fieldname": "_metadata"
}
HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "_metadata": [
        {
            "updated": "2017-11-09T12:07:48Z",
            "cmm_type": "study",
            "created": "2017-11-09T11:06:03Z"
        },
        {
            "updated": "2017-11-09T11:37:16Z",
            "cmm_type": "study",
            "created": "2017-11-09T11:37:16Z"
        }
    ]
}

Note

Distinct queries for datetime-fields will not work as expected, due to different precision in MongoDB and Document Store JSON. MongoDB stores datetimes in millisecond’s precision, while Document Store JSON supports second’s precision.

Status Codes:

Documents

Documents are objects stored in a collection. Documents support four different types of fields:

  1. key-value pair:

    {"study_number": "1200"}
    
  2. contained key-value pairs:

    {"_metadata": {
        "updated": "2018-01-31T11:37:34Z",
        "cmm_type": "study",
        "created": "2018-01-31T11:37:27Z"
        }
    }
    
  3. localized contained key-value pairs:

    {"study_titles": [
       {
           "language": "en",
           "study_title": "Study 1983"
       },
       {
           "language": "fi",
           "study_title": "Tutkimus 1983"
       }
    ]}
    
  4. list of unique values:

    {"study_numbers": ["1210", "3134", "1175", "2290", "2498"]}
    

Importers

There are importers for DDI3.1., DDI 2.5. and DDI 1.2.2., which can be used to initially import DDI-XML files to document store.

Importer tries to update documents if the are already found from the database. However it is not quaranteed to work properly in cases where an ID element for a field is not found from the DDI. Therefore it is best to use the importer only for initial import of records and afterwards use the REST API to update the documents.

Importer reads xml:lang attributes from the XML-elements to get the language of the element’s content. If an element should have no xml:lang attribute, the language is read from the root XML-element’s xml:lang. If the root element has no xml:lang attribute the content is assumed to be in english, and en is used for the language.