Kuha Document Store

Kuha Document Store is a HTTP backend API written in Python for serving Document Store records to multiple repo handlers. The Document Store uses MongoDB as a persistent storage and provides multiple endpoints for managing the database documents.

Kuha Document Store is a part of Open Source software bundle Kuha2.

Features

REST API for full control

Kuha Document Store has a REST API that gives end users full control of the records stored in the Document Store. The REST API may be used to build functionality for spesific needs, for example, to automatically update a record when a record is changed in a 3rd party storage system.

With the REST API, users can submit their records to Document Store using HTTP with JSON payload.

Flexible query support

Kuha Document Store provides an endpoint for selectively querying stored records. The Query API is used by client applications which are a part of the Kuha2 software bundle.

Dependencies & requirements

  • Python 3.8 or newer

  • MongoDB 3.4 or newer (License: GNU AGPL v3.0)

The software is continuously tested with supported Python versions.

MongoDB 3.4 is the first supported version, but the software is also known to work with 3.6, 4.2 and 5.0. Intermediate versions are most likely suitable.

Python packages

The following can be obtained from Python package index.

  • motor (License: Apache License 2.0)

  • pymongo (License: Apache License 2.0)

  • tornado (License: Apache License 2.0)

  • Cerberus (License: ICS)

  • python-dateutil (License: Simplified BSD)

Kuha Common is a library used with Kuha2 software. It can be obtained from https://bitbucket.org/tietoarkisto/kuha_common

  • kuha_common (License: EUPL)

License

Kuha Document Store is available under the EUPL. See LICENSE.txt for the full license.

Configuration

The application can be configured with a configuration file, via command line arguments or by environment variables. All configuration options have default values. If a configuration option is specified in more than one place, then command line values override environment variables which override configuration file values which override defaults.

This lists some of the available configuration options. Use –help to list all available options.

-h, --help

Show help message and exit.

--print-configuration

Print active configuration and exit.

--port <port>

Port of Kuha document store database. Defaults to ´`6001``. May also be controlled by setting environment variable: KUHA_DS_PORT.

--replica <replica host + port>

MongoDB replica host and port. Repeat for multiple replicas. May also be controlled by setting environment variable: KUHA_DS_DBREPLICAS.

Configuration file

Args that start with ‘–’ (eg. –document-store-port) can also be set in a config file. The configuration file lookup searches the file from current working directory and from the package directory. The name of the default configuration file is kuha_document_store.ini, and can be set via configuration option --config.

Note

Invoke with --help to print out config file lookup paths.

Environment variables

If the program will be run by using the scripts provided in scripts subdirectory, the runtime environment can be controlled via scripts/runtime_env, which will be created by copying from scripts/runtime_env.dist at installation time by scripts/install_kuha_document_store_virtualenv.sh.

Running the program

This guide will use convenience scripts from scripts subdirectory. It is assumed that the program was installed by using scripts/install_kuha_document_store_virtualenv.sh.

Run Document Store server:

./scripts/run_kuha_document_store.sh

The script will source scripts/runtime_env and activate the installed virtualenv. Finally it calls kuha_ds_serve, with given command line arguments.

HTTP API endpoints

Root for the requests is configurable and defaults to localhost:6001/v0. Every endpoint will return HTTP status code 500 on internal errors.

Note

Responses with multiple objects will be streamed one object at a time.

REST API

Document Store REST API provides CRUD support to the underlying documents.

GET /(collection)/(document_id)

Get object from collection with optional document_id. If document_id is not given, endpoint will return all objects in collection.

Parameters
  • collection (str) – Document collection. One of studies, variables, questions or study_groups.

  • document_id (str) – Optional document ID. 24-character hex string.

Status Codes
POST /studies

Create a new object to studies-collection from JSON request body.

Example request:

POST /studies HTTP/1.1
Content-Type: application/json

{"study_number": "study_1"}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82e76e6fb71d06fef00e69"
}
Request Headers
Request JSON Object
  • study_number (string) – Required study number. Used as an identifier. Must be unique within collection.

Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object.

Status Codes
POST /variables

Create a new object to variables-collection from JSON request body.

Example request:

POST /variables HTTP/1.1
Content-Type: application/json

{
    "study_number": "study_1",
    "variable_name": "variable_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ecf16fb71d06fef00e6a"
}
Request Headers
Request JSON Object
  • study_number (string) – Required study number. Used as an identifier combined with variable_name. Their combination must be unique within collection.

  • variable_name (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.

Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object.

Status Codes
POST /questions

Create a new object to questions-collection from JSON request body.

Example request:

POST /questions HTTP/1.1
Content-Type: application/json

{
    "study_number": "study_1",
    "question_identifier": "question_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ee1a6fb71d06fef00e6b"
}
Request Headers
Request JSON Object
  • study_number (string) – Required study number. Used as an identifier combined with question_identifier. Their combination must be unique within collection.

  • question_identifier (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.

Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object.

Status Codes
POST /study_groups

Create a new object to study_groups-collection from JSON request body.

Example request:

POST /study_groups HTTP/1.1
Content-Type: application/json

{
    "study_group_identifier": "study_group_1"
}

Example response:

HTTP/1.1 201 Created
Content-Type:  application/json; charset=UTF-8

{
    "result": "insert_successful",
    "error": null,
    "affected_resource": "5a82ee876fb71d06fef00e6c"
}
Request Headers
Request JSON Object
  • study_group_identifier (string) – Required. Used as an identifier and must be unique within collection.

Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object.

Status Codes
DELETE /(collection)/(document_id)

Delete document or all documents within collection. If optional document_id is left out, will delete all documents within collection.

Query Parameters
  • delete_type (string) –

    Optional delete_type parameter. Defaults to soft.

    • soft is the default delete type. It results in logically deleted records.

    • hard will physically delete records.

Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.

Status Codes
  • 200 OK – Delete successful.

  • 404 Not Found – Resource not found.

  • 409 Conflict – Attempt logical delete on a record which is already logically deleted.

PUT /(collection)/(document_id)

Replace document within collection. :see: Documents for information on payload.

Note

Leave _metadata field out of the request, to let document store handle updated-timestamp automatically.

Request Headers
Response JSON Object
  • result (string) – Operation outcome.

  • error (string) – Errors during operation.

  • affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.

Status Codes

Query API

Query documents or information on documents from collection.

POST /query/(collection)

Execute query against collection and return results in JSON.

Request Headers
Query Parameters
  • query_type (string) –

    Optional query parameter to select the query type.

    • select is the default query type. It returns all documents found by filter.

    • count returns the number of documents which match the filter.

    • distinct return distinct results for certain field which match the filter.

Request JSON Object
  • _filter (object) – Query filter. Used for all query types. Request may specify multiple filter conditions.

Example request with multiple filter conditions:

POST /query/variables HTTP/1.1
Content-Type: application/json

{
    "_filter": {
        "study_number": "study_1",
        "variable_name": "variable_1"
    }
}
Request JSON Object
  • fields (array) – Optional. Select returned fields. Used in select query type. _id will always be returned. If not set, full document will be returned.

  • skip (int) – Optinal. Skip documents from the beginning. Used in select query type.

  • limit (int) – Optional. Limit the number of returned documents. Used in select query type.

  • sort_by (string) – Optional. Sort the queried documents by field. Used in select query type.

  • sort_order (int) –

    Optional. Sort order of the queried documents. Used in select query type.

    • 1: Ascending sort order.

    • -1: Descending sort order.

  • fieldname (string) – Mandatory for distinct query type. Return distinct values for this field.

Result depends on the requested query_type.

JSON response for select query-type

Results will be streamed one object at a time. The object is a document with requested fields.

JSON response for count query-type

Response JSON Object
  • count (int) – Number of documents found with _filter.

JSON response for distinct query-type

If fieldname points to document’s leaf node the response is in the following format.

HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "<fieldname>": ["<list-of-distinct-values>"]
}

If fieldname points to document’s branch node the response is in the following format.

HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "<fieldname>": ["<list-of-distinct-objects>"]
}

Example requests and responses for distinct query-type

POST /query/studies?query_type=distinct HTTP/1.1
Content-Type: application/json

{
    "fieldname": "_metadata.updated"
}
HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "_metadata.updated": [
        "2018-02-13T13:49:37Z",
        "2018-02-08T10:55:41Z"
    ]
}
POST /query/studies?query_type=distinct HTTP/1.1
Content-Type: application/json

{
    "fieldname": "_metadata"
}
HTTP/1.1 200 OK
Content-Type:  application/json; charset=UTF-8

{
    "_metadata": [
        {
            "updated": "2017-11-09T12:07:48Z",
            "cmm_type": "study",
            "created": "2017-11-09T11:06:03Z"
        },
        {
            "updated": "2017-11-09T11:37:16Z",
            "cmm_type": "study",
            "created": "2017-11-09T11:37:16Z"
        }
    ]
}

Note

Distinct queries for datetime-fields will not work as expected, due to different precision in MongoDB and Document Store JSON. MongoDB stores datetimes in millisecond’s precision, while Document Store JSON supports second’s precision.

Status Codes

Documents

Documents are objects stored in a collection. Documents support four different types of fields:

  1. key-value pair:

    {"study_number": "1200"}
    
  2. contained key-value pairs:

    {"_metadata": {
        "updated": "2018-01-31T11:37:34Z",
        "cmm_type": "study",
        "created": "2018-01-31T11:37:27Z"
        }
    }
    
  3. localized contained key-value pairs:

    {"study_titles": [
       {
           "language": "en",
           "study_title": "Study 1983"
       },
       {
           "language": "fi",
           "study_title": "Tutkimus 1983"
       }
    ]}
    
  4. list of unique values:

    {"study_numbers": ["1210", "3134", "1175", "2290", "2498"]}