Kuha Document Store¶
Kuha Document Store is a HTTP backend API written in Python for serving Document Store records to multiple repo handlers. The Document Store uses MongoDB as a persistent storage and provides multiple endpoints for managing the database documents.
Kuha Document Store is a part of Open Source software bundle Kuha2.
Features¶
Import records from DDI XML
Kuha Document Store provides an easy way to import multiple records all at once by simply submitting a DDI file to an import-endpoint. The Document Store imports all records found from the file and handles inserts and updates correctly.
REST API for full control
Kuha Document Store has a REST API that gives end users full control of the records stored in the Document Store. The REST API may be used to build functionality for spesific needs, for example, to automatically update a record when a record is changed in a 3rd party storage system.
With the REST API, end users are not tied to using DDI, but may use arbitrary metadata formats and submit their records to Document Store using HTTP with JSON payload.
Flexible query support
Kuha Document Store provides an endpoint for selectively querying stored records. The Query API is used by client applications which are a part of the Kuha2 software bundle.
Dependencies & requirements¶
- Python 3.5 or newer
- MongoDB 3.4 or newer (License: GNU AGPL v3.0)
- Recommended: python3-venv 3.5.1 or newer
The software is continuously tested against Python versions 3.5, 3.6, 3.7, 3.8 and 3.9.
MongoDB 3.4 is the first supported version, but the software is also known to work with 3.6 and 4.2. Intermediate versions are most likely suitable.
Python packages
The following can be obtained from Python package index.
- motor (License: Apache License 2.0)
- pymongo (License: Apache License 2.0)
- tornado (License: Apache License 2.0)
- Cerberus (License: ICS)
- python-dateutil (License: Simplified BSD)
Kuha Common is a library used with Kuha2 software. It can be obtained from https://bitbucket.org/tietoarkisto/kuha_common
- kuha_common (License: EUPL)
License¶
Kuha Document Store is available under the EUPL. See LICENSE.txt for the full license.
Configuration¶
The application can be configured with a configuration file, via command line arguments or by environment variables. All configuration options have default values. If a configuration option is specified in more than one place, then command line values override environment variables which override configuration file values which override defaults.
The following configuration options are available:
-
-h
,
--help
¶
Show help message and exit.
-
--print-configuration
¶
Print active configuration and exit.
-
--document-store-port
<port>
¶ Port of Kuha document store database. Defaults to ´`6001``. May also be controlled by setting environment variable:
KUHA_DS_PORT
.
-
--document-store-api-version
<api_version>
¶ Api version for document store. This gets prepended to the URL path. Defaults to
v0
. May also be controlled by setting environment variable:KUHA_DS_API_VERSION
.
-
--database-host
<database_host>
¶ Host/IP of the Document Store database. Defaults to
localhost
. May also be controlled by setting environment variable:KUHA_DS_DBHOST
-
--database-port
<port>
¶ Port of the Document Store database. Defaults to
27017
. May also be controlled by setting environment variable:KUHA_DS_DBPORT
-
--database-name
<name>
¶ Name of Document Store database. Defaults to
kuha_document_store
. May also be controlled by setting environment variable:KUHA_DS_DBMAME
-
--database-user-reader
<user>
¶ Username for database user having read-only rights. Defaults to
reader
. May also be controlled by setting environment variable:KUHA_DS_DBUSER_READER
-
--database-pass-reader
<password>
¶ Password for database user having read-only rights. Defaults to
reader
. May also be controlled by setting environment variable:KUHA_DS_DBPASS_READER
-
--database-user-editor
<user>
¶ Username for database user having editing rights. Defaults to
editor
. May also be controlled by setting environment variable:KUHA_DS_DBUSER_EDITOR
-
--database-pass-editor
<password>
¶ Password for database user having editing rights. Defaults to
editor
. May also be controlled by setting environment variable:KUHA_DS_DBPASS_EDITOR
-
--loglevel
<loglevel>
¶ Lowest logging level of log messages that get output. Valid values are logging levels supported by Python’s
logging
[CRITICAL,ERROR,WARNING,INFO,DEBUG]
. Defaults toINFO
. May also be controlled by setting environment variable:KUHA_LOGLEVEL
-
--logformat
<logformat>
¶ Logging format supported by
logging
. Defaults to%(asctime)s %(levelname)s(%(name)s): %(message)s)
May also be controlled by setting environment variable:KUHA_LOGFORMAT
Configuration file
Args that start with ‘–’ (eg. –document-store-port) can also be set
in a config file. The configuration file lookup searches the file
from current working directory and from the package directory.
The name of the configuration file is kuha_document_store.ini
.
Note
Invoke with --help
to print out config file lookup paths.
Environment variables
If the program will be run by using the scripts provided in scripts
subdirectory, the runtime environment can be controlled via scripts/runtime_env
,
which will be created by copying from scripts/runtime_env.dist
at
installation time by scripts/install_kuha_document_store_virtualenv.sh
.
Running the program¶
This guide will use convenience scripts from scripts
subdirectory.
It is assumed that the program was installed by using
scripts/install_kuha_document_store_virtualenv.sh
.
Run Document Store server:
./scripts/run_kuha_document_store.sh
The script will source scripts/runtime_env
and activate the installed
virtualenv. Finally it calls kuha_ds_serve
, with given command line arguments.
HTTP API endpoints¶
Root for the requests is configurable and defaults to localhost:6001/v0
.
Every endpoint will return HTTP status code 500
on internal errors.
Note
Responses with multiple objects will be streamed one object at a time.
REST API¶
Document Store REST API provides CRUD support to the underlying documents.
-
GET
/
(collection)/
(document_id)¶ Get object from collection with optional document_id. If document_id is not given, endpoint will return all objects in collection.
Parameters: - collection (str) – Document collection. One of studies, variables, questions or study_groups.
- document_id (str) – Optional document ID. 24-character hex string.
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid parameters
- 404 Not Found – Resource not found
-
POST
/studies
¶ Create a new object to studies-collection from JSON request body.
Example request:
POST /studies HTTP/1.1 Content-Type: application/json {"study_number": "study_1"}
Example response:
HTTP/1.1 201 Created Content-Type: application/json; charset=UTF-8 { "result": "insert_successful", "error": null, "affected_resource": "5a82e76e6fb71d06fef00e69" }
Request Headers: - Content-Type – application/json
Request JSON Object: - study_number (string) – Required study number. Used as an identifier. Must be unique within collection.
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object.
Status Codes: - 201 Created – Created successfully.
- 415 Unsupported Media Type – Invalid content type.
- 400 Bad Request – Invalid JSON, Validation failed, Duplicate unique value.
-
POST
/variables
¶ Create a new object to variables-collection from JSON request body.
Example request:
POST /variables HTTP/1.1 Content-Type: application/json { "study_number": "study_1", "variable_name": "variable_1" }
Example response:
HTTP/1.1 201 Created Content-Type: application/json; charset=UTF-8 { "result": "insert_successful", "error": null, "affected_resource": "5a82ecf16fb71d06fef00e6a" }
Request Headers: - Content-Type – application/json
Request JSON Object: - study_number (string) – Required study number. Used as an identifier combined with variable_name. Their combination must be unique within collection.
- variable_name (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object.
Status Codes: - 201 Created – Created successfully.
- 415 Unsupported Media Type – Invalid content type.
- 400 Bad Request – Invalid JSON, Validation failed, Duplicate unique value.
-
POST
/questions
¶ Create a new object to questions-collection from JSON request body.
Example request:
POST /questions HTTP/1.1 Content-Type: application/json { "study_number": "study_1", "question_identifier": "question_1" }
Example response:
HTTP/1.1 201 Created Content-Type: application/json; charset=UTF-8 { "result": "insert_successful", "error": null, "affected_resource": "5a82ee1a6fb71d06fef00e6b" }
Request Headers: - Content-Type – application/json
Request JSON Object: - study_number (string) – Required study number. Used as an identifier combined with question_identifier. Their combination must be unique within collection.
- question_identifier (string) – Required variable name. Used as an identifier combined with study_number. Their combination must be unique within collection.
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object.
Status Codes: - 201 Created – Created successfully.
- 415 Unsupported Media Type – Invalid content type.
- 400 Bad Request – Invalid JSON, Validation failed, Duplicate unique value.
-
POST
/study_groups
¶ Create a new object to study_groups-collection from JSON request body.
Example request:
POST /study_groups HTTP/1.1 Content-Type: application/json { "study_group_identifier": "study_group_1" }
Example response:
HTTP/1.1 201 Created Content-Type: application/json; charset=UTF-8 { "result": "insert_successful", "error": null, "affected_resource": "5a82ee876fb71d06fef00e6c" }
Request Headers: - Content-Type – application/json
Request JSON Object: - study_group_identifier (string) – Required. Used as an identifier and must be unique within collection.
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object.
Status Codes: - 201 Created – Created successfully.
- 415 Unsupported Media Type – Invalid content type.
- 400 Bad Request – Invalid JSON, Validation failed, Duplicate unique value.
-
DELETE
/
(collection)/
(document_id)¶ Delete document or all documents within collection. If optional document_id is left out, will delete all documents within collection.
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.
Status Codes: - 200 OK – Delete successful.
- 404 Not Found – Resource not found.
-
PUT
/
(collection)/
(document_id)¶ Replace document within collection. :see: Documents for information on payload.
Note: Leave
_metadata
field out of the request, to let document store handle updated-timestamp automatically.Request Headers: - Content-Type – application/json
Response JSON Object: - result (string) – Operation outcome.
- error (string) – Errors during operation.
- affected_resource (string) – document_id of the created object or number of deleted documents if document_id is not given in request.
Status Codes: - 200 OK – Replace successful
- 404 Not Found – Resource not found
- 400 Bad Request – Invalid JSON, Validation failed, Duplicate unique value.
Import API¶
Documents may be imported to Document Store by using the import API. See Importers for more information on how documents are parsed.
-
POST
/import/
(importer_id)/
(collection)¶ Import document using importer specified with importer_id. Optional collection may be given to limit the import to a specific collection. If collection is not given the importer will import to collections that are applicable to the posted file.
Note: Importer API may only be used to initially import documents to the database. After the initial import the documents may be updated via the REST API.
Parameters: - importer_id (str) –
Mandatory parameter to select the importer.
ddi_31
imports DDI 3.1 file.ddi_c
imports DDI 2.5 file.ddi_122_nesstar
imports DDI 1.2.2. Nesstar file.
- collection (str) – Optional parameter to limit the import to specific collection. One of studies, variables, questions or study_groups.
Request Headers: - Content-Type – text/xml
Request body: DDI file contents
Response JSON Object: - result (string) – Operation result
- imported_docs (array) – Imported document IDs
- error (string) – Errors found during import
Status Codes: - 200 OK – Import successful
- 400 Bad Request – Empty or invalid request body
- 404 Not Found – Invalid importer id
- 415 Unsupported Media Type – Invalid content type
- importer_id (str) –
Query API¶
Query documents or information on documents from collection.
-
POST
/query/
(collection)¶ Execute query against collection and return results in JSON.
Request Headers: - Content-Type – application/json
Query Parameters: - query_type (string) –
Optional query parameter to select the query type.
select
is the default query type. It returns all documents found by filter.count
returns the number of documents which match the filter.distinct
return distinct results for certain field which match the filter.
Request JSON Object: - _filter (object) – Query filter. Used for all query types. Request may specify multiple filter conditions.
Example request with multiple filter conditions:
POST /query/variables HTTP/1.1 Content-Type: application/json { "_filter": { "study_number": "study_1", "variable_name": "variable_1" } }
Request JSON Object: - fields (array) – Optional. Select returned fields. Used in select query type.
_id
will always be returned. If not set, full document will be returned. - skip (int) – Optinal. Skip documents from the beginning. Used in select query type.
- limit (int) – Optional. Limit the number of returned documents. Used in select query type.
- sort_by (string) – Optional. Sort the queried documents by field. Used in select query type.
- sort_order (int) –
Optional. Sort order of the queried documents. Used in select query type.
1
: Ascending sort order.-1
: Descending sort order.
- fieldname (string) – Mandatory for distinct query type. Return distinct values for this field.
Result depends on the requested query_type.
JSON response for select query-type
Results will be streamed one object at a time. The object is a document with requested fields.
JSON response for count query-type
Response JSON Object: - count (int) – Number of documents found with
_filter
.
JSON response for distinct query-type
If
fieldname
points to document’s leaf node the response is in the following format.HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 { "<fieldname>": ["<list-of-distinct-values>"] }
If
fieldname
points to document’s branch node the response is in the following format.HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 { "<fieldname>": ["<list-of-distinct-objects>"] }
Example requests and responses for distinct query-type
POST /query/studies?query_type=distinct HTTP/1.1 Content-Type: application/json { "fieldname": "_metadata.updated" }
HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 { "_metadata.updated": [ "2018-02-13T13:49:37Z", "2018-02-08T10:55:41Z" ] }
POST /query/studies?query_type=distinct HTTP/1.1 Content-Type: application/json { "fieldname": "_metadata" }
HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 { "_metadata": [ { "updated": "2017-11-09T12:07:48Z", "cmm_type": "study", "created": "2017-11-09T11:06:03Z" }, { "updated": "2017-11-09T11:37:16Z", "cmm_type": "study", "created": "2017-11-09T11:37:16Z" } ] }
Note
Distinct queries for datetime-fields will not work as expected, due to different precision in MongoDB and Document Store JSON. MongoDB stores datetimes in millisecond’s precision, while Document Store JSON supports second’s precision.
Status Codes: - 200 OK – OK
- 400 Bad Request – Message body empty, invalid query_type, invalid query parameters for query type.
- 415 Unsupported Media Type – Invalid Content-Type
Documents¶
Documents are objects stored in a collection. Documents support four different types of fields:
key-value pair:
{"study_number": "1200"}
contained key-value pairs:
{"_metadata": { "updated": "2018-01-31T11:37:34Z", "cmm_type": "study", "created": "2018-01-31T11:37:27Z" } }
localized contained key-value pairs:
{"study_titles": [ { "language": "en", "study_title": "Study 1983" }, { "language": "fi", "study_title": "Tutkimus 1983" } ]}
list of unique values:
{"study_numbers": ["1210", "3134", "1175", "2290", "2498"]}
Importers¶
There are importers for DDI3.1., DDI 2.5. and DDI 1.2.2., which can be used to initially import DDI-XML files to document store.
Importer tries to update documents if the are already found from the database. However it is not quaranteed to work properly in cases where an ID element for a field is not found from the DDI. Therefore it is best to use the importer only for initial import of records and afterwards use the REST API to update the documents.
Importer reads xml:lang attributes from the XML-elements to get the language of the
element’s content. If an element should have no xml:lang attribute, the language is read
from the root XML-element’s xml:lang. If the root element has no xml:lang attribute the
content is assumed to be in english, and en
is used for the language.