Datasets API

General

Datasets, like all objects accessible using the different APIs in Metax, have an internal identifier field identifier, which uniquely identifies a record withing Metax.

The standard way to retrieve a single dataset is by sending a request to the API GET /rest/v2/datasets/<pid>, where <pid> is the record’s internal identifier. The result returned from the API contains various information about the state of the dataset, such as last-modified timestamps, PAS state, and other data. Included is also the field probably of the most interest to end users: The research_dataset field. research_dataset contains the actual user-provided metadata descriptions of The Dataset.

Datasets can be listed and browsed using the API GET /rest/v2/datasets. Retrieving a dataset or listing datasets can be augmented in various ways by using additional parameters. For details, see swagger’s section about datasets.

When creating a new dataset, it is recommended to always create the new dataset first into draft state by using the optional query parameter ?draft=true. When the dataset is in draft state, you can freely modify the dataset and add or remove files from it, until you are satisfied with the result and ready to publish the dataset for all world to find. Draft datasets can also be permanently deleted at any time without any trace left behind, unlike published datasets. If the query parameter ?draft=true is left out (or value is false), the dataset is published immediately upon creation. This can be particularly useful for large automated jobs where creating a draft first may be an unnecessary middle-step.

Data model visualization

The dataset data model visualization can be seen here https://tietomallit.suomi.fi/model/mrd. The data model visualization is very helpful to keep open when creating dataset metadata descriptions, as it shows all the different possible fields and relations, which can be used, which fields are mandatory, and so on.

Additionally, the chosen data catalog may have some additional restrictions to the schema, such as reduced relations, or reduced mandatory fields. Read more about data catalogs and their implications here Data Catalogs.

Dataset schemas

JSON schema files dataset metadata descriptions (field research_dataset):

There are also other schemas for datasets in other data catalogs for specialized use, such as for harvesting.

Common schema validation errors

Schema validation errors can sometimes be difficult to decipher for an untrained eye, and in some cases they just simply don’t tell exactly what’s wrong with some object or value (e.g. when object is not conforming to oneOf objects!). Here are some common hard-to-understand error messages, and tips what to look for.

“<object> is not valid under any of the given schemas”

Can happen with oneOf objects (ResearchAgent is one of Person or Organization objects). Depending which type of object you have used (person or organization), double check that the fields on that object conform to either the Person object or the Organization object in the dataset schema. Ensure:

  • field values are of the correct type

  • mandatory fields are present

  • relation field cardinalities are correct (is the relation field an array, or a single object?)

Unfortunately the oneOf errors are not very detailed in the current schema validation library.

Terminology

Records, catalog records

The results returned from the API GET /rest/v2/datasets/<pid> are also sometimes called “catalog records”, or “records”. At the top level there are Data Catalogs, and Data Catalogs contain Catalog Records. Catalog records can be considered the “technical” name of a dataset inside Metax.

Identifier

Usually when identifier is mentioned in the documentation, by default it refers to the internal Metax identifier of an object. The internal identifier field always resides on the root level of the object retrieved from Metax API.

Preferred Identifier

Preferred identifier is the “public” identifier of a dataset. When referring to a dataset in publications, tweets, or where ever in the outside world, preferred identifier is the identifier to use. When creating datasets in Metax, preferred identifiers are always automatically generated by Metax. Harvested datasets are an exception: Harvested datasets will use the identifier in the original source as the preferred identifier.

User metadata

When a user has added some files to a dataset, the user can choose to write additional descriptions to those files. The files already include various automatically generated technical metadata, such as byte sizes, mime types, checksum values and algorithms and such, but any extra metadata that the user wishes to enter about some file is called “user metadata”.

Data Catalogs

Every dataset belongs in a Data Catalog. Data catalogs house datasets with different origins (harvested vs. Fairdata user provided datasets), slightly different schemas (IDA and ATT catalogs for example), and datasets in some catalogs are automatically versioned. While reading datasets from all catalogs is possible by anybody (save for some data which might be considered as sensitive, such as personal information), adding datasets to catalogs can be restricted: Others allow adding only by known services, but some also by end users.

Data catalogs can be browsed by using the API /rest/datacatalogs. The data catalog data model visualization can be found here https://tietomallit.suomi.fi/model/mdc. The data catalog JSON schema file can be found here.

The official Fairdata data catalogs with end user write access are:

Catalog

Purpose

Identifier

IDA

Store datasets which have files stored in the IDA Fairdata service.

urn:nbn:fi:att:data-catalog-ida

ATT

Store datasets which have data stored elsewhere than in the IDA Fairdata service.

urn:nbn:fi:att:data-catalog-att

PAS

Store datasets which have data (or a copy of the data) stored in the Fairdata PAS service.

urn:nbn:fi:att:data-catalog-pas

Other data catalogs where End Users can directly store dataset metadata:

Catalog

Purpose

Identifier

Legacy

Store legacy datasets that are published elsewhere. Published datasets may not have all of the required metadata to qualify as Fairdata-dataset. Identifiers are not generated by Metax: User has to provide any identifiers.

urn:nbn:fi:att:data-catalog-legacy

Draft

Store datasets which are in draft state and data catalog is not yet decided. This catalog is used as a default catalog when creating datasets. Note you must change catalog to one from above before file addition or publication.

urn:nbn:fi:att:data-catalog-dft

Choosing the right Data Catalog

Other than the harvested data catalogs managed by Fairdata harvesters, the two most interesting data catalogs are probably the IDA catalog, and the ATT catalog, commonly referred to as “the Fairdata catalogs”. Also common for these catalogs is the fact that end users can add datasets to them. For the most parts these two catalogs are behaviourally identical, but they do serve different purposes, and have one critical technical difference.

IDA catalog

The IDA catalog hosts datasets, which have their files stored in the Fairdata IDA service. The datasets stored in this catalog use a schema which allow to use the fields research_dataset.files (dataset file data model) and research_dataset.directories (dataset directory data model), which are used to list and describe related files in IDA. On the other hand, the schema is missing the field research_dataset.remote_resources, meaning it does not allow listing files stored in other file storages than IDA.

Note

For end users it is important to note, that you will never be “creating” or “storing” new files in Metax or in IDA by using Metax API: Files are always stored by using the IDA service (https://www.fairdata.fi/en/ida/). Once the files have been stored (frozen) using IDA, the metadata of the stored files is automatically sent to Metax. Then, using Metax APIs, the metadata of the files can be browsed, and linked to datasets, and finally published to the world as part of a dataset.

ATT catalog

The ATT catalog is the opposite of the IDA catalog: It hosts datasets whose files are stored elsewhere than in the Fairdata IDA service. The datasets in this catalog use a schema which allow using the field research_dataset.remote_resources (dataset remote resource data model), while missing the IDA related fields.

PAS catalog

The PAS catalog contains metadata of datasets that either have all their data store in the Fairdata PAS service, or a copy of the data. Datasets can not be created into this catalog freely, but require a special contract with the PAS service to do so.

Attaching a dataset to a catalog

When creating a new dataset and wishing to use for example the ATT catalog, the dataset would be linked to it in the following way:

import requests

dataset_data = {
    "data_catalog": "urn:nbn:fi:att:data-catalog-att",
    "research_dataset": {
        # lots of content...
    }
}

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets', json=dataset_data, headers=headers)
assert response.status_code == 201, response.content

For more involving examples, see the examples section for datasets.

Dataset lifecycle in Metax

  1. A dataset is created as a draft. When in draft state:
    • Files can be added and removed from the dataset freely.

    • Metadata descriptions can be edited.

    • The dataset is not publicly findable.

    • The dataset can be premanently deleted at any time by the user.

  2. A dataset is published. When the dataset is published:
    • The dataset’s metadata descriptions can still be updated at any time.

    • Files can no longer be freely added or removed (a couple of exceptions remain, see Dataset versioning).

    • The dataset becomes publicly findable (any selected access restrictions, such as embargo, applies).

    • Dataset receives permanent resolvable identifiers.

    • Dataset can no longer be premanently deleted. A tombstone page will remain after deletion.

    • New versions can be created from the datasets, where files can again be freely added or removed until dataset is published.

  3. Dataset is stored to PAS (long-term preservation) through the PAS process.
    • A PAS contract is needed to store datasets into PAS.

    • Is not a mandatory step in the lifecycle of all datasets.

    • If an IDA dataset is taken to PAS, the original dataset continues its life as a separate dataset.

    • Datasets can also be created directly into PAS (directly into the PAS catalog).

  4. A dataset is implicitly deprecated as a result of someone deleting a dataset’s files from the file storage.

  5. A dataset is explicitly deleted by the user.

Read-only metadata fields

In the field research_dataset, the following metadata fields are generally considered read-only for the user:

  • total_files_byte_size (calculated by Metax)

  • total_remote_resources_byte_size (calculated by Metax)

  • metadata_version_identifier (generated by Metax)

  • preferred_identifier

For preferred_identifier, exceptions exist: For harvested datasets, the harvester must set the value, and in certain data catalogs, the user must provide the value. In cases where the value is missing when required to be provided, Metax will raise an error to inform the user.

End User API: Writable fields

When using the End User API, some restrictions apply which fields can be set or modified by the user.

When creating a record using the REST API, the following catalog record root-level fields can be set:

  • data_catalog

  • research_dataset

  • cumulative_state

When updating a record using the REST API, the following catalog record root-level fields can be updated:

  • research_dataset

When using the RPC API, some fields are automatically updated as a result, such as when publishing a dataset (state is updated), or when changing cumulative state of the dataset, e.g. closing a cumulative period (cumulative_state is updated, date_cumulation_ended is updated). See the swagger doc pages for details about available RPC API endpoints.

If-Modified-Since header in dataset API

If-Modified-Since header can be used in GET /rest/v2/datasets, GET|PUT|PATCH /rest/v2/datasets/<pid>, or GET /rest/v2/datasets/identifiers requests. This will return the result(s) only if the resources have been modified after the date specified in the header. In update operations the use of the header works as with other types of resources in Metax API. The format of the header should follow guidelines mentioned in https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since

If the requested resource has not been modified after the date specified in the header, the response will be 304 Not Modified.

Dataset versioning

General

What does dataset versioning mean?

At the core of dataset versioning is the need to enforce immutability of files that a dataset consists of. When a dataset is created into draft state, files can be freely added or removed from it. Once the draft dataset is published, the set of files becomes permanent, and files can no longer be freely added or removed.

Exceptions exist to the rule of not being able change files of a published dataset:

  1. Cumulative datasets
    • If a dataset has been marked as cumulative dataset, files can be freely added to it as long as the cumulative period remains open. Removing files is not permitted. Once the cumulative period is closed, adding new files to the dataset is no longer permitted.

  2. Dataset is published, but has 0 files in it
    • It’s possible to publish a dataset without any files in it. In this case, it will be possible to add files to the dataset one time. After that, normal restrictions will apply. When using the API, this means that the files should be added in a single request to the API.

As a slightly less significant form of versioning, when updating the contents of field research_dataset, the previous metadata version is archived so it may be accessed or restored later.

Note

As an end user who is editing the descriptions of their datasets, you generally shouldn’t care that new metadata versions are being created. It does not affect your current dataset’s identifiers, validity, or ability to access it or refer to it elsewhere. The old metadata is simply being archived so that it may be accessed or restored later. Bear in mind though, that old metadata versions are still as public information as everything else in the current most recent version.

How to create a new version of a dataset?

Creating a new version of a dataset is a manual operation. A new dataset version can be created by calling a special RPC API POST /rpc/v2/datasets/create_new_version?identifier=<dataset_identifier>, which creates a new version of the targeted dataset and creates links between the new and the old version. The new version is saved into draft state, and needs to be separately published by using the designated RPC API.

It should be noted that a dataset can have only one “next version” existing at a time. As long as the newer version is still in draft state, the new dataset version can be freely deleted, but when the new version is published, then the original version can no longer have new versions created from it. It’s still possible to manually create a completely new dataset and using the original dataset as a template, and manually describe in metadata that this new dataset is related to another dataset, but the automatic versioning links that are created by using the RPC API will not be there.

It is also possible to create a new dataset version from a deprecated dataset. In this situation, the version creation process creates a new dataset, and automatically removes all files from the dataset that are marked as having been deleted (which caused the original dataset to be marked as deprecated in the first place). While deprecated datasets themselves cannot be restored, a new version can be created where the missing files are removed, in addition to any other corrective measures made by the user.

Deleting files in a file storage

In order to be able to add files to a dataset, the files have to be first uploaded to a supported file storage (such as Fairdata IDA), and the file metadata uploaded to Metax. If, for some reason, the files are deleted from the related file storage, and Metax has been aware of the files being deleted, then the related datasets are marked as “deprecated”, since Metax can no longer guarantee that the files of the dataset exist anywhere. It is still possible that the dataset is findable and fully downloadable from somewhere else, but as far as Metax knows, the dataset is broken.

Terminology

  • Metadata version: Only metadata descriptions differ between metadata versions. Identifiers do not change between metadata versions.

  • Dataset version: The associated set of files differ between different dataset versions of the same record. Identifiers change between versions.

  • Deprecated dataset: When some of the dataset’s files have been physically deleted in the related file storage, then that dataset is marked as “deprecated”. Deprecated dataset’s are still publicly findable, but they are no longer downloadable. It’s possible that a deprecated dataset is still findable and downloadable from some other service than Fairdata.

How to enable versioning in a dataset?

A data catalog has the setting dataset_versioning (boolean) which indicates whether or not datasets saved to that catalog should enforce rules related to versioning. In general, versioning is only enabled for IDA catalogs. Versioning cannot be enabled for harvested data catalogs (an error is raised if it is attempted, to prevent accidents).

Browsing a dataset’s versions

Browsing metadata versions

The API GET /rest/v2/datasets/<pid>/metadata_versions can be used to list metadata versions of a specific dataset. Access details of a specific version using the API GET /rest/v2/datasets/<pid>/metadata_versions/<metadata_version_identifier>.

Browsing dataset versions

When retrieving a single dataset record, the following version-related fields are always present if other versions exist:

Field

Purpose

dataset_version_set

A list of all other dataset versions of the dataset.

next_dataset_version

Link to the next dataset version.

previous_dataset_version

Link to the previous dataset version.

Using the identifiers provided by the above fields, it’s possible to retrieve information about a specific dataset version using the standard datasets API GET /rest/v2/datasets/<pid>.

Note that if the next version of a dataset is still in draft state, then the next_dataset_version field will only be visible for authorized users (the owner of the dataset), with the field state present (when the next version is published, state field will not normally be there). The field dataset_version_set always only lists published datasets, for all users!

Uniqueness of datasets

Non-harvested data catalogs

In non-harvested data catalogs, the uniqueness of a dataset is generally determined by two fields:

  • Identifier of the record object (catalogrecord.identifier), the value of which is unique globally, and generated server-side when the dataset is created. This is an internal identifier, used to identify and access a particular record in Metax.

  • Identifier of the dataset (catalogrecord.research_dataset.preferred_identifier). This is the identifier of “The Dataset”, i.e. the actual data and metadata you care about. The value is generated server-side when the dataset is created.

Harvested data catalogs

In harvested data, the value of preferred_identifier can and should be extracted from the harvested dataset’s source data. The harvester is allowed to set the preferred_identifier for the datasets it creates in Metax, so harvest source organization should indicate which field they would like to use as the preferred_identifier.

The value of preferred_identifier is unique within its data catalog, so there can co-exist for example three datasets, in three different data catalogs, which have the same preferred_identifier value. When retrieving details of a single record using the API, information about these “alternate records” is included in the field alternate_record_set, which contains a list of Metax internal identifiers of the other records, and is a read-only field.

If the field alternate_record_set is missing from a record, it means there are no alternate records sharing the same preferred_identifier in different data catalogs.

Using an existing dataset as a template

If you want to use an existing dataset as a template for a new dataset, you can retrieve a dataset from the API, remove two particular identifying fields from the returned object, and then use the resulting object in a new create request to Metax API. Example:

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content
print('Retrieved a dataset that has identifier: %s' response.data['identifier'])

new_dataset = response.data
del new_dataset['identifier']
del new_dataset['research_dataset']['preferred_identifier']

# note: uses the ?draft=true optional query param, leaving the new dataset into draft state
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets?draft=true', json=new_dataset, headers=headers)
assert response.status_code == 201, response.content
print('Created a new dataset that has identifier: %s' response.data['identifier'])

Reference data guide

A dataset’s metadata descriptions requires the use of reference data in quite many places, and actually even the bare minimum accepted dataset already uses reference data in three different fields.

This sub-section contains a table (…a python dictionary) that shows which relations and fields of the field research_dataset require or offer the option to use reference data. For example, research_dataset.language is a relation, while research_dataset.language.identifier is a field of that relation. The table is best inspected when holding in the other hand the visualization at https://tietomallit.suomi.fi/model/mrd, which is a visualization of the schema of field research_dataset (plus the main record object, CatalogRecord, which is actually what the API GET /rest/v2/datasets returns).

About ResearchAgent, Organization, and Person

Before diving into the reference data table, a few things should be mentioned about the person and organization -type objects in the dataset schema.

In the schema visualization at https://tietomallit.suomi.fi/model/mrd, there are various relations leading from the object ResearchDataset to the object ResearchAgent (research agent data model). The visualization tool is - at current time - unable to visualize “oneOf”-relations of JSON schemas. If opening one of the actual dataset schema files provided by the API /rest/schemas, such as https://metax.fd-test.csc.fi/rest/v2/schemas/ida_dataset, and searching for the string “oneOf” inside that file, you will see that the object ResearchAgent is actually an instance of either the Person (person data model) or the Organization (organization data model) object. That means, that for example when setting the research_dataset.curator relation (which is an array), the contents of the curator field can be either a person, an organization, or a mix of persons and organizations.

To specify whether some ResearchAgent object should be of type Person or of type Organization, do the following:

# ... other fields
"curator": [{
    "name": "John Doe",

    # this special field dictates the type. the curator object is of type person.
    "@type": "Person"
}]
# ... other fields

Likewise, to specify an Organization object:

# ... other fields
"curator": [{
    # note! for organizations, the "name" field supports translations, and has to specify at least one language!
    "name": {
        "en": Organization X",
        "fi": Organisaatio X",
    },

    # this special field dictates the type. the curator object is of type organization.
    "@type": "Organization"
}]
# ... other fields

In the above example, the curator field is actually an array, so the list of curators can even be a mix of objects where some are persons, and some are organizations.

All this needs to be taken into account when looking which reference data to use, when dealing with Person or Organization objects in the schema.

Reference data table

In the table, on the left hand side is described the relation object which uses reference data, and on the right hand side is mode, and url. Note that one or several of the relations can be an array of objects, instead of a single object. Mode value is either required or optional, where required means the relation’s identifier field will only accept values from reference data, and all other values will result in a validation error. Optional means a value from reference data can be used as the identifier’s value, if opting to do so, but custom values will also be accepted (such as custom identifiers of organizations, if you have any). The value of the field url finally is the url where the reference data can be found in ElasticSearch.

Some of the reference data can also be browsed using the koodistot.suomi.fi service: https://koodistot.suomi.fi/registry;registryCode=fairdata. It is important to note that not all reference data indexes are available in that service, but for what’s in there, it can be helpful.

In the below table, the person- and organization-related relations have been separated from the rest of the fields that use reference data, to make it easier to find out which reference data to use depending on what kind of object is being used.

It helps to have the research_dataset data model visualization open while looking at the table. To help with recognizing which relations are single objects, and which are arrays, the below table has a tailing [] in field names to signal that the field is actually an array. While effort is made to keep this table up to date, if it looks like it contains mistakes (e.g. some field is actually not an array, or vica versa), the truth is always found in the related dataset JSON schema file.

Note

Below reference data urls contain the ?pretty=true parameter, which formats the output into a more readable form. The default page only shows a few results, so be sure to check out Querying Reference Data for more examples how to browse reference data in general.

{
    "research_dataset.access_rights.access_type.identifier":           { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/access_type/_search?pretty=true" },
    "research_dataset.access_rights.license[].identifier":             { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/license/_search?pretty=true" },
    "research_dataset.access_rights.restriction_grounds[].identifier": { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/restriction_grounds/_search?pretty=true" },
    "research_dataset.directories[].use_category.identifier":          { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/use_category/_search?pretty=true" },
    "research_dataset.field_of_science[].identifier":                  { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/field_of_science/_search?pretty=true" },
    "research_dataset.files[].file_type.identifier":                   { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/file_type/_search?pretty=true" },
    "research_dataset.files[].use_category.identifier":                { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/use_category/_search?pretty=true" },
    "research_dataset.infrastructure[].identifier":                    { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/research_infra/_search?pretty=true" },
    "research_dataset.language[].identifier":                          { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/language/_search?pretty=true" },
    "research_dataset.other_identifier[].type.identifier":             { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/identifier_type/_search?pretty=true" },
    "research_dataset.provenance[].event_outcome.identifier":          { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/event_outcome/_search?pretty=true" },
    "research_dataset.provenance[].lifecycle_event.identifier":        { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/lifecycle_event/_search?pretty=true" },
    "research_dataset.provenance[].preservation_event.identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/preservation_event/_search?pretty=true" },
    "research_dataset.provenance[].spatial.place_uri.identifier":      { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/location/_search?pretty=true" },
    "research_dataset.provenance[].used_entity[].type.identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/resource_type/_search?pretty=true" },
    "research_dataset.relation[].entity.type.identifier":              { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/resource_type/_search?pretty=true" },
    "research_dataset.relation[].relation_type.identifier":            { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/relation_type/_search?pretty=true" },
    "research_dataset.remote_resources[].file_type.identifier":        { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/file_type/_search?pretty=true" },
    "research_dataset.remote_resources[].license[].identifier":        { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/license/_search?pretty=true" },
    "research_dataset.remote_resources[].media_type":                  { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/reference_data/mime_type/_search?pretty=true" },
    "research_dataset.remote_resources[].resource_type.identifier":    { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/resource_type/_search?pretty=true" },
    "research_dataset.remote_resources[].use_category.identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/use_category/_search?pretty=true" },
    "research_dataset.spatial[].place_uri.identifier":                 { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/location/_search?pretty=true" },
    "research_dataset.theme[].identifier":                             { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/keyword/_search?pretty=true" },

    # organizations. note! can be recursive through the organization-object's `is_part_of` relation
    "research_dataset.contributor[].contributor_type[].identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.contributor[].identifier":                        { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.creator[].contributor_type[].identifier":         { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.creator[].identifier":                            { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.curator[].contributor_type[].identifier":         { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.curator[].identifier":                            { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.is_output_of[].funder_type.identifier":           { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.is_output_of[].has_funding_agency[].identifier":  { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.is_output_of[].source_organization[].identifier": { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.other_identifier[].provider.identifier":          { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.provenance[].was_associated_with.contributor_type[].identifier": { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.publisher[].contributor_type[].identifier":       { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.publisher[].identifier":                          { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.rights_holder[].contributor_type[].identifier":   { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.rights_holder[].identifier":                      { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },

    # persons
    "research_dataset.contributor[].contributor_role[].identifier":   { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.contributor[].contributor_type[].identifier":   { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.contributor[].member_of.identifier":            { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.creator[].contributor_role[].identifier":       { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.creator[].contributor_type[].identifier":       { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.creator[].member_of.identifier":                { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.curator[].contributor_role[].identifier":       { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.curator[].contributor_type[].identifier":       { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.curator[].member_of.identifier":                { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.publisher[].contributor_role[].identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.publisher[].contributor_type[].identifier":     { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.publisher[].member_of.identifier":              { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.provenance[].was_associated_with[].contributor_role[].identifier": { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.provenance[].was_associated_with[].contributor_type[].identifier": { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_typ/_search?pretty=truee" }
    "research_dataset.provenance[].was_associated_with[].member_of.identifier":          { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
    "research_dataset.rights_holder[].contributor_role[].identifier": { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_role/_search?pretty=true" },
    "research_dataset.rights_holder[].contributor_type[].identifier": { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/contributor_type/_search?pretty=true" },
    "research_dataset.rights_holder[].member_of.identifier":          { "mode": "optional", "url": "https://metax.fd-test.csc.fi/es/organization_data/organization/_search?pretty=true" },
}

Note

A special note for the relations contributor_type and contributor_role. In ResearchAgent relations of type Organization, only the relation contributor_type can be used. For same relations where type Person is being used instead, both contributor_type and contributor_role can be used. This is also communicated in the schema, but since persons and organizations can often be used in place of each other, this small difference can slip unnoticed! There are other differences in the schema as well of course, but this can be less obvious.

Using REMS

REMS can be used to give access for downloading dataset files to individual users. When dataset access is REMS managed, dataset owner can decide which users are able to download the files affiliated to the dataset.

To enable REMS, set access_type to permit and ensure that dataset belongs to IDA catalog and has at least one license defined. You can enable REMS when creating a new dataset or later while updating an existing dataset.

Changing access type

When access_type is set to permit, dataset downloads are managed by REMS. If this functionality is no longer wanted, simply changing the access_type to any other access type disables REMS for the dataset. Example of defining permit access type:

# ... other fields
"access_rights": {
    # ... other access rights
    "access_type": {
        "identifier": "http://uri.suomi.fi/codelist/fairdata/access_type/code/permit"
    }
}
# ... other fields

More information about updating a dataset can be found in Update examples.

Changing license

License is required property for those datasets that are managed by REMS. This license is what a downloading user must agree to. If there are multiple licenses described in dataset, REMS only considers the first one. So changing the license in REMS is changing the first license in the dataset. Example of defining a license:

# ... other fields
"access_rights": {
    # ... other access rights
    "license": [
        {
        "identifier": "http://uri.suomi.fi/codelist/fairdata/license/code/CC0-1.0"
        }
    ]
}
# ... other fields

Please refer to Update examples for more information about update process.

Note

Changing the license for REMS managed dataset closes all existing download accesses to the dataset.

Access granter

Metax stores the necessary user information about the access granter in a separate field on CatalogRecord. When making dataset REMS managed, end users do not need to worry about this because this information will be automatically gathered from the access token. Service users need to provide this information in the request body because this is required property when making dataset REMS managed. Access granter is visible via API only for the owner of the dataset. Example:

access_granter = {
    "userid": "jodoe1",
    "name": "John Doe",
    "email": "john.doe@example.com"
}

Examples

These code examples are from the point of view of an end user. Using the API as an end user requires that the user logs in to https://metax.fd-test.csc.fi/secure in order to get a valid access token, which will be used to authenticate with the API. The process for end user authentication is described on the page End User Access.

When services accounts interact with Metax, services have the additional responsibility of providing values for fields related to the current user modifying or creating resources, and generally taking care that the user is permitted to do whatever it is that they are doing.

Retrieve minimal valid dataset template

The API GET /rpc/datasets/get_minimal_dataset_template returns a valid minimal dataset, that can be used as-is to create a dataset into Metax. PAS template can be fetched with type enduser_pas.

import requests

response = requests.get('https://metax.fd-test.csc.fi/rpc/v2/datasets/get_minimal_dataset_template?type=enduser')
assert response.status_code == 200, response.content

# dataset_data can now be used in a POST request to create a new dataset!
dataset_data = response.json()

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets?draft=true', json=dataset_data, headers=headers)
assert response.status_code == 201, response.content
print(response.json())

Important

The other code examples below contain the full dataset in written form to give you an idea what the dataset contents really look like. While these textual examples can sometimes get outdated, the dataset template from the API is always kept up-to-date, and would serve as a good starting point for your own dataset.

Creating datasets

Create a dataset with minimum required fields.

import requests

dataset_data = {
    "data_catalog": "urn:nbn:fi:att:data-catalog-att",
    "research_dataset": {
        "title": {
            "en": "Test Dataset Title"
        },
        "description": {
            "en": "A descriptive description describing the contents of this dataset. Must be descriptive."
        },
        "creator": [
            {
                "name": "Teppo Testaaja",
                "@type": "Person",
                "member_of": {
                    "name": {
                        "fi": "Mysteeriorganisaatio"
                    },
                    "@type": "Organization"
                }
            }
        ],
        "curator": [
            {
                "name": {
                    "und": "School Services, BIZ"
                },
                "@type": "Organization",
                "identifier": "http://uri.suomi.fi/codelist/fairdata/organization/code/01901"
            }
        ],
        "language":[{
            "title": { "en": "en" },
            "identifier": "http://lexvo.org/id/iso639-3/aar"
        }],
        "access_rights": {
            "access_type": {
                "identifier": "http://uri.suomi.fi/codelist/fairdata/access_type/code/open"
            }
        }
    }
}

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets?draft=true', json=dataset_data, headers=headers)
assert response.status_code == 201, response.content
print(response.json())

The response should look something like below:

{
    "id": 9152,
    "identifier": "54efa8b4-f03f-4155-9814-7de6aed4adce",
    "data_catalog": {
        "id": 1,
        "identifier": "urn:nbn:fi:att:data-catalog-att"
    },
    "dataset_version_set": [
        {
            "identifier": "54efa8b4-f03f-4155-9814-7de6aed4adce",
            "preferred_identifier": "urn:nbn:fi:att:58757004-e9b8-4ac6-834c-f5affaa7ec29",
            "removed": false,
            "date_created": "2018-09-10T12:18:38+03:00"
        }
    ],
    "deprecated": false,
    "metadata_owner_org": "myorganization.fi",
    "metadata_provider_org": "myorganization.fi",
    "metadata_provider_user": "myuserid",
    "research_dataset": {
        "title": {
            "en": "Test Dataset Title"
        },

        # <... all the other content that you uploaded ...>

        "preferred_identifier": "draft:54efa8b4-f03f-4155-9814-7de6aed4adce",
        "metadata_version_identifier": "49de6002-df1c-4090-9af6-d4e970904a5b"
    },
    "state": "draft",
    "cumulative_state": 0,
    "preservation_state": 0,
    "removed": False,
    "date_created": "2018-09-10T12:18:38+03:00",
    "user_created": "myuserid"
}

Explanation of all the fields in the received response/newly created dataset:

  • id An internal database identifier in Metax.

  • identifier The unique identifier of the created record in Metax. This is the identifier to use when interacting with the dataset in Metax in any subsequent requests, such as when retrievng, updating, or deleting the dataset.

  • dataset_version_set List of dataset versions associated with this record. Having just created a new record, there is obviously only one record listed.

  • deprecated When files are deleted or unfrozen from IDA, any datasets containing those files are marked as “deprecated”, and the value of this field will be set to True. The value of this field may have an effect in other services, when displaying the dataset contents.

  • metadata_owner_org, metadata_provider_org, metadata_provider_user Information about the creator of the metadata, and the associated organization. These are automatically placed according to the information available from the authentication token.

  • research_dataset Now has two new fields generated by Metax:

    • preferred_identifier The persistent identifier of the dataset. This is the persistent identifier to use when externally referring to the dataset, in publications etc. When the dataset is in draft state, the value is “draft:<identifier>”, which is NOT a real persistent identifier.

    • metadata_version_identifier The identifier of the specific metadata version. Will be generated by Metax each time the contents of the field research_dataset changes.

  • state State of the dataset. Value is “draft” or “published”.

  • cumulative_state Cumulative state of the dataset..

  • preservation_state The PAS status of the record.

  • removed Value will be True when the record is deleted.

  • date_created Date when record was created.

  • user_created Identifier of the user who created the record.

Caution

While in test environments using the internal id fields will work in place of the string-form unique identifiers (identifier field), and are very handy for that purpose, in production environment they should never be used, since in some situations they can change without notice and may result in errors or accidentally referring to unintended objects, while the longer identifiers will be persistent, and are always safe to use. Example how to use the internal id field to retrieve a dataset: https://metax.fd-test.csc.fi/rest/v2/datasets/12 (note: assuming there exists a record with the id: 12)

Errors: Required fields missing

Try to create a dataset with required fields missing. Below example is missing the required field data_catalog.

import requests

dataset_data = {
    "research_dataset": {
        "title": {
            "en": "Test Dataset Title"
        },
        "description": {
            "en": "A descriptive description describing the contents of this dataset. Must be descriptive."
        },
        "creator": [
            {
                "name": "Teppo Testaaja",
                "@type": "Person",
                "member_of": {
                    "name": {
                        "fi": "Mysteeriorganisaatio"
                    },
                    "@type": "Organization"
                }
            }
        ],
        "curator": [
            {
                "name": {
                    "und": "School Services, BIZ"
                },
                "@type": "Organization",
                "identifier": "http://uri.suomi.fi/codelist/fairdata/organization/code/01901"
            }
        ],
        "language":[{
            "title": { "en": "en" },
            "identifier": "http://lexvo.org/id/iso639-3/aar"
        }],
        "access_rights": {
            "access_type": {
                "identifier": "http://uri.suomi.fi/codelist/fairdata/access_type/code/open"
            }
        }
    }
}

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets?draft=true', json=dataset_data, headers=headers)
assert response.status_code == 400, response.content
print(response.json())

The error response should look something like this:

{
    "data_catalog": [
        "This field is required."
    ]
    "error_identifier": "2018-09-10T08:52:24-4c755256"
}

Errors: JSON validation error in field research_dataset

Try to create a dataset when JSON schema validation fails for field research_dataset. In the below example, the required field title is missing from the JSON blob inside field research_dataset.

Important

The contents of the field research_dataset are validated directly against the relevant schema from GET /rest/v2/schemas, so probably either the ida schema or att schema, depending on if you are going to include files from the Fairdata IDA service in your dataset or not. When schema validation fails, the entire output from the validator is returned. For an untrained eye, it can be difficult to find the relevant parts from the output. For that reason, it is strongly recommended that you:

  • Periodically upload your dataset to Metax using the optional query parameter ?dryrun=true, which executes all validations on the dataset, and returns you the same result it normally would have returned, except nothing really gets saved into Metax database. If you are working on a draft dataset, then using the dryrun parameter may not be relevant for you.

  • Start with a bare minimum working dataset description, and add new fields and descriptions incrementally, validating the contents periodically. This way, it will be a lot easier to backtrack and find any mistakes in the JSON structure.

import requests

dataset_data = {
    "data_catalog": "urn:nbn:fi:att:data-catalog-att",
    "research_dataset": {
        "description": {
            "en": "A descriptive description describing the contents of this dataset. Must be descriptive."
        },
        "creator": [
            {
                "name": "Teppo Testaaja",
                "@type": "Person",
                "member_of": {
                    "name": {
                        "fi": "Mysteeriorganisaatio"
                    },
                    "@type": "Organization"
                }
            }
        ],
        "curator": [
            {
                "name": {
                    "und": "School Services, BIZ"
                },
                "@type": "Organization",
                "identifier": "http://uri.suomi.fi/codelist/fairdata/organization/code/01901"
            }
        ],
        "language":[{
            "title": { "en": "en" },
            "identifier": "http://lexvo.org/id/iso639-3/aar"
        }],
        "access_rights": {
            "access_type": {
                "identifier": "http://uri.suomi.fi/codelist/fairdata/access_type/code/open"
            }
        }
    }
}

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets', json=dataset_data, headers=headers)
assert response.status_code == 400, response.content
print(response.json())

The error response should look something like this:

{
    "research_dataset": [
        "'title' is a required property. Json path: []. Schema: { ... <very long output here>"
    ],
    "error_identifier": "2018-09-10T09:04:41-54fb4e22"
}

Retrieving datasets

Retrieving an existing dataset using a dataset’s internal Metax identifier:

import requests

response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123')
assert response.status_code == 200, response.content
print(response.json())

Here, the abc123 is the Metax internal identifier of the record (field identifier). The retrieved content should look exactly the same as when creating a dataset. See above.

By default, the received data does not include the user metadata of files and directories. In order to include the user metadata, use the optional query parameter ?include_user_metadata=true. Then, the user metadata can be found in research_dataset.files and research_dataset.directories.

Updating datasets

Update metadata

Update an existing dataset using a PUT request:

import requests

# first retrieve a dataset that you are the owner of
headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content

modified_data = response.json()
modified_data['research_dataset']['description']['en'] = 'A More Accurate Description'

response = requests.put('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', json=modified_data, headers=headers)
assert response.status_code == 200, response.content
print(response.json())

A successful update operation will return the dataset with updated content.

Caution

When updating a dataset, be sure to authenticate with the API when retrieving the dataset, since some sensitive fields from the dataset are filtered out when retrieved without authentication (or by the general public). Otherwise, when saving the dataset, you may accidentally lose some data when you upload the modified dataset!

The exact same result can be achieved using a PATCH request, which allows you to only update specific fields. In the below example, we are updating only the field research_dataset. While you can always use either PUT or PATCH for update, PATCH is always less risky in the sense that you will not accidentally modify fields you didn’t intend to. Using PATCH is more relevant to service accounts, since end user API users already have pretty strict restrictions in place for what fields can be modified.

# ... the beginning is the same as in the above example

# only updating the field research_dataset
modified_data = {
    'research_dataset': response.json()['research_dataset']
}

modified_data['research_dataset']['description']['en'] = 'A More Accurdate Description'

# add the HTTP Authorization header, since authentication will be required when executing write operations in the API.
headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.patch('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', json=modified_data, headers=headers)

# ... the rest is the same as in the above example

The outcome of the update operation should be the same as in the above example.

Working with files: Add and exclude files

Create a new dataset, while adding files to it

It’s possible to add files to a dataset in the same initial request, where the dataset is first created. More files can the be added or excluded in subsequent requests using a different related API. See other examples.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# lets assume the cr_data contains all the other necessary minimum fields to create a dataset.

# note: this entry only tells Metax to add this file to the dataset. the entry itself is not persisted anywhere after
# processing of the dataset has finished.
cr_data['research_dataset']['files'] = [
    { 'identifier': '5105ab9839f63a909893183c14f9b55n' }
]

response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets?draft=true', json=cr_data, headers=headers)
assert response.status_code == 201, response.content

# retrieve list of a technical file metadata of a dataset
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()) == 1, response.json()

Add new files to a draft dataset

The example assumes a draft dataset has been previously created, without any files.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# note: this entry only tells Metax to add these files to the dataset. the entries are not persisted anywhere after
# processing of the dataset has finished.
file_changes = {
    'files': [
        { 'identifier': '5105ab9839f63a909893183c14f9e9db' },
        { 'identifier': '5105ab9839f63a909893183c14f9h37f' },
    ]
}

response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', json=file_changes, headers=headers)
assert response.status_code == 200, response.content
assert response.json()['files_added'] == 2, response.json()

# retrieve list of a technical file metadata of a dataset
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()) == 2, response.json()

added_file_identifiers = [ f['identifier'] for f in response.json() ]
assert '5105ab9839f63a909893183c14f9e9db' in added_file_identifiers, added_file_identifiers
assert '5105ab9839f63a909893183c14f9h37f' in added_file_identifiers, added_file_identifiers

Add a directory of files to a dataset

Functionally, adding a directory to a dataset works the exact same way as adding a single file. The effect of adding a directory vs. a single file is a lot greater though, since all the files included in that directory, and its sub-directories, are added to the dataset.

Below is an example similar to the first example where we added files. The dataset in its initial state does not have any files added to it.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# lets assume the example directories contain a total of 10 files

file_changes = {
    'directories': [
        { 'identifier': '5105ab9839f63a909893183c14f9kk3h' },
        { 'identifier': '5105ab9839f63a909893183c14f9br77' },
    ]
}

response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', json=file_changes, headers=headers)
assert response.status_code == 200, response.content
assert response.json()['files_added'] == 10, response.json()

# retrieve list of a technical file metadata of a dataset
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()) == 10, response.json()

Excluding files

When adding files en masse by adding a directory, it’s possible to exclude individual files or directories of files.

When adding and excluding directories in the same request, the entries are processed in the order they are provided in the request. I.e., if at the very end of a list of directory entries which includes some exclusions, a root directory is provided which adds files, then none of the other exclusion entries will have mattered. File entries are processed after directory entries.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# lets assume the example files contain a total of 10 files, where the excluded directory contains 2 files.
# the total amount of added files should therefore be 7.

file_changes = {
    'files': [
        { 'identifier': '5105ab9839f63a909893183c14f9b55n', 'exclude': True },
    ],
    'directories': [
        { 'identifier': '5105ab9839f63a909893183c14f9kk3h' }, # a directory that contains the other directory, and the other file
        { 'identifier': '5105ab9839f63a909893183c14f9br77', 'exclude': True },
    ]
}

response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', json=file_changes, headers=headers)
assert response.status_code == 200, response.content
assert response.json()['files_added'] == 7, response.json()

# retrieve list of a technical file metadata of a dataset
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()) == 7, response.json()

Add files while including user metadata

When adding files to a dataset, it’s possible to include user metadata for those files in the same request body. User metadata can additionally be updated or deleted using a separate API endpoint.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

file_changes = {
    'files': [
        {
            'identifier': '5105ab9839f63a909893183c14f9b55n',
            'title': 'Example file',
            'description': 'Detailed description of example file.',
            'use_category': {
                'identifier': 'source'
            }
        }
    ]
}

response = requests.post('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files', json=file_changes, headers=headers)
assert response.status_code == 200, response.content
assert response.json()['files_added'] == 1, response.json()

# the files user metadata should now be available from the research_dataset.files relation
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()['research_dataset']['files']) == 1, response.json()
assert response.json()['research_dataset']['files'][0]['title'] == 'Example file', response.json()

Retrieve technical metadata of a single file

Retrieve full technical metadata of a single file of a dataset.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# retrieve technical metadata of a file
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/5105ab9839f63a909893183c14f9b55n', headers=headers)
assert response.status_code == 200, response.content
assert response.json()['identifier'] == '5105ab9839f63a909893183c14f9b55n', response.json()

Working with files: Updating user metadata

In addition to including user metadata when adding the files, user metadata can additionally be updated or deleted using a separate API endpoint.

Important

Using this API assumes that the files have been previously added to the dataset. Adding new files to the dataset using this API is NOT possible! Trying to add user metadata for files that have not been added to the dataset will result in an error.

PUT can be used to fully replace user metadata. When initially adding user metadata to a file, the minimum required fields should always be present. After a file already has some user metadata in place, PATCH can be used to update individual fields of it.

Add or replace user metadata

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

file_changes = {
    'files': [
        {
            'identifier': '5105ab9839f63a909893183c14f9b55n',
            'title': 'Example file',
            'description': 'Detailed description of example file.',
            'use_category': {
                'identifier': 'source'
            }
        }
    ]
}

response = requests.put('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/user_metadata', json=file_changes, headers=headers)
assert response.status_code == 200, response.content

# the files user metadata should now be available from the research_dataset.files relation
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()['research_dataset']['files']) == 1, response.json()
assert response.json()['research_dataset']['files'][0]['title'] == 'Example file', response.json()

Partially update user metadata

The example assumes the files have already had user metadata added previously.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

file_changes = {
    'files': [
        {
            'identifier': '5105ab9839f63a909893183c14f9b55n',
            'description': 'An improved, more detailed description of example file.',
        }
    ]
}

response = requests.patch('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/user_metadata', json=file_changes, headers=headers)
assert response.status_code == 200, response.content

# the files user metadata should now be available from the research_dataset.files relation
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content
assert len(response.json()['research_dataset']['files']) == 1, response.json()
assert response.json()['research_dataset']['files'][0]['description'].startswith('An improved'), response.json()

Deleting user metadata

Files user metadata can be deleted by adding the key delete with value True to any entry in the request body. The example assumes the files have already had user metadata added previously. The key can be used in both PUT and PATCH requests when using the user_metadata API endpoint. The example works the same way for directories.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

file_changes = {
    'files': [
        {
            'identifier': '5105ab9839f63a909893183c14f9b55n', 'delete': True
        }
    ]
}

response = requests.put('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/user_metadata', json=file_changes, headers=headers)
assert response.status_code == 200, response.content

# the files user metadata should no longer be available from the research_dataset.files relation
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 200, response.content
assert 'files' not in response.json()['research_dataset'], response.json()

Retrieve user metadata of a single file

Retrieve user metadata of a single file of a dataset.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# retrieve technical metadata of a file
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/5105ab9839f63a909893183c14f9b55n/user_metadata', headers=headers)
assert response.status_code == 200, response.content
assert response.json()['identifier'] == '5105ab9839f63a909893183c14f9b55n', response.json()

Retrieve user metadata of a single directory

Retrieve user metadata of a single directory of a dataset using the same API endpoint, but by additionally providing the ?directory=true query parameter, in which case the procided identifier is regarded to be an identifier of a directory instead.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

# retrieve technical metadata of a file
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/5105ab9839f63a909893183c14f9k228/user_metadata?directory=true', headers=headers)
assert response.status_code == 200, response.content
assert response.json()['identifier'] == '5105ab9839f63a909893183c14f9k228', response.json()

Deleting datasets

Deleting a draft dataset

Delete a draft dataset using a DELETE request:

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.delete('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 204, response.content

# the dataset is now removed from the general API results
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123')
assert response.status_code == 404, 'metax should return 404 due to dataset not found'

# the dataset should not be findable even if using the ?removed=true parameter
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123?removed=true')
assert response.status_code == 404, 'dataset should have been permanently deleted'

Deleting a published dataset

Delete a published dataset using a DELETE request:

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.delete('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123', headers=headers)
assert response.status_code == 204, response.content

# the dataset is now removed from the general API results
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123')
assert response.status_code == 404, 'metax should return 404 due to dataset not found'

# removed datasets are still findable using the ?removed=true parameter
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123?removed=true')
assert response.status_code == 200, 'metax should have returned a dataset'
assert response.json()['removed'] is True, 'dataset should be marked as removed'

Publishing datasets

If a dataset has been initially created into draft state, the dataset must be published in order for it to become publicly findable, and for the dataset to receive persistent resolvabe identifiers. Publishing a dataset is done using a special RPC API endpoint, which is only usable by the owner of the dataset. The response from the request should contain the newly generated persistent identifier of the dataset, which is from then on found in the research_dataset.preferred_identifier field.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rpc/v2/datasets/publish_dataset?identifier=abc123', headers=headers)
assert response.status_code == 200, response.content
assert 'preferred_identifier' in response.json(), 'response should include the newly generated preferred_identifier'

response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123')
assert response.status_code == 200, response.content
assert response.json()['state'] == 'published, 'dataset state should now be published'

Creating a new version of a dataset

When a dataset has been published, a new version of it can be created using a special RPC API endpoint, which is only usable by the owner of the dataset. The new dataset version is created into draft state.

Being able to create a new version of a dataset in an automated fashion using this API requires that the dataset is created into a data catalog that supports dataset versioning, such as the Fairdata IDA catalog.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
response = requests.post('https://metax.fd-test.csc.fi/rpc/v2/datasets/create_new_version?identifier=abc123', headers=headers)
assert response.status_code == 201, response.content
assert 'identifier' in response.json(), 'response should include the internal identifier of the new dataset version'

Browsing a dataset’s files

File metadata of a dataset can be browsed in two ways.

First way is to retrieve a flat list of file metadata of all the files included in the dataset. Be advised though: The below API endpoint does not utilize paging! If the number of files is very large, the amount of data being downloaded by default can be very large! Therefore, it is highly recommended to use the query parameter file_fields=field_1,field_2,field_3... to only retrieve the information you are interested in:

import requests

# retrieve all file metadata
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files')
assert response.status_code == 200, response.content

# retrieve only specified fields from file metadata
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files?file_fields=identifier,file_path')
assert response.status_code == 200, response.content

In addition to above, individual files can be retrieved in the following manner:

import requests

# retrieve all file metadata
response = requests.get('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/5105ab9839f63a909893183c14f9b55n')
assert response.status_code == 200, response.content

The second way is by using the same API as is used to generally browse the files of a project (see Browsing files). Browsing the files of a dataset works the same way, except that an additional query parameter cr_identifier=<dataset_identifer> should be provided, in order to retrieve only those files and directories, which are included in the specified dataset.

Example:

import requests

response = requests.get('https://metax.fd-test.csc.fi/rest/v2/directories/dir123/files?cr_identifier=abc123')
assert response.status_code == 200, response.content

Hint

Etsin, a Fairdata service, provides a nice graphical UI for browsing files of published datasets.

Note

When browsing the files of a dataset, authentication with the API is not required, since if a dataset is retrievable from the API, it means it has been published, and its files are now public information.

When browsing files for the purpose of editing a dataset, the query parameter ?not_cr_identifier=<dataset_identifier> can be useful to browse only files that have NOT been added to the dataset. Using this parameter requires that the user is the owner of the dataset, and a member of the project of files being browsed. Example:

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }

response = requests.get('https://metax.fd-test.csc.fi/rest/v2/directories/dir123/files?not_cr_identifier=abc123', headers=headers)
assert response.status_code == 200, response.content

Using reference data

Modifying research_dataset to contain data that depends on reference data.

Be sure to also check out Querying Reference Data for useful examples how to browse reference data in general.

Add a directory

Below example assumes an existing bare minimum draft dataset, to which some files have already been added. This example adds some user metadata to that directory. The directory-object has a mandatory field called use_category, which requires using a value from reference data in its identifier field. In the dataset reference data table on this same page(Reference data table), we should be able to find this row:

{
    # ...
    "research_dataset.directories[].use_category.identifier":             { "mode": "required", "url": "https://metax.fd-test.csc.fi/es/reference_data/use_category/_search?pretty=true" },
    # ...
}

This means that the field research_dataset.directories.use_category.identifier uses reference data, and the mode field in the table indicates the value for identifier must become from reference data: Custom values are not allowed. The url shows that valid values can be found from here: https://metax.fd-test.csc.fi/es/reference_data/use_category/_search?pretty=true. So we go ahead, and browse the reference data, and in this example, decide that “source code” is a fitting use category for the directory, so the value to use for the identifier field research_dataset.directories[].use_category.identifier would be the uri field of the selected reference data: “http://uri.suomi.fi/codelist/fairdata/use_category/code/source”. Below is an example how to use the value.

Note: Instead of using the uri value, code would work just as well.

import requests

headers = { 'Authorization': 'Bearer abc.def.ghi' }
file_changes = {
    'directories' = [
        {
            "identifier": "5105ab9839f63a909893183c14f9e113",
            "title": "Directory Title",
            "description": "What is this directory about",
            "use_category": {
                # the value to the below field is from reference data
                "identifier": "http://uri.suomi.fi/codelist/fairdata/use_category/code/source",
            }
        }
    ]
}

response = requests.put('https://metax.fd-test.csc.fi/rest/v2/datasets/abc123/files/user_metadata', json=modified_data, headers=headers)
assert response.status_code == 200, response.content

When the dataset is updated, some fields inside the field use_category will have been populated by Metax according to the used reference data.

For more information about reference data, see Reference Data.