Skip to main content

Concept

Dataset is a collection of datapoints. It can be used for the following purposes:
  1. Data storage for use in future fine-tuning or prompt-tuning.
  2. Provide inputs and expected outputs for Evaluations.

Format

Every datapoint has two fixed JSON objects: data and target, each with arbitrary keys. target is only used in evaluations.
  • data – the actual datapoint data,
  • target – data additionally sent to the evaluator function.
  • metadata – arbitrary key-value metadata about the datapoint.
For every key inside data and target, the value can be any JSON value.

Example

This is an example of a valid datapoint.
{
    "data": {
        "color": "red",
        "size": "large",
        "messages": [
            {
                "role": "user",
                "content": "Hello, can you help me choose a T-shirt?"
            },
            {
                "role": "assistant",
                "content": "I'm afraid, we don't sell T-shirts"
            }
        ]
    },
    "target": {
        "expected_output": "Of course! What size and color are you looking for?"
    }
}

Editing

Datasets are editable. You can edit the datapoints by clicking on the datapoint and editing the data in JSON. Changes are saved as a new datapoint version.

Versioning

Each datapoint has a unique id and a created_at timestamp. Every time you edit a datapoint, under the hood, a new datapoint version is created with the same id, but the created_at timestamp is updated. The version stack is push-only. That is, when you revert to a previous version, a copy of that version is created and added as a current version. Example:
  • Initial version (v1):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-01T00:00:00.000Z",
  "data": { "key": "initial value" }
}
  • Version 2 (v2):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-05T00:00:05.000Z",
  "data": { "key": "value at v2" }
}
  • Version 3 (v3):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-10T00:00:10.000Z",
  "data": { "key": "value at v3" }
}
After this, you want to update to version 1 (initial version). This will create a new version (v4) with the same id, but the created_at timestamp is updated.
  • Version 4 (v4):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-15T00:00:15.000Z",
  "data": { "key": "initial value" }
}

Datapoint id

When you push a new datapoint to a dataset, a UUIDv7 is generated for it. This allows to sort datapoints by their creation order and preserve the order of insertion.