Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CBOR Metadata Schema

Tensogram v2 uses CBOR (Concise Binary Object Representation) for all structured metadata. There are four kinds of CBOR structures in a message, each living in its own frame:

  1. GlobalMetadata — in header or footer metadata frames
  2. DataObjectDescriptor — inside each data object frame
  3. IndexFrame — in header or footer index frames
  4. HashFrame — in header or footer hash frames

All CBOR maps use deterministic encoding with canonical key ordering per RFC 8949 section 4.2. Keys are sorted by the byte representation of their CBOR-encoded key, applied recursively to nested maps. This means the same metadata always produces the same bytes — important if you hash messages or compare them by digest.

GlobalMetadata

The global metadata frame contains a single CBOR map. The only required key is version; everything else is optional.

KeyTypeRequiredDescription
versionuintYesFormat version. Currently 2
basearray of mapsNoPer-object metadata — one entry per data object, each entry holds ALL metadata for that object independently
_reserved_mapNoLibrary internals (provenance: encoder, time, uuid). Client code MUST NOT write to this.
_extra_mapNoClient-writable catch-all for ad-hoc message-level annotations
any unknown keyanyNoSilently ignored on decode (forward compatibility)

Each data object is self-describing via its own per-frame descriptor (see below). The base array provides per-object metadata at the message level so readers can discover object metadata from the global frame alone, without opening each data object frame.

The base Array

The base array is one entry per data object. Each entry is a CBOR map holding ALL structured metadata for that object. The encoder auto-populates _reserved_.tensor (containing ndim, shape, strides, dtype) in each entry. Application keys (e.g. "mars") are preserved:

{
  "base": [
    {
      "mars": { "class": "od", "stream": "oper", "param": "2t", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": { "class": "od", "stream": "oper", "param": "10u", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ]
}

Each entry corresponds to one data object in order. Entries are independent — there is no tracking of which keys are common across objects. If you need to extract commonalities (e.g. for display or merge operations), use the compute_common() utility in software after decoding.

Key difference from earlier versions: There is no common/payload split. Every base[i] entry is self-contained. MARS keys that are shared across all objects (e.g. class, stream, date) are simply repeated in each entry.

The _reserved_ Section

The _reserved_ section at the message level holds library-managed provenance information. Client code can read these values but must not write to _reserved_ — the encoder validates this and rejects messages where client code has written to it.

{
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.1.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  }
}

Note: _reserved_.encoder.version is set to the library’s crate version at compile time via env!("CARGO_PKG_VERSION") — the value above reflects the tensogram version in use.

Within each base[i] entry, the encoder also auto-populates _reserved_.tensor:

{
  "_reserved_": {
    "tensor": {
      "ndim": 2,
      "shape": [721, 1440],
      "strides": [1440, 1],
      "dtype": "float32"
    }
  }
}

The _extra_ Section

The _extra_ section is a client-writable catch-all for ad-hoc message-level annotations:

{
  "_extra_": {
    "source": "ifs-cycle49r2",
    "experiment_tag": "alpha-run-003"
  }
}

Example GlobalMetadata

A complete example with two data objects (temperature and wind fields):

{
  "version": 2,
  "base": [
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "2t"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "10u"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ],
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.6.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "_extra_": {
    "source": "ifs-cycle49r2"
  }
}

Each base[i] entry is fully self-contained. The only key that varies between the two entries above is param. All other MARS keys are repeated — this is by design. Commonalities can be computed in software via compute_common() when needed.

Optional: Full GRIB Namespace Keys

When the GRIB importer runs with preserve_all_keys (CLI: --all-keys), all non-mars ecCodes namespace keys are stored under a "grib" sub-object within each base[i] entry:

{
  "base": [
    {
      "mars": { "class": "od", "grid": "regular_ll", "param": "2t", "..." : "..." },
      "grib": {
        "geography": { "Ni": 1440, "Nj": 721, "gridType": "regular_ll" },
        "time":      { "dataDate": 20260404, "dataTime": 0 },
        "ls":        { "edition": 2, "centre": "ecmf", "packingType": "grid_ccsds" },
        "parameter":  { "paramId": 167, "shortName": "2t", "units": "K" },
        "statistics": { "max": 311.03, "min": 212.84, "avg": 277.6 }
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    }
  ]
}

The namespaces captured are: ls, geography, time, vertical, parameter, statistics. Keys may overlap between namespaces (e.g. gridType appears in both ls and geography); each namespace stores its own copy. Empty namespaces are omitted.

DataObjectDescriptor

Each data object frame contains its own CBOR descriptor. This descriptor fully describes how to decode the payload — its type, shape, encoding pipeline, and optional per-object metadata. It lives inside the data object frame (not in a central metadata block).

KeyTypeRequiredDescription
typetextYesObject type, e.g. "ntensor" (Rust field: obj_type)
ndimuintYesNumber of dimensions
shapearray of uintYesSize of each dimension
stridesarray of uintYesElement stride per dimension
dtypetextYesData type string (see Data Types)
byte_ordertextYes"big" or "little"
encodingtextYes"none" or "simple_packing"
filtertextYes"none" or "shuffle"
compressiontextYes"none", "szip", "zstd", "lz4", "blosc2", "zfp", or "sz3"
hashmapNoIntegrity hash of the payload (see below)
masksmapNoNaN / Inf bitmask companion descriptors (see below)
encoding paramsvariousConditionalRequired when encoding != "none"
filter paramsvariousConditionalRequired when filter != "none"
compression paramsvariousConditionalRequired when compression != "none"
any other keyanyNoPer-object encoding parameters

Example: Temperature Field Descriptor

Here is what a descriptor might look like for a global temperature field at 0.25-degree resolution, compressed with zstd:

{
  "type": "ntensor",
  "ndim": 2,
  "shape": [721, 1440],
  "strides": [1440, 1],
  "dtype": "float32",
  "byte_order": "little",
  "encoding": "simple_packing",
  "reference_value": 193.72,
  "binary_scale_factor": -16,
  "decimal_scale_factor": 0,
  "bits_per_value": 16,
  "filter": "none",
  "compression": "zstd",
  "zstd_level": 3,
  "hash": {
    "type": "xxh3",
    "value": "a1b2c3d4e5f60718"
  }
}

The params field in DataObjectDescriptor is for encoding parameters only (e.g. reference_value, bits_per_value). MARS keys and other application metadata are stored in the global metadata base[i]["mars"].

Encoding Parameters (simple_packing)

KeyTypeDescription
reference_valuefloatMinimum value in the original data
binary_scale_factorintPower-of-2 scaling factor
decimal_scale_factorintPower-of-10 scaling factor
bits_per_valueuintNumber of bits per packed value (1-64)

Filter Parameters (shuffle)

KeyTypeDescription
shuffle_element_sizeuintByte width of each element (e.g., 4 for float32)

Compression Parameters

szip:

KeyTypeDescription
szip_rsiuintReference sample interval
szip_block_sizeuintBlock size (typically 8 or 16)
szip_flagsuintAEC encoding flags
szip_block_offsetsarray of uintBit offsets of RSI block boundaries (computed by the library or provided via encode_pre_encoded, see Pre-encoded Payloads)

zstd:

KeyTypeDefaultDescription
zstd_levelint3Compression level (1-22)

lz4: No additional parameters required.

blosc2:

KeyTypeDefaultDescription
blosc2_codectext"lz4"Internal codec: blosclz, lz4, lz4hc, zlib, zstd
blosc2_clevelint5Compression level (0-9)
blosc2_typesizeuint(auto)Element byte width for shuffle optimization

zfp:

KeyTypeDescription
zfp_modetext"fixed_rate", "fixed_precision", or "fixed_accuracy"
zfp_ratefloatBits per value (only for fixed_rate)
zfp_precisionuintBit planes to keep (only for fixed_precision)
zfp_tolerancefloatMax absolute error (only for fixed_accuracy)

sz3:

KeyTypeDescription
sz3_error_bound_modetext"abs", "rel", or "psnr"
sz3_error_boundfloatError bound value

Hash Descriptor

The optional hash field records an integrity digest of the raw payload bytes.

KeyTypeDescription
typetext"xxh3"
valuetextHex-encoded digest

NaN / Inf mask companion (masks)

When the object was encoded with allow_nan=true and/or allow_inf=true AND the payload actually contained at least one matching non-finite value, the descriptor carries a masks sub-map. Each kind (nan, inf+, inf-) is independently optional — only the kinds that appeared are present.

{
  ... standard DataObjectDescriptor fields ...,
  "masks": {
    "nan": {
      "method": "roaring",
      "offset": 40,
      "length": 12
    },
    "inf+": {
      "method": "rle",
      "offset": 52,
      "length": 3
    }
  }
}

Each entry:

KeyTypeDescription
methodtext"none" | "rle" | "roaring" | "blosc2" | "zstd" | "lz4" — compression method actually used (may differ from the requested method due to the small-mask auto-fallback)
offsetuintByte offset of the mask blob from the start of the payload region (= first byte after the 16-byte frame header)
lengthuintByte length of the mask blob on disk
paramsmapOptional method-specific parameters (e.g. {"level": 3} for zstd, {"codec": "lz4", "level": 5} for blosc2)

Canonical key order for masks is the byte-lex sort inf+ < inf- < nan. The encoder writes mask blobs between the payload and the CBOR descriptor in the same canonical order. See NaN / Inf Handling for the encode / decode semantics.

IndexFrame

Index frames (header or footer) contain a CBOR map that lets readers jump directly to any data object without scanning.

KeyTypeDescription
object_countuintNumber of data objects in the message
offsetsarray of uintByte offset of each data object frame from message start
lengthsarray of uintByte length of each data object frame

Example IndexFrame

{
  "object_count": 3,
  "offsets": [256, 1048832, 2097408],
  "lengths": [1048576, 1048576, 524288]
}

The offsets array gives O(1) random access to any object — seek to offsets[i] and read lengths[i] bytes.

HashFrame

Hash frames (header or footer) store per-object integrity hashes, allowing verification without reading the individual descriptors.

KeyTypeDescription
object_countuintNumber of data objects
hash_typetextHash algorithm: "xxh3"
hashesarray of textHex-encoded digest for each object, in order

Example HashFrame

{
  "object_count": 3,
  "hash_type": "xxh3",
  "hashes": [
    "a1b2c3d4e5f60718",
    "b2c3d4e5f6071829",
    "c3d4e5f60718293a"
  ]
}

Canonical Encoding

All CBOR maps are encoded with keys sorted by the byte representation of their CBOR-encoded key (RFC 8949 section 4.2). This sorting is applied recursively — nested maps are also sorted.

For short string keys (the common case), this is equivalent to sorting by the key string itself. For long keys or non-string keys, the CBOR byte encoding determines the order.

Why does this matter? If you hash an entire message or compare messages by digest, deterministic encoding ensures that logically identical messages produce identical bytes even if the keys were inserted in different order during construction.