Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Tensogram is a binary message format for N-dimensional scientific tensors — the kind of data that appears in weather and climate forecasting, Earth observation, medical and microscopy imaging, genomics, particle physics, materials simulation, and machine-learning pipelines. It carries its own metadata, supports arbitrary tensor dimensions, and is fast to encode and decode.

What Tensogram gives you

  • Self-describing messages. Every message carries the metadata needed to decode it — shape, dtype, encoding pipeline, application annotations — using CBOR. No external schema required.
  • Any number of dimensions. A single message can carry multiple tensors, each with its own shape, dtype, and encoding. A 3-D spectrum, a 2-D field, and a 4-D ensemble tensor can coexist in one message.
  • Vocabulary-agnostic. The library never interprets metadata keys. Application layers (MARS at ECMWF, CF in climate, BIDS in neuroimaging, your in-house taxonomy) own key names.
  • Transport and file in one format. The same bytes that traverse a socket can be appended to a .tgm file; both support O(1) random access to any object.
  • Interop with existing formats. Importers for GRIB and NetCDF let you bring existing data into Tensogram pipelines without a lossy re-modelling step.
  • Partial range decode. Extract sub-tensor slices without decoding the whole object — useful for remote data at scale.

Tensogram is developed and maintained by ECMWF and is used in operational weather-forecasting workloads, but nothing in the format is weather-specific. The design targets the N-tensor-at-scale problem common to many scientific domains.

Crate Layout

The primary four Rust crates make up the default workspace build:

tensogram/
├── rust/
│   ├── tensogram        ← encode, decode, framing, file API,
│   │                            validation, remote object store
│   ├── tensogram-encodings   ← simple_packing, shuffle, compression
│   ├── tensogram-cli         ← `tensogram` command-line tool
│   └── tensogram-ffi         ← C FFI layer for C/C++ callers
├── python/
│   └── bindings/             ← Python bindings (PyO3 / maturin)
├── cpp/
│   └── include/              ← C++ wrapper header + C header

On top of those, the repository ships several opt-in crates — the tensogram-grib / tensogram-netcdf importers (exposed as the convert-grib / convert-netcdf CLI subcommands), the tensogram-wasm WebAssembly bindings, and the pure-Rust tensogram-szip / tensogram-sz3 / tensogram-sz3-sys compression crates — together with the separate Python packages tensogram-xarray (xarray backend) and tensogram-zarr (Zarr v3 store backend), and a tensogram-benchmarks crate. See plans/ARCHITECTURE.md for the full crate list and build recipes.

Most users interact with tensogram and the CLI. The encodings crate is used internally by the core but is also importable directly if you need to call the encoding functions outside of a full message.

Installation

Rust:

cargo add tensogram

Python:

pip install tensogram          # core
pip install tensogram[all]     # with xarray + zarr backends

CLI:

cargo install tensogram-cli

See the Quick Start for feature flags, optional dependencies, and detailed setup.

Quick Example

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
    encode, decode, GlobalMetadata, DataObjectDescriptor,
    ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};

// Describe what you're storing: a 100×200 grid of f32 values
let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    params: BTreeMap::new(),
    hash: None,
};

let global_meta = GlobalMetadata {
    version: 2,
    ..Default::default()
};

// Your raw bytes (100 × 200 × 4 bytes = 80,000 bytes)
let data = vec![0u8; 100 * 200 * 4];

// Encode into a self-contained message
let message = encode(&global_meta, &[(&desc, &data)], &EncodeOptions::default()).unwrap();

// Decode it back
let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();
assert_eq!(objects[0].0.shape, vec![100, 200]);
assert_eq!(objects[0].1, data);
}

The message bytes can be written to a file, sent over a socket, or stored in a database. The receiver does not need any external schema — everything is self-describing.

What is a Message?

A Tensogram message is a single, self-contained binary blob. It carries:

  1. A Preamble – fixed-size header with magic bytes, version, flags, and total length
  2. Optional header frames – metadata, index, and hash frames for fast random access
  3. One or more data object frames – each containing a CBOR descriptor and the actual tensor bytes
  4. Optional footer frames – metadata, index, and hash frames (used in streaming mode)
  5. A Postamble – footer offset and terminator magic

Every message begins with the ASCII string TENSOGRM and ends with 39277777. This makes it trivial to find message boundaries even in a file containing hundreds of concatenated messages.

Structure at a Glance

block-beta
    columns 1
    A["PREAMBLE (24 bytes)\nTENSOGRM · version · flags · total_length"]
    B["Header Metadata Frame (optional)\nCBOR GlobalMetadata"]
    C["Header Index Frame (optional)\nobject count + offsets"]
    D["Header Hash Frame (optional)\nobject count + hash type + hashes"]
    E["Data Object Frame 0\nCBOR descriptor + payload bytes"]
    F["Data Object Frame 1 (if present)\nCBOR descriptor + payload bytes"]
    G["... (more data object frames)"]
    H["Footer Hash / Index / Metadata Frames (optional)"]
    I["POSTAMBLE (16 bytes)\nfirst_footer_offset · 39277777"]

Frame-Based Design

The v2 wire format is entirely frame-based. Every piece of data between the Preamble and Postamble is wrapped in a frame. Each frame starts with a 4-byte marker (FR + a uint16 frame type), a version, flags, and a length field. This uniform structure means a decoder can skip any frame it does not understand by jumping over its declared length.

Frame types:

Type IDNameLocation
1Header Metadata FrameHeader
2Header Index FrameHeader
3Header Hash FrameHeader
4Data Object FrameBody
5Footer Hash FrameFooter
6Footer Index FrameFooter
7Footer Metadata FrameFooter
8Preceder Metadata FrameBody (before a Data Object)

Padding between frames is allowed (from ENDF to the next FR marker) for 64-bit memory alignment.

Why Header Frames?

When a message is encoded in a single buffer (the common case), the index and hash frames are placed in the header, right after the Preamble. A decoder reads the Preamble, then the metadata frame, then the index frame, and can immediately seek to any data object by offset. That is O(1) random access, which matters when a message carries many large tensors.

Streaming Support

When encoding in streaming mode, the producer may not know in advance how many data objects the message will contain. In this case:

  • total_length in the Preamble is set to 0 (unknown)
  • Index and hash frames are written in the footer instead of the header
  • The Postamble’s first_footer_offset field points back to where the footer frames begin

A decoder reading a streamed message seeks to the end, reads the Postamble, then jumps to the footer frames to find the index. Both paths (header index and footer index) give O(1) access to any object.

Data Object Frames

Each data object is self-contained in its own frame. The frame carries:

  • A CBOR descriptor (DataObjectDescriptor) describing the tensor shape, dtype, encoding pipeline, and optional hash
  • The binary payload (the actual encoded tensor bytes)

The CBOR descriptor can appear before or after the payload within the frame. By default it is placed after the payload, since some encoding parameters (like hash values) are only known after the payload has been written. A flag in the frame header indicates the position.

Messages vs Files

A .tgm file is just a sequence of messages written one after another:

[message 1][message 2][message 3]...

There is no file-level index or header. The TensogramFile API scans the file once (lazily, on first access) and builds an in-memory list of (offset, length) pairs for each message. After that, reading any message is a seek + read – no scan needed.

To find message boundaries in a file:

  1. Scan for TENSOGRM magic (8 bytes)
  2. If total_length is non-zero, use it to advance to the next message
  3. Otherwise, walk frames using their length fields until the next magic or EOF

Self-Description

Every message carries all the information needed to decode it:

  • The dtype of every object (float32, int16, etc.)
  • The shape and strides (dimensions and memory layout)
  • The full encoding pipeline applied to the payload (encoding, filter, compression)
  • The byte order of each object’s data
  • Any application-level metadata (MARS keys, units, timestamps, etc.)

This means a decoder never needs an external schema. You can receive a Tensogram message on a new machine, years after it was encoded, and decode it correctly.

Edge Case: Zero-Object Messages

A message with no data object frames is valid. It contains only the Preamble, a metadata frame, and the Postamble. This is useful for sending pure metadata (e.g. a control message or an acknowledgement with provenance information) without any tensor payload.

#![allow(unused)]
fn main() {
let metadata = GlobalMetadata {
    version: 2,
    ..Default::default()
};
let msg = encode(&metadata, &[], &EncodeOptions::default()).unwrap();
}

Metadata

Metadata in Tensogram is stored as CBOR – Concise Binary Object Representation (RFC 8949). Think of it as a compact, binary version of JSON. It supports the same types (strings, integers, floats, booleans, arrays, maps), but is smaller and faster to parse.

Metadata Locations

In v2, metadata lives in two distinct places:

LevelWhere it livesWhat it contains
GlobalHeader or footer metadata frameGlobalMetadata: version + base (per-object metadata array) + _reserved_ (library internals) + _extra_ (client annotations)
Per-objectEach data object frame’s CBOR descriptorDataObjectDescriptor: tensor shape, encoding pipeline, hash, plus params for encoding parameters

Each data object carries its own descriptor inline within its frame.

GlobalMetadata

The global metadata frame contains a GlobalMetadata struct with three named sections:

#![allow(unused)]
fn main() {
GlobalMetadata {
    version: 2,
    base: Vec::new(),              // one BTreeMap per data object (independent entries)
    reserved: BTreeMap::new(),     // library internals (_reserved_ in CBOR)
    extra: BTreeMap::new(),        // client-writable catch-all (_extra_ in CBOR)
}
}

In CBOR, this looks like (using ECMWF MARS keys as one concrete example vocabulary):

{
  "version": 2,
  "base": [
    {
      "mars": {
        "class": "od", "type": "fc",
        "date": "20260401", "time": "1200", "param": "2t"
      }
    }
  ],
  "_extra_": {
    "source": "ifs-cycle49r2"
  }
}

The same mechanism works for any application vocabulary. A neuroimaging pipeline might use a BIDS namespace:

{
  "version": 2,
  "base": [{
    "bids": { "subject": "sub-01", "session": "ses-01",
              "task": "rest", "run": 1 }
  }]
}

A materials-simulation pipeline might use a custom namespace:

{
  "version": 2,
  "base": [{
    "material": { "composition": "Fe3O4", "lattice": "cubic", "T_K": 300.0 }
  }]
}

The library does not know or care which vocabulary is used — it simply stores, serialises, and returns the keys you supply.

The version field is required (u16). The base array holds per-object metadata. _extra_ is a free-form catch-all – you can add any key using any CBOR value type. The library does not interpret or validate these keys. Your application layer assigns meaning.

Per-Object Metadata in base

The base section is a CBOR array of maps — one entry per data object. Each entry holds ALL structured metadata for that object independently. Entries are self-contained — there is no tracking of which keys are common across objects.

The encoder auto-populates _reserved_.tensor (with ndim, shape, strides, dtype) in each entry when you call encode() or StreamingEncoder::finish(). Application keys are preserved:

{
  "base": [
    {
      "mars": { "class": "od", "type": "fc", "param": "2t", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    },
    {
      "mars": { "class": "od", "type": "fc", "param": "10u", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    }
  ]
}

This lets readers discover the shape, type, and per-object metadata of every object by reading only the global metadata frame — without opening each data object frame.

No common/varying split: Every base[i] entry is self-contained. MARS keys shared across all objects (e.g. class, type) are simply repeated in each entry. If you need to extract commonalities (e.g. for display or merges), use the compute_common() utility in software after decoding.

DataObjectDescriptor

The params field of each DataObjectDescriptor is a BTreeMap<String, ciborium::Value> for encoding parameters only (e.g. reference_value, bits_per_value). These are flattened into the CBOR descriptor alongside the fixed tensor fields.

For example, a data object’s CBOR descriptor might look like:

{
  "type": "ntensor",
  "ndim": 2,
  "shape": [721, 1440],
  "strides": [1440, 1],
  "dtype": "float32",
  "byte_order": "big",
  "encoding": "simple_packing",
  "filter": "none",
  "compression": "szip",
  "reference_value": 230.5,
  "bits_per_value": 16,
  "hash": { "type": "xxh3", "value": "a1b2c3d4e5f6..." }
}

Here, reference_value and bits_per_value live in the params map. Application metadata such as MARS keys belongs in base[i]["mars"] in the global metadata.

Namespaced Keys

Convention: application-layer keys are grouped under a namespace key, so that multiple vocabularies can coexist in the same message. For example, ECMWF’s MARS vocabulary lives under "mars":

{
  "version": 2,
  "base": [
    {
      "mars": {
        "class": "od", "type": "fc",
        "param": "2t", "date": "20260401", "step": 6
      }
    }
  ]
}

Other pipelines use other namespaces — "cf" for CF conventions, "bids" for neuroimaging, "dicom" for medical imaging, or anything your application defines. This convention applies at both levels — global metadata and per-object params.

Filtering with the CLI

The -w flag on ls, dump, get, and copy uses dot-notation to filter messages on any namespace. The examples below use the MARS vocabulary, but the same syntax works with any application namespace (e.g. bids.subject, dicom.Modality, product.name):

# Only messages where mars.param equals "2t" or "10u"
tensogram ls data.tgm -w "mars.param=2t/10u"

# Exclude messages where mars.class equals "od"
tensogram ls data.tgm -w "mars.class!=od"

The / character separates OR values. Key lookup searches base[i] entries first (skipping _reserved_, first match across entries), then _extra_ for backwards compatibility.

Preceder Metadata Frames

In streaming mode, per-object metadata is normally only available in the footer metadata frame (written after all objects). A Preceder Metadata Frame (frame type 8) allows producers to send per-object metadata before the data object, without waiting for the footer.

A preceder carries a GlobalMetadata CBOR with a single-entry base array for the next data object:

{
  "version": 2,
  "base": [{"product": {"name": "temperature"}, "units": "K"}]
}

Merge rule: On decode, preceder keys override footer base[i] keys on conflict. Structural keys auto-populated by the encoder (in _reserved_.tensor: ndim, shape, strides, dtype) are preserved from the footer when absent from the preceder. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.

Use StreamingEncoder::write_preceder() before write_object() to emit a preceder frame. Preceders are optional per-object: some objects may have them, others may not.

Value Type Rules

Keys must be text strings. Values must be JSON-compatible CBOR types: string, integer, float, boolean, null, array, or map. Byte strings, CBOR tags, undefined, and half-precision floats are not allowed. See Metadata Value Types for the full rules and rationale.

Deterministic Encoding

When Tensogram encodes metadata to CBOR, it sorts all map keys by their CBOR byte representation (RFC 8949 Section 4.2 canonical form). This guarantees that the same metadata always produces the same bytes, regardless of the order you inserted keys in your application code. This matters for hashing and reproducibility.

Edge case: Nested maps are also sorted recursively. Even metadata stored inside a CBOR map value (like the "mars" namespace) gets canonical ordering.

Objects and Dtypes

An object is one N-dimensional tensor inside a message. A message can carry multiple objects. In v2, each object is fully described by a single struct:

  • A DataObjectDescriptor carrying tensor metadata, encoding pipeline, and integrity hash – all in one place
  • The actual binary payload within the object’s frame

There is no separate “payload descriptor” array. The descriptor travels with the data inside the same frame.

DataObjectDescriptor

#![allow(unused)]
fn main() {
DataObjectDescriptor {
    // ── Tensor metadata ──
    obj_type: "ntensor",           // always "ntensor" for now
    ndim: 2,                       // number of dimensions
    shape: vec![100, 200],         // size of each dimension
    strides: vec![200, 1],         // elements to skip per dimension step
    dtype: Dtype::Float32,         // element type

    // ── Encoding pipeline ──
    byte_order: ByteOrder::Big,    // big or little endian
    encoding: "simple_packing",    // or "none"
    filter: "shuffle",             // or "none"
    compression: "szip",           // or "none", "zstd", "lz4", etc.

    // ── Flexible parameters (encoding only) ──
    params: BTreeMap::from([       // BTreeMap<String, ciborium::Value>
        ("reference_value".into(), ciborium::Value::Float(230.5)),
        ("bits_per_value".into(), ciborium::Value::Integer(16.into())),
    ]),

    // ── Integrity ──
    hash: Some(HashDescriptor {
        hash_type: "xxh3",
        value: "a1b2c3d4e5f6...",
    }),
}
}

The params map is flattened into the CBOR alongside the fixed fields, so the on-wire CBOR is a single flat map. This keeps things simple for decoders – no nested “encoding” or “tensor” sub-objects to navigate.

Each data object has its own descriptor, so different objects in the same message can use different encodings, byte orders, and hash algorithms.

Strides

Strides tell you how to navigate the memory layout. For a C-contiguous (row-major) array of shape [100, 200]:

  • Advancing along axis 0 (rows) skips 200 elements
  • Advancing along axis 1 (columns) skips 1 element

So strides = [200, 1]. For a Fortran-contiguous (column-major) array the strides would be reversed: [1, 100].

To compute C-contiguous strides from shape:

#![allow(unused)]
fn main() {
fn compute_strides(shape: &[u64]) -> Vec<u64> {
    let mut strides = vec![1u64; shape.len()];
    for i in (0..shape.len() - 1).rev() {
        strides[i] = strides[i + 1] * shape[i + 1];
    }
    strides
}
// shape [100, 200] → strides [200, 1]
// shape [4, 5, 6]  → strides [30, 6, 1]
}

Supported Data Types

NameSizeDescription
float162 bytesIEEE 754 half-precision float
bfloat162 bytesBrain float (truncated float32)
float324 bytesIEEE 754 single-precision float
float648 bytesIEEE 754 double-precision float
complex648 bytesTwo float32 (real + imag)
complex12816 bytesTwo float64 (real + imag)
int81 byteSigned integer
int162 bytesSigned integer
int324 bytesSigned integer
int648 bytesSigned integer
uint81 byteUnsigned integer
uint162 bytesUnsigned integer
uint324 bytesUnsigned integer
uint648 bytesUnsigned integer
bitmask< 1 bytePacked bits (sub-byte; size depends on element count)

Edge case: bitmask returns 0 from byte_width(). Callers that need the actual byte count must compute it from the element count: (num_elements + 7) / 8.

Multiple Objects in One Message

A message can carry several related tensors. Two concrete examples:

  • A wave-spectrum message with the spectrum itself as a 3-tensor and a land/sea mask as a 2-tensor.
  • A medical-imaging message with a 4-D time-series volume, a 3-D segmentation mask, and a 1-D array of acquisition timestamps.
block-beta
    columns 3
    A["Object 0\nSpectrum\nf32 · 721×1440×30\nencoding: simple_packing"]:2
    B["Object 1\nLand mask\nuint8 · 721×1440\nencoding: none"]:1

All objects live in the same message. Each object has its own DataObjectDescriptor embedded in its frame and its own entry in GlobalMetadata.base holding per-object application metadata. Different objects can use completely different encoding pipelines.

Edge case: The number of DataObjectDescriptor entries and the data slices passed to encode() must be equal. The encoder returns an error if they do not match.

The Encoding Pipeline

Every object payload passes through a three-stage pipeline on the way in (encoding) and out (decoding). The stages always run in the same order:

flowchart TD
    subgraph Encode["Encode Path"]
        direction TB
        A["Raw bytes"]
        B["Stage 1 — Encoding
        (lossy quantization)"]
        C["Stage 2 — Filter
        (byte shuffle)"]
        D["Stage 3 — Compression
        (szip / zstd / lz4 / blosc2 / zfp / sz3)"]
        A --> B --> C --> D
    end

    S[("Stored bytes")]

    subgraph Decode["Decode Path"]
        direction TB
        F["Stage 3 — Decompress"]
        G["Stage 2 — Unshuffle"]
        H["Stage 1 — Dequantize"]
        I["Raw bytes"]
        F --> G --> H --> I
    end

    D --> S --> F

    style A fill:#e8f5e9,stroke:#388e3c
    style S fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style I fill:#e8f5e9,stroke:#388e3c
    style Encode fill:#e3f2fd,stroke:#1565c0,color:#1565c0
    style Decode fill:#fce4ec,stroke:#c62828,color:#c62828

Each stage is independently configurable per object via fields in the DataObjectDescriptor. Set a stage to "none" to skip it. For callers with already-encoded payloads, a pipeline-bypass option exists via encode_pre_encoded (see Pre-encoded Payloads).

Stage 1: Encoding

Encoding transforms values to reduce the number of bits needed to represent them. The only supported encoding right now is simple_packing — a lossy quantisation that maps a bounded range of floating-point values onto N-bit integers. The bit layout matches GRIB 2 simple_packing so quantised payloads are interoperable with existing GRIB tooling.

ValueMeaning
"none"Pass through unchanged
"simple_packing"Lossy quantization (see Simple Packing)

Stage 2: Filter

Filters rearrange bytes to improve compression ratios. The shuffle filter reorders bytes by their significance level (all most-significant bytes first, then all second-most-significant bytes, etc.), which makes float data much more compressible because nearby values have similar high bytes.

ValueMeaning
"none"Pass through unchanged
"shuffle"Byte-level shuffle (see Byte Shuffle Filter)

Stage 3: Compression

Compression reduces the total byte count. Seven compressors are implemented:

ValueTypeRandom AccessNotes
"none"Pass-throughYesNo compression
"szip"LosslessYesCCSDS 121.0-B-3 via libaec
"zstd"LosslessNoExcellent ratio/speed tradeoff
"lz4"LosslessNoFastest decompression
"blosc2"LosslessYesMulti-codec, chunk-level access
"zfp"LossyYes (fixed-rate)Floating-point arrays
"sz3"LossyNoError-bounded scientific data

See Compression for full details on each compressor, including parameters and random access support.

Note: ZFP and SZ3 operate directly on typed floating-point data. Use them with encoding: "none" and filter: "none" – they replace both encoding and compression.

Typical Combinations

Use caseencodingfiltercompression
Exact integers (e.g. a mask)nonenonenone
Lossy bounded-range floatssimple_packingnoneszip
Best lossless (floats)noneshuffleszip or blosc2
GRIB 2 CCSDS-interoperablesimple_packingnoneszip
Real-time streamingnonenonelz4
Archival storagenoneshufflezstd
ML model weightsnonenoneblosc2
Lossy float w/ random accessnonenonezfp (fixed_rate)
Error-bounded sciencenonenonesz3

How It Looks in Code

The entire pipeline is configured through the DataObjectDescriptor:

#![allow(unused)]
fn main() {
DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 2,
    shape: vec![721, 1440],
    strides: vec![1440, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "szip".into(),
    params: BTreeMap::from([
        ("reference_value".into(), Value::Float(230.5)),
        ("bits_per_value".into(), Value::Integer(16.into())),
    ]),
    hash: None, // set automatically during encoding
}
}

All encoding parameters (reference_value, bits_per_value, szip_block_offsets, etc.) go into the params map. The encoder populates additional params during encoding (like block offsets for szip), and the decoder reads them back.

Integrity Hashing

After all three stages, the stored bytes can be hashed. The hash is stored in the DataObjectDescriptor’s hash field alongside the encoded bytes. On decode, if verify_hash: true is set, the hash is recomputed and compared.

AlgorithmHash lengthNotes
xxh316 hex chars (64-bit)Default. Fast, non-cryptographic

Edge case: The hash covers the stored bytes (after encoding + filter + compression), not the original raw bytes. This means a hash mismatch always indicates storage or transmission corruption, not a quantization difference from lossy encoding.

Wire Format (v3)

This page describes the exact byte layout of a Tensogram v3 message — the format shipped in 0.17.0. You need this if you are implementing a reader in another language, debugging a corrupted file, or just want to understand what is happening under the hood. For the normative specification, see plans/WIRE_FORMAT.md.

All integer fields are big-endian (network byte order).

Overview

A Tensogram message is built from three sections: a header (preamble + optional frames), one or more data object frames, and a footer (optional frames + postamble).

┌────────────────────────────────────────────────────────────────────┐
│  PREAMBLE                  magic, version, flags, length  (24 B)   │
├────────────────────────────────────────────────────────────────────┤
│  HEADER METADATA FRAME     CBOR global metadata      (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  HEADER INDEX FRAME        CBOR object offsets       (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  HEADER HASH FRAME         CBOR object hashes        (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  PRECEDER METADATA FRAME   per-object metadata       (optional)    │
│  DATA OBJECT FRAME 0       header + payload + descriptor           │
│  PRECEDER METADATA FRAME   per-object metadata       (optional)    │
│  DATA OBJECT FRAME 1       ...                                     │
│  DATA OBJECT FRAME 2       (no preceder)                           │
│  ...                       (any number of objects)                 │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER HASH FRAME         CBOR object hashes        (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER INDEX FRAME        CBOR object offsets       (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER METADATA FRAME     CBOR global metadata      (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  POSTAMBLE   first_footer_offset, total_length, end_magic  (24 B)  │
└────────────────────────────────────────────────────────────────────┘

At least one metadata frame (header or footer) must be present — messages cannot exist without metadata. Index and hash frames are optional but highly encouraged. By default, the encoder places them in the header when writing to a buffer, or in the footer when streaming.

Frame ordering: The decoder enforces that frames appear in order: header frames, then data object frames, then footer frames. A header frame appearing after a data object frame, or a data object frame appearing after a footer frame, is rejected as malformed.

Preamble (24 bytes)

The preamble is the fixed-size start of every message.

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       8       Magic: "TENSOGRM" (ASCII)
8       2       Version (uint16 BE) — must be 3 in v3
10      2       Flags (uint16 BE)
12      4       Reserved (uint32 BE) — set to zero
16      8       Total length (uint64 BE)

Total length is the byte count of the entire message from the first byte of the preamble to the last byte of the postamble. A value of zero means the encoder is in streaming mode — the total length was not known when the preamble was written.

Version compatibility. v3 decoders reject any preamble whose version field is not exactly 3. Older v1/v2 messages must be re-encoded.

Preamble flags

The flags field is a bitmask indicating which optional frames are present and, new in v3, whether inline per-frame hash slots are populated:

BitFlagMeaning
0HEADER_METADATAA HeaderMetadata frame is present.
1FOOTER_METADATAA FooterMetadata frame is present.
2HEADER_INDEXA HeaderIndex frame is present.
3FOOTER_INDEXA FooterIndex frame is present.
4HEADER_HASHESA HeaderHash aggregate frame is present.
5FOOTER_HASHESA FooterHash aggregate frame is present.
6PRECEDER_METADATAAt least one PrecederMetadata frame is present.
7HASHES_PRESENTEvery frame’s inline hash slot is populated with a non-zero xxh3-64 digest (new in v3).

Unused flag bits must be set to zero.

Frames

Every frame (header, footer, and data object) shares a common 16-byte frame header and ends with a type-specific footer whose last 12 bytes are always [hash u64][ENDF 4] (new in v3).

Frame header (16 bytes)

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       2       Start marker: "FR" (ASCII)
2       2       Frame type (uint16 BE)
4       2       Frame version (uint16 BE)
6       2       Reserved flags (uint16 BE)
8       8       Frame length — offset to end of frame (uint64 BE)

Frame versions are independent from the message version and from each other.

Every frame ends with this fixed-size tail:

Offset (from frame end)  Size    Field
───────────────────────  ──────  ─────────────────────────────────
-12                      8       hash (uint64 BE) — xxh3-64 digest of the frame body, or 0x0000000000000000 when HASHES_PRESENT = 0
-4                       4       End marker: "ENDF" (ASCII)

Data-object frames (type 9) have a larger 20-byte footer that adds an 8-byte cbor_offset field before the common tail.

Frame types

TypeNameContents
1Header MetadataCBOR global metadata map
2Header IndexCBOR index of data object offsets
3Header HashCBOR aggregate of per-object hashes
4(reserved)Occupied by the obsolete v2 NTensorFrame; any v3 decoder errors on read
5Footer HashCBOR aggregate of per-object hashes
6Footer IndexCBOR index of data object offsets
7Footer MetadataCBOR global metadata map
8Preceder MetadataPer-object CBOR metadata (see below)
9NTensorFrameDescriptor + payload + optional NaN / Inf bitmask companion sections (see NaN / Inf Handling)

The body phase of a v3 message carries one or more data-object frames. In v3 only NTensorFrame (type 9) is defined; future types can slot in at fresh unused numbers without bumping the wire version.

Padding between frames

It is valid to have padding bytes between a frame’s ENDF marker and the next frame’s FR marker. This allows encoders to align frame starts to 8-byte (64-bit) boundaries for memory-mapped access.

Data Object Frames

A data object frame wraps one tensor’s payload together with its CBOR descriptor. v3 defines exactly one concrete data-object type, NTensorFrame (type 9). The descriptor can go either before or after the payload — flag bit 0 in the frame header controls this. The default is after, because when encoding the descriptor is sometimes only fully known once the payload has been written (e.g. after computing a hash or determining compressed size).

NTensorFrame (type 9) — v3 canonical layout

┌──────────────────────────────────────────────────────────────┐
│  FRAME HEADER       "FR" + type(9) + ver + flags + len (16 B)│
├──────────────────────────────────────────────────────────────┤
│  DATA PAYLOAD       raw or compressed bytes, NaN/Inf         │
│                     positions substituted with 0.0           │
├──────────────────────────────────────────────────────────────┤
│  mask_nan blob      OPTIONAL — compressed NaN position mask  │
├──────────────────────────────────────────────────────────────┤
│  mask_inf+ blob     OPTIONAL — compressed +Inf position mask │
├──────────────────────────────────────────────────────────────┤
│  mask_inf- blob     OPTIONAL — compressed -Inf position mask │
├──────────────────────────────────────────────────────────────┤
│  CBOR DESCRIPTOR    carries a top-level "masks" sub-map      │
│                     when any mask is present (see below)     │
├──────────────────────────────────────────────────────────────┤
│  cbor_offset (uint64 BE, 8 B)                                │
│  hash        (uint64 BE, 8 B)   xxh3-64 of body              │
│  "ENDF"      (4 B)                                           │
└──────────────────────────────────────────────────────────────┘

The data-object footer is 20 bytes: [cbor_offset u64] [hash u64][ENDF 4]. The cbor_offset field points at the CBOR descriptor’s start relative to the frame’s first byte. The inline hash slot carries the xxh3-64 of the frame body (everything between the 16-byte header and this 20-byte footer) when the message’s HASHES_PRESENT preamble flag is set; otherwise it is 0x0000000000000000.

Hash scope includes payload + masks + CBOR. It does NOT include the header, the cbor_offset field, the hash slot itself, or ENDF.

The CBOR descriptor fully describes the data object: its type, shape, strides, data type, byte order, encoding pipeline, and optional per-object metadata. See the CBOR Metadata page for the schema.

See NaN / Inf Handling for the mask encode / decode semantics and the documented lossy-reconstruction caveat.

Preceder Metadata Frame

A Preceder Metadata Frame (type 8) optionally appears immediately before a Data Object Frame. It carries per-object metadata for the following data object, using the same GlobalMetadata CBOR format but with a single-entry base array.

Use case: Streaming producers that do not know ahead of time when the message will end can emit per-object metadata early via preceders, rather than waiting for the footer.

Ordering rules:

  • Must appear in the data objects phase (after headers, before footers).
  • Must be followed by exactly one Data Object Frame.
  • Two consecutive preceders without an intervening DataObject are invalid.
  • A dangling preceder (not followed by a DataObject) is invalid.
  • Preceders are optional per-object.

CBOR structure:

{
  "version": 2,
  "base": [{"mars": {"param": "2t"}, "units": "K"}]
}

Merge on decode: Preceder keys override footer base[i] keys on conflict. Footer-only keys (e.g., auto-populated _reserved_.tensor with ndim, shape, strides, dtype) are preserved. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.

Postamble (16 bytes)

The postamble sits at the very end of every message.

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       8       first_footer_offset (uint64 BE)
8       8       End magic: "39277777" (ASCII)

first_footer_offset is the byte offset (from the start of the message) to the first footer frame. This is never zero:

  • If footer frames exist, it points to the start of the first one (e.g., the Footer Hash Frame).
  • If no footer frames exist, it points to the postamble itself.

This guarantee means a reader can always distinguish “no footer frames” from “footer at offset 0” without ambiguity.

The end magic 39277777 was chosen because it is unlikely to appear naturally in floating-point or integer data, making it useful as a corruption boundary detector.

Random Access Patterns

With a header index (most common)

When a message was written in non-streaming mode, the index is in the header. This is the fastest path — no seeking to the end required.

1. Read preamble (24 B) → check flags
2. Read header metadata frame → global context
3. Read header index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode

When a message was written in streaming mode, the encoder did not know the object count or offsets up front. The index lives in the footer.

1. Seek to end − 24, read postamble → first_footer_offset
2. Seek to first_footer_offset, scan footer frames → find index
3. Read footer index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode

Both paths give O(1) access to any data object by index. The object count is derived from offsets.len().

Scanning a Multi-Message File

Multiple messages can be concatenated into a single .tgm file. To find message boundaries:

  1. Scan forward for the TENSOGRM magic (8 bytes).
  2. Read total_length from the preamble.
    • If total_length is non-zero, advance by that many bytes to reach the next message.
    • If total_length is zero (streaming mode), use the header index frame length if present.
  3. If neither total length nor header index is available, walk frame-by-frame — each frame header contains a length field — until the next TENSOGRM magic or EOF.
  4. Verify the 39277777 end magic at the expected position to confirm message integrity.
flowchart TD
    A[Start of file] --> B{Find TENSOGRM?}
    B -- No --> Z[End of scan]
    B -- Yes --> C[Read total_length at +16]
    C --> D{total_length > 0?}
    D -- Yes --> E[Advance to offset + total_length]
    D -- No --> F[Walk frame-by-frame to next magic]
    E --> G[Verify 39277777 end magic]
    F --> G
    G -- Valid --> H[Record message]
    H --> B
    G -- Invalid --> I[Skip 1 byte, resume scan]
    I --> B

If the end magic does not match, the message is likely corrupt. The scanner skips one byte and resumes searching — this is the corruption recovery path.

A Note on CBOR

Frames that contain CBOR data (metadata, index, hash) use length-prefixed CBOR encoding — there are no explicit start/end markers within the CBOR stream itself. The CBOR decoder reads the first byte to determine the data type and item count, then consumes exactly that many bytes. The frame boundaries (FRENDF) provide the outer containment.

All CBOR maps use deterministic encoding with canonical key ordering (RFC 8949 section 4.2). See CBOR Metadata for details.

CBOR Metadata Schema

Tensogram v2 uses CBOR (Concise Binary Object Representation) for all structured metadata. There are four kinds of CBOR structures in a message, each living in its own frame:

  1. GlobalMetadata — in header or footer metadata frames
  2. DataObjectDescriptor — inside each data object frame
  3. IndexFrame — in header or footer index frames
  4. HashFrame — in header or footer hash frames

All CBOR maps use deterministic encoding with canonical key ordering per RFC 8949 section 4.2. Keys are sorted by the byte representation of their CBOR-encoded key, applied recursively to nested maps. This means the same metadata always produces the same bytes — important if you hash messages or compare them by digest.

GlobalMetadata

The global metadata frame contains a single CBOR map. The only required key is version; everything else is optional.

KeyTypeRequiredDescription
versionuintYesFormat version. Currently 2
basearray of mapsNoPer-object metadata — one entry per data object, each entry holds ALL metadata for that object independently
_reserved_mapNoLibrary internals (provenance: encoder, time, uuid). Client code MUST NOT write to this.
_extra_mapNoClient-writable catch-all for ad-hoc message-level annotations
any unknown keyanyNoSilently ignored on decode (forward compatibility)

Each data object is self-describing via its own per-frame descriptor (see below). The base array provides per-object metadata at the message level so readers can discover object metadata from the global frame alone, without opening each data object frame.

The base Array

The base array is one entry per data object. Each entry is a CBOR map holding ALL structured metadata for that object. The encoder auto-populates _reserved_.tensor (containing ndim, shape, strides, dtype) in each entry. Application keys (e.g. "mars") are preserved:

{
  "base": [
    {
      "mars": { "class": "od", "stream": "oper", "param": "2t", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": { "class": "od", "stream": "oper", "param": "10u", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ]
}

Each entry corresponds to one data object in order. Entries are independent — there is no tracking of which keys are common across objects. If you need to extract commonalities (e.g. for display or merge operations), use the compute_common() utility in software after decoding.

Key difference from earlier versions: There is no common/payload split. Every base[i] entry is self-contained. MARS keys that are shared across all objects (e.g. class, stream, date) are simply repeated in each entry.

The _reserved_ Section

The _reserved_ section at the message level holds library-managed provenance information. Client code can read these values but must not write to _reserved_ — the encoder validates this and rejects messages where client code has written to it.

{
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.1.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  }
}

Note: _reserved_.encoder.version is set to the library’s crate version at compile time via env!("CARGO_PKG_VERSION") — the value above reflects the tensogram version in use.

Within each base[i] entry, the encoder also auto-populates _reserved_.tensor:

{
  "_reserved_": {
    "tensor": {
      "ndim": 2,
      "shape": [721, 1440],
      "strides": [1440, 1],
      "dtype": "float32"
    }
  }
}

The _extra_ Section

The _extra_ section is a client-writable catch-all for ad-hoc message-level annotations:

{
  "_extra_": {
    "source": "ifs-cycle49r2",
    "experiment_tag": "alpha-run-003"
  }
}

Example GlobalMetadata

A complete example with two data objects (temperature and wind fields):

{
  "version": 2,
  "base": [
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "2t"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "10u"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ],
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.6.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "_extra_": {
    "source": "ifs-cycle49r2"
  }
}

Each base[i] entry is fully self-contained. The only key that varies between the two entries above is param. All other MARS keys are repeated — this is by design. Commonalities can be computed in software via compute_common() when needed.

Optional: Full GRIB Namespace Keys

When the GRIB importer runs with preserve_all_keys (CLI: --all-keys), all non-mars ecCodes namespace keys are stored under a "grib" sub-object within each base[i] entry:

{
  "base": [
    {
      "mars": { "class": "od", "grid": "regular_ll", "param": "2t", "..." : "..." },
      "grib": {
        "geography": { "Ni": 1440, "Nj": 721, "gridType": "regular_ll" },
        "time":      { "dataDate": 20260404, "dataTime": 0 },
        "ls":        { "edition": 2, "centre": "ecmf", "packingType": "grid_ccsds" },
        "parameter":  { "paramId": 167, "shortName": "2t", "units": "K" },
        "statistics": { "max": 311.03, "min": 212.84, "avg": 277.6 }
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    }
  ]
}

The namespaces captured are: ls, geography, time, vertical, parameter, statistics. Keys may overlap between namespaces (e.g. gridType appears in both ls and geography); each namespace stores its own copy. Empty namespaces are omitted.

DataObjectDescriptor

Each data object frame contains its own CBOR descriptor. This descriptor fully describes how to decode the payload — its type, shape, encoding pipeline, and optional per-object metadata. It lives inside the data object frame (not in a central metadata block).

KeyTypeRequiredDescription
typetextYesObject type, e.g. "ntensor" (Rust field: obj_type)
ndimuintYesNumber of dimensions
shapearray of uintYesSize of each dimension
stridesarray of uintYesElement stride per dimension
dtypetextYesData type string (see Data Types)
byte_ordertextYes"big" or "little"
encodingtextYes"none" or "simple_packing"
filtertextYes"none" or "shuffle"
compressiontextYes"none", "szip", "zstd", "lz4", "blosc2", "zfp", or "sz3"
hashmapNoIntegrity hash of the payload (see below)
masksmapNoNaN / Inf bitmask companion descriptors (see below)
encoding paramsvariousConditionalRequired when encoding != "none"
filter paramsvariousConditionalRequired when filter != "none"
compression paramsvariousConditionalRequired when compression != "none"
any other keyanyNoPer-object encoding parameters

Example: Temperature Field Descriptor

Here is what a descriptor might look like for a global temperature field at 0.25-degree resolution, compressed with zstd:

{
  "type": "ntensor",
  "ndim": 2,
  "shape": [721, 1440],
  "strides": [1440, 1],
  "dtype": "float32",
  "byte_order": "little",
  "encoding": "simple_packing",
  "reference_value": 193.72,
  "binary_scale_factor": -16,
  "decimal_scale_factor": 0,
  "bits_per_value": 16,
  "filter": "none",
  "compression": "zstd",
  "zstd_level": 3,
  "hash": {
    "type": "xxh3",
    "value": "a1b2c3d4e5f60718"
  }
}

The params field in DataObjectDescriptor is for encoding parameters only (e.g. reference_value, bits_per_value). MARS keys and other application metadata are stored in the global metadata base[i]["mars"].

Encoding Parameters (simple_packing)

KeyTypeDescription
reference_valuefloatMinimum value in the original data
binary_scale_factorintPower-of-2 scaling factor
decimal_scale_factorintPower-of-10 scaling factor
bits_per_valueuintNumber of bits per packed value (1-64)

Filter Parameters (shuffle)

KeyTypeDescription
shuffle_element_sizeuintByte width of each element (e.g., 4 for float32)

Compression Parameters

szip:

KeyTypeDescription
szip_rsiuintReference sample interval
szip_block_sizeuintBlock size (typically 8 or 16)
szip_flagsuintAEC encoding flags
szip_block_offsetsarray of uintBit offsets of RSI block boundaries (computed by the library or provided via encode_pre_encoded, see Pre-encoded Payloads)

zstd:

KeyTypeDefaultDescription
zstd_levelint3Compression level (1-22)

lz4: No additional parameters required.

blosc2:

KeyTypeDefaultDescription
blosc2_codectext"lz4"Internal codec: blosclz, lz4, lz4hc, zlib, zstd
blosc2_clevelint5Compression level (0-9)
blosc2_typesizeuint(auto)Element byte width for shuffle optimization

zfp:

KeyTypeDescription
zfp_modetext"fixed_rate", "fixed_precision", or "fixed_accuracy"
zfp_ratefloatBits per value (only for fixed_rate)
zfp_precisionuintBit planes to keep (only for fixed_precision)
zfp_tolerancefloatMax absolute error (only for fixed_accuracy)

sz3:

KeyTypeDescription
sz3_error_bound_modetext"abs", "rel", or "psnr"
sz3_error_boundfloatError bound value

Hash Descriptor

The optional hash field records an integrity digest of the raw payload bytes.

KeyTypeDescription
typetext"xxh3"
valuetextHex-encoded digest

NaN / Inf mask companion (masks)

When the object was encoded with allow_nan=true and/or allow_inf=true AND the payload actually contained at least one matching non-finite value, the descriptor carries a masks sub-map. Each kind (nan, inf+, inf-) is independently optional — only the kinds that appeared are present.

{
  ... standard DataObjectDescriptor fields ...,
  "masks": {
    "nan": {
      "method": "roaring",
      "offset": 40,
      "length": 12
    },
    "inf+": {
      "method": "rle",
      "offset": 52,
      "length": 3
    }
  }
}

Each entry:

KeyTypeDescription
methodtext"none" | "rle" | "roaring" | "blosc2" | "zstd" | "lz4" — compression method actually used (may differ from the requested method due to the small-mask auto-fallback)
offsetuintByte offset of the mask blob from the start of the payload region (= first byte after the 16-byte frame header)
lengthuintByte length of the mask blob on disk
paramsmapOptional method-specific parameters (e.g. {"level": 3} for zstd, {"codec": "lz4", "level": 5} for blosc2)

Canonical key order for masks is the byte-lex sort inf+ < inf- < nan. The encoder writes mask blobs between the payload and the CBOR descriptor in the same canonical order. See NaN / Inf Handling for the encode / decode semantics.

IndexFrame

Index frames (header or footer) contain a CBOR map that lets readers jump directly to any data object without scanning.

KeyTypeDescription
offsetsarray of uintByte offset of each data object frame from message start
lengthsarray of uintByte length of each data object frame

Object count is derived from offsets.len(); lengths.len() must equal offsets.len() or the decoder emits a MetadataError.

Example IndexFrame

{
  "offsets": [256, 1048832, 2097408],
  "lengths": [1048576, 1048576, 524288]
}

The offsets array gives O(1) random access to any object — seek to offsets[i] and read lengths[i] bytes.

HashFrame

Hash frames (header or footer) mirror the per-object inline hash slots of each data-object frame’s footer (see wire-format.md §2.4), so readers can inspect the aggregate without walking every frame.

KeyTypeDescription
algorithmtextHash algorithm name. "xxh3" is the only value a v3 encoder emits.
hashesarray of textHex-encoded digest for each object, in emission order.

Object count is derived from hashes.len(). An unknown algorithm value triggers an UnknownHashAlgorithm warning at validate time; the inline slots remain the authoritative check.

Example HashFrame

{
  "algorithm": "xxh3",
  "hashes": [
    "a1b2c3d4e5f60718",
    "b2c3d4e5f6071829",
    "c3d4e5f60718293a"
  ]
}

Canonical Encoding

All CBOR maps are encoded with keys sorted by the byte representation of their CBOR-encoded key (RFC 8949 section 4.2). This sorting is applied recursively — nested maps are also sorted.

For short string keys (the common case), this is equivalent to sorting by the key string itself. For long keys or non-string keys, the CBOR byte encoding determines the order.

Why does this matter? If you hash an entire message or compare messages by digest, deterministic encoding ensures that logically identical messages produce identical bytes even if the keys were inserted in different order during construction.

Metadata Value Types

All Tensogram metadata — whether in GlobalMetadata, the base / _reserved_ / _extra_ sections, or per-object params — is stored as CBOR. This page describes which value types are valid, which are forbidden, and why.

Allowed Types

Use only the subset of CBOR types that have direct JSON equivalents:

CBOR typeRust / Python equivalentExample
text stringString / str"imaging", "2026-01-12"
integeri64 / int850, -1, 0
floatf64 / float3.14, -273.15
booleanbool / booltrue, false
nullNone / None(absence of a value)
arrayVec<Value> / list[1440, 721], ["t2", "flair"]
mapBTreeMap<String, Value> / dict{"device": "mri", "sequence": "t2_flair"}

Map keys must be text strings. Nested arrays and maps are allowed and encoded recursively.

Forbidden Types

The following CBOR types are not allowed in Tensogram metadata:

TypeReason
byte stringsOpaque blobs break cross-language interoperability; use base64 text instead
CBOR tagsTags (#6.<n>) are not parsed by most CBOR libraries and can change value semantics
undefinedOnly valid in streaming CBOR; never appears in map values
half-precision floats (f16)Not supported by many JSON bridges; use f64
non-string map keysInteger or binary keys are non-canonical and not searchable

The base Section

The base section of GlobalMetadata is a CBOR array of maps — one entry per data object. Each entry holds ALL structured metadata for that object independently. The encoder auto-populates _reserved_.tensor (with ndim, shape, strides, dtype) in each entry when you call encode() or StreamingEncoder::finish(). Any other keys the application placed in a base entry before encoding (e.g. a per-object vocabulary namespace) are preserved. The example below uses the MARS vocabulary; any application namespace works the same way:

{
  "version": 2,
  "base": [
    {
      "mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "2t", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    },
    {
      "mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "lnsp", "levtype": "ml" },
      "_reserved_": {
        "tensor": { "ndim": 1, "shape": [137], "strides": [1], "dtype": "float64" }
      }
    }
  ]
}

Each entry is fully self-contained — all keys for that object appear in its entry. There is no separate “common” section for shared keys. If you need to extract commonalities (e.g. for display), use the compute_common() utility in software after decoding.

Note: base describes the collection of objects at the message level. Individual tensor encoding details (encoding pipeline, hash) remain in each object’s own DataObjectDescriptor. The DataObjectDescriptor.params field is reserved for encoding parameters only — it does not carry application metadata.

Practical Guidance

  • Prefer integers for numeric identifiers (paramId, date, run_id).
  • Use text strings for classification codes even if they happen to be numeric-looking — consistency with your chosen vocabulary is more important than type optimisation.
  • Use nested maps for namespaced keys (e.g., "mars": {...}, "bids": {...}, "dicom": {...}).
  • Keep individual values small. Avoid storing large arrays (e.g., grid coordinates) in metadata — they belong in data objects.

See Also

Data Types

The dtype field in an object descriptor names the element type of the tensor. It is stored as a lowercase text string in CBOR.

Type Table

CBOR stringRust variantBytes per elementNotes
float16Dtype::Float162IEEE 754 half-precision
bfloat16Dtype::Bfloat162Brain float — same exponent range as float32, less mantissa precision
float32Dtype::Float324IEEE 754 single-precision
float64Dtype::Float648IEEE 754 double-precision
complex64Dtype::Complex648Pair of float32 (real, imaginary)
complex128Dtype::Complex12816Pair of float64 (real, imaginary)
int8Dtype::Int81Signed
int16Dtype::Int162Signed
int32Dtype::Int324Signed
int64Dtype::Int648Signed
uint8Dtype::Uint81Unsigned
uint16Dtype::Uint162Unsigned
uint32Dtype::Uint324Unsigned
uint64Dtype::Uint648Unsigned
bitmaskDtype::Bitmask0*Packed bits

*bitmask returns 0 from byte_width() — see the edge case note below.

Byte Order

The byte_order field in the payload descriptor specifies whether multi-byte elements are stored in big-endian ("big") or little-endian ("little") order. This applies to the stored payload bytes after encoding.

Single-byte types (int8, uint8, bitmask) are unaffected by byte order.

Bitmask Edge Case

Dtype::Bitmask is for packing boolean or categorical data sub-byte. The payload size is ceil(num_elements / 8) bytes. The byte_width() method returns 0 as a sentinel; callers that need the actual payload size must compute it:

#![allow(unused)]
fn main() {
let payload_bytes = if dtype == Dtype::Bitmask {
    (num_elements + 7) / 8
} else {
    num_elements * dtype.byte_width()
};
}

Choosing a dtype

SituationRecommended dtype
Temperature, wind speed, pressure (weather)float32
High-precision scientific analysisfloat64
ML model weightsbfloat16 or float16
Integer indices, countsint32 or int64
Land-sea masks, validity flagsuint8 or bitmask
Complex wave spectracomplex64

Quick Start

This page walks you through encoding and decoding a real tensor — a 2D temperature field — in about 20 lines of Rust.

Installation

Rust

cargo add tensogram

Or add it to your Cargo.toml manually:

[dependencies]
tensogram = "0.15"

Optional features:

FeatureWhat it adds
mmapZero-copy memory-mapped file reads
asyncAsync I/O via tokio
remoteRead from S3, GCS, Azure Blob, or HTTP
szip-purePure-Rust szip (no C dependency)
zstd-purePure-Rust zstd (no C dependency)

All compression codecs (szip, zstd, lz4, blosc2, zfp, sz3) and multi-threading are enabled by default.

cargo add tensogram --features mmap,async,remote

Python

pip install tensogram

With xarray and Zarr backends:

pip install tensogram[all]      # everything
pip install tensogram[xarray]   # xarray backend only
pip install tensogram[zarr]     # Zarr v3 store only

CLI

cargo install tensogram-cli

Encode a 2-D Float Field

This example encodes a 100×200 float32 grid — representative of many scientific 2-D fields (temperature, pressure, intensity, density, …).

use std::collections::BTreeMap;
use tensogram::{
    encode, decode, GlobalMetadata, DataObjectDescriptor,
    ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};

fn main() {
    // 1. Make some synthetic data: 100×200 float32 grid
    //    In production, this would come from your model output, sensor,
    //    or upstream pipeline.
    let shape = vec![100u64, 200];
    let strides = vec![200u64, 1]; // C-contiguous (row-major)
    let num_elements = 100 * 200;
    let data: Vec<u8> = (0..num_elements)
        .flat_map(|i| (273.15f32 + (i as f32 / 100.0)).to_be_bytes())
        .collect();

    // 2. Describe the tensor
    let global = GlobalMetadata {
        version: 2,
        ..Default::default()
    };

    let desc = DataObjectDescriptor {
        obj_type: "ntensor".to_string(),
        ndim: 2,
        shape,
        strides,
        dtype: Dtype::Float32,
        byte_order: ByteOrder::Big,
        encoding: "none".to_string(),
        filter: "none".to_string(),
        compression: "none".to_string(),
        params: BTreeMap::new(),
        hash: None, // hash is added automatically by EncodeOptions::default()
    };

    // 3. Encode — produces a self-contained message
    let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default()).unwrap();

    println!("Encoded {} bytes", message.len());

    // 4. Decode it back
    let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();

    println!(
        "Decoded: {} objects, shape {:?}, dtype {}",
        objects.len(),
        objects[0].0.shape,
        objects[0].0.dtype,
    );
    assert_eq!(objects[0].1, data);
}

Add Application Metadata

Real messages need application-layer metadata so downstream tools know what the data represents. Per-object metadata goes into the base array — one entry per data object — and is organised under a namespace key so that multiple vocabularies can coexist.

The example below uses ECMWF’s MARS vocabulary for concreteness. The same mechanism works with any vocabulary: CF conventions ("cf"), BIDS ("bids"), DICOM ("dicom"), or your own ("product", "experiment", "device", …).

#![allow(unused)]
fn main() {
use ciborium::Value;

// Build a "mars" namespace for the object — one concrete vocabulary example.
// You can just as easily use "bids", "dicom", "product", or any custom name.
let mars_map = vec![
    (Value::Text("class".into()), Value::Text("od".into())),
    (Value::Text("date".into()),  Value::Text("20260401".into())),
    (Value::Text("step".into()),  Value::Integer(6.into())),
    (Value::Text("type".into()),  Value::Text("fc".into())),
    (Value::Text("param".into()), Value::Text("2t".into())),
];

let mut entry = BTreeMap::new();
entry.insert("mars".to_string(), Value::Map(mars_map));

let global = GlobalMetadata {
    version: 2,
    base: vec![entry], // one entry per data object
    ..Default::default()
};

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    params: BTreeMap::new(),
    hash: None,
};
}

What’s Next?

  • Use simple_packing to reduce payload size by 4-8x
  • Use the File API to append many messages to a .tgm file
  • Use the CLI to inspect files without writing any code

Vocabularies

Tensogram is vocabulary-agnostic: the library never interprets metadata keys. The same message can carry any combination of application-defined namespaces alongside the auto-populated library-reserved keys. This page collects example vocabularies that have been (or could naturally be) used with Tensogram, so you can pick a convention that matches your domain — or invent your own.

How metadata is structured

A Tensogram message’s per-object metadata lives in base[i], a BTreeMap<String, ciborium::Value>. By convention, each application vocabulary sits under its own top-level namespace key so that multiple vocabularies can coexist without collision:

{
  "version": 2,
  "base": [{
    "mars":   { "class": "od", "param": "2t" },
    "cf":     { "standard_name": "air_temperature", "units": "K" },
    "custom": { "experiment": "run-042" }
  }]
}

All three namespaces above are valid, visible to tooling, and survive round-trip. The library never reads or validates their contents.

Example vocabularies

MARS (ECMWF, weather forecasting)

Used internally at ECMWF and by downstream consumers of ECMWF’s MARS archive. Keys describe the operational provenance of a forecast field: class, stream, type, parameter, level, date/time, step, etc.

{
  "mars": {
    "class": "od", "stream": "oper", "type": "fc",
    "date": "20260401", "time": "1200", "step": 6,
    "param": "2t", "levtype": "sfc"
  }
}

The GRIB importer (tensogram convert-grib) automatically populates this namespace from GRIB MARS keys. See MARS Key Mapping for the full key list.

CF Conventions (climate, ocean, atmospheric)

CF Conventions are the standard attribute vocabulary for climate and forecast data in NetCDF. The NetCDF importer (tensogram convert-netcdf --cf) lifts the CF allow-list into a "cf" sub-map. See NetCDF CF Metadata Mapping.

{
  "cf": {
    "standard_name": "air_temperature",
    "long_name": "2 metre temperature",
    "units": "K",
    "cell_methods": "time: mean"
  }
}

BIDS (neuroimaging)

The Brain Imaging Data Structure organises neuroimaging datasets with entity-level metadata. A natural fit for Tensogram messages carrying fMRI, dMRI, or EEG tensors.

{
  "bids": {
    "subject": "sub-01", "session": "ses-01",
    "task": "rest", "run": 1, "acq": "hires"
  }
}

DICOM (medical imaging)

DICOM tags are the standard descriptors for medical imaging studies. They can be mapped into a "dicom" namespace for use with Tensogram messages carrying imaging volumes, time-series, or segmentation masks.

{
  "dicom": {
    "Modality": "MR", "SeriesDescription": "T2_FLAIR",
    "SliceThickness": 1.0, "RepetitionTime": 8000
  }
}

Zarr attributes (generic)

Zarr v3 attribute maps are generic key-value stores. When using the Zarr backend (tensogram-zarr), group-level and array-level attributes are surfaced through _extra_ and per-array descriptor params.

Custom namespaces

For any domain that does not have an established vocabulary, or when a pipeline wants to carry bespoke fields alongside a standard namespace, invent your own:

{
  "experiment": {
    "id": "run-042",
    "operator": "alice",
    "hypothesis": "beam stability",
    "started_at": "2026-04-18T10:30:00Z"
  }
}

Suggested conventions for custom namespaces:

  • Use a short, lowercase namespace key ("product", "instrument", "run", "experiment", "device").
  • Group related fields under a single namespace rather than scattering them at the top level of base[i].
  • Prefer ISO 8601 timestamps, SI units in units fields, and UTF-8 text for identifiers.
  • Document your namespace schema somewhere versioned (a README, a JSON schema, a wiki page) so downstream consumers can interpret it consistently.

Multiple vocabularies in one message

You can freely mix vocabularies in the same base[i] entry — the library preserves all of them:

{
  "base": [{
    "mars":       { "param": "2t", "levtype": "sfc" },
    "cf":         { "standard_name": "air_temperature", "units": "K" },
    "provenance": { "pipeline_id": "pp-17", "stage": "post-process" }
  }]
}

This lets one team’s producers emit messages that are simultaneously interpretable by tools expecting MARS, CF-aware tooling, and an internal provenance tracker.

Looking up keys

The dotted-path helpers exposed by each binding vary. The CLI, the C FFI (tgm_metadata_get_string / _get_int / _get_float), the C++ wrapper (metadata::get_string / get_int / get_float), and the TypeScript package (getMetaKey) all accept a full dotted path. The Rust crate and the Python package do not expose a dotted-path helper at this time; use direct nested access instead.

TypeScript — dotted path

import { getMetaKey } from '@ecmwf/tensogram';

const param   = getMetaKey(meta, 'mars.param');
const subject = getMetaKey(meta, 'bids.subject');

CLI — dotted path

# Filter messages on a namespaced key
tensogram ls data.tgm -w "mars.param=2t/10u"
tensogram ls data.tgm -w "bids.subject=sub-01"

# Print specific keys
tensogram get -p "cf.standard_name,cf.units" data.tgm

Python — dict-style nested access

# Metadata.__getitem__ does a top-level search across base[i] (skipping
# _reserved_) and falls back to the message-level _extra_ map. The returned
# value is a plain Python dict, so the next lookup is standard dict access.
param   = meta["mars"]["param"]
subject = meta["bids"]["subject"]

# meta.base[i], meta.reserved, and meta.extra are also available directly
# if you want the raw per-object / reserved / extra dicts.
first_base = meta.base[0]

Rust — pattern-match on ciborium::Value

#![allow(unused)]
fn main() {
use ciborium::Value;
use tensogram::GlobalMetadata;

// `meta.base` is `Vec<BTreeMap<String, Value>>`. Find the namespace on
// the first-matching base entry, then pull a text field from the nested
// map. Falls back to `meta.extra` for message-level annotations.
fn get_text<'a>(meta: &'a GlobalMetadata,
                namespace: &str, field: &str) -> Option<&'a str> {
    let pull = |map: &'a [(Value, Value)]| -> Option<&'a str> {
        map.iter().find_map(|(k, v)| match (k, v) {
            (Value::Text(k), Value::Text(v)) if k == field => Some(v.as_str()),
            _ => None,
        })
    };
    for entry in &meta.base {
        if let Some(Value::Map(items)) = entry.get(namespace)
            && let Some(val) = pull(items)
        {
            return Some(val);
        }
    }
    if let Some(Value::Map(items)) = meta.extra.get(namespace) {
        return pull(items);
    }
    None
}

let param = get_text(&meta, "mars", "param");
}

Tensogram keeps the Rust surface small on purpose. If your pipeline needs dotted-path lookup in Rust, wrap the snippet above in a helper of your own, or call out to the CLI.

Lookup semantics (all bindings that support dotted paths)

First match across base[0], base[1], … (skipping _reserved_ within each entry), then fall back to the message-level _extra_ map. An explicit _extra_.key (or extra.key) prefix bypasses the base search.

See also

Jupyter Notebook Walk-through

The examples/jupyter/ directory carries a curated set of narrative notebooks that introduce Tensogram interactively, with live visualisations. Unlike the flat .py examples under examples/python/ — which are minimal reference snippets for copy-paste — the notebooks are for learning.

Every notebook is executed end-to-end on every PR by the notebooks CI job, so they cannot rot.

The five journeys

#NotebookWhat you will learn
101_quickstart_and_mars.ipynbEncode & decode a 2D tensor, visualise it, attach MARS metadata, walk the base / _reserved_ / _extra_ layout.
202_encoding_and_fidelity.ipynbSweep every encoding × filter × compression combination and plot ratio vs time vs fidelity.
303_from_grib_to_tensogram.ipynbConvert a real ECMWF opendata GRIB2 file with the new Python API (tensogram.convert_grib + tensogram.convert_grib_buffer).
404_from_netcdf_to_tensogram.ipynbBuild a small CF-compliant NetCDF in-process, convert it with tensogram.convert_netcdf, and open the result as an xarray Dataset via engine="tensogram".
505_validation_and_parallelism.ipynbRun the four validation levels, inject corruption, sweep threads=0…N and plot the speedup.

Running the notebooks locally

# Build the Python bindings with GRIB and NetCDF support.
# Requires libeccodes + libnetcdf installed at the OS level.
uv venv .venv --python 3.13
source .venv/bin/activate
uv pip install maturin
cd python/bindings
maturin develop --features grib,netcdf
cd ../..

# Install notebook-only dependencies + the xarray backend.
uv pip install -e examples/jupyter

# Launch JupyterLab.
jupyter lab examples/jupyter/

Option 2 — conda env create

conda env create -f examples/jupyter/environment.yml
conda activate tensogram-jupyter
jupyter lab examples/jupyter/

Option 3 — Binder / Colab

Launch badges in the notebook directory’s README.md — zero local install.

OS-level dependencies

Notebooks 03 (GRIB) and 04 (NetCDF) need C libraries installed at the operating system level. They are not Python packages.

LibraryNeeded bymacOS (Homebrew)Debian / Ubuntu
libeccodesnotebook 03brew install eccodesapt install libeccodes-dev
libnetcdf + libhdf5notebook 04brew install netcdf hdf5apt install libnetcdf-dev libhdf5-dev

The official PyPI wheels (pip install tensogram) do not ship GRIB / NetCDF support: the manylinux_2_28 base image lacks the C libraries. If you try to call tensogram.convert_grib(...) on a wheel without the feature, you get a clean RuntimeError("tensogram was built without GRIB support...") that points you at this page.

To enable the feature, rebuild from source:

git clone https://github.com/ecmwf/tensogram
cd tensogram/python/bindings
maturin develop --features grib,netcdf

Running the notebooks in CI

The repository runs the notebooks end-to-end on every PR via a dedicated notebooks job. The gate is:

pytest --nbval-lax examples/jupyter/ -v

--nbval-lax executes every cell in every notebook and fails the build on any exception. Cell outputs are not compared — we commit the notebooks with empty outputs (enforced by the python/tests/test_jupyter_structure.py guard).

Output hygiene

Committed notebooks must have empty cell outputs. Install the nbstripout pre-commit hook once:

uv pip install nbstripout
nbstripout --install

With the hook installed, git commit automatically strips outputs.

Adding a new notebook

  1. Copy an existing .ipynb as a template.
  2. First cell must be a markdown license banner mentioning “ECMWF” or “Apache”.
  3. Last cell must be a “Where to go next” markdown pointer.
  4. If you import matplotlib, call matplotlib.use("Agg") before the first import matplotlib.pyplot.
  5. Update EXPECTED_NOTEBOOKS in python/tests/test_jupyter_structure.py.
  6. Link it from examples/jupyter/README.md and this guide page.
  7. Run pytest --nbval-lax examples/jupyter/ locally before committing.

Encoding Data

This page covers the encode() function and EncodeOptions in detail.

Function Signature

#![allow(unused)]
fn main() {
pub fn encode(
    global_metadata: &GlobalMetadata,
    descriptors: &[(&DataObjectDescriptor, &[u8])],
    options: &EncodeOptions,
) -> Result<Vec<u8>>
}
  • global_metadata — reference to message-level metadata (version, base entries, _extra_ fields)
  • descriptors — a slice of (descriptor, data) pairs, one per object
  • options — controls hash algorithm and compression backend selection (the emit_preceders field is reserved for future buffered-mode support; preceders are currently only emitted via StreamingEncoder::write_preceder)

Returns a complete, self-contained message as a Vec<u8>.

EncodeOptions

#![allow(unused)]
fn main() {
pub struct EncodeOptions {
    /// Hash algorithm to use. None disables hashing entirely.
    pub hash_algorithm: Option<HashAlgorithm>,
    /// Reserved — buffered `encode()` rejects `true`. Use
    /// `StreamingEncoder::write_preceder()` instead.
    pub emit_preceders: bool,
    /// Which backend to use for szip / zstd when both FFI and pure-Rust
    /// implementations are compiled in.
    pub compression_backend: CompressionBackend,
}

impl Default for EncodeOptions {
    fn default() -> Self {
        Self {
            hash_algorithm: Some(HashAlgorithm::Xxh3),
            emit_preceders: false,
            compression_backend: CompressionBackend::default(),
        }
    }
}
}

The default applies xxh3 hashing to every object payload. Use None to skip hashing:

#![allow(unused)]
fn main() {
let options = EncodeOptions {
    hash_algorithm: None,
    ..Default::default()
};
}

What Encode Does

For each object, in order:

  1. Validate — checks that each pair has a descriptor and corresponding data
  2. Run the encoding pipeline — applies encoding, filter, compression from the object’s DataObjectDescriptor
  3. Hash — if hash_algorithm is set, computes and stores the hash in the descriptor
  4. Serialize CBOR — encodes the GlobalMetadata and all DataObjectDescriptors to canonical CBOR
  5. Frame — assembles preamble, header frames (metadata/index/hash), data object frames, and postamble

Encoding with Simple Packing

To use simple_packing, you need to compute the quantization parameters first, then put them in the DataObjectDescriptor:

#![allow(unused)]
fn main() {
use tensogram_encodings::simple_packing;
use ciborium::Value;

// Your original values as f64 (simple_packing always works on f64).
// source_data might be a temperature grid, pressure field, intensity
// image, or any other bounded-range scalar field.
let values: Vec<f64> = source_data.iter().map(|&x| x as f64).collect();

// Compute quantization parameters for 16 bits per value
let params = simple_packing::compute_params(&values, 16, 0)?;

// Put the parameters into the descriptor
let mut packing_params = BTreeMap::new();
packing_params.insert("reference_value".into(),
    Value::Float(params.reference_value));
packing_params.insert("binary_scale_factor".into(),
    Value::Integer((params.binary_scale_factor as i64).into()));
packing_params.insert("decimal_scale_factor".into(),
    Value::Integer((params.decimal_scale_factor as i64).into()));
packing_params.insert("bits_per_value".into(),
    Value::Integer((params.bits_per_value as i64).into()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float64,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    params: packing_params,
    hash: None,
};
}

Then encode as normal, passing the original raw bytes (as f64 bytes):

#![allow(unused)]
fn main() {
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();

let global = GlobalMetadata { version: 2, ..Default::default() };
let message = encode(&global, &[(&desc, &raw)], &EncodeOptions::default())?;
}

The encoder applies simple_packing internally. The payload stored in the message is the packed bits, not the original f64 bytes.

Encoding Multiple Objects

Pass multiple (descriptor, data) pairs:

#![allow(unused)]
fn main() {
let global = GlobalMetadata { version: 2, ..Default::default() };

let message = encode(
    &global,
    &[(&spectrum_desc, &spectrum_data), (&mask_desc, &land_mask_data)],
    &EncodeOptions::default(),
)?;
}

Each descriptor independently specifies its own encoding, compression, dtype, and byte order. The encoder processes each pair in sequence.

Error Conditions

ErrorCause
EncodingNaN in data when using simple_packing
Encodingbits_per_value out of range (0–64)
CompressionCompressor-specific error (invalid params, unsupported dtype)
MetadataCBOR serialization failed

Pre-Encoded Data API (Advanced)

When to use this API

The encode_pre_encoded API is for advanced callers whose data is already encoded by an external pipeline (e.g., a GPU kernel that emits packed bytes, or a streaming receiver passing payloads through). It bypasses Tensogram’s internal encoding pipeline and uses the supplied bytes verbatim.

Do NOT use this API for ordinary encoding. Use encode() instead.

⚠️ The bit-vs-byte trap

WARNING: When using compression="szip", the szip_block_offsets parameter contains bit offsets, not byte offsets. The first offset must be 0 and every offset must satisfy offset <= encoded_bytes_len * 8. This matches the libaec/szip wire format. See cbor-metadata.md for the format reference.

Getting this wrong is the #1 caller mistake. Tensogram validates the offsets structurally (monotonicity, bounds) but cannot detect a byte-instead-of-bit mistake until decode_range fails.

API surface

Rust

#![allow(unused)]
fn main() {
pub fn encode_pre_encoded(
    metadata: &GlobalMetadata,
    descriptors_and_data: &[(&DataObjectDescriptor, &[u8])],
    options: &EncodeOptions,
) -> Result<Vec<u8>, TensogramError>
}

Python

import tensogram

msg: bytes = tensogram.encode_pre_encoded(
    global_meta_dict={"version": 2},
    descriptors_and_data=[(descriptor_dict, raw_bytes)],
    hash="xxh3",
)

C

tgm_error tgm_encode_pre_encoded(
    const char *metadata_json,
    const uint8_t *const *data_ptrs,
    const size_t *data_lens,
    size_t num_objects,
    const char *hash_algo,
    tgm_bytes_t *out
);

C++

std::vector<std::uint8_t> tensogram::encode_pre_encoded(
    const std::string& metadata_json,
    const std::vector<std::pair<const std::uint8_t*, std::size_t>>& objects,
    const encode_options& opts = {}
);

Hash semantics

The library always recomputes the hash of the pre-encoded bytes using the algorithm specified in EncodeOptions.hash_algorithm (default xxh3). Any hash the caller stored on the descriptor is silently overwritten. This guarantees the wire format invariant descriptor.hash == hash_algo(bytes) always holds.

Provenance semantics

The encoded message is byte-format-indistinguishable from one produced by encode(). The decoder cannot tell which API produced it. The provenance fields _reserved_.encoder.name, _reserved_.time, and _reserved_.uuid are populated identically.

Self-consistency checks

Before encoding, the library validates:

  1. Caller has not set EncodeOptions.emit_preceders (rejected).
  2. Caller has not put _reserved_ in their metadata (rejected).
  3. Each descriptor passes the standard validate_object checks.
  4. If compression="szip" and szip_block_offsets is supplied:
    • It’s a CBOR Array of u64.
    • First offset is 0.
    • Strictly monotonically increasing.
    • All bit offsets <= bytes_len * 8.
  5. If szip_block_offsets is supplied but compression != "szip", rejected.

These are structural checks only. The library does NOT trial-decode the bytes to verify they actually decode correctly.

Limitation: encoding=“none” size check

When encoding="none", the validate_object check enforces payload_len == shape_product * dtype_byte_width. This means you cannot pass compression-only payloads (e.g., zstd-compressed raw bytes) with encoding="none" because the compressed size will not match the expected raw size. Wrap such payloads in at least simple_packing or another encoding.

Worked example: simple_packing + szip with decode_range

#![allow(unused)]
fn main() {
use tensogram::{
    encode_pre_encoded, DataObjectDescriptor, EncodeOptions,
    GlobalMetadata, ByteOrder, Dtype,
};
use std::collections::BTreeMap;
use ciborium::Value;

// Pre-encoded bytes from a GPU kernel + szip block offsets in BITS
let pre_encoded_bytes: Vec<u8> = /* from GPU */;
let szip_offsets_bits: Vec<u64> = vec![0, 8192, 16384, /* ... */];

let mut params: BTreeMap<String, ciborium::Value> = BTreeMap::new();
params.insert("bits_per_value".into(), Value::Integer(24u64.into()));
params.insert("reference_value".into(), Value::Float(0.0));
params.insert("binary_scale_factor".into(), Value::Integer((-10i64).into()));
params.insert("decimal_scale_factor".into(), Value::Integer(0i64.into()));
params.insert("szip_rsi".into(), Value::Integer(128i64.into()));
params.insert("szip_block_size".into(), Value::Integer(16i64.into()));
params.insert("szip_flags".into(), Value::Integer(8i64.into()));
params.insert("szip_block_offsets".into(),
    Value::Array(szip_offsets_bits.into_iter()
        .map(|o| Value::Integer(o.into()))
        .collect()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 2,
    shape: vec![1024, 1024],
    strides: vec![1024, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "szip".into(),
    params,
    hash: None,
};

let msg = encode_pre_encoded(
    &GlobalMetadata::default(),
    &[(&desc, &pre_encoded_bytes)],
    &EncodeOptions::default(),
)?;

// decode_range works because szip_block_offsets is present.
}

How it works

flowchart TD
    subgraph pre["encode_pre_encoded path"]
        A[Caller bytes] --> B[validate_object]
        B --> C[validate_szip_block_offsets]
        C --> D[Recompute hash]
    end

    subgraph normal["encode path"]
        G[Caller bytes] --> H[Run encoding pipeline]
        H --> D
    end

    D --> E[Wrap in CBOR framing]
    E --> F[Wire message]

The pre-encoded path skips the pipeline entirely. The wire format is identical.

Byte order

When using encoding="none", the caller’s bytes are stored verbatim — the library does NOT validate or flip byte order on encode. The bytes must be in the byte order declared in the descriptor’s byte_order field.

For example, if byte_order="big" and encoding="none", the caller must provide big-endian bytes.

On decode, the library automatically converts to native byte order by default (native_byte_order=true). Callers can use from_ne_bytes() or data_as<T>() directly without worrying about which byte order was used on the wire. Set native_byte_order=false to get the raw wire-order bytes.

Streaming API

StreamingEncoder::write_object_pre_encoded() is the streaming counterpart of encode_pre_encoded(). It writes a single pre-encoded object to the stream. It can be interleaved freely with write_object() (normal encode) calls.

Rust

#![allow(unused)]
fn main() {
let mut enc = StreamingEncoder::new(output, &metadata, &options)?;
enc.write_object_pre_encoded(&descriptor, &pre_encoded_bytes)?;
enc.finish()?;
}

Python

enc = tensogram.StreamingEncoder({"version": 2})
enc.write_object_pre_encoded(descriptor_dict, raw_bytes)
msg = enc.finish()

C++

tensogram::streaming_encoder enc(path, metadata_json);
enc.write_object_pre_encoded(descriptor_json, data_ptr, data_len);
enc.finish();

Error reference

encode_pre_encoded can raise the following errors:

Error conditionMessage contains
obj_type is empty"obj_type must not be empty"
ndim doesn’t match shape.len()"ndim … does not match shape.len()"
strides.len() doesn’t match shape.len()"strides.len() … does not match shape.len()"
encoding="none" and data size wrong"data_len … does not match expected … bytes"
emit_preceders=true in buffered mode"emit_preceders is not supported"
Caller set _reserved_ in metadata"_reserved_"
szip_block_offsets not starting at 0"first offset must be 0"
szip_block_offsets not strictly increasing"strictly increasing"
szip_block_offsets exceeds bit bound"exceeds … bit bound"
szip_block_offsets with non-szip compression"szip_block_offsets provided but compression"
Unknown encoding string"encoding"
Unknown dtype"unknown dtype"

Strides convention

The library treats strides as opaque metadata — it only validates that strides.len() == shape.len(). The convention differs between language bindings:

  • Rust tests use element strides (e.g., [1] for 1D, [5, 1] for shape [4, 5])
  • C++ tests use byte strides (e.g., [4] for float32, [12, 4] for shape [2, 3] float32)

Both conventions work correctly since the library does not interpret stride values.

Cross-references

Decoding Data

Tensogram provides four decode functions for different use cases. Choose the one that does the least work for your situation — they are all zero-copy on the metadata path.

The DecodedObject Type

Before diving in, it helps to know the common return type:

#![allow(unused)]
fn main() {
type DecodedObject = (DataObjectDescriptor, Vec<u8>);
}

A DecodedObject is a tuple of the object’s descriptor (shape, dtype, encoding info, etc.) and the decoded raw bytes. You will see this pattern throughout the decode API.

Four Decode Functions

decode — full message

#![allow(unused)]
fn main() {
pub fn decode(
    message: &[u8],
    options: &DecodeOptions,
) -> Result<(GlobalMetadata, Vec<(DataObjectDescriptor, Vec<u8>)>)>
}

Decodes all objects. Returns the global metadata and a vector of DecodedObject tuples — one per object, with raw bytes in the logical dtype after de-quantization.

#![allow(unused)]
fn main() {
let (meta, objects) = decode(&message, &DecodeOptions::default())?;

// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}, bytes: {}", desc.shape, desc.dtype, data.len());
}

decode_metadata — metadata only

#![allow(unused)]
fn main() {
pub fn decode_metadata(message: &[u8]) -> Result<GlobalMetadata>
}

Reads only the CBOR section. Does not touch any payload bytes. Use this for filtering and listing.

#![allow(unused)]
fn main() {
let meta = decode_metadata(&message)?;
println!("version: {}", meta.version);
}

decode_object — single object by index

#![allow(unused)]
fn main() {
pub fn decode_object(
    message: &[u8],
    index: usize,
    options: &DecodeOptions,
) -> Result<(GlobalMetadata, DataObjectDescriptor, Vec<u8>)>
}

Decodes one object without reading the others. Uses the binary header’s offset table to seek directly to the right payload. O(1) seek regardless of how many objects the message contains.

Returns the global metadata, the object’s descriptor, and the decoded bytes as a three-element tuple.

#![allow(unused)]
fn main() {
// Decode only the second object (index 1)
let (meta, descriptor, payload) = decode_object(&message, 1, &DecodeOptions::default())?;
println!("shape: {:?}, dtype: {}", descriptor.shape, descriptor.dtype);
}

Edge case: If index >= num_objects, returns TensogramError::Object("index out of range").

decode_range — partial sub-tensor

#![allow(unused)]
fn main() {
pub fn decode_range(
    message: &[u8],
    object_index: usize,
    ranges: &[(u64, u64)],  // (offset, count) in flattened element order
    options: &DecodeOptions,
) -> Result<(DataObjectDescriptor, Vec<Vec<u8>>)>
}

Decodes one or more contiguous slices of elements from an object. Each (offset, count) pair in ranges selects a span of elements along the flattened dimension; the function returns one byte vector per range by default. This split-result design avoids an unnecessary copy when the caller needs the ranges individually (e.g. to feed separate array slices).

Rust — split results (default)

#![allow(unused)]
fn main() {
// Two separate ranges from object 0
let (desc, parts) = decode_range(
    &message, 0,
    &[(100, 50), (300, 25)],
    &DecodeOptions::default(),
)?;
assert_eq!(parts.len(), 2);           // one Vec<u8> per range
println!("first  range bytes: {}", parts[0].len());
println!("second range bytes: {}", parts[1].len());
}

Rust — joined result

If you prefer a single contiguous buffer, flatten the results:

#![allow(unused)]
fn main() {
let joined: Vec<u8> = parts.into_iter().flatten().collect();
}

Python — split results (default, join=False)

import tensogram

parts = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range
print(len(parts))        # 2
print(parts[0].shape)    # (50,)

Python — joined result (join=True)

arr = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# arr is a single flat numpy array with all ranges concatenated
print(arr.shape)          # (75,)

N-dimensional slicing: The xarray backend maps N-dimensional slice notation (e.g. ds["temperature"].sel(lat=slice(10, 20), lon=slice(30, 40))) into the (offset, count) pairs that decode_range expects, so you rarely need to compute flattened offsets by hand when working through xarray.

Pre-encoded messages: Messages produced via encode_pre_encoded only support decode_range if the caller provided the necessary bit-precise szip_block_offsets (see Pre-encoded Payloads).

Edge case: decode_range works with all encoding+compression combinations that support random access: uncompressed data, simple_packing (bit extraction), szip (RSI block seeking), blosc2 (chunk access), and zfp fixed-rate mode. It returns an error for the shuffle filter (byte rearrangement breaks contiguous sample ranges) and for stream compressors (zstd, lz4, sz3) that don’t support partial decode.

DecodeOptions

#![allow(unused)]
fn main() {
pub struct DecodeOptions {
    /// If true, verify the hash of each decoded payload.
    pub verify_hash: bool,
    /// When true (the default), decoded payloads are converted to the
    /// caller's native byte order. Set to false to receive bytes in the
    /// message's declared wire byte order.
    pub native_byte_order: bool,
    /// Which backend to use for szip / zstd when both FFI and pure-Rust
    /// implementations are compiled in.
    pub compression_backend: CompressionBackend,
}

impl Default for DecodeOptions {
    fn default() -> Self {
        Self {
            verify_hash: false,
            native_byte_order: true,
            compression_backend: CompressionBackend::default(),
        }
    }
}
}

Native byte order (default)

By default, all decoded data is returned in the caller’s native byte order — the library handles any necessary byte-swapping automatically. You never need to check byte_order or call .byteswap():

#![allow(unused)]
fn main() {
let (_, objects) = decode(&message, &DecodeOptions::default())?;
let floats: Vec<f32> = objects[0].1
    .chunks_exact(4)
    .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

In Python, numpy arrays are always directly usable:

_, objects = tensogram.decode(msg)
arr = objects[0][1]   # numpy array — values are correct, no byteswap needed

This applies to all decode functions (decode, decode_object, decode_range), all encodings (none, simple_packing), all compression codecs, and all language bindings (Rust, Python, C, C++).

Wire byte order (opt-in)

Set native_byte_order: false to receive the raw bytes in the message’s declared wire byte order. This is useful for zero-copy forwarding or when you need the exact on-wire representation:

#![allow(unused)]
fn main() {
let opts = DecodeOptions { native_byte_order: false, ..Default::default() };
let (_, objects) = decode(&message, &opts)?;
// objects[0].1 is in the descriptor's declared byte_order (e.g. big-endian)
}

Hash verification

Hash verification is opt-in. Enable it when data integrity is critical:

#![allow(unused)]
fn main() {
let options = DecodeOptions { verify_hash: true, ..Default::default() };
let result = decode(&message, &options);
// Returns Err(TensogramError::HashMismatch { expected, actual }) if corrupted
}

Edge case: If the descriptor has no hash (i.e. the message was encoded with hash_algorithm: None), verify_hash: true silently skips verification for that object. No error is returned.

Working with the Decoded Bytes

Decoded bytes are in native byte order (with the default DecodeOptions). Cast them as native:

#![allow(unused)]
fn main() {
// float32 object → use from_ne_bytes
let floats: Vec<f32> = data
    .chunks_exact(4)
    .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

For simple_packing decoded data, the output is always f64 bytes (8 bytes per element), regardless of the original dtype stored in the descriptor:

#![allow(unused)]
fn main() {
// simple_packing always decodes to f64, in native byte order
let values: Vec<f64> = data
    .chunks_exact(8)
    .map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

Scanning for Messages First

If you’re working with a buffer that might contain multiple messages (e.g. a .tgm file loaded into memory), scan it first to get message boundaries:

#![allow(unused)]
fn main() {
let offsets = scan(&big_buffer); // Vec<(usize, usize)> = (start, length)

for (start, len) in offsets {
    let msg = &big_buffer[start..start + len];
    let meta = decode_metadata(msg)?;
    println!("version: {}", meta.version);
}
}

The scan function is tolerant of corruption — it skips invalid regions and continues looking for the next valid TENSOGRM marker.

NaN / Inf handling

By default the Tensogram encoder rejects any NaN or ±Inf in float / complex payloads. The encode call fails with TensogramError::Encoding (C FFI: TgmError::Encoding; Python: EncodingError; TypeScript: EncodingError; C++: tensogram::encoding_error) and names the element index, dtype, and a hint that points at the opt-in flags described below.

This chapter walks through the three policies available on encode:

  1. Reject (default) — any non-finite input fails the call. Use this when your pipeline guarantees finite values and any NaN / Inf is a bug you want to surface loudly.
  2. Allow NaN — NaN values are substituted with 0.0 on the wire and their positions are recorded in a compressed bitmask stored alongside the payload. Decode restores canonical NaN at those positions by default.
  3. Allow ±Inf — same as allow_nan but for +∞ and −∞ together (the flag covers both signs; two per-sign bitmasks are written when both kinds appear in the payload).

The mask companion is formally called the NTensorFrame — wire-format type 9, defined in plans/BITMASK_FRAME.md and the wire-format reference.

When to use which policy

SituationFlag to set
Finite data only, want hard failure on contaminationdefault (both off)
NetCDF _FillValue → NaN, Zarr missing data, sensor gapsallow_nan=true
Propagating numerical overflow as ±Infallow_inf=true
Mixed missing-value / overflow databoth true

Don’t pre-process to a sentinel value when allow_nan / allow_inf does the job — the bitmask is designed to compress aggressively (hybrid Roaring containers by default) and keeps the missing-data semantics visible to the decoder. Sentinel values throw that information away.

Cross-language opt-in

Rust

#![allow(unused)]
fn main() {
use tensogram::{encode, EncodeOptions, GlobalMetadata, DataObjectDescriptor};

let options = EncodeOptions {
    allow_nan: true,
    allow_inf: true,
    ..Default::default()
};
let msg = encode(&meta, &[(&desc, payload_bytes)], &options)?;
}

Python

import numpy as np
import tensogram

data = np.array([1.0, np.nan, 3.0], dtype=np.float64)
msg = tensogram.encode(
    {"version": 2},
    [(desc, data)],
    allow_nan=True,
)
decoded = tensogram.decode(msg)
# decoded.objects[0].data() → [1.0, nan, 3.0]

TypeScript

import { encode, decode } from '@ecmwf/tensogram';

const msg = encode(
    { version: 2 },
    [{ descriptor, data: new Float64Array([1, NaN, 3]) }],
    { allowNan: true },
);
const decoded = decode(msg);

C++

tensogram::encode_options opts;
opts.allow_nan = true;
auto msg = tensogram::encode(metadata_json, objects, opts);

CLI

$ tensogram --allow-nan reshuffle -o out.tgm input.tgm
$ TENSOGRAM_ALLOW_NAN=1 tensogram convert-netcdf data.nc -o data.tgm

Decode-side reconstruction

By default every decode path restores the canonical quiet-NaN / ±Inf bit pattern at every masked position. Opt out (e.g. to inspect the on-disk zero-substituted representation) by passing restore_non_finite=false:

# Get the 0.0-substituted payload without the NaN bits.
raw = tensogram.decode(msg, restore_non_finite=False)
# raw.objects[0].data() → [1.0, 0.0, 3.0]

The advanced decode_with_masks API (Rust + Python) returns both the zero-substituted payload AND the raw decompressed per-kind Vec<bool> masks, so callers can build custom missing-value representations without materialising canonical NaN bytes.

Lossy reconstruction — read this carefully

The masked encode path does not preserve the original NaN payload bits. On decode every masked NaN is restored with the canonical quiet-NaN pattern:

  • f32::NAN bits = 0x7FC00000
  • f64::NAN bits = 0x7FF8000000000000
  • Float16 / bfloat16 use their dtype-native quiet-NaN patterns
  • Complex64 / complex128 restore the canonical pattern to both real and imag components

Signalling NaNs, custom payload bits, and mixed real / imag kinds for complex dtypes are therefore flattened to the canonical form through a mask round-trip. If you need bit-exact NaN preservation, pre-encode your payload and use encode_pre_encoded to bypass the substitute-and-mask stage entirely. See plans/BITMASK_FRAME.md §7.1 for the full design rationale.

Mask compression methods

Six methods are available per-kind:

MethodBest forFeature
roaring (default)any mask shapepure Rust, works on WASM
rlehighly clustered masks (land / sea, swath gaps)pure Rust
blosc2dense dtype-aligned masksblosc2 feature
zstdgeneric good-ratiozstd feature
lz4decode-speed prioritylz4 feature
nonetiny masks (auto-fallback)always available

Small masks (uncompressed bit-packed byte count ≤ 128 by default) automatically fall back to none regardless of the requested method — compressing a few bytes costs more than it saves. Set small_mask_threshold_bytes = 0 to disable the auto-fallback.

Set per-kind methods via the matching options:

msg = tensogram.encode(
    meta, [(desc, data)],
    allow_nan=True, allow_inf=True,
    nan_mask_method='rle',
    pos_inf_mask_method='roaring',
    neg_inf_mask_method='roaring',
    small_mask_threshold_bytes=0,
)

Validation

tensogram validate --full cross-checks every NaN / ±Inf in the decoded payload against the frame’s mask companion: masked positions are expected and pass; any NaN / Inf at a non-masked position is reported as NanDetected / InfDetected (see the validator reference).

Files without a mask companion keep the pre-0.17 semantics — any non-finite value in the decoded output is an error.

Migration from pre-0.17

Prior to 0.17 the reject_nan / reject_inf opt-in flags upgraded the NaN check to be pipeline-independent. These flags are removed in 0.17 (breaking change). Rejection is now always on by default; opt in to masked substitution with the replacement flags:

Pre-0.170.17+
reject_nan=False (default, pass-through)allow_nan=True (substitute + mask)
reject_nan=True (opt-in reject)default (always reject)
reject_inf=False / Truesame split, allow_inf

See CHANGELOG.md for the full breaking-change list and upgrade notes.

Working with Files

The TensogramFile struct provides a high-level API for reading and writing .tgm files. It handles lazy scanning, buffered appending, and random access by message index.

Creating a File

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, EncodeOptions};

let mut file = TensogramFile::create("forecast.tgm")?;
}

This creates (or truncates) the file. No data is written yet.

Appending Messages

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
    GlobalMetadata, DataObjectDescriptor, ByteOrder, Dtype, EncodeOptions,
};

let global = GlobalMetadata { version: 2, ..Default::default() };

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    params: BTreeMap::new(),
    hash: None,
};

    file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}

Each append encodes one message and appends it to the end of the file. You can call it as many times as you like — each message is independent and self-describing.

Typical pattern for writing a multi-message file (one message per parameter, run, subject, sample, experiment — whatever your pipeline produces):

#![allow(unused)]
fn main() {
let mut file = TensogramFile::create("output.tgm")?;

for key in ["2t", "10u", "10v", "msl"] {
    let (global, desc, data) = produce_field(key);
    file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}
}

Opening and Counting Messages

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;

// Streaming scan happens here (lazily, on first access)
let count = file.message_count()?;
println!("{} messages in file", count);
}

The first access triggers a streaming scan that reads preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, every read_message call is a seek + read — no further scanning.

Reading Messages

#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions};

// Read raw bytes of message 3
let raw_bytes = file.read_message(3)?;

// Decode message 3
let (meta, objects) = decode(&raw_bytes, &DecodeOptions::default())?;

// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}", desc.shape, desc.dtype);
}

Both are O(1) after the initial scan: they seek to the stored offset and read length bytes.

Iterating Over All Messages

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;

for raw in file.iter()? {
    let raw = raw?;
    let meta = tensogram::decode_metadata(&raw)?;
    println!("version: {}", meta.version);
}
}

Memory note: For files with many large messages, prefer iterating by index with read_message(i) inside a loop to process one at a time.

Random Access by Index

One of Tensogram’s design goals is O(1) object access. After scanning, any message is reachable in constant time. Within a message, any object is reachable in constant time via the binary header’s offset table:

flowchart TD
    A["file.read_message(42)"]
    B["Message bytes"]
    C["Binary header"]
    D["Seek to payload 2"]
    E["Decode only object 2"]

    A -- "seek + read" --> B
    B --> C
    C -- "lookup offset for object 2" --> D
    D --> E

    style A fill:#388e3c,stroke:#2e7d32,color:#fff
    style E fill:#1565c0,stroke:#0d47a1,color:#fff

File Layout Diagram

forecast.tgm
├── [message 0] — TENSOGRM ... 39277777
├── [message 1] — TENSOGRM ... 39277777
├── [message 2] — TENSOGRM ... 39277777
│   ├── Preamble (24B)
│   ├── Header Metadata Frame (CBOR GlobalMetadata)
│   ├── Header Index Frame (CBOR offsets)
│   ├── Data Object Frame 0 (payload + CBOR descriptor)
│   └── Data Object Frame 1 (payload + CBOR descriptor)
│   └── Postamble (16B)
└── ...

No file-level header, no file-level index. All indexing is per-message, built in-memory at scan time.

Remote Access (optional)

Enable the remote feature to open .tgm files on S3, GCS, Azure, or HTTP with selective range-based reads:

[dependencies]
tensogram = { path = "...", features = ["remote"] }
#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};

let mut file = TensogramFile::open_source("s3://bucket/forecast.tgm")?;

// Fetch only the second object from message 0 — no full download
let (meta, desc, data) = file.decode_object(0, 1, &DecodeOptions::default())?;
}

Supports header-indexed and footer-indexed files (read-only) from Rust, Python, xarray, and zarr. See the Remote Access guide for storage options, request budgets, and limitations.

Memory-Mapped I/O (optional)

Enable the mmap feature to use memory-mapped file access:

[dependencies]
tensogram = { path = "...", features = ["mmap"] }
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_mmap("forecast.tgm")?;

// Scan happens during open_mmap — no lazy scan needed
let count = file.message_count()?;

// Reads from the memory-mapped region (no additional seek)
let raw = file.read_message(0)?;
}

This is useful for large files where you want to avoid per-message seek + read overhead. The file is mapped read-only. All existing decode functions work unchanged.

Async I/O (optional)

Enable the async feature for tokio-based non-blocking file operations:

[dependencies]
tensogram = { path = "...", features = ["async"] }
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_async("forecast.tgm").await?;

// Read a message without blocking the async runtime
let raw = file.read_message_async(0).await?;

// Decode also runs on a blocking thread (safe for FFI codecs)
let (meta, objects) = file.decode_message_async(0, &opts).await?;
}

All CPU-intensive work (scanning, decoding, FFI calls to compression libraries) runs via tokio::task::spawn_blocking, so it won’t block the async runtime.

Edge Cases

Appending to an Existing File

TensogramFile::create truncates. To append to an existing file, use standard file I/O:

#![allow(unused)]
fn main() {
use std::io::Write;
let mut f = std::fs::OpenOptions::new().append(true).open("forecast.tgm")?;

let global = GlobalMetadata { version: 2, ..Default::default() };
let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default())?;
f.write_all(&message)?;
}

Or open the file with TensogramFile::open and use append() — the append method always writes at the end regardless of how the file was opened.

Corrupted Messages

The scanner skips corrupted messages and continues. A message is considered corrupted if:

  • The total_length field points to a location where 39277777 is not present
  • The header is truncated

The scanner recovers by advancing one byte and searching for the next TENSOGRM.

Empty Files

message_count() returns 0 for an empty file. read_message(0) returns an error.

Remote Access

Enable the remote feature to open .tgm files on HTTP, S3, GCS, or Azure without downloading the whole file. Individual objects are fetched via targeted range requests.

[dependencies]
tensogram = { path = "...", features = ["remote"] }

Opening a Remote File

#![allow(unused)]
fn main() {
use tensogram::TensogramFile;

// Auto-detect: local path or remote URL
let mut file = TensogramFile::open_source("https://example.com/data.tgm")?;

// S3
let mut file = TensogramFile::open_source("s3://bucket/data.tgm")?;
}

open_source inspects the URL scheme and routes to the remote backend for s3://, s3a://, gs://, az://, azure://, http://, https://. Everything else is treated as a local path.

The Rust open() method is unchanged and always opens a local file. In Python, TensogramFile.open() auto-detects remote URLs.

You can also check whether a string is a remote URL without opening:

#![allow(unused)]
fn main() {
use tensogram::is_remote_url;

assert!(is_remote_url("s3://bucket/file.tgm"));
assert!(!is_remote_url("/local/path/file.tgm"));
}

Storage Options (Credentials, Region, etc.)

Pass an explicit options map for fine-grained control:

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::TensogramFile;

let mut opts = BTreeMap::new();
opts.insert("aws_access_key_id".to_string(), "AKIA...".to_string());
opts.insert("aws_secret_access_key".to_string(), "...".to_string());
opts.insert("region".to_string(), "eu-west-1".to_string());

let mut file = TensogramFile::open_remote("s3://bucket/data.tgm", &opts)?;
}

When no options are passed, credentials are read from the environment (e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, GOOGLE_APPLICATION_CREDENTIALS).

Python Usage

import tensogram

# Auto-detect remote URL
with tensogram.TensogramFile.open("s3://bucket/data.tgm") as f:
    meta = f.file_decode_metadata(0)
    result = f.file_decode_object(0, 0)
    data = result["data"]  # numpy array

# With explicit storage options
with tensogram.TensogramFile.open_remote(
    "s3://bucket/data.tgm",
    {"region": "eu-west-1"}
) as f:
    print(f.source())   # "s3://bucket/data.tgm"
    print(f.is_remote()) # True

xarray Usage

import xarray as xr

ds = xr.open_dataset(
    "s3://bucket/data.tgm",
    engine="tensogram",
    storage_options={"region": "eu-west-1"},
)

Supported Schemes

SchemeBackendNotes
http://, https://HTTPallow_http is set automatically for http://
s3://, s3a://Amazon S3Env-based or explicit credentials
gs://Google Cloud StorageService account or env
az://, azure://Azure Blob StorageMSI or env

All backends are provided by the object_store crate.

Object-Level Access

Three methods provide selective access without downloading full messages:

#![allow(unused)]
fn main() {
use tensogram::DecodeOptions;

// Metadata only — triggers layout discovery on first call, then cached
let meta = file.decode_metadata(0)?;

// Descriptors — reads only the descriptor data needed for each object
let (meta, descriptors) = file.decode_descriptors(0)?;

// Single object by index — fetches only the target object frame
let (meta, desc, data) = file.decode_object(0, 2, &DecodeOptions::default())?;
}

These methods also work on local files, where they read the full message and decode the requested parts.

Request Budget

Header-indexed files (buffered writes)

PhaseOperationHTTP Requests
Openopen_source / open_remote1 HEAD + 1 GET (first preamble only, 24 B)
Next messagefirst data access to message i1 GET (preamble + layout combined)
Cacheddecode_metadata(i) again0 (served from cache)
Object readdecode_object(i, j)1 GET per object (if layout already cached)
Descriptorsdecode_descriptors(i)1–3 GETs per object (descriptor-only reads for large frames)
Message countmessage_count()1 GET per undiscovered message (24 B each, preamble only)
PhaseOperationHTTP Requests
Openopen_source / open_remote1 HEAD + 1 GET (first preamble only, 24 B)
Next messagefirst data access to message i1 GET (preamble) + 1 GET (suffix)
Cacheddecode_metadata(i) again0 (served from cache)
Object readdecode_object(i, j)1 GET per object (if layout already cached)
Descriptorsdecode_descriptors(i)1–3 GETs per object
Message countmessage_count()1 GET per undiscovered message (24 B each)

Streaming files (total_length=0)

PhaseOperationHTTP Requests
Openopen_source / open_remote1 HEAD + 1 GET (preamble) + 1 GET (END_MAGIC check)
First accessdecode_metadata(0)2 GETs (postamble + footer region)
Object readdecode_object(0, j)1 GET per object
Message countmessage_count()0 (streaming is always the last message)

Layout discovery is combined with message scanning for both header-indexed and footer-indexed messages — the library reads the preamble and layout in one GET (header-indexed) or two GETs (footer-indexed suffix read). message_count() uses a lean scan path (24 bytes per preamble). Streaming messages (total_length=0) must be the last message in a multi-message file.

How It Works (Header-Indexed Example)

sequenceDiagram
    participant App
    participant TensogramFile
    participant ObjectStore

    App->>TensogramFile: open_source("s3://bucket/file.tgm")
    TensogramFile->>ObjectStore: HEAD (get file size)
    TensogramFile->>ObjectStore: GET range 0..24 (preamble)
    Note right of TensogramFile: Discover message offsets

    App->>TensogramFile: decode_object(0, 2)
    TensogramFile->>ObjectStore: GET range 24..N (header chunk, up to 256KB)
    Note right of TensogramFile: First access: parse metadata + index, cache layout
    TensogramFile->>ObjectStore: GET range offset..offset+len (object frame 2)
    TensogramFile-->>App: (metadata, descriptor, decoded_bytes)

Checking if a File is Remote

#![allow(unused)]
fn main() {
use tensogram::TensogramFile;

let file = TensogramFile::open_source("s3://bucket/data.tgm")?;
assert!(file.is_remote());
println!("source: {}", file.source()); // "s3://bucket/data.tgm"
}

source() returns the original URL for remote files and the file path for local files.

Error Handling

Remote access can return different TensogramError variants depending on the failure:

Error conditionError typeWhen it happens
Invalid URLRemoteopen_source / open_remote with a malformed URL
Connection failureRemoteNetwork unreachable, DNS failure, timeout
File not foundRemoteHTTP 404, S3 NoSuchKey
No valid messagesRemoteFile contains no parseable messages
Unsupported layoutRemoteMessage lacks both header-index and footer-index flags
Object index out of rangeObjectdecode_object(i, j) where j >= object_count

All errors are returned as Result. The library avoids panics.

Shared Runtime

Remote I/O uses a process-wide shared tokio runtime (multi-thread, 2 workers) created on first use. All RemoteBackend instances share the same runtime, so TCP connection pools and DNS caches are reused across calls.

The sync bridge adapts to the calling context:

  • Not in a tokio runtime (Python, CLI): the shared runtime’s handle drives the future directly — no extra thread creation.
  • Inside a multi-thread tokio runtime (#[tokio::test], server handler): block_in_place tells tokio to spawn a replacement worker so the blocked thread doesn’t cause runtime starvation.
  • Inside a current-thread tokio runtime: falls back to a scoped thread, since block_in_place is not supported on single-threaded runtimes.

Async API

The async feature enables async methods for decode, read, and metadata extraction. These work for both local and remote files:

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};

// Async decode methods (feature = "async")
let meta = file.decode_metadata_async(0).await?;
let (meta, descs) = file.decode_descriptors_async(0).await?;
let (meta, desc, data) = file.decode_object_async(0, 0, &DecodeOptions::default()).await?;
let msg = file.read_message_async(0).await?;
}

When both remote and async features are enabled, async open methods are also available:

#![allow(unused)]
fn main() {
// Async open (auto-detects local vs remote) — requires remote + async
let mut file = TensogramFile::open_source_async("s3://bucket/data.tgm").await?;

// Async open with explicit storage options
let mut file = TensogramFile::open_remote_async(
    "s3://bucket/data.tgm",
    &opts,
).await?;
}

For remote backends, async methods directly await object store operations, bypassing the sync bridge entirely. For local backends, they use spawn_blocking for file I/O.

[dependencies]
tensogram = { path = "...", features = ["remote", "async"] }

Range Reads

TensogramFile::decode_range() supports partial object decoding for both local and remote files. It takes an object index and a list of (offset, count) element ranges, returning only the requested elements without decoding the entire object.

For remote files, it fetches the full object frame (via indexed access) then runs the range decode pipeline on the raw payload. This is most beneficial with szip-compressed objects that have szip_block_offsets, where only the compressed blocks covering the requested range are decompressed.

#![allow(unused)]
fn main() {
// Rust: decode elements 100..200 from object 0
let ranges = vec![(100, 100)];
let (desc, parts) = file.decode_range(0, 0, &ranges, &DecodeOptions::default())?;
}
# Python: decode elements 100..200 from object 0
arr = file.file_decode_range(0, 0, [(100, 100)], join=True)

The xarray backend uses file_decode_range automatically when slicing remote arrays that support partial decode (uncompressed or szip-compressed objects without shuffle filters).

Descriptor-Only Reads

decode_descriptors() fetches only the CBOR descriptor from each data object frame, not the full payload. For large objects (hundreds of MB), this avoids downloading the entire frame just to extract a few hundred bytes of metadata.

For frames smaller than 64 KB, the full frame is read in a single request (fewer round-trips). For larger frames, the library reads only the frame header (16 bytes), footer (12 bytes), and the CBOR descriptor region.

Limitations

  • Streaming messages must be last. In multi-message files, streaming-encoded messages (total_length=0) must be the last message. The remote scanner assumes the streaming message extends to the end of the file.
  • Optimistic scan for buffered messages. Remote message scanning validates preamble magic and total_length plausibility but does not verify end-of-message markers for buffered messages. Streaming messages (total_length=0) do validate the END_MAGIC at EOF.
  • Read-only. Remote writes are not supported.
  • Header probe size. Layout discovery reads a single chunk of up to 256 KB from the header region. If the metadata or index frame does not fit in this chunk, decode_metadata() will error (it does not retry with a larger read).
  • HTTP server requirements. The remote HTTP server must support HEAD requests (for file size) and Range request headers (for partial reads).
  • read_message() and decode_message() download the full message even for remote files. Use decode_metadata(), decode_descriptors(), or decode_object() for selective access.
  • Zarr remote reads are lazy per-chunk. The zarr store fetches only metadata at open time; individual chunks are decoded on first access. Local files still use eager decode for lower latency.
  • Sequential async access. Async methods take &mut self, so a single file handle cannot serve concurrent async reads. Open separate handles for parallelism.

Iterators

Tensogram provides lazy iterator APIs for traversing messages and objects without loading everything into memory at once.

Hierarchy

graph TD
    F[File / Buffer] -->|messages| M1[Message 1]
    F -->|messages| M2[Message 2]
    F -->|messages| M3[Message N]
    M1 -->|objects| O1["(DataObjectDescriptor, Vec&lt;u8&gt;)"]
    M1 -->|objects| O2["(DataObjectDescriptor, Vec&lt;u8&gt;)"]
    O1 -->|access| D1["descriptor + data"]
    O2 -->|access| D2["descriptor + data"]

Rust API

Buffer message iterator

Iterate over messages in a &[u8] byte buffer. Zero-copy: yields slices pointing into the original buffer.

#![allow(unused)]
fn main() {
use tensogram::{messages, decode, DecodeOptions};

let buf: Vec<u8> = std::fs::read("multi.tgm")?;

for msg_bytes in messages(&buf) {
    let (meta, objects) = decode(msg_bytes, &DecodeOptions::default())?;
    println!("version={} objects={}", meta.version, objects.len());
}
}

The iterator calls scan() once on construction, then yields &[u8] slices in sequence. Garbage between valid messages is silently skipped.

MessageIter implements ExactSizeIterator, so .len() returns the remaining count at any point.

Object iterator

Iterate over the decoded objects (tensors) inside a single message. Each item is a (DataObjectDescriptor, Vec<u8>) tuple:

#![allow(unused)]
fn main() {
use tensogram::{objects, DecodeOptions};

for result in objects(&msg_bytes, DecodeOptions::default())? {
    let (descriptor, data) = result?;
    println!("shape={:?} dtype={} encoding={} bytes={}",
             descriptor.shape, descriptor.dtype, descriptor.encoding, data.len());
}
}

Each object is decoded through the full pipeline on demand — objects you don’t consume are never decoded.

For metadata-only access (no payload decode), use objects_metadata. This returns DataObjectDescriptors without decoding any payloads:

#![allow(unused)]
fn main() {
use tensogram::objects_metadata;

for desc in objects_metadata(&msg_bytes)? {
    println!("shape={:?} dtype={} byte_order={}", desc.shape, desc.dtype, desc.byte_order);
}
}

File iterator

Iterate over messages stored in a .tgm file with seek-based lazy I/O:

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, objects, DecodeOptions};

let mut file = TensogramFile::open("forecast.tgm")?;
for raw in file.iter()? {
    let raw = raw?;
    // Nested: iterate objects within this message
    for result in objects(&raw, DecodeOptions::default())? {
        let (desc, data) = result?;
        println!("{:?} {} {} bytes", desc.shape, desc.dtype, data.len());
    }
}
}

file.iter() scans the file once (if not already scanned), then returns a FileMessageIter that reads each message via seek + read. The iterator does not borrow the TensogramFile — it owns an open file handle and a copy of the message offsets.

C / C++ API

The C FFI uses an opaque-handle + next() pattern. Each iterator returns TGM_OK while items remain, and TGM_END_OF_ITER as an end sentinel.

Buffer iterator

tgm_buffer_iter_t *iter;
tgm_buffer_iter_create(buf, buf_len, &iter);

const uint8_t *msg_ptr;
size_t msg_len;
while (tgm_buffer_iter_next(iter, &msg_ptr, &msg_len) == TGM_OK) {
    // msg_ptr borrows from the original buffer
    tgm_message_t *msg;
    tgm_decode(msg_ptr, msg_len, 0, &msg);
    // ... use msg ...
    tgm_message_free(msg);
}
tgm_buffer_iter_free(iter);

Lifetime: the buffer must remain valid until tgm_buffer_iter_free.

File iterator

tgm_file_t *file;
tgm_file_open("data.tgm", &file);

tgm_file_iter_t *iter;
tgm_file_iter_create(file, &iter);

tgm_bytes_t raw;
while (tgm_file_iter_next(iter, &raw) == TGM_OK) {
    // raw.data is owned — free with tgm_bytes_free
    tgm_message_t *msg;
    tgm_decode(raw.data, raw.len, 0, &msg);
    // ... use msg ...
    tgm_message_free(msg);
    tgm_bytes_free(raw);
}
tgm_file_iter_free(iter);
tgm_file_close(file);

Object iterator

tgm_object_iter_t *iter;
tgm_object_iter_create(msg_ptr, msg_len, 0, &iter);

tgm_message_t *obj;
while (tgm_object_iter_next(iter, &obj) == TGM_OK) {
    uint64_t ndim = tgm_object_ndim(obj, 0);
    const uint64_t *shape = tgm_object_shape(obj, 0);
    // ... use shape, data ...
    tgm_message_free(obj);
}
tgm_object_iter_free(iter);

C++ API

The C++ wrapper (include/tensogram.hpp) provides RAII iterator classes that manage the underlying C handles automatically.

Buffer iterator

#include <tensogram.hpp>

auto buf = /* read file into std::vector<uint8_t> */;
tensogram::buffer_iterator iter(buf.data(), buf.size());

const std::uint8_t* msg_ptr;
std::size_t msg_len;
while (iter.next(msg_ptr, msg_len)) {
    auto msg = tensogram::decode(msg_ptr, msg_len);
    std::printf("version=%llu objects=%zu\n", msg.version(), msg.num_objects());
}

File iterator

auto f = tensogram::file::open("forecast.tgm");
tensogram::file_iterator iter(f);

std::vector<std::uint8_t> raw;
while (iter.next(raw)) {
    auto msg = tensogram::decode(raw.data(), raw.size());
    std::printf("objects=%zu\n", msg.num_objects());
}

Object iterator

tensogram::object_iterator iter(msg_ptr, msg_len);
tensogram::message obj = tensogram::decode(msg_ptr, msg_len); // placeholder for next()
while (iter.next(obj)) {
    auto o = obj.object(0);
    auto shape = o.shape();
    std::printf("dtype=%s shape=[%llu, %llu]\n",
                o.dtype_string().c_str(), shape[0], shape[1]);
}

Range-based for on message

auto msg = tensogram::decode(buf, len);
for (const auto& obj : msg) {
    std::printf("dtype=%s bytes=%zu\n",
                obj.dtype_string().c_str(), obj.data_size());
}

Python API

TensogramFile supports iteration, indexing, and slicing:

import tensogram

# Iterate all messages
with tensogram.TensogramFile.open("forecast.tgm") as f:
    for meta, objects in f:
        for desc, arr in objects:
            print(f"  shape={arr.shape}  dtype={desc.dtype}")

# Index and slice
with tensogram.TensogramFile.open("forecast.tgm") as f:
    meta, objects = f[0]        # first message
    meta, objects = f[-1]       # last message
    subset = f[10:20]           # range of messages
    every_5th = f[::5]          # strided access

# Buffer iteration
buf = open("data.tgm", "rb").read()
for meta, objects in tensogram.iter_messages(buf):
    desc, arr = objects[0]
    print(f"  shape={arr.shape}")

decode(), decode_message(), file iteration, and iter_messages() return Message namedtuples with .metadata and .objects fields. Tuple unpacking (meta, objects = msg) also works. TensogramFile supports len(f) and context manager (with).

Thread safety: iterators own independent file handles and buffer copies — no shared mutable state. Safe under free-threaded Python (PEP 703, no GIL).

Edge cases

ScenarioBehavior
Empty buffer / fileIterator yields zero items
Garbage between messagesSilently skipped by scanner
Truncated message at endSkipped (not yielded)
Zero-object messageobjects() returns empty iterator
I/O error during file iterationFileMessageIter::next() yields Err(...)

Python API

Tensogram provides native Python bindings via PyO3. All tensor data crosses the boundary as NumPy arrays.

Installation

# From PyPI (once published)
pip install tensogram

# From source
pip install maturin numpy
cd python/bindings && maturin develop

Quick Start

import numpy as np
import tensogram

# Encode a 2D temperature field
temps = np.random.randn(100, 200).astype(np.float32) + 273.15
meta = {"version": 2}
desc = {"type": "ntensor", "shape": [100, 200], "dtype": "float32"}

msg = tensogram.encode(meta, [(desc, temps)])

# Decode it back
meta, objects = tensogram.decode(msg)
desc, array = objects[0]
print(array.shape)  # (100, 200)

Encoding

Basic encoding

tensogram.encode() takes metadata, a list of (descriptor, array) pairs, and returns wire-format bytes:

msg = tensogram.encode(
    {"version": 2},
    [({"type": "ntensor", "shape": [3], "dtype": "float32"}, np.array([1, 2, 3], dtype=np.float32))],
    hash="xxh3",  # default; use None to skip hashing
)

Descriptor keys

Every object in a message is described by a dict. The three required keys define what the tensor looks like; the optional keys control how it is stored on the wire.

KeyRequiredDefaultDescription
"type"yesObject type, e.g. "ntensor"
"shape"yesTensor dimensions, e.g. [100, 200]
"dtype"yesData type name (see Data Types)
"strides"norow-majorElement strides; computed automatically if omitted
"byte_order"nonative"little" or "big"; defaults to host byte order
"encoding"no"none"Encoding stage — see below
"filter"no"none"Filter stage — see below
"compression"no"none"Compression stage — see below

Any additional keys (e.g. "reference_value", "bits_per_value") are stored in the descriptor’s .params dict and passed through to the encoding pipeline.

The encoding pipeline

Each object passes through a three-stage pipeline before it is stored. You control each stage via descriptor keys:

raw bytes → encoding → filter → compression → wire payload

Encoding transforms the data representation:

ValueWhat it doesUse case
"none"Pass-through (default)Exact values, integer data
"simple_packing"Quantize floats to packed integersBounded-range scalar fields (GRIB-compatible)

Filter rearranges bytes to improve compressibility:

ValueWhat it doesUse case
"none"Pass-through (default)Most cases
"shuffle"Byte-transpose by element width (requires "shuffle_element_size")Improves lz4/zstd ratio on typed data

Compression reduces the payload size:

ValueRandom accessTypeUse case
"none"yesNo compression
"zstd"nolosslessGeneral-purpose, best ratio/speed tradeoff
"lz4"nolosslessFastest decompression
"szip"yes (RSI blocks)losslessInteger/packed data (CCSDS 121.0-B-3)
"blosc2"yes (chunks)losslessLarge tensors, multi-codec
"zfp"yes (fixed-rate)lossyFloating-point arrays
"sz3"nolossyError-bounded scientific data

Compression parameters are passed as extra descriptor keys. For example, zstd level:

desc = {
    "type": "ntensor", "shape": [1000], "dtype": "float32",
    "compression": "zstd", "zstd_level": 9,
}

For the full list of compressor parameters, see Compression.

Common pipeline combinations

# Lossless, fast decompression
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
        "compression": "lz4"}

# Lossless, best ratio (shuffle_element_size must match dtype byte width)
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
        "filter": "shuffle", "shuffle_element_size": 4, "compression": "zstd", "zstd_level": 12}

# Quantise a bounded-range float field to 16-bit packed ints, then compress
# (the same pipeline GRIB 2 uses for simple_packing + CCSDS).
# compute_packing_params expects a flat float64 array
values = data.astype(np.float64).ravel()
params = tensogram.compute_packing_params(values, bits_per_value=16, decimal_scale_factor=0)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
        "encoding": "simple_packing", "compression": "zstd", **params}

# Lossy float compression with error bound (zfp operates on float64)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
        "compression": "zfp", "zfp_mode": "fixed_accuracy", "zfp_tolerance": 0.01}

Invalid combinations: Some pipeline combinations are rejected at encode time — e.g. zfp + shuffle (ZFP operates on typed floats, not byte-shuffled data) or simple_packing + sz3 (both are encoding stages). See Compression — Invalid Combinations.

Multiple objects per message

A single message can contain multiple tensors, each with its own descriptor:

spectrum = np.random.randn(256).astype(np.float64)
mask = np.array([1, 0, 1, 1, 0], dtype=np.uint8)

msg = tensogram.encode(
    {"version": 2},
    [
        ({"type": "ntensor", "shape": [256], "dtype": "float64", "compression": "zstd"}, spectrum),
        ({"type": "ntensor", "shape": [5], "dtype": "uint8"}, mask),
    ],
)

Pre-encoded data

If you already have compressed/packed payloads (e.g. from another system), use tensogram.encode_pre_encoded() with the same interface. The library skips the encoding pipeline and writes the bytes as-is:

msg = tensogram.encode_pre_encoded(meta, [(desc, pre_compressed_bytes)])

See Pre-Encoded Data API for details.

Decoding

Full decode

meta, objects = tensogram.decode(msg)

Returns a Message namedtuple with .metadata and .objects. Tuple unpacking works directly.

By default, decoded arrays are in the caller’s native byte order — the library handles byte-swapping automatically. Pass native_byte_order=False to receive the raw wire byte order instead:

meta, objects = tensogram.decode(msg, native_byte_order=False)

Metadata

meta is a Metadata object:

meta.version     # int — always 2
meta.base        # list[dict] — per-object metadata (one entry per object)
meta.extra       # dict — message-level annotations (_extra_ in CBOR)
meta.reserved    # dict — library internals (_reserved_ in CBOR, read-only)
meta["key"]      # dict-style access (checks base entries, then extra)

To read metadata without decoding any payloads:

meta = tensogram.decode_metadata(msg)

To read metadata and descriptors (no payload decode):

meta, descriptors = tensogram.decode_descriptors(msg)
for desc in descriptors:
    print(desc.shape, desc.dtype, desc.compression)

Selective decode

Decode a single object without touching the others — O(1) seek via the binary header’s offset table:

meta, desc, array = tensogram.decode_object(msg, index=2)

Decode a sub-range of elements from one object (for compressors that support random access):

# Elements 100-149 and 300-324 from object 0
parts = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range

# Or join into a single contiguous array
joined = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# joined is a single flat numpy array of shape (75,)

decode_range works with uncompressed data, simple_packing, szip, blosc2, and zfp fixed-rate mode. It returns an error for stream compressors (zstd, lz4, sz3) and for the shuffle filter. See Decoding Data for details.

Scanning and iteration

To find message boundaries in a buffer without decoding:

offsets = tensogram.scan(buf)  # list of (offset, length) pairs

To iterate messages in a multi-message buffer:

for meta, objects in tensogram.iter_messages(buf):
    print(meta.version, len(objects))

Hash verification

meta, objects = tensogram.decode(msg, verify_hash=True)

Raises RuntimeError if any object’s payload hash doesn’t match. If the message was encoded without a hash (hash=None), verification is silently skipped.

File API

Writing

with tensogram.TensogramFile.create("forecast.tgm") as f:
    for step in range(24):
        data = model.run(step)
        desc = {"type": "ntensor", "shape": list(data.shape), "dtype": "float32",
                "compression": "zstd"}
        f.append({"version": 2, "base": [{"step": step}]}, [(desc, data)])

Each append encodes one message and writes it to the end of the file. Messages are independent and self-describing.

Reading

with tensogram.TensogramFile.open("forecast.tgm") as f:
    print(len(f))                    # message count

    meta, objects = f[0]             # index (supports negative indices)
    subset = f[1:10:2]              # slice → list[Message]

    for meta, objects in f:          # iterate all messages
        for desc, array in objects:
            print(desc.shape, array.dtype)

    raw = f.read_message(0)          # raw bytes for forwarding/caching

The first access triggers a streaming scan that records message offsets. After that, every read is an O(1) seek.

Streaming encoder

For building a message one object at a time in memory:

enc = tensogram.StreamingEncoder({"version": 2}, hash="xxh3")
for desc, data in objects:
    enc.write_object(desc, data)
msg = enc.finish()  # returns complete message as bytes

For pre-encoded payloads, use enc.write_object_pre_encoded(desc, raw_bytes).

Async API

AsyncTensogramFile provides the same operations as TensogramFile but as asyncio coroutines. A single handle supports truly concurrent operations with no per-handle mutex; internal caches are thread-safe.

Opening and decoding

import asyncio
import tensogram

async def main():
    f = await tensogram.AsyncTensogramFile.open("forecast.tgm")

    meta, objects = await f.decode_message(0)
    result = await f.file_decode_object(0, 0)
    print(result["data"].shape)

asyncio.run(main())

For remote files with credentials:

    f = await tensogram.AsyncTensogramFile.open_remote(
        "s3://bucket/data.tgm", {"region": "eu-west-1"}
    )

Concurrent decoding with asyncio.gather

Multiple decode calls run concurrently on a single handle:

    results = await asyncio.gather(
        f.file_decode_object(0, 0),
        f.file_decode_object(1, 0),
        f.file_decode_object(2, 0),
    )

Batch decoding from many messages at once

When you need the same data from many messages, for example reading how a value at one grid point changes over 300 time steps, individual requests are slow because each one is a separate HTTP round-trip.

file_decode_range_batch collects the requested element ranges across messages and fetches the underlying data in a batched HTTP call. file_decode_object_batch does the same for full frames:

    indices = list(range(300))
    row, col, grid = 100, 200, 528
    offset = row * grid + col

    values = await f.file_decode_range_batch(indices, 0, [(offset, 1)], join=True)

    frames = await f.file_decode_object_batch(indices, 0)

For even more speed, split the work into chunks and run them concurrently:

    chunks = [indices[i::16] for i in range(16)]
    batch_results = await asyncio.gather(
        *[f.file_decode_range_batch(chunk, 0, [(offset, 1)], join=True)
          for chunk in chunks]
    )

The sync TensogramFile also has file_decode_range_batch and file_decode_object_batch with the same signatures. Both batch methods require a remote backend; calling them on a local file raises OSError.

Layout prefetching

Before running many concurrent decodes on a remote file, prefetch the internal layout metadata to avoid repeated discovery requests:

    count = await f.message_count()
    await f.prefetch_layouts(list(range(count)))

Context manager and iteration

    async with await tensogram.AsyncTensogramFile.open("data.tgm") as f:
        await f.message_count()   # required before async for or len(f)
        async for meta, objects in f:
            print(objects[0][1].shape)

Async iteration works on remote files (sync iteration does not). await f.message_count() must be called once before using async for or len(f), to discover the message count without blocking the event loop.

Other methods

    count = await f.message_count()
    raw = await f.read_message(0)
    all_raw = await f.messages()
    print(f.is_remote(), f.source())

Note: len(f) requires a prior await f.message_count() call. Without it, len(f) raises RuntimeError.

When to use async vs sync

ScenarioRecommendation
Script, CLI, or notebookTensogramFile (sync)
Inside an asyncio event loopAsyncTensogramFile
xarray or zarrSync (those frameworks are synchronous)
Many concurrent remote readsasyncio.gather on one AsyncTensogramFile
Same data from many messagesfile_decode_range_batch or file_decode_object_batch

Validation

Two functions check whether messages and files are well-formed without consuming the data. See also the CLI reference.

report = tensogram.validate(msg)
file_report = tensogram.validate_file("data.tgm")

Levels

LevelCheckshash_verified
"quick"Structure only: magic bytes, frame layout, lengthsalways False
"default"+ metadata (CBOR) + integrity (hash verification, decompression)True only if hash succeeds and no errors
"checksum"Hash verification only, structural warnings suppressedTrue only if hash succeeds and no errors
"full"+ fidelity (full decode, decoded-size check, NaN/Inf scan)True only if hash succeeds and no errors
# Full validation with canonical CBOR key-order checking
report = tensogram.validate(msg, level="full", check_canonical=True)

Return values

validate() returns:

{
    "issues": [
        {
            "code": "hash_mismatch",   # stable snake_case string
            "level": "integrity",      # which validation level found it
            "severity": "error",       # "error" or "warning"
            "description": "...",      # human-readable message
            "object_index": 0,         # optional — which object
            "byte_offset": 1234,       # optional — position in buffer
        }
    ],
    "object_count": 1,
    "hash_verified": False,
}

validate_file() returns file-level issues plus per-message reports:

{
    "file_issues": [
        {"byte_offset": 100, "length": 19, "description": "trailing bytes after last message"}
    ],
    "messages": [
        {"issues": [], "object_count": 1, "hash_verified": True}
    ],
}

Interpreting results

report = tensogram.validate(msg)
if not report["issues"]:
    print(f"OK — {report['object_count']} objects, hash verified")
else:
    for issue in report["issues"]:
        print(f"[{issue['severity']}] {issue['code']}: {issue['description']}")

GRIB / NetCDF conversion

Three PyO3-bound helpers wrap tensogram-grib and tensogram-netcdf. They are always callable — when the Python wheel was built without the corresponding Cargo feature, each raises RuntimeError with a pointer to rebuild instructions.

You can probe availability at runtime:

import tensogram

if tensogram.__has_grib__:
    msgs = tensogram.convert_grib("forecast.grib2")

if tensogram.__has_netcdf__:
    msgs = tensogram.convert_netcdf("data.nc")

convert_grib(path, **options) -> list[bytes]

Convert a GRIB file (as many messages as it contains) to Tensogram wire format. Returns one bytes per output Tensogram message — join or write sequentially to produce a .tgm file.

msgs = tensogram.convert_grib(
    "forecast.grib2",
    grouping="merge_all",      # "merge_all" | "one_to_one"
    preserve_all_keys=False,   # lift every ecCodes namespace into base[i]["grib"]
    encoding="simple_packing", # "none" | "simple_packing"
    bits=16,                   # None -> defaults to 16; ignored for encoding="none"
    filter="none",             # "none" | "shuffle"
    compression="szip",        # "none" | "zstd" | "lz4" | "blosc2" | "szip"
    compression_level=None,    # applies to zstd / blosc2 (None = codec default)
    threads=0,                 # 0 = sequential; honours TENSOGRAM_THREADS env var
    hash="xxh3",               # "xxh3" | None
    # NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
    allow_nan=False,           # False (default) rejects any NaN input
    allow_inf=False,           # False (default) rejects any ±Inf input
)
with open("forecast.tgm", "wb") as fh:
    for msg in msgs:
        fh.write(msg)

Pipeline defaults and edge cases:

  • bits=None with encoding="simple_packing" defaults to 16 bits.
  • bits outside 1..=64 silently falls back to encoding="none" and emits a warning to stderr. Validate your inputs before calling if fail-fast is important.
  • Unknown compression / encoding names raise ValueError with the list of valid choices in the message.
  • Unknown grouping / split_by / hash values raise ValueError.
  • Missing input paths raise FileNotFoundError.
  • Building the wheel without the grib / netcdf feature causes the corresponding function to raise RuntimeError at call time with rebuild instructions.

Requires libeccodes at the OS level and the wheel built with --features grib (maturin develop --features grib). Official PyPI wheels do not currently include the grib feature — see Jupyter Notebook Walk-through.

convert_grib_buffer(buf, **options) -> list[bytes]

In-memory variant of convert_grib. Accepts any Python bytes-like object (bytes, bytearray, memoryview, numpy.uint8[:]). Useful when the GRIB bytes come from a byte-range HTTP fetch, a cache, or any other in-memory source — no filesystem staging needed.

import requests

# Byte-range download of a single GRIB message from data.ecmwf.int.
resp = requests.get(
    "https://data.ecmwf.int/forecasts/.../...grib2",
    headers={"Range": "bytes=74573515-75234113"},
)
msgs = tensogram.convert_grib_buffer(
    resp.content,
    encoding="simple_packing",
    bits=16,
    compression="szip",
    # See [NaN / Inf Handling](nan-inf-handling.md) for the
    # `allow_nan` / `allow_inf` opt-in if your data contains
    # non-finite values.
)

convert_grib and convert_grib_buffer produce bit-identical decoded payloads for the same input. The encoded bytes may differ — each call stamps a fresh timestamp and UUID into _reserved_.

convert_netcdf(path, **options) -> list[bytes]

Convert a NetCDF-3 or NetCDF-4 file to Tensogram. Packed variables (scale_factor / add_offset) are automatically unpacked to float64.

msgs = tensogram.convert_netcdf(
    "data.nc",
    split_by="file",           # "file" | "variable" | "record"
    cf=False,                  # lift 16 CF attributes into base[i]["cf"]
    encoding="none",
    bits=None,
    filter="none",
    compression="zstd",
    compression_level=3,
    threads=0,
    hash="xxh3",
    # NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
    allow_nan=False,           # False (default) rejects any NaN input
    allow_inf=False,           # False (default) rejects any ±Inf input
)

Note on NaN and --encoding simple_packing. Since 0.17 the importer hard-fails on NaN or Inf in a variable targeted for simple_packing (previous behaviour: stderr warning + fallback to encoding="none"). If your NetCDF has _FillValue / missing_value fields unpacked to NaN, either stick with the default encoding="none" or pre-process the values. See the NetCDF Importer error-handling reference for the full contract.

Requires libnetcdf + libhdf5 at the OS level and the wheel built with --features netcdf.

Error Handling

ExceptionWhen
FileNotFoundErrorconvert_grib(path) / convert_netcdf(path) called with a non-existent path (subclass of OSError).
OSErrorOther file I/O failures (permission denied, disk error, etc.).
ValueErrorInvalid parameters; unknown dtype; NaN in simple packing; unknown validation level; invalid grouping / split_by / hash; unknown codec / bit width in the conversion pipeline; empty/non-GRIB input buffer; split_by="record" on a NetCDF without an unlimited dimension.
RuntimeErrorHash mismatch during decode(..., verify_hash=True); calling convert_grib / convert_grib_buffer / convert_netcdf on a wheel built without the feature; internal ecCodes / libnetcdf C-library failures that cannot be classified as caller-input errors.
KeyErrorMissing metadata key via meta["key"].

Supported dtypes

CategoryTypes
Floating pointfloat16, bfloat16, float32, float64
Complexcomplex64, complex128
Signed integerint8, int16, int32, int64
Unsigned integeruint8, uint16, uint32, uint64
Specialbitmask

bfloat16 is returned as ml_dtypes.bfloat16 when ml_dtypes is installed; otherwise it falls back to np.uint16.

See Data Types for byte widths and wire-format details.

Examples

See examples/python/ for complete working examples:

ExampleTopic
01_encode_decode.pyBasic round-trip
02_mars_metadata.pyPer-object metadata (ECMWF MARS vocabulary example)
02b_generic_metadata.pyPer-object metadata using a generic application namespace
03_simple_packing.pySimple-packing encoding
04_multi_object.pyMulti-object messages, selective decode
05_file_api.pyMulti-message .tgm files
06_hash_and_errors.pyHash verification and error handling
07_iterators.pyFile iteration, indexing, slicing
08_xarray_integration.pyOpening .tgm as xarray Datasets
08_zarr_backend.pyReading/writing through Zarr v3
09_dask_distributed.pyDask distributed computing over 4-D tensors
09_streaming_consumer.pyStreaming consumer pattern
11_encode_pre_encoded.pyPre-encoded data API
12_convert_netcdf.pyNetCDF → Tensogram import via the Python API
13_validate.pyMessage and file validation
15_async_operations.pyAsync open, decode, and asyncio.gather
17_convert_grib.pyGRIB → Tensogram import (file + in-memory buffer)

For narrative walk-throughs with plots and explanations, see also examples/jupyter/*.ipynb — five journey notebooks covering quickstart/MARS, encoding pipeline fidelity, GRIB conversion, NetCDF conversion with xarray, and validation with multi-threaded encoding.

C++ API

Tensogram provides a header-only C++17 wrapper at cpp/include/tensogram.hpp. It delegates all work to the C FFI and adds RAII handle management, typed exceptions, and idiomatic C++ patterns.

Requirements

  • C++17 compiler (GCC 7+, Clang 5+, MSVC 19.14+)
  • Rust static library built via cargo build --release
  • CMake 3.16+ (recommended)

Build

cargo build --release
cmake -S cpp -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Quick Start

#include <tensogram.hpp>

// Encode
std::string meta_json = R"({"version": 2, "descriptors": [...]})";
std::vector<float> data(100 * 200, 0.0f);
auto encoded = tensogram::encode(
    meta_json,
    {{reinterpret_cast<const uint8_t*>(data.data()), data.size() * sizeof(float)}});

// Decode
auto msg = tensogram::decode(encoded.data(), encoded.size());
auto obj = msg.object(0);
const float* values = obj.data_as<float>();

RAII Classes

ClassWrapsCleanup
messagetgm_message_ttgm_message_free
metadatatgm_metadata_ttgm_metadata_free
filetgm_file_ttgm_file_close
buffer_iteratortgm_buffer_iter_ttgm_buffer_iter_free
file_iteratortgm_file_iter_ttgm_file_iter_free
object_iteratortgm_object_iter_ttgm_object_iter_free
streaming_encodertgm_streaming_encoder_ttgm_streaming_encoder_free

All classes are move-only (copy deleted). Handles are released automatically when the object goes out of scope.

Error Handling

C error codes are mapped to a typed exception hierarchy:

try {
    auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
    // Invalid message framing
} catch (const tensogram::hash_mismatch_error& e) {
    // Payload integrity check failed
} catch (const tensogram::error& e) {
    // Any Tensogram error (base class)
    std::cerr << e.what() << " (code=" << e.code() << ")\n";
}

Validation

Two free functions validate messages and files, returning JSON strings:

// Validate a single message buffer (default level)
auto report = tensogram::validate(buf, len);

// Full validation with canonical CBOR check
auto full_report = tensogram::validate(buf, len, "full", /*check_canonical=*/true);

// Validate a .tgm file
auto file_report = tensogram::validate_file("data.tgm");
auto file_full   = tensogram::validate_file("data.tgm", "full");

Validation levels: "quick", "default", "checksum", "full".

The returned JSON contains issues, object_count, and hash_verified for single messages, or file_issues and messages for files. Parse with your preferred JSON library.

An invalid level string or a missing file throws tensogram::invalid_arg_error or tensogram::io_error respectively. Validation issues (corrupted data, hash mismatches) are reported in the JSON — they do not throw.

Iterators

See Iterators for buffer, file, and object iterator usage.

Examples

See examples/cpp/ for complete working examples covering encode/decode, metadata, file API, simple packing, and iterators.

TypeScript API

Tensogram ships @ecmwf/tensogram, a TypeScript package that wraps the WebAssembly build with typed, idiomatic helpers. Use it in any modern browser or Node ≥ 20.

Status: Scope B is complete. Typed encode / decode / scan, dtype dispatch, metadata helpers, progressive streaming decode, and the TensogramFile file / URL helper are all available. Scope C follow-ups (validate wrapper, encodePreEncoded, first-class float16 / bfloat16 / complex* types, npm publish pipeline) are tracked in plans/TYPESCRIPT_WRAPPER.md.

Installation

The package is not yet published to npm. Build it locally:

# First, build the WebAssembly blob from the Rust source
cd typescript
npm install
npm run build:wasm   # runs wasm-pack build -t web -d typescript/wasm
npm run build        # runs wasm-pack + tsc

Or use the top-level Makefile:

make ts-build        # build WASM + tsc
make ts-test         # vitest
make ts-typecheck    # strict tsc --noEmit on src + tests

Quick start

import {
  init, encode, decode,
  type DataObjectDescriptor,
  type GlobalMetadata,
} from '@ecmwf/tensogram';

// One-time WASM initialisation (idempotent)
await init();

// ── Encode ────────────────────────────────────────────────────────────
const temps = new Float32Array(100 * 200);
for (let i = 0; i < temps.length; i++) temps[i] = 273.15 + i / 100;

const meta: GlobalMetadata = { version: 2 };
const descriptor: DataObjectDescriptor = {
  type: 'ntensor',
  ndim: 2,
  shape: [100, 200],
  strides: [200, 1],
  dtype: 'float32',
  byte_order: 'little',
  encoding: 'none',
  filter: 'none',
  compression: 'none',
};

const msg: Uint8Array = encode(meta, [{ descriptor, data: temps }]);

// ── Decode ────────────────────────────────────────────────────────────
const { metadata, objects } = decode(msg);
const arr = objects[0].data();  // Float32Array (inferred from dtype)
console.log(arr.length);        // 20000

API surface

init(opts?)

Loads and instantiates the WASM blob. Must be awaited before any other function is called. Safe to call multiple times — subsequent calls reuse the same instance.

await init();                                              // defaults
await init({ wasmInput: new URL('...', import.meta.url) });  // custom location

encode(metadata, objects, opts?)

ParameterTypeDescription
metadataGlobalMetadataWire-format metadata; version: 2 is required
objectsArray<{ descriptor, data }>Each data is a TypedArray or Uint8Array
opts.hash'xxh3' | falseHash algorithm. Default 'xxh3'. Pass false to disable.

Returns: Uint8Array containing the complete wire-format message.

decode(buf, opts?)

ParameterTypeDescription
bufUint8ArrayRaw message bytes
opts.verifyHashbooleanDefault false. If true, throws HashMismatchError on corruption.

Returns: { metadata: GlobalMetadata, objects: DecodedObject[], close() }.

decodeMetadata(buf)

Returns only the metadata; does not touch any payload bytes.

decodeObject(buf, index, opts?)

O(1) seek to object index, decoding only that object.

scan(buf)

Returns Array<{ offset: number; length: number }> for each Tensogram message found in a (potentially multi-message) buffer. Garbage between messages is silently skipped.

DecodedObject / DecodedFrame

interface DecodedObject {
  readonly descriptor: DataObjectDescriptor;
  /** Copy into the JS heap.  Safe across WASM memory growth. */
  data(): TypedArray;
  /** Zero-copy view.  Invalidated if WASM memory grows. */
  dataView(): TypedArray;
  readonly byteLength: number;
}

interface DecodedFrame extends /* structurally */ DecodedObject {
  /** The matching `base[i]` entry from the containing message. */
  readonly baseEntry: BaseEntry | null;
  close(): void;
}

The returned array type is picked from descriptor.dtype:

dtypeReturned TypedArray
float32Float32Array
float64Float64Array
int8Int8Array
int16Int16Array
int32Int32Array
int64BigInt64Array
uint8Uint8Array
uint16Uint16Array
uint32Uint32Array
uint64BigUint64Array
float16 / bfloat16Uint16Array (no native half-precision in JS)
complex64Float32Array (interleaved real, imag)
complex128Float64Array (interleaved real, imag)
bitmaskUint8Array (packed bits)

getMetaKey(meta, path)

Dot-path lookup matching the Rust / Python / CLI first-match-across-base semantics: searches base[0], base[1], …, skipping the _reserved_ key in each, then falls back to _extra_.

getMetaKey(meta, 'mars.param')      // 'base[0].mars.param' first match
getMetaKey(meta, '_extra_.source')  // explicit _extra_ prefix

Returns undefined if the key is missing (never throws).

computeCommon(meta)

Mirror of tensogram::compute_common. Returns a Record<string, CborValue> of keys that are present with identical values in every entry of meta.base. Useful for display and merge operations.

Error classes

All errors thrown from this package are instances of the abstract TensogramError class. Eight concrete subclasses match the Rust TensogramError variants plus the TS-layer InvalidArgumentError and StreamingLimitError:

import {
  TensogramError,
  FramingError,
  MetadataError,
  EncodingError,
  CompressionError,
  ObjectError,
  IoError,
  RemoteError,
  HashMismatchError,
  InvalidArgumentError,
  StreamingLimitError,
} from '@ecmwf/tensogram';

try {
  decode(corruptBuffer);
} catch (err) {
  if (err instanceof FramingError) {
    console.error('bad wire format:', err.message);
  } else if (err instanceof HashMismatchError) {
    console.error('integrity failure:', err.expected, err.actual);
  } else {
    throw err;
  }
}

Memory model

  • Safe-copy by default. object.data() / frame.data() always allocate a new TypedArray on the JS heap. It remains valid even after the underlying DecodedMessage / DecodedFrame is freed or WASM memory grows.
  • Zero-copy opt-in. object.dataView() / frame.dataView() return a view directly into WASM linear memory. It is invalidated the next time any WASM call grows linear memory — which can happen on the next encode() / decode(). Read the view immediately or copy it.
  • Explicit cleanup. DecodedMessage, DecodedFrame, and TensogramFile all expose .close() to release WASM-side memory. A FinalizationRegistry also calls .free() on the underlying WASM handle when the wrapper is garbage-collected, but explicit .close() is strongly recommended for deterministic cleanup.

Streaming decode

Use decodeStream(readable, opts?) to progressively decode a ReadableStream<Uint8Array>. Works against any stream source — fetch().body, a Node Readable.toWeb(), a Blob.stream(), or a hand-rolled ReadableStream.

import { decodeStream } from '@ecmwf/tensogram';

const res = await fetch('/data.tgm');
for await (const frame of decodeStream(res.body!)) {
  render(frame.descriptor.shape, frame.data());
  frame.close();
}

Options:

OptionTypeDescription
signalAbortSignalCancels the iteration. The underlying reader is cancelled and the decoder is freed cleanly.
maxBufferBytesnumberMax size of the internal staging buffer. Default: 256 MiB. Exceeding this throws StreamingLimitError.
onError(err: StreamDecodeError) => voidCalled whenever a corrupt message is skipped. The iterator does not throw on skips — it keeps going.

Key behaviours:

  • Chunk-boundary tolerant. A message can be split across any number of chunks. The decoder accumulates until a complete message is seen, then emits every object as a separate frame.
  • Corruption resilient. A single bad message is skipped; the iterator keeps going with subsequent messages. Pass onError to observe the skips.
  • Early break is safe. Breaking out of the for await loop runs the generator’s finally block, which releases the stream reader and frees the decoder.
  • AbortSignal cancels cleanly. Firing the signal cancels the underlying reader; the generator throws whatever error the signal carries.

File API

TensogramFile gives you random-access reads over a .tgm file, whether it lives on the local file system, behind an HTTPS URL, or already in memory.

import { TensogramFile } from '@ecmwf/tensogram';

// Node: from the local file system
const file = await TensogramFile.open('/data/input.tgm');

// Browser or Node: over HTTPS
const file = await TensogramFile.fromUrl('https://example.com/input.tgm');

// Any runtime: from pre-loaded bytes
const file = TensogramFile.fromBytes(uint8ArrayFromSomewhere);

All three factories produce an identical object:

interface TensogramFile extends AsyncIterable<DecodedMessage> {
  readonly messageCount: number;
  readonly byteLength: number;
  readonly source: 'local' | 'remote' | 'buffer';

  message(index: number, opts?: DecodeOptions): Promise<DecodedMessage>;
  messageMetadata(index: number): Promise<GlobalMetadata>;
  rawMessage(index: number): Uint8Array;

  [Symbol.asyncIterator](): AsyncIterator<DecodedMessage>;
  close(): void;
}

Usage:

const file = await TensogramFile.open('/data/input.tgm');
try {
  console.log(`${file.messageCount} messages, ${file.byteLength} bytes`);

  // Random access
  const first = await file.message(0);
  console.log(first.objects[0].descriptor.shape);
  first.close();

  // Async iteration
  for await (const msg of file) {
    // ...
    msg.close();
  }
} finally {
  file.close();
}

TensogramFile.open(path, opts?) (Node only)

Loads the file via node:fs/promises. The node:fs/promises import is dynamic so browser bundlers can tree-shake this code path.

OptionTypeDescription
signalAbortSignalCancels the initial read.

TensogramFile.fromUrl(url, opts?) (any fetch-capable runtime)

Downloads the file over HTTPS using the ambient globalThis.fetch.

OptionTypeDescription
fetchtypeof fetchOverride the fetch implementation (useful for tests and for browsers with a polyfill).
headersHeadersInitExtra request headers (auth, etc.).
signalAbortSignalCancels the download.

TensogramFile.fromBytes(bytes)

Wraps an already-loaded Uint8Array. The buffer is defensively copied, so later mutation of the caller’s buffer is invisible to the TensogramFile.

Range-based lazy access

Since Scope C, TensogramFile.fromUrl automatically probes the server for HTTP Range support. When the HEAD response advertises Accept-Ranges: bytes and a finite Content-Length, the file switches to a lazy backend:

  • The initial open issues a small HEAD + one 24-byte Range read per message preamble to build the boundary index. No payload data is downloaded.
  • rawMessage(i) / message(i) fetch just the requested message’s bytes via a Range: bytes=offset-(offset+length-1) GET.
  • A small LRU caches recently-fetched message bytes so repeat reads are free.

When the server omits Accept-Ranges, returns non-200 on HEAD, or the file uses streaming-mode messages (total_length=0 — the writer did not know the final length up front), the open falls back to a single eager GET. Behaviour is indistinguishable to callers except in memory use and timing.

Browser callers using fromUrl directly need CORS to expose the Accept-Ranges, Content-Range, and Content-Length headers.

Append (Node local file system)

TensogramFile#append(meta, objects, opts?) encodes the new message in-memory, appends it to the on-disk file, refreshes the position index, and makes the new message reachable via message(i) on the same handle. Only supported when the file was opened via TensogramFile.open(path)fromBytes- and fromUrl-backed files throw InvalidArgumentError, matching the contract in the other language bindings.

const file = await TensogramFile.open('/data/forecast.tgm');
try {
  await file.append({ version: 2 }, [{ descriptor, data }]);
  console.log(`now has ${file.messageCount} messages`);
} finally {
  file.close();
}

Scope-C API additions

Scope C brought the TypeScript wrapper to full API parity with Rust / Python / FFI / C++. The surface additions are:

Function / classWhat it does
decodeRange(buf, objIndex, ranges, opts?)Partial sub-tensor decode. ranges is an array of [offset, count] pairs in element units; each returned parts[i] is a dtype-typed view. Option join: true concatenates every range into a single view.
computeHash(bytes, algo?)Standalone xxh3 hash — matches the digest stamped by encode() on the same bytes.
simplePackingComputeParams(values, bits, decScale?)GRIB-style simple-packing parameter computation. Return shape uses snake-case keys so the result spreads directly into a descriptor.
validate(buf, opts?)Report-only validation (never throws on bad input). Modes: quick, default, checksum, full.
validateBuffer(buf, opts?)Multi-message buffer: reports file-level gaps / trailing garbage plus per-message reports.
validateFile(path, opts?)Node-only helper: reads the file via node:fs/promises then delegates to validateBuffer.
encodePreEncoded(meta, objects, opts?)Wrap already-encoded bytes verbatim into a wire-format message. The library still validates descriptor structure and stamps a fresh hash.
StreamingEncoderFrame-at-a-time construction. Two modes: buffered (default, finish() returns the complete Uint8Array) or streaming via opts.onBytes callback (bytes flow through the callback as they’re produced; finish() returns an empty Uint8Array).
TensogramFile#appendAppend a new message to a file opened via TensogramFile.open(path). Node-only.

Streaming StreamingEncoder (no full-message buffering)

For browser uploads, WebSocket pushes, or any sink that needs bytes as soon as they are produced, pass an onBytes callback to the StreamingEncoder constructor:

const enc = new StreamingEncoder({ version: 2 }, {
  onBytes: (chunk) => uploadSocket.send(chunk),   // e.g. WebSocket.send
});
enc.writeObject(descriptor, new Float32Array([1, 2, 3]));
enc.finish();    // flushes footer; returns empty Uint8Array in streaming mode
enc.close();

Semantics:

  • The callback is invoked during construction (preamble + header metadata frame), during each writeObject / writeObjectPreEncoded (one data-object frame’s bytes, potentially across multiple invocations), and during finish() (footer frames + postamble).
  • Concatenating every chunk the callback sees (in order) yields a message byte-for-byte identical to what buffered mode would return. Tested via round-trip with decode().
  • The callback must be synchronousPromise return values are silently discarded because the Rust/WASM writer contract is synchronous. Buffer internally first if you need async work.
  • Each chunk is JS-owned and fresh per invocation. Copy (new Uint8Array(chunk) or chunk.slice()) if you need to keep it past the next writeObject — the underlying ArrayBuffer is invalidated when WASM memory grows.
  • If the callback throws, the exception surfaces as an IoError on the next writeObject / finish. The encoder state is undefined after an error — call close() and start over.
  • enc.streaming (getter) reports whether an onBytes sink was supplied — useful for code that needs to branch on mode.

Parity note: the Rust core StreamingEncoder<W: Write> has always supported arbitrary sinks; the WASM/TS surface now exposes this capability to JS code. Python / FFI / C++ bindings remain buffered-only; extending them would follow the same JsCallbackWriter pattern with a language-specific sink abstraction and is tracked in plans/TYPESCRIPT_WRAPPER.md.

First-class half-precision and complex dtypes

Scope C also upgraded the dtype dispatch in {@link typedArrayFor}. obj.data() now returns a first-class view for dtypes JS does not have a native TypedArray for:

Dtypedata() return type
float16Float16Array (native when available) or Float16Polyfill (TC39-accurate)
bfloat16Bfloat16Array — 1-8-7 layout, truncating-with-round-to-nearest-even narrow
complex64 / complex128ComplexArray.real(i), .imag(i), .get(i) → {re, im}, iteration

All three classes expose .bits / .data for zero-copy access to the underlying raw storage if you need it.

const m = decode(buf);
const f16 = m.objects[0].data();           // Float16Array or polyfill
const asFloat32 = f16.toFloat32Array();    // widened copy
const bits = f16.bits;                      // raw binary16

const cplx = m.objects[1].data() as ComplexArray;
for (let i = 0; i < cplx.length; i++) {
  console.log(cplx.real(i), cplx.imag(i));
}

The polyfill is used automatically when the host runtime does not ship globalThis.Float16Array. hasNativeFloat16Array() and getFloat16ArrayCtor() expose the detection machinery for callers that want direct control.

Breaking change from Scope B: Before Scope C, obj.data() on float16 / bfloat16 returned a raw Uint16Array of bits, and complex dtypes returned an interleaved Float32Array / Float64Array. Consumers that relied on that shape can reach the same bytes via .bits (for f16/bf16) or .data (for complex).

The low-level bit-conversion helpers (halfBitsToFloat, floatToHalfBits, bfloat16BitsToFloat, floatToBfloat16Bits) and the isComplexDtype type-guard are internal and are not re-exported from @ecmwf/tensogram. Callers that need bit-level manipulation should grab the raw storage from a view’s .bits / .data accessor and do the conversion themselves, or import directly from @ecmwf/tensogram/float16, …/bfloat16, …/complex with the understanding that these module paths are not part of the stable API.

Examples

See examples/typescript/ in the repository for runnable scripts:

  • 01_encode_decode.ts — basic round-trip
  • 02_mars_metadata.ts — per-object metadata using the MARS vocabulary
  • 02b_generic_metadata.ts — per-object metadata using a generic application namespace
  • 03_multi_object.ts — multiple dtypes in one message
  • 04_decode_range.ts — partial sub-tensor decode
  • 05_streaming_fetch.ts — progressive decode over a ReadableStream
  • 06_file_api.tsTensogramFile over Node fs, fetch, and in-memory bytes
  • 07_hash_and_errors.ts — hash verification and typed errors
  • 08_validate.tsvalidate(buf) + validateFile(path)
  • 11_encode_pre_encoded.ts — wrap already-encoded bytes
  • 12_streaming_encoder.ts — frame-at-a-time encoder with preceders
  • 13_range_access.ts — lazy TensogramFile.fromUrl over HTTP Range
  • 14_streaming_callback.tsStreamingEncoder with onBytes callback sink

Run them with:

cd examples/typescript
npm install
npx tsx 01_encode_decode.ts     # or any other file

Design notes

See plans/TYPESCRIPT_WRAPPER.md for the full design document covering architecture, phases, test strategy, memory model, and open follow-ups.

Cross-language parity

This TypeScript package decodes the same golden .tgm files used by the Rust, Python, and C++ test suites. The committed files at rust/tensogram/tests/golden/*.tgm are decoded by each language’s test runner; any drift in wire-format semantics fails all four suites.

Specifically, typescript/tests/golden.test.ts decodes:

  • simple_f32.tgm — single-object Float32 round-trip
  • multi_object.tgm — mixed-dtype message (f32 / i64 / u8)
  • mars_metadata.tgm — MARS keys under base[0].mars
  • multi_message.tgm — two concatenated messages (via scan())
  • hash_xxh3.tgm — verifyHash success + tamper detection

typescript/tests/property.test.ts and the Scope-C dtype suites add fast-check property tests pinning:

  • mapTensogramError never throws for any finite-string input and always returns a TensogramError subclass;
  • encode → decode is bit-exact for random Float32 shapes across random application metadata;
  • decode on random byte input either succeeds with a structurally valid message or throws a typed TensogramError — never panics;
  • float32 → float16 → float32 round-trip stays within half-precision ulp for any random value in a reasonable magnitude band;
  • float32 → bfloat16 → float32 round-trip stays within bfloat16 ulp;
  • complex64 encode → decode preserves real(i) / imag(i) byte-for-byte across random shapes and values.

The CI typescript job rebuilds and runs every TS test on every PR.

Tensoscope

Tensoscope is an interactive web viewer for .tgm files. It runs entirely in the browser — no server-side component — by decoding data via the @ecmwf/tensogram WebAssembly package.

Quick start

Build the WASM package first, then start the dev server:

cd typescript && make ts-build
cd tensoscope && npm install && npm run dev

Open http://localhost:5173 in your browser, then drag-and-drop a .tgm file onto the page or paste a URL into the file open dialog.

Loading a file

Two modes are supported:

  • Local file — drag the .tgm file onto the drop zone, or click Open file.
  • Remote URL — paste an HTTP/HTTPS URL. The file is fetched in full before scanning. (HTTP Range support for lazy loading is planned.)

Once loaded, Tensoscope scans all messages and builds a field index without decoding any payloads.

Field browser

The left sidebar lists every decodable field in the file. Each entry shows:

  • Variable name (resolved from mars.param, name, or param metadata keys)
  • Shape and dtype

Click a field to decode it and render it on the map.

Map view

Fields with two spatial dimensions (latitude × longitude) are rendered as a coloured overlay on an interactive map. Regridding from the unstructured source grid onto the display pixel grid runs in a web worker so the UI stays responsive while large arrays are processed.

Projections

Switch between flat (Mercator, powered by MapLibre GL JS) and globe (3D sphere, powered by CesiumJS with OpenStreetMap base tiles) using the projection picker in the bottom-left of the map. Camera position is preserved when switching between the two renderers.

Render modes

A Heatmap / Contours toggle in the top-left of the map switches between two rendering styles:

  • Heatmap — smooth continuous gradient from the active colour scale. Pixel colours are interpolated linearly across the data range.
  • Contours — filled colour bands (like matplotlib.contourf). The data range is divided into N discrete bands where N is the number of colour steps in the active palette (default 10 for continuous palettes; stop count for custom palettes). Each band is rendered with a single solid colour.

Colour scale

The colour bar at the bottom of the map shows the current field range. Use the colour scale controls to:

  • Change the colour map (perceptually uniform maps from d3-scale-chromatic)
  • Lock or reset the min/max range

Animation

For files with a time or step dimension, the step slider appears below the map. Use play/pause to animate through steps at a fixed frame rate.

Docker deployment

cd tensoscope
make build          # build the container image
make run            # serve at http://localhost:8000
BASE_PATH=/scope make run   # serve under a subpath

The image uses nginx and accepts a BASE_PATH environment variable for subpath deployments behind a reverse proxy.

Known limitations

  • Only lat/lon grids are currently regridded; polar stereographic and other projections are not yet handled.
  • 3D fields (pressure levels) cannot yet be sliced via the level selector (the UI component exists but is not yet wired up).
  • HTTP Range-based lazy loading is not yet implemented; the full file is fetched before any field can be displayed.

xarray Integration

The tensogram-xarray package provides a read-only xarray backend engine for .tgm files. Once installed, you can open tensogram data with:

import xarray as xr
ds = xr.open_dataset("data.tgm", engine="tensogram")

This chapter explains the conversion philosophy, the mapping rules, and walks through progressively complex examples so you know exactly what to expect – and what to provide – when loading tensogram data into xarray.


Philosophy: Why Mapping is Needed

Tensogram and xarray have fundamentally different data models:

ConceptTensogramxarray
DimensionsUnnamed, positional (shape = [512, 512])Named ("x", "y", "latitude", "time")
CoordinatesNot built-in; application metadataArrays of values labelling each dimension
VariablesData objects, indexed by positionNamed DataArrays inside a Dataset
AttributesCBOR maps at message and per-object levelKey-value dicts on Dataset and DataArray

Tensogram is vocabulary-agnostic by design. The library never interprets metadata keys – it does not know what "mars.param", "bids.subject", or "product.name" means. xarray, on the other hand, requires named dimensions and coordinate arrays to enable its powerful label-based indexing and alignment.

The tensogram-xarray backend bridges this gap. It applies a set of rules to translate tensogram structure into xarray structure, and lets you override those rules when the defaults are not enough.

flowchart LR
    A["Tensogram Message"] --> B["tensogram-xarray"]
    B --> C["xr.Dataset"]
    D["User Mapping<br/>(optional)"] -.-> B
    E["Coordinate<br/>Auto-Detection"] -.-> B

The Mapping Pipeline

When you call xr.open_dataset("file.tgm", engine="tensogram"):

  1. Read metadata – only the CBOR metadata is parsed (no payload decode).
  2. Detect coordinates – data objects whose name or param matches a known coordinate name (latitude, longitude, time, …) become coordinate arrays.
  3. Name dimensions – if you provided dim_names, those are used. Otherwise, axes matching a detected coordinate use that coordinate’s name; remaining axes become dim_0, dim_1, …
  4. Name variables – if you provided variable_key, the value at that metadata path becomes the variable name. Otherwise object_0, object_1, …
  5. Wrap data lazily – each tensor is backed by a BackendArray that decodes on demand. No payload bytes are read until you access .values.

Example 1: Simplest Case – Single Object, No Metadata

Creating the file:

import numpy as np
import tensogram

data = np.arange(60, dtype=np.float32).reshape(6, 10)
meta = {"version": 2}
desc = {"type": "ntensor", "shape": [6, 10], "dtype": "float32",
        "byte_order": "little", "encoding": "none",
        "filter": "none", "compression": "none"}

with tensogram.TensogramFile.create("simple.tgm") as f:
    f.append(meta, [(desc, data)])

Opening in xarray:

>>> import xarray as xr
>>> ds = xr.open_dataset("simple.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions:   (dim_0: 6, dim_1: 10)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    object_0  (dim_0, dim_1) float32 ...
Attributes:
    tensogram_version:  2

The data object became a variable named object_0. Dimensions are auto-generated as dim_0, dim_1. No coordinates – tensogram has no information to generate them.

Adding dimension names:

>>> ds = xr.open_dataset("simple.tgm", engine="tensogram",
...                      dim_names=["latitude", "longitude"])
>>> ds["object_0"].dims
('latitude', 'longitude')

Example 2: Single Object with Coordinate Objects

When coordinate arrays are stored as separate data objects in the same message, the backend auto-detects them by name.

Creating the file:

lat = np.linspace(-90, 90, 5, dtype=np.float64)
lon = np.linspace(0, 360, 8, endpoint=False, dtype=np.float64)
temp = np.random.default_rng(42).random((5, 8)).astype(np.float32)

meta = {"version": 2, "base": [
    {"name": "latitude"},
    {"name": "longitude"},
    {"name": "temperature"},
]}

with tensogram.TensogramFile.create("with_coords.tgm") as f:
    f.append(meta, [
        ({"type": "ntensor", "shape": [5], "dtype": "float64", ...}, lat),
        ({"type": "ntensor", "shape": [8], "dtype": "float64", ...}, lon),
        ({"type": "ntensor", "shape": [5, 8], "dtype": "float32", ...}, temp),
    ])

Opening in xarray:

>>> ds = xr.open_dataset("with_coords.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions:      (latitude: 5, longitude: 8)
Coordinates:
  * latitude     (latitude) float64 -90.0 -45.0 0.0 45.0 90.0
  * longitude    (longitude) float64 0.0 45.0 90.0 135.0 180.0 225.0 270.0 315.0
Data variables:
    temperature  (latitude, longitude) float32 ...
Attributes:
    tensogram_version:  2

How it works:

  • Objects with name: "latitude" and name: "longitude" match known coordinate names (case-insensitive).
  • They become coordinate arrays on the Dataset.
  • The temperature object’s shape (5, 8) matches the sizes of latitude (5) and longitude (8), so its dimensions are automatically resolved to ("latitude", "longitude").

Known Coordinate Names

The following names are recognized (case-insensitive):

NameCanonical dimension
lat, latitudelatitude
lon, longitudelongitude
xx
yy
timetime
levellevel
pressurepressure
heightheight
depthdepth
frequencyfrequency
stepstep

If no matching coordinate objects are found and no dim_names are provided, dimensions remain generic (dim_0, dim_1, …).


Example 3: Multi-Object with variable_key

When a message contains multiple data objects, each with per-object metadata identifying the parameter, you can use variable_key to name the variables.

Creating the file:

t2m = np.ones((3, 4), dtype=np.float32) * 273.15
u10 = np.ones((3, 4), dtype=np.float32) * 5.0

meta = {"version": 2,
    "base": [
        {"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "2t", "levtype": "sfc"}},
        {"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "10u", "levtype": "sfc"}},
    ],
}

with tensogram.TensogramFile.create("mars.tgm") as f:
    f.append(meta, [
        ({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, t2m),
        ({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, u10),
    ])

Without variable_key:

>>> ds = xr.open_dataset("mars.tgm", engine="tensogram")
>>> list(ds.data_vars)
['object_0', 'object_1']

With variable_key:

>>> ds = xr.open_dataset("mars.tgm", engine="tensogram",
...                      variable_key="mars.param")
>>> list(ds.data_vars)
['2t', '10u']
>>> ds.attrs
{'tensogram_version': 2}

The variable_key supports dotted paths: "mars.param" navigates into the nested mars dict within each object’s metadata.


Example 4: Multi-Message File with Auto-Merge

When a .tgm file contains many messages (one object each) that differ only in metadata, open_datasets() can stack them along outer dimensions.

Creating the file:

import tensogram_xarray

rng = np.random.default_rng(99)
with tensogram.TensogramFile.create("multi.tgm") as f:
    for param in ["2t", "10u"]:
        for date in ["20260401", "20260402"]:
            data = rng.random((3, 4), dtype=np.float32).astype(np.float32)
            meta = {"version": 2,
                    "base": [{"mars": {"param": param, "date": date}}]}
            desc = {"type": "ntensor", "shape": [3, 4], "dtype": "float32",
                    "byte_order": "little", "encoding": "none",
                    "filter": "none", "compression": "none"}
            f.append(meta, [(desc, data)])

Opening with open_datasets():

>>> datasets = tensogram_xarray.open_datasets(
...     "multi.tgm", variable_key="mars.param"
... )
>>> len(datasets)
1
>>> ds = datasets[0]
>>> list(ds.data_vars)
['2t', '10u']

What happened:

  1. The scanner read metadata from all 4 messages (no payload decode).
  2. Objects were grouped by structure: all have shape (3, 4) and float32.
  3. variable_key="mars.param" split by parameter: 2t (2 objects) and 10u (2 objects).
  4. Within each sub-group, mars.date varies across ["20260401", "20260402"], so it became an outer dimension.
  5. Each variable has shape (2, 3, 4) with a mars.date coordinate.

Example 5: Heterogeneous File – Auto-Split

When a file contains objects of different shapes or dtypes, they cannot be merged into a single Dataset. open_datasets() automatically splits them into compatible groups.

Creating the file:

with tensogram.TensogramFile.create("hetero.tgm") as f:
    # Message 0: 2D float32 temperature field
    f.append({"version": 2, "base": [{"name": "temp"}]},
             [({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
               np.ones((3, 4), dtype=np.float32))])

    # Message 1: 2D float32 wind field (same shape -- compatible)
    f.append({"version": 2, "base": [{"name": "wind"}]},
             [({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
               np.ones((3, 4), dtype=np.float32) * 2)])

    # Message 2: 1D int32 counts (different shape AND dtype -- incompatible)
    f.append({"version": 2, "base": [{"name": "counts"}]},
             [({"type": "ntensor", "shape": [5], "dtype": "int32", ...},
               np.array([1, 2, 3, 4, 5], dtype=np.int32))])

Opening:

>>> datasets = tensogram_xarray.open_datasets("hetero.tgm")
>>> len(datasets)
2
>>> datasets[0]  # The (3, 4) float32 group
<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 4)
Data variables:
    temp     (dim_0, dim_1) float32 ...
    wind     (dim_0, dim_1) float32 ...

>>> datasets[1]  # The (5,) int32 group
<xarray.Dataset>
Dimensions:  (dim_0: 5)
Data variables:
    counts   (dim_0) int32 ...

Objects that share (shape, dtype) are grouped together. Incompatible objects go to separate Datasets.


Example 6: Providing Full User Mapping

For complete control, pass all mapping parameters:

ds = xr.open_dataset(
    "forecast.tgm",
    engine="tensogram",
    dim_names=["latitude", "longitude"],
    variable_key="mars.param",
    message_index=0,         # which message in a multi-message file
    verify_hash=True,        # verify xxh3 integrity on decode
)
ParameterTypeEffect
dim_nameslist[str]Names for the innermost tensor axes (positional)
variable_keystrDotted path in per-object metadata for variable naming
message_indexintWhich message to open (default 0)
merge_objectsboolIf True, calls open_datasets() and returns first result
verify_hashboolVerify xxh3 hashes during decode
drop_variableslist[str]Variables to exclude from the Dataset
range_thresholdfloatFraction of total elements below which partial reads are used (default 0.5)

For multi-message files, use tensogram_xarray.open_datasets() directly:

import tensogram_xarray

datasets = tensogram_xarray.open_datasets(
    "forecast.tgm",
    dim_names=["latitude", "longitude"],
    variable_key="mars.param",
    verify_hash=True,
)

Example 7: Lazy Loading with Dask

Data is always loaded lazily. Opening a file only reads metadata – the tensor payloads are decoded on first access. This enables working with larger-than-memory files via dask.

# Open with dask chunking
ds = xr.open_dataset("large.tgm", engine="tensogram", chunks={})
print(ds["object_0"])
# <xarray.DataArray 'object_0' (dim_0: 10000, dim_1: 10000)>
# dask.array<...>

# Compute a mean without loading the full array
mean = ds["object_0"].mean().compute()

See also: Dask Integration for a complete walkthrough with distributed computation, performance tuning, and a runnable 4-D tensor example.

When Partial Reads Are Used

The backend inspects each data object’s encoding pipeline to determine whether partial reads via decode_range(join=False) are available:

CompressionFilterPartial Read?Mechanism
nonenoneYesDirect byte offset
szipnoneYesRSI block offset seeking
blosc2noneYesIndependent chunk decompression
zfp (fixed_rate)noneYesFixed-size blocks, computable offsets
zfp (fixed_precision)noneNoVariable-size blocks
zfp (fixed_accuracy)noneNoVariable-size blocks
zstdnoneNoStream compressor
lz4noneNoStream compressor
sz3noneNoStream compressor
AnyshuffleNoByte rearrangement breaks contiguous ranges

When partial reads are available, slicing a lazy array decodes only the requested region:

ds = xr.open_dataset("szip_data.tgm", engine="tensogram")
# Only the bytes for rows 100-110 are decompressed:
subset = ds["object_0"][100:110, :].values

When partial reads are not available (stream compressors or shuffle filter), the full object is decoded and then sliced in memory. This is transparent to the user – the API is identical.

N-Dimensional Slice Mapping

When you slice a lazy xarray variable backed by tensogram, the backend must convert an N-dimensional slice into flat element ranges that decode_range() understands. Here is how the decomposition works:

  1. Find the split point – scan the slice dimensions from innermost to outermost and find the first (innermost) dimension whose slice does not cover the full axis. All dimensions inner to this point are contiguous in memory and form a single block per outer-index combination.

  2. Compute the contiguous block size – multiply the lengths of all dimensions inner to (and including) the split dimension’s slice width. This gives the number of elements in each flat range.

  3. Generate one range per outer-index combination – iterate over the Cartesian product of sliced indices in all dimensions outer to the split point. Each combination produces one (offset, count) pair.

  4. Merge adjacent ranges – if two consecutive ranges abut in the flat layout (i.e. offset_i + count_i == offset_{i+1}), they are merged into a single wider range to reduce I/O calls.

Concrete example: an array of shape (100, 200) sliced as [10:20, 50:100]:

  • The innermost dimension (axis 1) has slice 50:100 (width 50), which does not cover the full axis (200), so it is the split point.
  • Contiguous block size = 50 elements (just the inner slice width).
  • Outer indices: axis 0 slice 10:20 gives indices [10, 11, ..., 19] – 10 combinations.
  • This produces 10 flat ranges, each of 50 elements: (10*200+50, 50), (11*200+50, 50), …, (19*200+50, 50).
  • None are adjacent (gap of 150 elements between each), so no merging occurs.

If the slice were [10:20, :] instead (full inner axis), the split point moves to axis 0 and the 10 individual ranges of 200 elements each are adjacent in memory – they merge into a single range (10*200, 10*200).

flowchart TD
    A["N-D slice<br/>arr[10:20, 50:100]"] --> B["Find split point<br/>axis 1 (not full)"]
    B --> C["Block size = 50"]
    C --> D["Outer indices:<br/>axis 0 → [10..19]"]
    D --> E["10 ranges of 50 elements"]
    E --> F["Merge adjacent?<br/>No — gap of 150"]
    F --> G["decode_range()<br/>10 × (offset, 50)"]

Range Threshold Heuristic

Even when partial reads are technically available, reading many small ranges can be slower than decoding the entire array – especially for compressed data where decompression has fixed overhead per block.

The backend uses a ratio-based heuristic controlled by the range_threshold parameter (default 0.5):

Rule: partial reads are used only when the total number of requested elements is less than range_threshold × total_elements.

With the default of 0.5, if you request more than 50% of the array, the backend falls back to a full decode and slices in memory. Lower values make the backend more aggressive about using partial reads; higher values make it prefer full decodes.

# More aggressive partial reads (use when each range is cheap, e.g. uncompressed)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.3)

# Almost always full decode (use when decode overhead is very low)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.9)

Installation

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install tensogram-xarray

This pulls in tensogram and xarray as dependencies. The xarray backend is registered automatically via entry points – no extra configuration needed.

>>> import xarray as xr
>>> "tensogram" in xr.backends.list_engines()
True

For dask support:

source .venv/bin/activate   # if not already in the virtualenv
uv pip install "tensogram-xarray[dask]"

Error Handling

The backend reports errors with enough context for diagnosis. Common error scenarios and their messages:

ScenarioError typeMessage includes
File not foundOSErrorFile path (from OS)
Negative message_indexValueError"message_index must be >= 0, got -1"
message_index out of rangeValueErrorIndex and file message count
dim_names length mismatchValueErrorActual vs expected count
Unsupported dtypeTypeError"unsupported tensogram dtype 'foo'"
decode_range failureFalls back to decode_objectWarning logged at DEBUG level with file, message, object, and cause
Incomplete hypercube in mergeValueErrorWhich coordinate combination is missing
Silent data loss in mergeWARNING logVariable name and count of dropped objects
Hash verification failureValueErrorObject index and expected/actual hash
Conflicting coordinate objectsValueErrorDimension name and mismatch details

Hash Verification and Partial Reads

When verify_hash=True is passed, xxh3 hash verification is performed on full object reads (decode_object) only. Partial reads via decode_range() intentionally skip hash verification because:

  • Partial reads decode only a subset of the payload, so the full-object hash cannot be validated.
  • The purpose of partial reads is to minimise I/O; verifying the hash would require reading the entire payload, defeating the optimisation.

This means that for lazily-loaded arrays, hash verification happens when a slice triggers a full-object decode (i.e. when the requested fraction exceeds range_threshold), but not when partial decode_range() is used.

Logging

The backend uses Python’s standard logging module. To see partial-read fallback diagnostics:

import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)

To see merge data-loss warnings (enabled by default at WARNING level):

import logging
logging.basicConfig(level=logging.WARNING)

Dask Integration

Tensogram supports Dask natively through its xarray backend. When you open a .tgm file with chunks={}, xarray wraps every tensor variable in a dask.array.Array. No data is read from disk until you call .compute() or .values.

import xarray as xr
ds = xr.open_dataset("forecast.tgm", engine="tensogram", chunks={})
# ds["temperature"].data is now a dask.array -- zero I/O so far
mean = ds["temperature"].mean().compute()  # data decoded here

This chapter explains how the integration works, walks through a complete example with distributed computation, and covers the performance knobs you can tune.


How It Works

The tensogram xarray backend implements BackendArray, xarray’s lazy-loading protocol. When dask requests a chunk, the backend:

  1. Opens the .tgm file and reads the raw message bytes.
  2. For small slices on compressors that support random access (none, szip, blosc2, zfp fixed-rate): maps the N-D slice to flat byte ranges and decodes only those ranges via decode_range().
  3. For large slices or stream compressors: falls back to full decode_object() and slices in memory.

The BackendArray stores only the file path (no open handles), making it pickle-safe for dask multiprocessing and distributed execution.

flowchart LR
    A["xr.open_dataset<br/>chunks={}"] --> B["BackendArray<br/>(lazy, pickle-safe)"]
    B --> C["dask.array.Array"]
    C -->|".compute()"| D["decode_range()<br/>or decode_object()"]
    D --> E["numpy.ndarray"]

Chunking Strategies

chunks valueBehaviour
{}Automatic: one chunk per tensor object (most common)
{"latitude": 100}Split along latitude every 100 elements
{"latitude": 100, "longitude": 200}Split along both axes

For tensogram files, chunks={} is usually the right choice because each data object is already a self-contained tensor. Finer chunking adds overhead from repeated file opens.


Complete Example: Distributed Statistics over 4-D Tensors

This walkthrough corresponds to examples/python/09_dask_distributed.py. It creates 4 .tgm files representing a 4-D temperature field (time x level x latitude x longitude), then computes statistics entirely through dask’s lazy execution.

Step 1: Create the Data Files

Each file contains 10 data objects (one per pressure level) plus latitude and longitude coordinate arrays:

import numpy as np
import tensogram

def _desc(shape, dtype="float32", **extra):
    return {
        "type": "ntensor", "shape": list(shape), "dtype": dtype,
        "byte_order": "little", "encoding": "none",
        "filter": "none", "compression": "none", **extra,
    }

LEVEL_VALUES = [1000, 925, 850, 700, 500, 400, 300, 200, 100, 50]
NLAT, NLON = 36, 72

with tensogram.TensogramFile.create("temperature_20260401.tgm") as f:
    lat = np.linspace(-87.5, 87.5, NLAT, dtype=np.float64)
    lon = np.linspace(0, 355, NLON, dtype=np.float64)

    objects = [
        (_desc([NLAT], dtype="float64", name="latitude"), lat),
        (_desc([NLON], dtype="float64", name="longitude"), lon),
    ]
    for level_hpa in LEVEL_VALUES:
        field = np.random.default_rng(42).random((NLAT, NLON)).astype(np.float32)
        desc = _desc([NLAT, NLON], name=f"temperature_{level_hpa}hPa")
        objects.append((desc, field))

    f.append({"version": 2}, objects)

Step 2: Open with Dask Lazy Loading

The critical parameters are engine="tensogram" and chunks={}:

import xarray as xr
import tensogram_xarray  # registers the engine

ds = xr.open_dataset(
    "temperature_20260401.tgm",
    engine="tensogram",
    variable_key="name",  # name variables from descriptor "name" field
    chunks={},             # enable dask lazy loading
)

At this point:

  • No tensor data has been decoded. Only CBOR metadata was read.
  • Each variable is a dask.array.Array:
>>> type(ds["temperature_1000hPa"].data)
<class 'dask.array.core.Array'>

>>> ds["temperature_1000hPa"].shape
(36, 72)

>>> ds["temperature_1000hPa"].chunks
((36,), (72,))

Step 3: Build a 4-D Tensor from Multiple Files

Stack variables across levels within each file, then stack files across time:

import dask
import dask.array as da

# Open all 4 files
paths = ["temperature_20260401.tgm", "temperature_20260402.tgm",
         "temperature_20260403.tgm", "temperature_20260404.tgm"]

datasets = [
    xr.open_dataset(p, engine="tensogram", variable_key="name", chunks={})
    for p in paths
]

# Stack levels within each file, then stack across time
# Build in LEVEL_VALUES order (not alphabetical) so axis matches labels
temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]

all_timesteps = []
for ds in datasets:
    level_arrays = [ds[v].data for v in temp_vars]
    all_timesteps.append(da.stack(level_arrays, axis=0))

full_4d = da.stack(all_timesteps, axis=0)
# Shape: (4, 10, 36, 72) -- (time, level, lat, lon)
# Still lazy -- zero I/O

Step 4: Compute Statistics with Dask

Schedule multiple computations, then execute them in a single dask.compute() call:

# Schedule (lazy -- no computation yet)
global_mean = full_4d.mean()
global_std  = full_4d.std()
global_min  = full_4d.min()
global_max  = full_4d.max()

# Execute all at once (data decoded from .tgm files here)
mean_val, std_val, min_val, max_val = dask.compute(
    global_mean, global_std, global_min, global_max
)

print(f"Mean: {mean_val:.2f} K")
print(f"Std:  {std_val:.2f} K")
print(f"Min:  {min_val:.2f} K")
print(f"Max:  {max_val:.2f} K")

Step 5: Selective Lazy Loading

Only the data you touch is decoded. Slicing the 4-D array triggers decoding of just the relevant chunks:

# Single point: backend uses decode_range() for the tiny slice
# (1 element out of 2592 = 0.04%, well below the 50% threshold)
point = full_4d[0, 0, 18, 0].compute()

# One pressure level across all times: touches 4 backing arrays
level_400 = full_4d[:, 5, :, :].mean().compute()

# Equatorial band: partial range decode for the selected rows
equatorial = full_4d[0, 0, 9:27, :].mean().compute()

Performance Tuning

The range_threshold Parameter

When dask requests a slice, the backend decides between partial decode (decode_range()) and full decode (decode_object()) based on the fraction of requested elements:

Rule: partial reads are used when requested_elements / total_elements <= range_threshold

range_thresholdBehaviour
0.3Aggressive partial reads (good for uncompressed data)
0.5 (default)Balanced: partial below 50%, full above
0.9Almost always full decode (good for fast decompressors)
# More aggressive partial reads
ds = xr.open_dataset("file.tgm", engine="tensogram",
                     chunks={}, range_threshold=0.3)

# Almost always full decode
ds = xr.open_dataset("file.tgm", engine="tensogram",
                     chunks={}, range_threshold=0.9)

Which Compressors Support Partial Reads?

CompressionPartial Read?Notes
noneYesDirect byte offset
szipYesRSI block seeking
blosc2YesIndependent chunk decompression
zfp (fixed_rate)YesFixed-size blocks
zfp (other modes)NoVariable-size blocks
zstdNoStream compressor
lz4NoStream compressor
sz3NoStream compressor

The shuffle filter also disables partial reads (byte rearrangement breaks contiguous ranges). The fallback is always transparent: the full object is decoded and sliced in memory.

Dask Scheduler Choice

Tensogram’s backend is thread-safe (uses a threading.Lock per array). All three dask schedulers work:

# Synchronous (debugging)
dask.config.set(scheduler="synchronous")

# Threaded (default, good for I/O-bound work)
dask.config.set(scheduler="threads")

# Multiprocessing (BackendArray is pickle-safe)
dask.config.set(scheduler="processes")

For large-scale work, dask.distributed also works because the BackendArray stores only the file path (no unpicklable state).


Thread Safety

The TensogramBackendArray uses a per-array threading.Lock to serialise file I/O. This means:

  • Multiple dask tasks can read different variables concurrently.
  • Reads to the same variable are serialised (no concurrent file opens for the same array).
  • The lock is excluded from pickle state and recreated on deserialise.

Installation

For dask support, install the optional dependency:

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install "tensogram-xarray[dask]"

This pulls in dask[array] alongside tensogram and xarray.


Debugging

Enable debug logging to see when partial reads are used vs full decodes:

import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)

You will see messages like:

DEBUG:tensogram_xarray.array:decode_range failed for forecast.tgm msg=0 obj=2,
    falling back to full decode: RangeNotSupported

This is expected for stream compressors and is not an error.


Error Handling

When Errors Are Raised

WhenWhatError type
open_dataset()File not foundOSError with file path
open_dataset()message_index negativeValueError with index
open_dataset()message_index out of rangeValueError with index and count
open_dataset()dim_names length mismatchValueError with actual vs expected
open_dataset()Unsupported dtypeTypeError with dtype name
.compute()Decode failureValueError or RuntimeError from tensogram
.compute()Hash mismatch (with verify_hash=True)ValueError with object index
.compute()File moved/deleted after openOSError from OS

Key design point: errors in metadata (file not found, bad index, wrong dim_names) surface immediately at open_dataset() time. Errors in data decoding surface at .compute() time because payloads are lazy-loaded.

Partial Read Fallback

When decode_range() fails (e.g. unsupported compressor for partial reads), the backend catches the error and falls back to full decode_object():

except (ValueError, RuntimeError, OSError) as exc:
    logger.debug("decode_range failed ... falling back to full decode: %s", exc)

This fallback is transparent — the user gets correct data regardless. Enable DEBUG logging to see when fallbacks occur.

Dask Worker Errors

File paths are automatically resolved to absolute paths when the dataset is opened. This prevents “file not found” errors when dask sends work to processes with a different working directory.

If a dask worker encounters a decode error, it propagates through dask’s error handling. The traceback will show the tensogram error with file path, message index, and object index for diagnosis.


Edge Cases

Ambiguous Dimension Matching

When coordinate arrays have the same size (e.g. both latitude and longitude have 360 elements), the backend cannot distinguish them by shape alone. The first match gets the coordinate name; the second falls back to a generic dim_N.

Workaround: pass explicit dim_names to disambiguate:

ds = xr.open_dataset("file.tgm", engine="tensogram",
                     dim_names=["latitude", "longitude"], chunks={})

Stacking Files with Different Variables

When stacking multiple .tgm files into a single dask array, verify that every dataset contains the expected variables before stacking:

temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]
for i, ds in enumerate(datasets):
    missing = [v for v in temp_vars if v not in ds.data_vars]
    if missing:
        raise KeyError(f"Dataset {i} missing: {missing}")

Otherwise da.stack() will fail with a confusing KeyError from a deep dask callback.

Zero-Object Messages

A .tgm file containing only metadata frames (no data objects) returns an empty xr.Dataset with no variables. This is valid and does not raise an error.

Scalar (0-D) Tensors

Data objects with shape=() (zero dimensions) are supported. They become scalar xr.Variable objects in the dataset.

Hash Verification with Partial Reads

When verify_hash=True is set, hash verification only runs on full object reads (via decode_object()). Partial reads via decode_range() skip verification because only a subset of the payload is decoded. This means:

  • Large slices (above range_threshold) trigger full decode with hash verification.
  • Small slices use decode_range() without hash verification.

This is by design. If you need guaranteed hash verification on every access, set range_threshold=0.0 to force full decodes.

Zarr v3 Backend

The tensogram-zarr package implements a Zarr v3 Store backed by .tgm files. This lets you read and write Tensogram data through the standard Zarr Python API.

Installation

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install tensogram-zarr

Requires zarr >= 3.0, tensogram, and numpy.

Reading a .tgm file through Zarr

import zarr
from tensogram_zarr import TensogramStore

# Open existing .tgm file as a read-only Zarr store
store = TensogramStore.open_tgm("data.tgm")
root = zarr.open_group(store=store, mode="r")

# Browse available arrays
for name, arr in root.members():
    print(f"{name}: shape={arr.shape}, dtype={arr.dtype}")

# Read an array (decoded eagerly at store open, served from memory)
temperature = root["2t"][:]
print(temperature.shape, temperature.mean())

# Access group-level metadata (from GlobalMetadata _extra_)
# The example below shows a MARS namespace; the attributes dict reflects
# whatever namespaces the producer put in the message's GlobalMetadata.
print(root.attrs["mars"])  # {'class': 'od', 'type': 'fc', ...}

How the mapping works

Each .tgm message maps to a Zarr group:

zarr.json                     # root group ← GlobalMetadata
temperature/zarr.json         # array metadata ← DataObjectDescriptor
temperature/c/0/0             # chunk data ← decoded object payload
pressure/zarr.json            # another array
pressure/c/0/0                # its chunk data
graph LR
    TGM[".tgm file"] --> GM["GlobalMetadata"]
    TGM --> OBJ1["Object 0: temperature"]
    TGM --> OBJ2["Object 1: pressure"]
    
    GM --> GZJ["zarr.json (group)"]
    OBJ1 --> AZJ1["temperature/zarr.json"]
    OBJ1 --> CHK1["temperature/c/0/0"]
    OBJ2 --> AZJ2["pressure/zarr.json"]
    OBJ2 --> CHK2["pressure/c/0/0"]

Key design decisions:

  • Each TGM data object becomes one Zarr array with a single chunk (chunk shape = array shape)
  • Variable names are resolved from metadata via a default lookup path (name, mars.param, param, mars.shortName, shortName), or a custom dot-path you supply
  • TGM encoding metadata is preserved in Zarr array attributes under _tensogram_* keys
  • Duplicate variable names get a numeric suffix (field, field_1)

Variable naming

By default, the store tries these metadata paths to name arrays:

  1. name
  2. mars.param
  3. param
  4. mars.shortName
  5. shortName
  6. Falls back to object_<index>

You can override with any dot-path, including non-MARS vocabularies:

# Weather pipeline using MARS
store = TensogramStore.open_tgm("weather.tgm", variable_key="mars.param")

# Neuroimaging pipeline using BIDS
store = TensogramStore.open_tgm("scans.tgm", variable_key="bids.task")

# Custom vocabulary
store = TensogramStore.open_tgm("data.tgm", variable_key="product.name")

Multi-message files

By default the store reads message 0. Select a different message with message_index:

store = TensogramStore.open_tgm("multi.tgm", message_index=2)

Writing a .tgm file through Zarr

import numpy as np
import zarr
from tensogram_zarr import TensogramStore

store = TensogramStore("output.tgm", mode="w")
root = zarr.open_group(store=store, mode="w")

# Create arrays — data is buffered in memory
root.create_array("temperature", data=np.random.rand(100, 200).astype(np.float32))
root.create_array("pressure", data=np.array([1000, 925, 850, 700], dtype=np.float64))

# Close flushes to .tgm
store.close()

The write path assembles all arrays into a single TGM message when the store is closed.

Context manager

with TensogramStore("data.tgm", mode="r") as store:
    root = zarr.open_group(store=store, mode="r")
    data = root["temperature"][:]
# Store automatically closed

Supported data types

Tensogram dtypeZarr data_typeNumPy dtype
float16float16float16
float32float32float32
float64float64float64
int8int8int8
int16int16int16
int32int32int32
int64int64int64
uint8uint8uint8
uint16uint16uint16
uint32uint32uint32
uint64uint64uint64
complex64complex64complex64
complex128complex128complex128
bitmaskuint8uint8

Byte range support

The store supports Zarr’s ByteRequest types for efficient partial reads:

  • RangeByteRequest(start, end) — read a byte range
  • OffsetByteRequest(offset) — read from offset to end
  • SuffixByteRequest(suffix) — read last N bytes

Comparison with tensogram-xarray

Featuretensogram-zarrtensogram-xarray
API levelLow-level (Zarr Store)High-level (xarray engine)
DimensionsGeneric (dim_0, dim_1)Named (lat, lon, time)
CoordinatesNot interpretedAuto-detected from metadata
Multi-messageOne message per storeAuto-merge into hypercubes
Write supportYesNo
Data loadingEager (all at open)Lazy (on-demand decode_range)

Use tensogram-zarr when you need direct Zarr API access or write support. Use tensogram-xarray when you want automatic coordinate detection and multi-message merging.

Edge cases and limitations

Variable name sanitization

If a metadata value used as a variable name contains / or \, those characters are replaced with _ to prevent spurious directory nesting in the virtual key space. Empty names become _.

mars.param = "temperature/surface"  →  variable name "temperature_surface"

Duplicate variable names

When multiple objects resolve to the same name, suffixes are appended: field, field_1, field_2, etc.

Zero-object messages

A message with no data objects is valid (metadata-only). The store produces a root group with attributes but no arrays.

Single chunk per array

Each TGM data object maps to a Zarr array with chunk_shape == array_shape (one chunk). There is no sub-chunking; partial reads within the array are handled by Zarr’s byte-range support against the single chunk. If a Zarr writer attempts to store multiple chunks for the same variable, a ValueError is raised — TensogramStore does not silently drop extra chunks.

Out-of-range message index

If message_index exceeds the number of messages in the file, an IndexError is raised. Negative indices are rejected with ValueError.

bfloat16 dtype

bfloat16 maps to Zarr data type "bfloat16" but is stored as raw 2-byte values (<V2 numpy dtype) since numpy has no native bfloat16 type. Use ml_dtypes.bfloat16 for interpretation.

Byte order handling

The read path normalises all chunk data to little-endian (matching the Zarr bytes codec default). The write path respects byte_order from the Zarr codecs metadata — if a big-endian bytes codec is specified, the data is byte-swapped before encoding to TGM.

JSON serialization (RFC 8259)

serialize_zarr_json() converts non-finite float values to their Zarr v3 string sentinels ("NaN", "Infinity", "-Infinity") so the output is valid RFC 8259 JSON.

Write path byte-count validation

When flushing to .tgm, the store validates that chunk byte count matches product(shape) * dtype_size. A mismatch raises ValueError with the expected and actual counts.

close() exception safety

If _flush_to_tgm() fails during close(), the store is still marked as closed (_is_open = False). The exception propagates normally — partial writes do not corrupt the file since TGM messages are written atomically.

When used as a context manager and an exception is already in flight, flush errors are logged at WARNING level instead of replacing the original exception.

Error handling

All errors surface with enough context for debugging:

ScenarioExceptionMessage includes
File not found / unreadableOSErrorFile path
Invalid TGM messageValueErrorFile path + message index
Object decode failureValueErrorFile path + message index + object index + variable name
Out-of-range message indexIndexErrorRequested index + available count
Negative message indexValueErrorThe invalid index value
Invalid modeValueErrorThe invalid mode string
Empty pathValueErrorThe value passed
Chunk byte-count mismatchValueErrorVariable name + expected vs actual byte count
Unsupported dtype on writeValueErrorVariable name + dtype
Invalid JSON in zarr.jsonValueErrorByte count + hex preview
Unknown ByteRequest typeTypeErrorThe type name
Array without chunk dataWARNING logVariable name (array skipped)
No arrays to flushWARNING logFile path

Errors from the underlying Rust tensogram library are wrapped with Python-level context so users see which file, message, and variable caused the problem.

anemoi-inference Integration

The tensogram-anemoi package provides a plug-and-play output for anemoi-inference, the ECMWF framework for running AI-based weather forecast models. Once installed, anemoi-inference automatically discovers the plugin via Python entry points — no code changes to anemoi-inference are required.

Installation

pip install tensogram-anemoi

Or from source:

pip install -e python/tensogram-anemoi/

Usage

In an anemoi-inference run config, specify tensogram as the output:

output:
  tensogram:
    path: forecast.tgm

All forecast steps are written to a single .tgm file as they are produced. Remote destinations (S3, GCS, Azure, …) are supported via fsspec:

output:
  tensogram:
    path: s3://my-bucket/forecast.tgm
    storage_options:
      key: ...
      secret: ...

Configuration options

All options after path must be supplied as keyword arguments.

OptionTypeDefaultDescription
pathstrDestination file path or remote URL
encodingstr"none""none" or "simple_packing"
bitsintNoneBits per value (required when encoding="simple_packing")
compressionstr"zstd""none", "zstd", "lz4", "szip", "blosc2"
dtypestr"float32"Field array dtype: "float32" or "float64"
storage_optionsdict{}Forwarded to fsspec for remote paths
stack_pressure_levelsboolFalseStack pressure-level fields into 2-D objects
variableslist[str]NoneRestrict output to a subset of variables
output_frequencyintNoneWrite every N steps
write_initial_stateboolNoneWhether to write step 0

Pressure-level stacking

When stack_pressure_levels=True, all fields sharing the same GRIB param are merged into a single 2-D object of shape (n_grid, n_levels), sorted by level ascending. The "mars" namespace carries "levelist": [500, 850, ...] instead of a scalar "level" (following standard MARS convention). Non-pressure-level fields are always written as individual 1-D objects.

output:
  tensogram:
    path: forecast.tgm
    stack_pressure_levels: true

Simple packing

For compact storage, use simple_packing with a bits value:

output:
  tensogram:
    path: forecast.tgm
    encoding: simple_packing
    bits: 16
    compression: zstd

Coordinate arrays (lat/lon) are never lossy-encoded; only field arrays are packed.


Metadata reference

Each .tgm file produced by tensogram-anemoi contains one message per forecast step. This section documents exactly what is stored in each message and how to read it with the raw tensogram Python API.

Opening a file

import tensogram

tgm = tensogram.TensogramFile.open("forecast.tgm")
print(len(tgm), "steps")

meta, objects = tgm[0]   # first step

meta is the decoded message metadata. objects is a list of (descriptor, array) pairs, one entry per object in the message.

Object layout

Every message has the following fixed layout:

Indexbase[i]["name"]Content
0"grid_latitude"Latitude coordinates, float64, shape (n_grid,)
1"grid_longitude"Longitude coordinates, float64, shape (n_grid,)
2 … Nvariable name or param nameField data
meta, objects = tgm[0]

lat_desc, lat_arr = objects[0]   # latitudes
lon_desc, lon_arr = objects[1]   # longitudes
fld_desc, fld_arr = objects[2]   # first field

The coordinate names "grid_latitude" and "grid_longitude" are intentionally distinct from the standard "latitude" / "longitude" names so that all objects in a message share a single flat grid dimension rather than each coordinate spawning its own dimension.

base[i] — per-object metadata

Each object has a corresponding entry in meta.base:

for i, entry in enumerate(meta.base):
    print(i, entry)

Every entry contains:

KeyTypePresent onDescription
"name"strall objectsVariable or coordinate name
"anemoi"dictall objectsanemoi-specific metadata (see below)
"mars"dictfield objects onlyMARS metadata (see below)

"anemoi" namespace

KeyTypePresent onDescription
"variable"strall objectsInternal anemoi-inference variable name

For coordinates, "variable" is "latitude" or "longitude" (the canonical name, not the "grid_*" name stored in "name"):

assert meta.base[0]["name"] == "grid_latitude"
assert meta.base[0]["anemoi"]["variable"] == "latitude"

assert meta.base[1]["name"] == "grid_longitude"
assert meta.base[1]["anemoi"]["variable"] == "longitude"

For fields, "variable" is the internal anemoi-inference name (e.g. "t500" for 500 hPa temperature, "2t" for 2 m temperature):

assert meta.base[2]["anemoi"]["variable"] == "2t"

"mars" namespace

Coordinate objects carry no "mars" key. Every field object carries a "mars" dict combining keys from the anemoi-inference checkpoint with the temporal keys derived from the forecast state:

Temporal keys (present on every field object):

KeyTypeDescriptionExample
"date"strAnalysis/base date (YYYYMMDD)"20240101"
"time"strAnalysis/base time (HHMM)"0000"
"step"int or floatForecast lead time in hours6, 1.5

Checkpoint keys (present when available in the model checkpoint):

KeyTypeDescriptionExample
"param"strGRIB parameter short name"2t", "t", "u"
"levtype"strLevel type"sfc", "pl", "ml"
"level"intPressure level (unstacked fields only)500
"levelist"list[int]Pressure levels (stacked fields only)[500, 850, 1000]

Reading field metadata:

meta, objects = tgm[0]

# Surface field (e.g. 2 m temperature)
entry = meta.base[2]
print(entry["name"])                    # "2t"
print(entry["anemoi"]["variable"])      # "2t"
print(entry["mars"]["param"])           # "2t"
print(entry["mars"]["date"])            # "20240101"
print(entry["mars"]["time"])            # "0000"
print(entry["mars"]["step"])            # 6

# Pressure-level field (unstacked)
entry = meta.base[3]
print(entry["mars"]["param"])           # "t"
print(entry["mars"]["levtype"])         # "pl"
print(entry["mars"]["level"])           # 500

With stack_pressure_levels=True, the pressure-level group has "levelist" instead of "level", and the array is 2-D:

entry = meta.base[2]                    # stacked t group
print(entry["mars"]["levelist"])        # [500, 850, 1000]
print(entry["mars"]["param"])           # "t"

desc, arr = objects[2]
print(arr.shape)                        # (n_grid, 3)  — columns sorted by level

meta.extra — message-level metadata

meta.extra carries metadata that applies to the whole message rather than individual objects.

"dim_names" — axis-size hints

dim_names = meta.extra["dim_names"]
# e.g. {"21600": "values"}
# or   {"21600": "values", "3": "level"}  (with stack_pressure_levels=True)

dim_names maps the string representation of an axis length to a semantic name. It exists to allow downstream tools to assign meaningful axis names without requiring any anemoi-specific knowledge. The grid axis is always labelled "values"; when pressure-level stacking is enabled, each unique level-axis size is labelled "level".

Object descriptors

Each (descriptor, array) pair returned by objects[i] gives low-level encoding detail:

desc, arr = objects[2]

print(desc.dtype)        # "float32" or "float64"
print(desc.shape)        # [n_grid] for flat, [n_grid, n_levels] for stacked
print(desc.encoding)     # "none" or "simple_packing"
print(desc.compression)  # "zstd", "lz4", etc.

Coordinate arrays are always float64 regardless of the dtype setting. Field arrays use the configured dtype ("float32" by default), promoted to float64 automatically when encoding="simple_packing".

Full inspection example

import tensogram

tgm = tensogram.TensogramFile.open("forecast.tgm")

for step_idx, (meta, objects) in enumerate(tgm):
    print(f"\n--- step {step_idx} ---")

    # Dimension hints
    print("dim_names:", meta.extra.get("dim_names", {}))

    for i, entry in enumerate(meta.base):
        desc, arr = objects[i]
        anemoi = entry.get("anemoi", {})
        mars = entry.get("mars", {})

        print(
            f"  [{i}] name={entry['name']!r:20s}"
            f"  variable={anemoi.get('variable')!r:10s}"
            f"  shape={arr.shape}"
            f"  dtype={desc.dtype}"
            + (f"  step={mars.get('step')}" if mars else "")
        )

Example output for a single step with surface fields and stacked pressure levels:

--- step 0 ---
dim_names: {'21600': 'values', '3': 'level'}
  [0] name='grid_latitude'    variable='latitude'   shape=(21600,)  dtype=float64
  [1] name='grid_longitude'   variable='longitude'  shape=(21600,)  dtype=float64
  [2] name='2t'               variable='2t'         shape=(21600,)  dtype=float32  step=6
  [3] name='t'                variable='t'          shape=(21600, 3)  dtype=float32  step=6
  [4] name='u'                variable='u'          shape=(21600, 3)  dtype=float32  step=6

Free-Threaded Python

Tensogram supports free-threaded Python (CPython 3.13t / 3.14t), which removes the Global Interpreter Lock (GIL) and allows true multi-threaded parallelism from Python.

What This Means

On standard CPython, the GIL serializes access to the interpreter — only one thread runs Python code at a time. Tensogram already releases the GIL during Rust computation (py.detach()), which helps, but the GIL is still re-acquired for numpy array construction and Python object creation.

On free-threaded CPython (3.13t / 3.14t), there is no GIL at all. Multiple threads can call tensogram.encode() and tensogram.decode() in true parallel. Use the included benchmark (rust/benchmarks/python/bench_threading.py) to measure scaling on your hardware.

Building for Free-Threaded Python

Install a free-threaded Python build:

# uv (recommended)
uv python install cpython-3.14+freethreaded

# Or via pyenv
pyenv install 3.14t

Build tensogram:

uv venv .venv --python python3.14t
source .venv/bin/activate
uv pip install maturin "numpy>=2.1"
cd python/bindings && maturin develop --release

Verify the GIL is disabled:

import sys
print(sys._is_gil_enabled())  # False

Thread-Safe API

All tensogram read operations are safe to call from multiple threads simultaneously:

import threading
import numpy as np
import tensogram

data = np.random.randn(1_000_000).astype(np.float32)
meta = {"version": 2, "base": [{}]}
desc = {"type": "ntensor", "shape": [1_000_000], "dtype": "float32"}
msg = tensogram.encode(meta, [(desc, data)])

def decode_worker():
    for _ in range(100):
        result = tensogram.decode(msg)

threads = [threading.Thread(target=decode_worker) for _ in range(8)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Each thread can independently:

  • Encode and decode messages
  • Scan buffers
  • Validate messages and files
  • Read from TensogramFile instances (same handle or separate handles)
  • Use StreamingEncoder (separate instances per thread)

TensogramFile Thread Safety

All read methods on TensogramFile (decode_message, read_message, decode_metadata, decode_descriptors, decode_object, decode_range, __getitem__, __len__, __iter__) use &self and support concurrent access from multiple threads on the same handle:

f = tensogram.TensogramFile.open("data.tgm")

def worker(thread_id):
    # Multiple threads can read from the same handle concurrently
    msg = f.decode_message(thread_id % len(f))

threads = [threading.Thread(target=worker, args=(i,)) for i in range(8)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Only append() requires exclusive access — calling it while other threads are reading will raise RuntimeError (PyO3 runtime borrow check).

Benchmark Results

Measured on Linux x86_64 (20 cores), NumPy 2.4.4, release build. Same-version paired comparisons to isolate the GIL effect.

All scaling below comes from Python-level threading (threading.Thread). Each call into Rust is single-threaded — there is no rayon or internal parallelism within a single encode/decode. The speedups reflect multiple Python threads entering Rust concurrently via py.detach(). A future Rust-level parallel pipeline would multiply on top of these numbers.

Headline: Decode Throughput (1M float32, no codec)

Threads3.13 (GIL)3.13t (free)3.14 (GIL)3.14t (free)
1416 op/s391 op/s408 op/s396 op/s
2432 (1.04x)775 (1.98x)432 (1.06x)776 (1.96x)
4427 (1.03x)1,356 (3.47x)425 (1.04x)1,352 (3.41x)
8309 (0.74x)1,507 (3.85x)293 (0.72x)1,841 (4.65x)

Headline: Encode Throughput (1M float32, no codec)

Threads3.13 (GIL)3.13t (free)3.14 (GIL)3.14t (free)
1608 op/s572 op/s504 op/s595 op/s
2761 (1.25x)709 (1.24x)664 (1.32x)702 (1.18x)
4659 (1.08x)726 (1.27x)468 (0.93x)725 (1.22x)
8520 (0.86x)706 (1.23x)351 (0.70x)717 (1.20x)

Small Messages (16K float32, no codec)

Threads3.13 (GIL)3.13t (free)3.14 (GIL)3.14t (free)
120,765 op/s17,085 op/s20,174 op/s12,951 op/s
223,689 (1.14x)35,642 (2.09x)23,093 (1.14x)35,176 (2.72x)
422,629 (1.09x)36,483 (2.14x)22,839 (1.13x)61,583 (4.75x)
823,664 (1.14x)79,539 (4.66x)22,487 (1.11x)73,549 (5.68x)
1623,418 (1.13x)93,627 (5.48x)23,369 (1.16x)168,786 (13.03x)

Other Operations (1M float32)

Scan (message boundary detection — ~0.2µs/call, GIL overhead dominates):

Threads3.14 (GIL)3.14t (free)
1312,930 op/s79,431 op/s
2421,701 (1.35x)266,103 (3.35x)
4629,505 (2.01x)811,096 (10.21x)
8522,940 (1.67x)389,106 (4.90x)
16516,342 (1.65x)1,231,777 (15.51x)

Validate (full message validation — CPU-bound, scales well on both):

Threads3.14 (GIL)3.14t (free)
15,457 op/s4,347 op/s
210,860 (1.99x)9,440 (2.17x)
420,249 (3.71x)18,752 (4.31x)
839,766 (7.29x)23,048 (5.30x)
1648,560 (8.90x)45,455 (10.46x)

Decode-range (sub-array extraction, 2x1K slices from 1M):

Threads3.14 (GIL)3.14t (free)
166,488 op/s40,265 op/s
2111,544 (1.68x)98,319 (2.44x)
4103,191 (1.55x)167,786 (4.17x)
8104,752 (1.58x)325,101 (8.07x)
16103,236 (1.55x)475,755 (11.82x)

Iter-messages (3 messages, 100K f32 each):

Threads3.14 (GIL)3.14t (free)
11,214 op/s1,195 op/s
21,291 (1.06x)2,327 (1.95x)
41,211 (1.00x)4,548 (3.81x)
81,194 (0.98x)5,589 (4.68x)
161,106 (0.91x)4,432 (3.71x)

Key Takeaways

Methodology: 5 runs per configuration, median reported. 200–500 warmup iterations for fast operations.

  • Validate scales near-linearly on both GIL and free-threaded — 8.9x (GIL) and 10.5x (free-threaded) at 16 threads. This is the most CPU-bound operation and benefits fully from py.detach() regardless of GIL.
  • Free-threaded decode scales to 4.7x at 8 threads for the headline workload (1M f32, no codec). GIL-enabled stays near 1.0x because numpy array construction dominates and serializes under the GIL.
  • GIL-enabled decode-range plateaus at ~1.7xpy.detach() allows 2 threads of overlap but the lightweight result construction can’t overlap further. Free-threaded reaches 11.8x at 16 threads.
  • Scan shows dramatic free-threaded scaling — free-threaded reaches 15.5x at 16 threads. GIL-enabled scales to 2.0x at 4 threads but drops back at higher thread counts due to contention.
  • Small messages (16K) reach 13.0x at 16 threads on free-threaded (3.14t) vs 1.2x on GIL-enabled.
  • iter_messages scales to 4.7x at 8 threads on free-threaded, then drops due to contention. GIL-enabled stays flat (~1.0x).
  • Single-thread trade-off — free-threaded single-thread performance varies by workload: decode is within ~5% of GIL-enabled (396 vs 408 op/s on 3.14), encode varies by version (3.14t is 18% faster than 3.14, while 3.13t is 6% slower than 3.13). Validate is ~20% slower (4,347 vs 5,457 op/s) and scan ~4x slower due to reference counting overhead on returned Python objects — both recover by 2 threads.

These numbers are machine-specific. Run the benchmark on your hardware:

python rust/benchmarks/python/bench_threading.py              # full suite
python rust/benchmarks/python/bench_threading.py --headline   # quick comparison
python rust/benchmarks/python/bench_threading.py --quick      # CI smoke test

Reference Comparison: Tensogram (Python) vs ecCodes (C)

This section measures Tensogram’s Python throughput against ecCodes’ native C performance on the same pipeline — 10 million float64 values (80 MiB), 24-bit simple packing + szip compression — as a concrete reference point. The pipeline is common in operational weather forecasting and is representative of scientific-quantisation workloads more broadly.

What we measured

Both sides are measured end-to-end: from a float64 array to serialized compressed bytes (encode), and back to a float64 array (decode). Both include metadata serialization, framing, and integrity overhead — not just the raw packing step.

ecCodes (C, single-threaded): The Rust benchmark (rust/benchmarks/src/bin/grib_comparison.rs) calls ecCodes’ C library directly via FFI. Encode: allocate a GRIB handle, configure the grid (10M regular lat/lon), set packing type to CCSDS at 24 bits, write the values array, serialize to GRIB bytes. Decode: load the GRIB message from bytes, extract the values array. No Python involved. Median of 10 iterations, 3 warmup.

Tensogram (Python, multi-threaded): The same 10M float64 values, same 24-bit quantization, same szip compression. Encode: pass a numpy array + CBOR metadata dict to tensogram.encode(), which crosses the PyO3 boundary, quantizes, compresses, frames, computes the integrity hash, and returns Python bytes. Decode: pass bytes to tensogram.decode(), which deframes, decompresses, dequantizes, and returns a numpy array. Each Python thread makes independent encode/decode calls. The GIL is released during the Rust computation.

Why scaling depends on the codec

Threading helps most when the Rust computation (compression, quantization) is the dominant cost. With simple packing + szip, each encode/decode spends ~170 ms in Rust and ~20 ms in Python/numpy — so ~89% of the time runs with the GIL released and threads scale well. Without compression, the Rust work is trivial (~1 ms) and the Python overhead limits parallelism.

The tables above measure uncompressed data to isolate the threading mechanism. The results below use the production pipeline (24-bit packing + szip) and show what real workloads achieve.

Results

ecCodes CCSDS (Rust FFI, single-threaded): 870 MB/s encode, 531 MB/s decode.

Tensogram from Python (free-threaded 3.14t, 5-run median, 10M float64 24-bit packing+szip):

Decode:

ThreadsThroughputvs ecCodes C
1446 MB/s0.84x
2858 MB/s1.62x
41,596 MB/s3.01x
82,602 MB/s4.90x

Encode:

ThreadsThroughputvs ecCodes C
1435 MB/s0.50x
2833 MB/s0.96x
41,516 MB/s1.74x
82,353 MB/s2.71x

Single-threaded Tensogram from Python is slower than ecCodes from C (the PyO3 boundary costs ~10-15% on decode, ~50% on encode due to numpy data extraction for 80 MiB). But at 2 threads, decode already surpasses ecCodes. At 4 threads, both encode and decode exceed ecCodes. At 8 threads, decode reaches 4.9x ecCodes throughput — from Python.

Requirements

  • Python >= 3.13t for free-threaded mode (3.12/3.13 GIL-enabled also works)
  • NumPy >= 2.1 (free-threaded support)
  • maturin >= 1.8 (free-threaded wheel building)

Known Limitations

Inherent:

  • Shared mutable numpy arrays across threads can cause data races (same as any Python threading)
  • xarray and zarr backends have their own threading models (dask, zarr locking)

By design:

  • TensogramFile read methods (decode_message, read_message, __getitem__, etc.) support concurrent access from multiple threads on the same handle. Only append() requires exclusive access.
  • bytes inputs to decode/scan/validate are zero-copy across the GIL release. bytearray inputs are copied once internally by PyO3.
  • iter_messages / PyBufferIter own a full buffer copy (the buffer must outlive iteration).

Multi-Threaded Coding Pipeline

Since v0.13.0 Tensogram exposes a caller-controlled thread budget that spreads encoding and decoding work across a scoped pool of workers. The feature is off by default — existing code paths produce byte-identical output to previous releases until the caller opts in.

This page covers:

The threads option

All four bindings expose a threads: u32 option on encode and decode entry points:

#![allow(unused)]
fn main() {
use tensogram::{encode, decode, EncodeOptions, DecodeOptions};

// Encode with a 4-thread pool:
let msg = encode(&meta, &descriptors, &EncodeOptions {
    threads: 4,
    ..Default::default()
})?;

// Decode with an 8-thread pool:
let (meta, objs) = decode(&msg, &DecodeOptions {
    threads: 8,
    ..Default::default()
})?;
}
import tensogram

msg = tensogram.encode(meta, descriptors, threads=4)
decoded = tensogram.decode(msg, threads=8)
tensogram::encode_options enc{};
enc.threads = 4;
auto bytes = tensogram::encode(meta_json, objects, enc);

tensogram::decode_options dec{};
dec.threads = 8;
auto msg = tensogram::decode(buf, len, dec);
tgm_encode(meta_json, data_ptrs, data_lens, num_objects,
           "xxh3", /* threads= */ 4, &out);
tgm_decode(buf, len, /* verify_hash */ 0, /* native_byte_order */ 1,
           /* threads= */ 8, &msg);
tensogram --threads 8 merge -o merged.tgm a.tgm b.tgm
TENSOGRAM_THREADS=4 tensogram split -o 'part_[index].tgm' input.tgm

Value semantics

threadsBehaviour
0 (default)Sequential, single-threaded. Falls back to the TENSOGRAM_THREADS env var if set and non-zero.
1Build a scoped 1-worker rayon pool. Useful for testing — everything flows through the parallel code paths but runs deterministically.
N ≥ 2Build a scoped N-worker rayon pool for the duration of the call. Pool is dropped when the call returns.

Cross-language parity

Every language binding exposes the same threads option on every encode/decode entry point that does CPU work. Metadata-only commands (scan, describe, list) never accept it because they never decode payloads.

Entry pointRustPythonC FFIC++ wrapperCLI
encode / encode_pre_encoded— (via subcommand)
decode / decode_object / decode_range— (via subcommand)
TensogramFile::append
TensogramFile::decode_message
TensogramFile::decode_range
Batch decode (object/range)— (not exposed in FFI)
AsyncTensogramFile::*— (async feature, trait)
StreamingEncoder::new
tensogram merge✅ (--threads)
tensogram split
tensogram reshuffle
tensogram convert-grib / convert-netcdf
tensogram validate⚠ (flag accepted but not plumbed — IDEAS)
tensogram copy / merge
TENSOGRAM_THREADS env var fallback

Legend: ✅ = full support, ⚠ = flag accepted but currently a no-op (tracked in IDEAS), — = not applicable at this layer.

Threshold behaviour

For very small payloads the pool-build cost (~10–100 µs) outweighs any parallelism gain. The library transparently skips the pool when the total payload bytes are below a threshold (default 64 KiB). The threshold is tunable:

#![allow(unused)]
fn main() {
EncodeOptions {
    threads: 8,
    parallel_threshold_bytes: Some(0),       // always parallel
    // parallel_threshold_bytes: Some(usize::MAX), // never parallel
    ..Default::default()
}
}

Axis-A vs axis-B dispatch

The threads budget is spent along one of two axes:

  • Axis A — across objects. When a message carries multiple data objects and none of them uses an axis-B-friendly codec, rayon par_iter() runs the encode/decode pipeline for each object on a worker in parallel. Output order is preserved exactly.

  • Axis B — inside one codec. When any stage is axis-B-friendly (simple_packing encoding, shuffle filter, blosc2 or zstd compression), the budget flows into the codec’s internal parallelism:

    StageHow it uses the budget
    simple_packing encode/decodeChunked par_iter with byte-aligned chunk sizes — output bytes remain identical.
    shuffle / unshuffleParallelise the outer byte_idx loop (shuffle) or output-chunk scatter (unshuffle).
    blosc2CParams::nthreads / DParams::nthreads — decompress path stays single-threaded in v0.13.0.
    zstd FFINbWorkers libzstd parameter on compress; decompress is inherently sequential.

Policy

Tensogram messages tend to carry a small number of very large objects, so the library prefers axis B when any codec can use it:

Object countAny object axis-B friendly?Behaviour
1Axis B (codec gets the full budget).
N ≥ 2yesAxis B on each object sequentially. Avoids N × N thread over-subscription.
N ≥ 2noAxis A (par_iter across objects), each codec single-threaded.

This decision happens once per encode/decode call based on the descriptors. Nothing is configurable beyond threads and parallel_threshold_bytes — the policy is deterministic.

Determinism contract

v0.13.0 makes two different promises depending on which codecs you use.

Transparent codecs — byte-identical across thread counts

These stages produce the same encoded bytes regardless of threads:

  • encoding = "none"
  • encoding = "simple_packing" (at any bits-per-value)
  • filter = "none"
  • filter = "shuffle"
  • compression ∈ {none, lz4, szip, zfp, sz3}

Encoded payload bytes are bit-exact identical for threads ∈ {0, 1, 2, 4, 8, 16, ...}. This is exercised by the rust/tensogram/tests/threads_determinism.rs integration suite.

Opaque codecs — lossless round-trip, may differ

compression ∈ {blosc2, zstd} hand off work to third-party C libraries. When their internal thread pool is asked to run in parallel, blocks land in the output frame in worker completion order. The compressed bytes may therefore differ from the sequential path — but every variant round-trips losslessly:

  • Encode with threads=8, decode with threads=0 → same decoded values as a pure sequential round-trip.
  • Golden files (produced with threads=0) are still byte-for-byte stable across releases because the default path is unchanged.

Why this matters

Determinism across thread counts is the core property that lets Tensogram users turn threads on in production without worrying about cache keys, deduplication hashes, or reproducible builds breaking. The invariant is tested at every layer — Rust, Python, C FFI, C++ wrapper — with a sweep over {0, 1, 2, 4, 8}.

Interaction with integrity hashing

The xxh3-64 integrity hash attached to every data object (EncodeOptions.hash_algorithm = Some(Xxh3), on by default) is a pure function of the final encoded bytes. Hashing runs in the calling thread after any intra-codec parallelism has joined; each object owns its own Xxh3Default hasher on the stack and the hasher is never shared across threads.

As a consequence the hash follows the same contract as the encoded bytes:

Codec classEncoded bytes across thread countsHash across thread counts
TransparentByte-identicalByte-identical
OpaqueMay reorder compressed blocksMay differ per-run

For opaque codecs the hash is still internally consistentdescriptor.hash == xxh3_64(encoded_payload) always holds for the bytes that were actually written — it just may not match a hash computed at a different thread count. verify_hash on decode always succeeds regardless of the threads value used at encode time.

Since the hash is folded into the codec output in lockstep (see plans/DONE.mdHash-while-encoding), turning on threads has no additional hash-computation cost beyond what threading already does to the encoded bytes themselves.

Environment variable override

TENSOGRAM_THREADS is consulted only when the caller-provided threads is 0. This matches the existing TENSOGRAM_COMPRESSION_BACKEND pattern:

# One-shot invocation — every library call inherits the budget.
TENSOGRAM_THREADS=4 python my_pipeline.py

# Explicit option still wins.
tensogram.encode(meta, descs, threads=0)   # sequential (env honoured)
tensogram.encode(meta, descs, threads=1)   # single-threaded (env ignored)
tensogram.encode(meta, descs, threads=16)  # 16 workers (env ignored)

The env var is parsed once per process (OnceLock), so changing it mid-run has no effect.

Interaction with free-threaded Python

threads is orthogonal to Python threading. For CPython 3.13+ built with --disable-gil, you can combine:

  • Python threads — run multiple Tensogram calls concurrently.
  • Tensogram threads — each call uses rayon internally.

The PyO3 bindings always release the GIL around encode/decode, so the two dimensions compose cleanly. Be careful about total thread count: N Python threads × M Tensogram threads creates N×M workers. The safest starting point is one dimension at a time.

Benchmarks and tuning

The threads-scaling benchmark measures encode/decode throughput for 7 representative codec combinations across a sweep of thread counts:

cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
    --num-points 16000000 \
    --iterations 5 \
    --warmup 2 \
    --threads 0,1,2,4,8,16

Output columns (per case × thread count):

  • enc (ms), dec (ms) — median wall time over iterations.
  • enc MB/s, dec MB/s — throughput based on the original byte size.
  • ratio — compressed size as a percentage of original.
  • size (MiB) — compressed size.
  • enc x, dec x — speedup relative to the threads=0 baseline.

See the Benchmark Results page for numbers on a reference machine.

Tuning recommendations

  1. Start with threads=0. The default is deterministic, well tested, and fast for small-to-medium payloads.
  2. Turn it on globally via env. TENSOGRAM_THREADS=$(nproc) is a reasonable starting point for CPU-bound data-movement pipelines. Leave the in-process tensogram calls as threads=0 unless you need finer control per call.
  3. Measure before tuning. On small payloads the threshold keeps you safe, but the sweet spot for large tensors varies by codec. For simple_packing + szip, 2–4 threads already reaches diminishing returns; for blosc2 it can scale further.
  4. Do not stack Python threads × Tensogram threads unless you know the total fits your CPU budget. Over-subscription destroys throughput.

Benchmarks

Tensogram ships with a benchmark suite that measures all encoding and compression combinations on synthetic data. It produces tabular comparisons of speed, compressed size, and decode fidelity. The benchmarks can be re-run at any time to measure the effect of changes.

Codec Matrix Benchmark

Tests all valid encoder × compressor × bit-width combinations on 16 million synthetic float64 values.

Quick start

cargo run --release -p tensogram-benchmarks --bin codec-matrix

Override parameters with CLI flags:

cargo run --release -p tensogram-benchmarks --bin codec-matrix -- \
    --num-points 16000000 \
    --iterations 10 \
    --warmup 3 \
    --seed 42
FlagDefaultDescription
--num-points16 000 000Number of float64 values to encode
--iterations10Timed iterations per combination (median reported)
--warmup3Warm-up iterations (discarded)
--seed42PRNG seed for deterministic data generation

Combinations measured

GroupDescriptionCount
BaselineNo encoding, no compression1
Lossless compressorsRaw floats compressed with zstd, LZ4, Blosc2, or szip4
SimplePacking + losslessQuantized to 16, 24, or 32 bits, then compressed with each of the above (or no compressor)15
Lossy codecsZFP (fixed rate 16/24/32) and SZ3 (absolute error 0.01)4
Total24

For actual results, see Benchmark Results.

How to read the results

The results page splits each benchmark into a performance table (timing, throughput, compressed size) and a fidelity table (error norms for lossy codecs).

ColumnMeaningBetter is
MethodEncoder + compressor. E.g. “24-bit + szip” means values are quantized to 24 bits then compressed with szip. [REF] marks the baseline.
Enc / Dec (ms)Median encode / decode time.Lower
Enc / Dec MB/sThroughput: uncompressed size ÷ median time.Higher
RatioCompressed size as percentage of original. 25% = compressed to ¼. Above 100% means the codec expanded the data.Lower
Size (MiB)Compressed output size.
LinfMax absolute error (worst single value).Smaller
L1Mean absolute error (average drift).Smaller
L2Root mean square error (penalizes outliers).Smaller

For lossless codecs all three error norms are zero. Errors are absolute, in the same units as the input data.

Quick rules of thumb:

  • If you need exact data back, use one of the lossless codecs.
  • If you can tolerate some loss, compare Ratio vs error norms for your use case.
  • Throughput (MB/s) is the most useful speed metric — it accounts for data size and lets you compare across different payload sizes.

Reference Comparison: ecCodes GRIB Encoding

Scientific codecs are easiest to understand alongside an established reference. ecCodes is a widely-deployed GRIB encoder used throughout operational weather forecasting. This benchmark compares Tensogram’s 24-bit SimplePacking + szip pipeline against ecCodes’ built-in packing methods on 10 million float64 values. Both sides are timed symmetrically: encoding measures the full path from a float64 array to compressed bytes, and decoding measures the reverse.

Requirements

  • ecCodes C library installed (brew install eccodes on macOS, apt install libeccodes-dev on Debian/Ubuntu)
  • Build with --features eccodes

Quick start

cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes
cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes -- \
    --num-points 10000000 \
    --iterations 10 \
    --warmup 3 \
    --seed 42

Methods compared

MethodDescription
ecCodes CCSDS (reference)CCSDS packing via ecCodes — a widely-deployed operational reference
ecCodes simple packingBasic fixed-bit-width packing without entropy coding
Tensogram 24-bit + szipTensogram’s SimplePacking at 24 bits followed by szip entropy coding

For actual results, see Benchmark Results.

Benchmark pipeline flow

flowchart TD
    G[Generate synthetic field] --> W[Warm-up iterations]
    W --> T[Timed iterations]
    T --> E[Encode]
    E --> D[Decode]
    D --> T
    T --> F[Fidelity check]
    F --> R[Print report]

    style G fill:#388e3c,stroke:#2e7d32,color:#fff
    style T fill:#1565c0,stroke:#0d47a1,color:#fff
    style F fill:#c62828,stroke:#b71c1c,color:#fff

Each timed iteration runs a full encode → decode cycle. After all iterations complete, the last decoded output is compared against the original to produce the fidelity metrics.

Things to know

Compression expansion

Some compressors (especially LZ4 on raw 64-bit floats) may produce output larger than the input (Ratio > 100%). This is normal — high-entropy data can’t always be compressed. The baseline row is a raw copy and always shows 100%.

Szip alignment

The codec matrix may round num_points up by 1–3 values for szip block alignment. This only matters for very small inputs.

Small data sizes

With --num-points 1, timing is dominated by per-call overhead rather than compression throughput. Use ≥ 10 000 points for meaningful comparisons.

GRIB grid shape

For prime num_points, the GRIB benchmark creates a 1 × N grid (not a realistic near-square grid). Use composite sizes for representative results (e.g. --num-points 10000000).

Reproducibility

The data generator is deterministic for a given --seed, so repeated runs on the same machine produce comparable timing. Compression ratios, sizes, and fidelity are reproducible across machines. Timing and throughput are not.

Error handling

If a single codec fails, the benchmark logs the error and continues with the remaining combinations. The summary line reports how many succeeded and failed. The CLI exits with code 1 if any combination failed.

Running in CI

For fast CI validation, pass --num-points 10000 --iterations 1 --warmup 1:

cargo run -p tensogram-benchmarks --bin codec-matrix -- \
    --num-points 10000 --iterations 1 --warmup 1

The smoke test suite (cargo test -p tensogram-benchmarks) uses 500–1000 points and completes in under 5 seconds.

Benchmark Results

This page is a snapshot of benchmark results recorded on a specific machine. For methodology, flags, and how to re-run, see Benchmarks.

Note: Timing and throughput are machine-specific. Compression ratios, sizes, and fidelity metrics are determined by the codec and are reproducible.

Run metadata

FieldValue
Date2026-04-16
Tensogram version0.13.0
CPUApple M4, 10 cores / 10 threads
OSmacOS 26.3 (Darwin 25.3.0)
Rustrustc 1.94.1
ecCodes2.46.0
Methodology10 timed iterations, 3 warmup, median reported

Codec Matrix

16 million float64 values (122 MiB). The test data is a synthetic smooth scientific-like field with values in the range 250–310 (a profile that also matches real temperature grids and other bounded-range physical measurements).

How fidelity is measured

After each encode→decode round-trip, the decoded values are compared to the original. Three error norms are reported, all absolute in the same units as the input:

  • Linf — the largest error for any single value. Answers: “what is the worst case?”
  • L1 — the average error across all values. Answers: “how far off are values on average?”
  • L2 (RMSE) — root mean square error. Like L1 but penalizes large outliers more heavily. Answers: “how large are the typical errors, weighted toward the worst ones?”

For lossless codecs all three are zero.

Lossless compressors on raw floats

No encoding step — raw 64-bit floats compressed directly. Decoded values are bit-identical to the original.

MethodEnc (ms)Dec (ms)Enc MB/sDec MB/sRatioSize (MiB)
no compression [REF]3.73.73281833226100.0%122.1
zstd level 3128.5114.5950106690.3%110.2
LZ48.57.41432816535100.4%122.6
Blosc251.926.62350458475.2%91.8
szip69.7206.81753590100.9%123.2

Raw 64-bit floats have high entropy, so most lossless compressors cannot reduce their size. LZ4 and szip slightly expand the data. Blosc2 is the exception — its byte-shuffle step exposes compressible patterns (75%).

SimplePacking (quantization) + lossless compressors

Values are quantized to N bits, then compressed. Fidelity depends only on the bit width, not on the compressor — see the fidelity table below.

MethodEnc (ms)Dec (ms)Enc MB/sDec MB/sRatioSize (MiB)
16-bit only17.315.17039807825.0%30.5
16-bit + zstd54.236.22254337524.4%29.7
16-bit + LZ419.722.26204549325.1%30.6
16-bit + Blosc2115.231.51060387320.3%24.8
16-bit + szip53.999.32263122914.6%17.8
24-bit only19.217.16347713537.5%45.8
24-bit + zstd67.341.11813296937.2%45.4
24-bit + LZ431.523.53871518837.6%46.0
24-bit + Blosc2124.940.0978305232.8%40.0
24-bit + szip63.3133.5192891427.2%33.2
32-bit only21.225.35771482550.0%61.0
32-bit + zstd97.837.01248329949.8%60.8
32-bit + LZ437.145.13287270650.2%61.3
32-bit + Blosc2141.038.3866318345.3%55.3
32-bit + szip69.8157.4174877539.7%48.4

Fidelity by bit width

Bit widthLinf (max abs)L1 (mean abs)L2 (RMSE)
16 bits4.9 × 10⁻⁴2.4 × 10⁻⁴2.8 × 10⁻⁴
24 bits1.9 × 10⁻⁶9.5 × 10⁻⁷1.1 × 10⁻⁶
32 bits7.5 × 10⁻⁹3.7 × 10⁻⁹4.3 × 10⁻⁹

For context: with input values around 280, a Linf of 1.9 × 10⁻⁶ means the worst-case relative error at 24 bits is roughly 7 parts per billion.

Lossy floating-point compressors

These operate directly on raw f64 bytes without quantization.

MethodEnc (ms)Dec (ms)Enc MB/sDec MB/sRatioSize (MiB)
ZFP rate 16220.1304.255540125.0%30.5
ZFP rate 24248.0468.549226137.5%45.8
ZFP rate 32288.0581.042421050.0%61.0
SZ3 abs 0.01131.4141.09298656.5%7.9

Fidelity by lossy codec

MethodLinf (max abs)L1 (mean abs)L2 (RMSE)
ZFP rate 161.3 × 10⁻²1.6 × 10⁻³2.0 × 10⁻³
ZFP rate 245.6 × 10⁻⁵6.1 × 10⁻⁶7.9 × 10⁻⁶
ZFP rate 321.9 × 10⁻⁷2.4 × 10⁻⁸3.1 × 10⁻⁸
SZ3 abs 0.011.0 × 10⁻²5.0 × 10⁻³5.8 × 10⁻³

Notable observations

  • 16-bit + szip achieves the best compression ratio (14.6%) among the SimplePacking combinations.
  • SZ3 achieves the smallest output overall (6.5%) with a max error of 0.01. If your application tolerates that error bound, this gives the best compression in this benchmark.
  • In this benchmark, higher ZFP rates gave proportionally smaller errors. ZFP fixed-rate modes always hit their target ratio exactly (25% / 37.5% / 50%).

Reference Comparison: ecCodes GRIB Encoding

GRIB is a binary format widely used in operational weather forecasting, and ecCodes (from ECMWF) is a common implementation. Comparing against it gives a concrete, reproducible reference point for Tensogram’s quantisation + entropy-coding pipeline.

This benchmark runs Tensogram’s 24-bit SimplePacking + szip and ecCodes’ built-in packing methods on the same input. Both sides are timed end-to-end: from a float64 array to serialised compressed bytes (encode), and back (decode).

10 million float64 values (76 MiB), 24-bit packing. Different dataset size from the codec matrix above.

MethodEnc (ms)Dec (ms)Enc MB/sDec MB/sRatioSize (MiB)
ecCodes CCSDS [REF]47.984.8159490027.2%20.8
ecCodes simple packing32.67.92339966037.5%28.6
Tensogram 24-bit + szip43.780.4174595027.4%20.9

All three methods produce identical fidelity: Linf = 1.9 × 10⁻⁶, L1 = 9.5 × 10⁻⁷, L2 = 1.1 × 10⁻⁶.

Notable observations

  • Tensogram and ecCodes CCSDS achieve nearly identical compression (27.4% vs 27.2%) and identical fidelity at 24 bits.
  • Tensogram encode is now slightly faster than ecCodes CCSDS (43.7 vs 47.9 ms) on this machine; decode is comparable (80.4 vs 84.8 ms).
  • ecCodes simple packing decodes fastest (7.9 ms) but produces a larger file (37.5% vs 27%).

Threading Scaling

The v0.13.0 multi-threaded coding pipeline lets callers spend a threads budget on encode/decode work. Results here show the effect of sweeping threads ∈ {0, 1, 2, 4, 8} on 16M f64 values (122 MiB) for seven representative codec combinations. threads=0 is the sequential baseline; speedups are measured against it.

Reminder: Transparent codecs (no codec, simple_packing, szip, lz4, zfp, sz3, shuffle) produce byte-identical encoded payloads across thread counts. Opaque codecs (blosc2, zstd with nb_workers > 0) may produce different compressed bytes while always round-tripping losslessly.

Lossless (no encoding)

MethodMetricthreads=0threads=1threads=2threads=4threads=8
none+noneenc MB/s3281835929368013517335520
none+nonespeedup1.00x1.09x1.12x1.07x1.08x
none+lz4enc MB/s77333619355920292513
none+lz4speedup1.00x0.47x0.46x0.26x0.32x
none+zstd(3)enc MB/s9421163207522591839
none+zstd(3)speedup1.00x1.23x2.20x2.40x1.95x
none+blosc2(lz4)enc MB/s31503140503074588906
none+blosc2(lz4)speedup1.00x1.00x1.60x2.37x2.83x

SimplePacking + compression

MethodMetricthreads=0threads=1threads=2threads=4threads=8
sp(16)+noneenc MB/s1296413268155841564314612
sp(16)+noneenc speedup1.00x1.02x1.20x1.21x1.13x
sp(16)+nonedec speedup1.00x1.14x2.37x2.34x2.18x
sp(24)+szipenc MB/s22732263235123892427
sp(24)+szipspeedup1.00x1.00x1.03x1.05x1.07x
sp(24)+blosc2(lz4)enc MB/s23712350396555546388
sp(24)+blosc2(lz4)enc speedup1.00x0.99x1.67x2.34x2.69x

Notable observations

  • Memory-bound baselines (none+none, none+lz4) do not scale. The parallel dispatch overhead outweighs any gain when the work per task is already at memory bandwidth. none+lz4 actually regresses — leave threads=0 for lz4-only workloads.
  • blosc2 scales best. Encoding with blosc2+lz4 reaches 2.8× on 8 threads; the sp(24)+blosc2 combination reaches 2.7× on encode and 1.3× on decode.
  • zstd scales ~2.4× on encode at 4 threads via libzstd’s NbWorkers. Beyond 4 threads the benefit plateaus on this CPU.
  • simple_packing decode is 2.3× faster at 2+ threads — the internal chunk-parallel scatter saturates memory bandwidth quickly.
  • szip is single-threaded. The marginal gains shown for sp(24)+szip come from parallelising the simple_packing stage only; szip itself runs sequentially in v0.13.0.

The raw numbers above were produced by the threads-scaling binary in rust/benchmarks. Re-run locally with:

cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
    --num-points 16000000 \
    --iterations 5 \
    --warmup 2 \
    --threads 0,1,2,4,8

Simple Packing

Simple packing is a lossy quantisation technique derived from GRIB’s simple-packing method. It quantises a range of floating-point values into N-bit integers, dramatically reducing payload size at the cost of precision.

A 16-bit simple_packing payload is 8× smaller than the equivalent float64 and 4× smaller than float32, with precision loss typically below instrument noise for most bounded-range scientific measurements (temperatures, voltages, pressures, intensity counts).

How It Works

Given a set of float64 values V[i]:

  1. Find the minimum value R (the reference value).
  2. Scale all values relative to R: Y[i] = (V[i] - R) × 10^D × 2^-E
  3. Round Y[i] to the nearest integer and pack it into B bits (MSB first).

The parameters D (decimal scale factor), E (binary scale factor), and B (bits per value) are chosen automatically by compute_params().

flowchart TD
    A["Input: V = [250.0, 251.3, 252.7]"]
    B["Find reference value
    R = min(V) = 250.0"]
    C["Scale relative to R
    [0, 1.3, 2.7] × 10^D × 2^−E"]
    D["Round to integers
    [0, 17369, 36044]"]
    E["Pack as 16-bit MSB
    00 00 43 99 8C 8C"]

    A --> B --> C --> D --> E

    style A fill:#388e3c,stroke:#2e7d32,color:#fff
    style E fill:#1565c0,stroke:#0d47a1,color:#fff

Limitations and Edge Cases

NaN and ±Infinity are Rejected

compute_params() and encode() return an error if the data contains any NaN or ±Infinity values. Simple packing has no representation for non-finite numbers (unlike IEEE 754 floats), and feeding Inf through the range / scale-factor derivation would produce an i32::MAX-saturated binary_scale_factor that silently decodes to NaN everywhere. Both are errors at the codec entry:

  • NaN → PackingError::NanValue(index)
  • +Inf / -Inf → PackingError::InfiniteValue(index)

Remove or replace non-finite values before encoding. If you want to preserve them, switch to encoding="none" and opt in to the NaN / Inf bitmask companion via allow_nan=true / allow_inf=true — see NaN / Inf Handling for the full semantics. Simple packing cannot represent non-finite values at all, so the mask companion is only available on the pass-through encoding path.

#![allow(unused)]
fn main() {
// Both rejected:
let with_nan = vec![1.0_f64, 2.0, f64::NAN, 4.0];
let with_inf = vec![1.0_f64, 2.0, f64::INFINITY, 4.0];
assert!(compute_params(&with_nan, 16, 0).is_err());
assert!(compute_params(&with_inf, 16, 0).is_err());
}

Params Safety Net

Beyond input-value validation, encode() also checks the SimplePackingParams it receives:

  • reference_value must be finite (NaN / ±Inf → error).
  • |binary_scale_factor| ≤ 256. The threshold catches the i32::MAX-saturation fingerprint from feeding Inf through compute_params indirectly; real-world data (|bsf| ≤ 60) fits comfortably. The constant MAX_REASONABLE_BINARY_SCALE = 256 is exported from tensogram_encodings::simple_packing.

This closes the standalone-API footgun where a caller constructs or mutates SimplePackingParams directly rather than deriving them from compute_params. Both failures surface as PackingError::InvalidParams { field, reason } with a clear message naming the offending field.

Constant Fields

If all values are identical (range = 0), compute_params() succeeds and stores everything in the reference value. All packed integers are 0. Decoding reconstructs the constant correctly.

bits_per_value Range

Valid range: 0 to 64. More than 64 bits is rejected. Zero bits is accepted — compute_params stores the first value as the reference value (not the minimum) and encode produces an empty byte buffer. Decode reconstructs the reference value for every element, so this is only lossless for constant fields. Typical range for scientific floating-point data is 8–24 bits.

bits_per_valuePacked valuesPrecision vs float64
8256 levelsCoarse (rough categories)
1665,536 levelsGood for temperature, wind
2416,777,216 levelsNear-float32 precision
32~4 billion levelsNear-float64 for most ranges

API

compute_params

#![allow(unused)]
fn main() {
pub fn compute_params(
    values: &[f64],
    bits_per_value: u32,
    decimal_scale_factor: i32,
) -> Result<SimplePackingParams, PackingError>
}

Computes the optimal packing parameters for the given data. Call this once before encoding.

#![allow(unused)]
fn main() {
let values: Vec<f64> = (0..1000).map(|i| 250.0 + i as f64 * 0.01).collect();
let params = compute_params(&values, 16, 0)?;

println!("reference_value: {}", params.reference_value);
println!("binary_scale_factor: {}", params.binary_scale_factor);
println!("bits_per_value: {}", params.bits_per_value);
}

encode

#![allow(unused)]
fn main() {
pub fn encode(
    values: &[f64],
    params: &SimplePackingParams,
) -> Result<Vec<u8>, PackingError>
}

Encodes f64 values to a packed byte buffer using the given parameters.

decode

#![allow(unused)]
fn main() {
pub fn decode(
    packed: &[u8],
    num_values: usize,
    params: &SimplePackingParams,
) -> Result<Vec<f64>, PackingError>
}

Decodes a packed buffer back to f64 values. The num_values parameter is required because the byte length alone is not enough to determine the element count (bits per value may not divide evenly into bytes).

Precision Example

Consider a bounded-range scalar field spanning 90 units (e.g. a temperature field 220–310 K, a pressure field 950–1040 hPa, or any analogous bounded scientific quantity):

bits_per_valueStep sizeMax error
80.353 units±0.18 units
120.022 units±0.011 units
160.00137 units±0.00069 units

At 16 bits, the error is smaller than most practical sensor precisions. The same analysis applies to any physical quantity with a bounded dynamic range.

Full Integration Example

#![allow(unused)]
fn main() {
use tensogram::{encode, decode, GlobalMetadata, DataObjectDescriptor,
                     ByteOrder, Dtype, EncodeOptions, DecodeOptions};
use tensogram_encodings::simple_packing;
use ciborium::Value;
use std::collections::BTreeMap;

// Source data: 1000 temperature values
let values: Vec<f64> = (0..1000).map(|i| 273.0 + i as f64 * 0.05).collect();
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();

// Compute packing parameters
let params = simple_packing::compute_params(&values, 16, 0).unwrap();

// Build descriptor with packing params
let mut p = BTreeMap::new();
p.insert("reference_value".into(), Value::Float(params.reference_value));
p.insert("binary_scale_factor".into(),
    Value::Integer((params.binary_scale_factor as i64).into()));
p.insert("decimal_scale_factor".into(),
    Value::Integer((params.decimal_scale_factor as i64).into()));
p.insert("bits_per_value".into(),
    Value::Integer((params.bits_per_value as i64).into()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 1,
    shape: vec![1000],
    strides: vec![1],
    dtype: Dtype::Float64,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "none".into(),
    params: p,
    hash: None,
};

let global = GlobalMetadata { version: 2, ..Default::default() };

let msg = encode(&global, &[(&desc, &raw)], &EncodeOptions::default()).unwrap();
println!("Packed size: {} bytes (was {} bytes)", msg.len(), raw.len());

let (_, objects) = decode(&msg, &DecodeOptions::default()).unwrap();
let decoded: Vec<f64> = objects[0].1.chunks_exact(8)
    .map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
    .collect();

// Check precision
for (orig, dec) in values.iter().zip(decoded.iter()) {
    assert!((orig - dec).abs() < 0.001);
}
}

Byte Shuffle Filter

The shuffle filter rearranges the bytes of a multi-byte array to improve compression. It is the same algorithm used by HDF5 and NetCDF4.

Why Shuffle Helps

For float32 data, each value occupies 4 bytes. The bytes within a float are not independent — nearby values tend to share their most-significant bytes (exponent + high mantissa) while the least-significant bytes are more random.

Without shuffle, the bytes are interleaved:

[B0 B1 B2 B3][B0 B1 B2 B3][B0 B1 B2 B3]...

A compressor sees B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 ... — not very compressible because the predictable (B0, B1) bytes are mixed with the random (B3) bytes.

After shuffle, all byte-0s come first, then all byte-1s, etc.:

[B0 B0 B0 ...][B1 B1 B1 ...][B2 B2 B2 ...][B3 B3 B3 ...]

Now the B0 run and B1 run are highly compressible (long runs of similar values). The B3 run is still noisy, but it’s isolated. Overall compression improves significantly.

API

shuffle

#![allow(unused)]
fn main() {
pub fn shuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}

Rearranges bytes. element_size is the byte width of each element (e.g. 4 for float32, 8 for float64).

#![allow(unused)]
fn main() {
let floats: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
let raw: Vec<u8> = floats.iter().flat_map(|f| f.to_ne_bytes()).collect();
let shuffled = shuffle(&raw, 4)?;
// shuffled is ready for compression
}

unshuffle

#![allow(unused)]
fn main() {
pub fn unshuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}

Reverses the shuffle. Applied automatically by the decode pipeline.

Using Shuffle in a Message

Set filter: "shuffle" in the DataObjectDescriptor and provide shuffle_element_size:

#![allow(unused)]
fn main() {
use ciborium::Value;

let mut params = BTreeMap::new();
params.insert(
    "shuffle_element_size".to_string(),
    Value::Integer(4.into()), // 4 bytes per float32
);

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 1,
    shape: vec![100],
    strides: vec![1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "shuffle".to_string(),
    compression: "none".to_string(),
    params,
    hash: None,
};
}

Edge Cases

Element Size Must Divide the Buffer

The shuffle operation requires data.len() % element_size == 0. If this is not true, the function returns Err(ShuffleError::Misaligned). Ensure your data buffer is a whole number of elements.

Shuffle Alone Does Not Compress

Shuffle rearranges bytes but does not reduce the total byte count. It only helps when followed by a compression stage (e.g. szip, zstd, lz4, blosc2). Set compression in the descriptor to apply compression after the shuffle step.

Combining with simple_packing

When using both encoding: "simple_packing" and filter: "shuffle", the pipeline applies them in order: encode first, then shuffle. The simple_packing output is 1-byte-per-packed-chunk (MSB-first bits), so shuffle_element_size should be 1 in this case (no benefit from shuffling already-packed data). In practice, the combination is unusual — either use simple_packing alone (when quantising float values) or shuffle alone (before a lossless compressor).

Compression

Compression is the third stage of the encoding pipeline. It reduces the total byte count of the already-encoded and filtered payload.

Supported Compressors

CompressorTypeRandom AccessNotes
nonePass-throughYes (trivial)No compression
szipLosslessYes (RSI blocks)CCSDS 121.0-B-3 via libaec. Best for integer/packed data
zstdLosslessNoZstandard. Excellent ratio/speed tradeoff
lz4LosslessNoFastest decompression. Good for real-time pipelines
blosc2LosslessYes (chunks)Multi-codec meta-compressor with chunk-level access
zfpLossyYes (fixed-rate)Purpose-built for floating-point arrays
sz3LossyNoError-bounded lossy compression for scientific data

The Compressor Trait

All compressors implement a common interface with three operations:

#![allow(unused)]
fn main() {
pub trait Compressor {
    fn compress(&self, data: &[u8]) -> Result<CompressResult, CompressionError>;
    fn decompress(&self, data: &[u8], expected_size: usize) -> Result<Vec<u8>, CompressionError>;
    fn decompress_range(
        &self,
        data: &[u8],
        block_offsets: &[u64],
        byte_pos: usize,
        byte_size: usize,
    ) -> Result<Vec<u8>, CompressionError>;
}
}

decompress_range enables partial decode without decompressing the entire payload. Compressors that don’t support it return CompressionError::RangeNotSupported.

Lossless Compressors

Szip (libaec)

Szip implements CCSDS 121.0-B-3, a lossless compressor designed for scientific data. It works on integer data and exploits the block structure of packed values.

Random access: Szip records RSI (Reference Sample Interval) block boundaries during encoding. These offsets are stored in metadata as szip_block_offsets, enabling seek-to-block partial decode via decompress_range. When using encode_pre_encoded, the caller must provide these bit-precise block offsets themselves to enable random access (see Pre-encoded Payloads).

ParameterTypeDescription
szip_rsiuintReference sample interval (samples per RSI block)
szip_block_sizeuintBlock size (typically 8 or 16)
szip_flagsuintAEC encoding flags (e.g., AEC_DATA_PREPROCESS)
szip_block_offsetsarray of uintBit offsets of RSI block boundaries (computed during encoding)

Important: libaec encodes integers only. For floating-point data, use either:

  • simple_packingszip (lossy quantization to integers, then compress)
  • shuffleszip (byte rearrangement, then compress as uint8)

Zstd (Zstandard)

General-purpose lossless compression with excellent ratio/speed tradeoff. Widely used and well-optimized.

ParameterTypeDefaultDescription
zstd_levelint3Compression level (1-22). Higher = better ratio, slower

No random access — decode_range is not supported with zstd.

LZ4

Fastest decompression of any compressor in the library. Slightly lower compression ratio than Zstd, but 3-5x faster to decompress.

No configurable parameters. No random access.

Blosc2

A meta-compressor that splits data into independently-compressed chunks, then stores them in a frame. Supports multiple internal codecs.

Random access: Because each chunk is independent, Blosc2 can decompress only the chunks covering the requested byte range. decompress_range works by mapping byte offsets to chunk indices.

ParameterTypeDefaultDescription
blosc2_codecstring"lz4"Internal codec: blosclz, lz4, lz4hc, zlib, zstd
blosc2_clevelint5Compression level (0-9)
blosc2_typesizeuint(auto)Element byte width for shuffle optimization

blosc2_typesize is automatically computed from the preceding pipeline stage: dtype byte width for unencoded data, 1 for shuffled bytes, or packed byte width for simple_packing output.

Lossy Compressors

ZFP

Purpose-built compression for floating-point arrays. ZFP compresses data in blocks of 4 elements (1D) and supports three modes:

ModeParameterDescription
fixed_ratezfp_rate (float)Fixed bits per value. Enables O(1) random access
fixed_precisionzfp_precision (uint)Fixed number of uncompressed bit planes
fixed_accuracyzfp_tolerance (float)Maximum absolute error bound

Random access: In fixed-rate mode, every block compresses to exactly the same number of bits. This means the byte offset of any block is computable from its index, enabling decompress_range without stored block offsets.

ParameterTypeDescription
zfp_modestringOne of "fixed_rate", "fixed_precision", "fixed_accuracy"
zfp_ratefloatBits per value (only for fixed_rate)
zfp_precisionuintBit planes to keep (only for fixed_precision)
zfp_tolerancefloatMax absolute error (only for fixed_accuracy)

Important: ZFP operates directly on floating-point data. Use encoding: "none" and filter: "none" — ZFP replaces both encoding and compression.

SZ3

Error-bounded lossy compression for scientific data. SZ3 uses prediction-based methods (interpolation, Lorenzo, regression) to achieve high compression ratios within strict error bounds.

ParameterTypeDescription
sz3_error_bound_modestringOne of "abs", "rel", "psnr"
sz3_error_boundfloatError bound value (meaning depends on mode)

Error bound modes:

  • abs — Absolute error: |original - decompressed| <= bound for every element
  • rel — Relative error: |original - decompressed| / value_range <= bound
  • psnr — Peak signal-to-noise ratio lower bound

No random access — decode_range is not supported with SZ3.

Important: Like ZFP, SZ3 operates on floating-point data. Use encoding: "none" and filter: "none".

Choosing a Compressor

flowchart TD
    A{"Data type?"}
    A -->|"Integer / packed"| B{"Need random access?"}
    A -->|"Float, lossy OK"| C{"Need random access?"}
    A -->|"Float, lossless"| D{"Speed priority?"}

    B -->|Yes| E["szip"]
    B -->|No| F{"Speed or ratio?"}
    F -->|Speed| G["lz4"]
    F -->|Ratio| H["zstd"]

    C -->|Yes| I["zfp (fixed_rate)"]
    C -->|No| J{"Error bound type?"}
    J -->|"Bits/precision"| K["zfp"]
    J -->|"Absolute/relative"| L["sz3"]

    D -->|"Fastest decompress"| M["lz4"]
    D -->|"Best ratio"| N["blosc2 or zstd"]
    D -->|"Need random access"| O["blosc2"]

    style E fill:#388e3c,stroke:#2e7d32,color:#fff
    style I fill:#388e3c,stroke:#2e7d32,color:#fff
    style O fill:#388e3c,stroke:#2e7d32,color:#fff
Use caseRecommendedWhy
Quantised floats with partial-access supportsimple_packing + szipRSI-block random access; interoperable with GRIB 2 CCSDS packing
Real-time streaminglz4Fastest decompression, low latency
Archival storagezstd (level 9-15)Best lossless ratio
ML model weightsblosc2Chunk random access, good for large tensors
Float fields, lossy OKzfp (fixed_rate)Best lossy ratio with random access
Error-bounded sciencesz3 (abs)Guaranteed error bounds per element
Exact integersnone or lz4No information loss

Invalid Combinations

Some pipeline combinations are rejected at configuration time:

CombinationRejected?Reason
zfp + shuffleYesZFP operates on typed floats; shuffle rearranges bytes
zfp + simple_packingYesZFP IS the encoding for floats
sz3 + shuffleYesSZ3 operates on typed data
sz3 + simple_packingYesSZ3 IS lossy encoding for floats
shuffle + decode_rangeYesByte rearrangement breaks contiguous sample ranges
zstd/lz4/sz3 + decode_rangeYesStream compressors don’t support partial decode

tensogram info

Displays a summary of a Tensogram file: number of messages, total file size, and format version.

Usage

tensogram info [FILES]...

Options

OptionDescription
-h, --helpPrint help

Example

$ tensogram info forecast.tgm
Messages : 48
File size: 1.2 GB
Version  : 1

What it Shows

FieldDescription
MessagesTotal number of valid messages found by scanning the file
File sizeRaw byte count of the file on disk
VersionFormat version from the first message’s metadata

Notes

  • The scan counts only valid messages (those with a matching TENSOGRM header and 39277777 terminator). Corrupted regions are skipped.
  • If the file is empty, Messages: 0 is shown.
  • Version is read from the first message. If messages have different versions, only the first is shown.

tensogram ls

Lists messages in a Tensogram file, showing metadata in tabular or JSON format.

Usage

tensogram ls [OPTIONS] [FILES]...

Options

OptionDescription
-w <WHERE_CLAUSE>Where-clause filter (e.g., mars.param=2t/10u)
-p <KEYS>Comma-separated keys to display
-jJSON output
-h, --helpPrint help

Examples

# List all messages with default columns
tensogram ls forecast.tgm

# Only temperature fields
tensogram ls forecast.tgm -w "mars.param=2t"

# Temperature or wind
tensogram ls forecast.tgm -w "mars.param=2t/10u/10v"

# Exclude ensemble members
tensogram ls forecast.tgm -w "mars.type!=em"

# Show only date and step columns
tensogram ls forecast.tgm -p "mars.date,mars.step"

# JSON output (one object per line, good for jq)
tensogram ls forecast.tgm -j | jq '.["mars.param"]'

Where Clause Syntax

The -w flag accepts a single expression:

key=value           # exact match
key=v1/v2/v3        # OR — matches any of v1, v2, v3
key!=value          # not equal
key!=v1/v2          # not any of v1, v2

Key format: namespace.field for namespaced keys (e.g. mars.param) or just field for top-level keys (e.g. version).

Missing key: For key=value, a missing key is treated as non-matching. For key!=value, a missing key passes the filter.

Only one -w expression can be specified per command. To apply multiple filters, pipe commands:

tensogram ls forecast.tgm -w "mars.type=fc" | grep "2t"

Pick Keys

The -p flag selects which metadata columns to display. Keys use the same dot-notation as -w:

tensogram ls forecast.tgm -p "mars.date,mars.step,mars.param"

Without -p, all available metadata keys are shown.

Default Table Output

mars.date   mars.step  mars.param  mars.type  shape
20260401    0          2t          fc         [721, 1440]
20260401    0          10u         fc         [721, 1440]
20260401    0          10v         fc         [721, 1440]
20260401    6          2t          fc         [721, 1440]
...

JSON Output

With -j, each matching message is printed as a JSON object on its own line:

{"mars.date": "20260401", "mars.step": "0", "mars.param": "2t", "shape": "[721, 1440]"}
{"mars.date": "20260401", "mars.step": "0", "mars.param": "10u", "shape": "[721, 1440]"}

This is compatible with jq, grep, and any tool that processes newline-delimited JSON.

tensogram dump

Prints the full contents of every message in a Tensogram file — metadata keys and optionally the raw data values.

Usage

tensogram dump [OPTIONS] [FILES]...

Options

OptionDescription
-w <WHERE_CLAUSE>Filter messages (e.g. mars.param=2t, same syntax as ls)
-p <KEYS>Comma-separated keys to display
-jJSON output
-h, --helpPrint help

Example

$ tensogram dump forecast.tgm
─── Message 0 ───
version    : 1
mars.class : od
mars.type  : fc
mars.date  : 20260401
mars.step  : 0

  Object 0
  type     : ntensor
  ndim     : 2
  shape    : [721, 1440]
  strides  : [1440, 1]
  dtype    : float32
  mars.param: 2t
  encoding : none
  filter   : none
  compression: none
  hash     : xxh3:a3f0123456789abc

─── Message 1 ───
...

Filtering

Use -w to limit the dump to specific messages:

# Dump only wave spectra
tensogram dump forecast.tgm -w "mars.param=wave_spectra"

JSON Output

With -j, each message is a JSON object:

{
  "message": 0,
  "metadata": {
    "version": 2,
    "base": [
      {
        "mars": {"class": "od", "type": "fc", "date": "20260401", "step": 0, "param": "2t"},
        "_reserved_": {"tensor": {"ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32"}}
      }
    ]
  },
  "objects": [
    {"type": "ntensor", "ndim": 2, "shape": [721, 1440], "dtype": "float32",
     "encoding": "none", "hash": {"type": "xxh3", "value": "a3f0..."}}
  ]
}

When to Use dump vs ls

  • Use ls for a quick overview of many messages (one line per message)
  • Use dump when you need to see all keys for a specific message, or check encoding parameters

tensogram get

Extracts a single metadata value from messages in a file. Returns an error if the key is missing.

Usage

tensogram get [OPTIONS] -p <KEYS> [FILES]...

Options

OptionDescription
-w <WHERE_CLAUSE>Filter messages (e.g. mars.param=2t, same syntax as ls)
-p <KEYS>Comma-separated keys to extract (required)
-h, --helpPrint help

Examples

# Get the mars.param value from all messages
tensogram get -p mars.param forecast.tgm

# Get the date from messages where param is 2t
tensogram get -p mars.date -w "mars.param=2t" forecast.tgm

# Get the shape of object 0
tensogram get -p shape forecast.tgm

Strict Key Lookup

Unlike ls which shows a blank for missing keys, get exits with a non-zero status if any matching message does not have the requested key:

$ tensogram get -p mars.nonexistent forecast.tgm
Error: key not found: mars.nonexistent

This makes get safe to use in shell scripts where missing data should fail fast.

Multi-Object Messages

For messages with multiple objects, get returns the first matching value found. Lookup checks top-level metadata first and then scans objects in order until it finds a match.

tensogram set

Modifies metadata keys in messages and writes the result to a new file. Matching messages are decoded, their metadata is updated, and they are re-encoded with the original payload bytes and pipeline settings.

Usage

tensogram set [OPTIONS] -s <SET_VALUES> <INPUT> <OUTPUT>

Options

OptionDescription
-s <SET_VALUES>Key=value pairs to set (comma-separated)
-w <WHERE_CLAUSE>Only modify messages matching this filter (e.g. mars.param=2t)
-h, --helpPrint help

Examples

# Change mars.date to 20260402 in all messages
tensogram set input.tgm output.tgm mars.date=20260402

# Set multiple keys at once
tensogram set input.tgm output.tgm mars.date=20260402,mars.step=12

# Only modify temperature fields
tensogram set input.tgm output.tgm mars.class=rd -w "mars.param=2t"

Key=Value Syntax

Multiple mutations can be specified as a comma-separated list:

tensogram set in.tgm out.tgm key1=val1,key2=val2,key3=val3

Keys use dot-notation: mars.param sets the param field inside the mars namespace. A top-level key like experiment sets a top-level metadata field.

Object-level metadata can be updated with objects.<index>.<path>:

# Add object-specific metadata to the first object
tensogram set input.tgm output.tgm objects.0.processing.version=2

Structural/Integrity Keys

The following keys cannot be modified because they describe the physical structure of the payload. Changing them would make the metadata inconsistent with the actual bytes on disk:

KeyReason
shapeTensor dimensions
stridesMemory layout
dtypeElement type
ndimNumber of dimensions
typeObject type
encodingEncoding algorithm
filterFilter algorithm
compressionCompression algorithm
hashPayload integrity hash
szip_rsiSzip compression block parameter
szip_block_sizeSzip compression block parameter
szip_flagsSzip compression flags
szip_block_offsetsSzip block seek table
reference_valueSimple packing quantization parameter
binary_scale_factorSimple packing quantization parameter
decimal_scale_factorSimple packing quantization parameter
bits_per_valueSimple packing quantization parameter
shuffle_element_sizeShuffle filter parameter

Attempting to modify any of these returns an error before any output is written.

Pass-Through for Non-Matching Messages

Messages that do not match the -w filter are copied verbatim to the output file. Their bytes are not re-encoded or re-hashed.

Note: Messages that are modified are re-encoded after the metadata mutation. Because the decoded payload bytes are unchanged, set preserves the original payload hash instead of recomputing it.

Workflow

flowchart TD
    A[Read message] --> B{Matches -w?}
    B -- No --> C[Write raw bytes to output]
    B -- Yes --> D[Decode metadata]
    D --> E[Apply mutations]
    E --> F[Re-encode message\npreserve payload hash]
    F --> G[Write to output]
    C --> H[Next message]
    G --> H

tensogram copy

Copies messages from one file to one or more output files. The output filename can include placeholders that expand to metadata values, allowing a single file to be split by parameter, date, step, or any other key.

Usage

tensogram copy [OPTIONS] <INPUT> <OUTPUT>

Options

OptionDescription
-w <WHERE_CLAUSE>Only copy messages that match this filter
-h, --helpPrint help

Basic Copy

# Copy all messages from one file to another
tensogram copy input.tgm output.tgm

Filename Placeholders

Wrap any metadata key in square brackets to expand it in the output filename:

# One file per parameter
tensogram copy forecast.tgm "by_param/[mars.param].tgm"
# Produces: by_param/2t.tgm, by_param/10u.tgm, by_param/msl.tgm, ...

# One file per date+step combination
tensogram copy forecast.tgm "archive/[mars.date]_[mars.step].tgm"
# Produces: archive/20260401_0.tgm, archive/20260401_6.tgm, ...

# Split by type and param
tensogram copy forecast.tgm "split/[mars.type]/[mars.param].tgm"
# Produces: split/fc/2t.tgm, split/an/2t.tgm, etc.

Multiple messages with the same expanded filename are appended to the same output file. This is how you split-then-concatenate: a 1000-message file with 4 unique mars.param values produces 4 output files with ~250 messages each.

Filtering During Copy

Combine -w with placeholders for targeted extraction:

# Copy only forecasts, split by step
tensogram copy forecast.tgm "steps/[mars.step].tgm" -w "mars.type=fc"

Edge Cases

Missing Placeholder Key

If a message does not have the key referenced by a placeholder, that placeholder expands to unknown:

# If mars.param is missing, the message is written to by_param/unknown.tgm
tensogram copy forecast.tgm "by_param/[mars.param].tgm"

Output Directory

The output directory must exist before running copy. The command does not create directories. Use mkdir -p beforehand:

mkdir -p by_param
tensogram copy forecast.tgm "by_param/[mars.param].tgm"

Overwriting

If the expanded output filename already exists before the copy starts, it is truncated once and matching messages are then appended in order. This means running copy twice will duplicate messages. To avoid this, delete or rename existing outputs first.

Placeholder Syntax Conflicts

If a metadata value contains /, \, or other characters that are invalid in filenames on your OS, the resulting filename will be invalid. Choose placeholder keys whose values are filesystem-safe (e.g. dates, step numbers, short codes).

tensogram merge

Merge messages from one or more files into a single message.

Usage

tensogram merge [OPTIONS] --output <OUTPUT> [INPUTS]...

Options

OptionDescription
-o, --output <OUTPUT>Output file
-s, --strategy <STRATEGY>Merge strategy for conflicting metadata keys: first (default) — first value wins, last — last value wins, error — fail on conflict [default: first]
-h, --helpPrint help

Description

All data objects from all input messages are collected into a single Tensogram message. Global metadata is merged according to --strategy: first (default) keeps the first value, last keeps the last, and error fails on conflict.

Examples

# Merge two files into one
tensogram merge file1.tgm file2.tgm -o merged.tgm

# Merge all messages in a single multi-message file
tensogram merge multi.tgm -o single.tgm

tensogram split

Split multi-object messages into separate single-object files.

Usage

tensogram split --output <OUTPUT> <INPUT>

Options

OptionDescription
-o, --output <OUTPUT>Output template (use [index] for numbering)
-h, --helpPrint help

Description

Each data object from each message in the input file becomes its own Tensogram message, inheriting the global metadata.

Output files are named using the template:

  • Use [index] for zero-padded numbering: split_[index].tgmsplit_0000.tgm, split_0001.tgm, …
  • Without [index]: the index is appended before the extension: out.tgmout_0000.tgm, out_0001.tgm, …

Examples

# Split with index template
tensogram split multi_object.tgm -o 'field_[index].tgm'

# Split with auto-numbered names
tensogram split multi_object.tgm -o output.tgm

tensogram reshuffle

Reshuffle frames: move footer frames to header position.

Usage

tensogram reshuffle --output <OUTPUT> <INPUT>

Options

OptionDescription
-o, --output <OUTPUT>Output file
-h, --helpPrint help

Description

Converts streaming-mode messages (footer-based index and hash frames) into random-access-mode messages (header-based index and hash frames).

This is a decode → re-encode operation. The data is not modified; only the frame layout changes so that index and hash information appears before the data objects, enabling efficient random access.

Examples

tensogram reshuffle streamed.tgm -o random_access.tgm

tensogram validate

Check whether .tgm files are well-formed and intact. Analogous to grib_check or h5check.

Usage

tensogram validate [OPTIONS] <FILES>...

Validation Levels

The command runs up to four validation levels:

LevelNameWhat it checks
1StructureMagic bytes, frame headers, ENDF markers, total_length, postamble, frame ordering, preceder legality, preamble flags vs observed frames
2MetadataCBOR parses correctly, required keys present (_reserved_.tensor, dtype, shape, strides), encoding/filter/compression types recognized, object count consistency, shape/strides/ndim consistency
3Integrityxxh3 hash in descriptor/hash-frame matches recomputed hash, compressed payloads decompress without error
4FidelityFull decode succeeds, decoded size matches shape/dtype, NaN/Inf in float arrays are errors

Modes

ModeLevelsDescription
default1–3Structure + metadata + integrity
quick1Structure only, no payloads
checksum3Hash verification only (structural errors still reported, no decompression)
full1–4All levels including fidelity (NaN/Inf check)

Level selectors (--quick, --checksum, --full) are mutually exclusive. --canonical is independent and can be combined with any level selector.

All flags

FlagDescription
--quickQuick mode: structure only (level 1)
--checksumChecksum only: hash verification (structural errors still reported, but metadata/decompression/fidelity checks skipped)
--fullFull mode: all levels including fidelity (levels 1-4)
--canonicalCheck RFC 8949 canonical CBOR key ordering (combinable with any level)
--jsonMachine-parseable JSON output
-h, --helpPrint help

Output

Human-readable (default)

file.tgm: OK (3 messages, 47 objects, hash verified)

On failure:

bad.tgm: FAILED — message 2, object 5: hash mismatch (expected a3f7..., got 91c2...)
bad.tgm: FAILED (1 error, 1 message, 3 objects)

JSON (--json)

[
  {
    "file": "file.tgm",
    "status": "ok",
    "messages": 1,
    "objects": 3,
    "hash_verified": true,
    "file_issues": [],
    "message_reports": [
      {
        "issues": [],
        "object_count": 3,
        "hash_verified": true
      }
    ]
  }
]

On failure, issues within message_reports[i].issues contain (note: object_index is 0-based in JSON; absent fields are omitted, not null):

{
  "code": "hash_mismatch",
  "level": "integrity",
  "severity": "error",
  "object_index": 4,
  "description": "hash mismatch (expected a3f7..., got 91c2...)"
}

Issue codes are stable snake_case strings (e.g. hash_mismatch, invalid_magic, buffer_too_short) suitable for machine parsing.

Exit Code

  • 0 — all files pass validation
  • 1 — one or more files have errors or file-level issues

Batch Mode

tensogram validate data/*.tgm

Validates all files. Reports per-file. Exits 1 if any file fails.

File-level Checks

When validating a file with multiple messages, the command also detects:

  • Unrecognized bytes between messages (garbage or padding)
  • Truncated messages at end of file
  • Trailing bytes after the last valid message

These are reported as file-level issues and cause validation to fail (exit code 1).

Library API

The same validation is available programmatically:

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram::{validate_message, validate_file, ValidateOptions};

// Validate a single message buffer
let report = validate_message(&bytes, &ValidateOptions::default());
assert!(report.is_ok());

// Validate a file
let file_report = validate_file(Path::new("data.tgm"), &ValidateOptions::default())?;
println!("{} messages, {} objects", file_report.messages.len(), file_report.total_objects());
}

Examples

# Default validation (levels 1-3)
tensogram validate measurements.tgm

# Quick structural check
tensogram validate --quick *.tgm

# Verify checksums only
tensogram validate --checksum archive/*.tgm

# Full validation including NaN/Inf detection (levels 1-4)
tensogram validate --full output.tgm

# Full validation with canonical CBOR check
tensogram validate --full --canonical output.tgm

# Check canonical CBOR encoding
tensogram validate --canonical output.tgm

# JSON output for CI pipelines
tensogram validate --json data/*.tgm

GRIB Import

Tensogram provides tensogram-grib, a dedicated crate for importing GRIB (GRIdded Binary) messages into Tensogram format. GRIB is widely used in operational weather forecasting; this importer lets you bring existing GRIB data into Tensogram pipelines while preserving the full MARS namespace metadata. Conversion is one-way: GRIB → Tensogram.

System Requirement

The ecCodes C library must be installed:

brew install eccodes       # macOS
apt install libeccodes-dev # Debian/Ubuntu

Building

The tensogram-grib crate is excluded from the default workspace build to avoid requiring ecCodes on machines that do not need GRIB import.

# Build the library
cd rust/tensogram-grib && cargo build

# Build CLI with GRIB support
cargo build -p tensogram-cli --features grib

Conversion Modes

Merge All (default)

All GRIB messages are combined into a single Tensogram message with N data objects. ALL MARS keys for each GRIB message are placed into the corresponding base[i] entry independently — there is no common/varying partitioning in the output.

tensogram convert-grib forecast.grib -o forecast.tgm

One-to-One (split)

Each GRIB message becomes a separate Tensogram message with one data object. All MARS keys go into base[0].

tensogram convert-grib forecast.grib -o forecast.tgm --split

Rust API

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_grib::{convert_grib_file, ConvertOptions, Grouping};

let options = ConvertOptions {
    grouping: Grouping::MergeAll,
    ..Default::default()
};

let messages = convert_grib_file(Path::new("forecast.grib"), &options)?;
// messages is Vec<Vec<u8>> — each element is a complete Tensogram wire-format message
}

Data Mapping

Source (GRIB)Target (Tensogram)
Grid values (values key)Data object payload (float64, little-endian)
Grid dimensions (Ni, Nj)DataObjectDescriptor.shape as [Nj, Ni]
Reduced Gaussian grids (Ni=0)Shape [numberOfPoints] (1D)
MARS keys (all, per message)GlobalMetadata.base[i]["mars"] (each entry independent)

Scope

Only GRIB → Tensogram import is supported. Tensogram → GRIB is out of scope because Tensogram’s N-tensor data model is a superset of GRIB’s 2-D-field model; a faithful down-conversion is often impossible.

See also

  • NetCDF Import — sister importer for NetCDF files; shares the --encoding/--bits/--filter/--compression pipeline flags with convert-grib.
  • Vocabularies — other application vocabularies that can coexist with MARS in the same message.

MARS Key Mapping

The importer reads the following MARS namespace keys from each GRIB message using ecCodes’ read_key_dynamic API.

Keys Extracted

Identification

GRIB KeyDescriptionExample
classMARS class"od" (operational)
typeData type"an" (analysis), "fc" (forecast)
streamData stream"oper", "enfo"
expverExperiment version"0001"

Parameter

GRIB KeyDescriptionExample
paramParameter ID"2t" (2m temperature)
shortNameShort name"2t"
nameFull name"2 metre temperature"
paramIdNumeric ID167
disciplineWMO discipline0
parameterCategoryWMO category0
parameterNumberWMO number0

Vertical

GRIB KeyDescriptionExample
levelLevel value500
typeOfLevelLevel type"isobaricInhPa"
levtypeMARS level type"pl" (pressure level)

Temporal

GRIB KeyDescriptionExample
date / dataDateReference date20260404
time / dataTimeReference time1200
stepRange / stepForecast step"0", "6", "0-6"
stepUnitsStep units1 (hours)

Spatial

GRIB KeyDescriptionExample
gridTypeGrid type"regular_ll"
Ni, NjGrid dimensions360, 181
numberOfPointsTotal grid points65160
latitudeOfFirstGridPointInDegreesFirst latitude90.0
longitudeOfFirstGridPointInDegreesFirst longitude0.0
latitudeOfLastGridPointInDegreesLast latitude-90.0
longitudeOfLastGridPointInDegreesLast longitude359.0
iDirectionIncrementInDegreesLongitude step1.0
jDirectionIncrementInDegreesLatitude step1.0

Other

GRIB KeyDescriptionExample
bitsPerValuePacking precision16
packingTypeGRIB packing"grid_simple"
centreOriginating centre"ecmf"
subCentreSub-centre0
generatingProcessIdentifierProcess ID148

Storage in Tensogram

Given N GRIB messages in merge-all mode:

  1. Extract all MARS keys from each message using read_key_dynamic
  2. Store ALL keys for each GRIB message in the corresponding base[i]["mars"] entry independently
  3. There is no common/varying partitioning in the output — each base[i] entry is self-contained
graph TD
    A[N GRIB messages] --> B[Extract MARS keys from each]
    B --> C["Store in base[i] independently"]
    C --> D["base[0]: all keys from GRIB msg 0"]
    C --> E["base[1]: all keys from GRIB msg 1"]
    C --> F["base[N-1]: all keys from GRIB msg N-1"]

If you need to extract commonalities after decoding (e.g. for display), use the compute_common() utility in software.

Sentinel Handling

ecCodes uses sentinel values for missing keys:

  • String: "MISSING" or "not_found" → skipped
  • Integer: 2147483647 or -2147483647 → skipped
  • Float: NaN or Inf → skipped

NetCDF Import

Tensogram ships tensogram-netcdf, a dedicated crate for importing NetCDF (both Classic and NetCDF-4) files into Tensogram messages. NetCDF is widely used in climate, ocean, atmospheric, and Earth-observation science, but the importer treats any NetCDF file the same way — the mapping is structural, not domain-specific.

The crate is exposed through the CLI as tensogram convert-netcdf and through a thin Rust library API. Conversion is one-way: NetCDF → Tensogram. There is no Tensogram → NetCDF writer.

System requirement

The NetCDF C library must be installed on your system:

brew install netcdf            # macOS
apt install libnetcdf-dev      # Debian/Ubuntu

The crate transitively pulls in HDF5 (used internally by NetCDF-4 files), so on Debian-family distros you also want libhdf5-dev.

Building

The tensogram-netcdf crate is excluded from the default workspace build to avoid forcing libnetcdf on every contributor. Build it explicitly:

# Library
cargo build --manifest-path rust/tensogram-netcdf/Cargo.toml

# CLI with NetCDF support
cargo build -p tensogram-cli --features netcdf

The binary then exposes the new subcommand:

tensogram convert-netcdf --help

Quick example

# Convert one file
tensogram convert-netcdf input.nc -o output.tgm

# Convert multiple files into a single output
tensogram convert-netcdf jan.nc feb.nc mar.nc -o q1.tgm

# Stream to stdout (useful for piping)
tensogram convert-netcdf input.nc | tensogram info /dev/stdin

Command-line options

FlagDefaultDescription
-o, --output PATHstdoutWhere to write the Tensogram file.
--split-by MODEfileGrouping mode: file, variable, or record. See Splitting modes.
--cfoffExtract the CF attribute allow-list into base[i]["cf"]. See CF metadata mapping.
--encoding ENCnonenone or simple_packing.
--bits Nauto (16)Bits per value for simple_packing (1–64).
--filter FILTERnonenone or shuffle.
--compression CODECnonenone, zstd, lz4, blosc2, or szip.
--compression-level Ncodec defaultLevel for zstd (1–22) and blosc2 (0–9).

The --encoding/--bits/--filter/--compression/--compression-level flags are the same set used by tensogram convert-grib. Both importers share a PipelineArgs struct so the two commands stay symmetric.

How variables become objects

Each numeric NetCDF variable in the root group is mapped 1:1 to a Tensogram data object. The variable’s name is stored under base[i]["name"], the dtype and shape come from the NetCDF type and dimension list, and the raw bytes become the object payload (always little-endian).

Dtype matrix

NetCDF typeTensogram Dtype
byteInt8
ubyteUint8
shortInt16
ushortUint16
intInt32
uintUint32
int64Int64
uint64Uint64
floatFloat32
doubleFloat64

char and string variables, as well as the NetCDF-4 enhanced types (compound, vlen, enum, opaque), are skipped with a warning. They have no clean tensor representation.

Scalar variables

A NetCDF scalar (zero dimensions) becomes an object with ndim = 0, shape = [], and a single value in the payload.

Packed data

Variables with scale_factor or add_offset attributes are unpacked during conversion: the raw integer values are read, multiplied by the scale, offset applied, and the result stored as Float64 regardless of the on-disk dtype. This matches the convention used by xarray and most netCDF tooling.

The fill value (_FillValue or missing_value) is replaced with NaN in the unpacked output. The original sentinel is preserved under base[i]["netcdf"]["_FillValue"] so consumers can recover it.

Time coordinates

Time coordinate variables are stored as numeric values (typically Float64) exactly as they appear in the file — Tensogram does not convert them to calendar dates. The CF units string ("days since 1970-01-01") and calendar ("gregorian", "noleap", etc.) are preserved under base[i]["netcdf"] so a consumer can decode them on demand.

NetCDF-4 groups

Tensogram extracts only the root group of a NetCDF-4 file. If sub-groups are detected the importer prints a warning to stderr and continues with the root variables. Sub-group support is intentionally out of scope for v1 — most operational datasets keep their data variables at the root anyway.

Splitting modes

The --split-by flag controls how variables are grouped into Tensogram messages.

--split-by=file (default)

All variables from one input file are bundled into a single Tensogram message containing N data objects. This is the most compact representation and is the right choice when you want to keep a NetCDF file as a single logical unit.

tensogram convert-netcdf forecast.nc -o forecast.tgm
# 1 message with N objects

--split-by=variable

Each variable becomes its own one-object Tensogram message. Useful when downstream consumers want to fetch individual variables without decoding the whole file.

tensogram convert-netcdf forecast.nc -o forecast.tgm --split-by variable
# N messages with 1 object each

--split-by=record

Splits along the unlimited (record) dimension. Each step along the unlimited dimension produces a separate message. The unlimited dimension is detected automatically; passing this mode against a file without one is a hard error (NoUnlimitedDimension).

Variables that don’t depend on the unlimited dimension (e.g. a static mask variable) are still included in every output message — that way each record is fully self-describing.

tensogram convert-netcdf timeseries.nc -o timeseries.tgm --split-by record
# 1 message per record

Encoding pipeline flags

The pipeline flags are applied per data object before encoding into the wire format. They use the same names and semantics as convert-grib:

StageFlagNotes
Encoding--encoding simple_packing --bits NLossy quantization. Float64 only — non-f64 variables in the same file are skipped (with a warning) and pass through unencoded so mixed files convert cleanly.
Filter--filter shuffleByte-shuffle filter, sets shuffle_element_size to the post-encoding byte width.
Compression--compression zstd --compression-level 3zstd_level defaults to 3.
Compression--compression lz4No params.
Compression--compression blosc2 --compression-level 9Uses blosc2_codec=lz4 by default.
Compression--compression szipSets szip_rsi=128, szip_block_size=16, szip_flags=8. Requires preceding simple_packing or shuffle because libaec szip caps at 32 bits per sample (raw f64 is 64 bits).

Variables that contain NaN or ±Inf (typically from unpacked _FillValue / missing_value substitution or degenerate arithmetic upstream) cannot be represented by simple_packing — the algorithm’s range / scale-factor derivation has no slot for non-finite values.

The importer hard-fails when --encoding simple_packing is requested on data containing NaN or Inf. The error names the offending variable and suggests recovery options:

error: simple_packing failed for forecast_temperature: NaN value
encountered at index 42. The variable contains NaN or Inf which
cannot be represented by simple_packing. Pre-process the data or
choose a different encoding (e.g. encoding="none").

Recovery options, in order of effort:

  1. Drop the --encoding simple_packing flag AND pass --allow-nan. The default pipeline (encoding="none") combined with the NaN bitmask companion frame round-trips NaN values losslessly. See NaN / Inf Handling.
  2. Substitute non-finite values with an in-band sentinel before conversion if you need simple_packing throughout.
  3. Split the conversion with --split-by variable and re-run per-variable, using --encoding simple_packing only for the variables you know are NaN-free.

Prior behaviour (pre-0.17). The importer used to soft-downgrade NaN-bearing variables to encoding="none" with a stderr warning. That silently hid data-quality problems from automated pipelines; 0.17 surfaces them as hard errors and pairs the fix with the --allow-nan bitmask opt-in (preferred over pre-processing). The non-f64-payload branch (a structural mismatch rather than a data-quality problem) keeps its stderr-warning + fallback behaviour unchanged.

# Pack temperature to 24-bit + zstd
tensogram convert-netcdf --encoding simple_packing --bits 24 \
  --compression zstd --compression-level 3 \
  era5_t2m.nc -o era5_t2m.tgm

# Shuffle + szip on a multi-variable file
tensogram convert-netcdf --filter shuffle --compression szip \
  forecast.nc -o forecast.tgm

CF metadata mapping

NetCDF attributes are always extracted into a netcdf sub-map under each base entry:

base[0]:
  name: "temperature"
  netcdf:
    units: "K"
    long_name: "Air Temperature"
    standard_name: "air_temperature"
    _FillValue: -32768
    add_offset: 273.15
    scale_factor: 0.01
    _global:
      Conventions: "CF-1.10"
      title: "..."
      institution: "..."

When --cf is set, an additional cf sub-map is added containing only the 16 CF allow-list attributes. This duplicate copy makes CF-aware tooling cheaper because it can ignore the verbose netcdf map and rely on a stable, standardised key set.

Limitations

  • No NetCDF writer. Conversion is one-way only.
  • No string or char variables. They are skipped with a warning.
  • No NetCDF-4 enhanced types (compound, vlen, enum, opaque).
  • Root group only. Sub-groups are skipped with a warning.
  • No tensogram-python bindings. The Python ecosystem talks to convert-netcdf through subprocess. The library API is Rust-only in v1.
  • simple_packing is f64-only. Mixed-dtype files convert cleanly but only f64 variables get packed.

Library API

If you’d rather call the importer directly from Rust:

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_netcdf::{convert_netcdf_file, ConvertOptions, DataPipeline, SplitBy};

let options = ConvertOptions {
    split_by: SplitBy::Variable,
    cf: true,
    pipeline: DataPipeline {
        encoding: "simple_packing".to_string(),
        bits: Some(24),
        compression: "zstd".to_string(),
        compression_level: Some(3),
        ..Default::default()
    },
    ..Default::default()
};

let messages = convert_netcdf_file(Path::new("forecast.nc"), &options)?;
// messages: Vec<Vec<u8>> — each element is a complete wire-format message
}

Note: DataPipeline is defined in tensogram::pipeline and re-exported from both tensogram_netcdf and tensogram_grib. The underlying apply_pipeline helper is the same for both importers, guaranteeing that convert-grib and convert-netcdf produce byte-identical descriptor fields for equivalent flag combinations.

See also

NetCDF CF Metadata Mapping

When tensogram convert-netcdf --cf is set, the importer walks each NetCDF variable and lifts a fixed set of 16 CF Conventions v1.10 attributes into a cf sub-map under the corresponding base[i] entry. The attributes are also still present in the verbose netcdf map alongside every other variable attribute — the cf map is a curated, schema-stable view that CF-aware tooling can rely on.

The allow-list lives in rust/tensogram-netcdf/src/metadata.rs as the constant CF_ATTRIBUTES. If you change the list, update this page to match.

Attributes lifted by --cf

CF AttributeTensogram KeyNotes
standard_namebase[i]["cf"]["standard_name"]CF standard name from the CF Standard Name Table, e.g. "air_temperature", "eastward_wind".
long_namebase[i]["cf"]["long_name"]Free-form descriptive label, e.g. "2 metre temperature".
unitsbase[i]["cf"]["units"]UDUNITS-compliant string, e.g. "K", "m s-1", "days since 1970-01-01".
calendarbase[i]["cf"]["calendar"]Calendar for time coordinate variables, e.g. "gregorian", "noleap", "360_day".
cell_methodsbase[i]["cf"]["cell_methods"]Aggregation description, e.g. "time: mean", "area: sum".
coordinatesbase[i]["cf"]["coordinates"]Space-separated list of auxiliary coordinate variable names, e.g. "lon lat".
axisbase[i]["cf"]["axis"]Dimension role flag: "X", "Y", "Z", or "T".
positivebase[i]["cf"]["positive"]Direction of vertical coordinate: "up" (altitude) or "down" (depth/pressure).
valid_minbase[i]["cf"]["valid_min"]Minimum valid value for QA/range checks.
valid_maxbase[i]["cf"]["valid_max"]Maximum valid value for QA/range checks.
valid_rangebase[i]["cf"]["valid_range"]Two-element array [min, max] — alternative to valid_min/valid_max.
boundsbase[i]["cf"]["bounds"]Name of an associated cell-bounds variable (irregular grids).
grid_mappingbase[i]["cf"]["grid_mapping"]Name of an associated coordinate reference system variable.
ancillary_variablesbase[i]["cf"]["ancillary_variables"]Space-separated list of related ancillary variable names (uncertainty, QA flags, etc.).
flag_valuesbase[i]["cf"]["flag_values"]Array of integer flag values for categorical variables.
flag_meaningsbase[i]["cf"]["flag_meanings"]Space-separated list of meanings, paired with flag_values.

That’s 16 attributes — the full CF allow-list as of v0.7.0.

Storage layout

For a CF-compliant temperature variable, the --cf flag produces:

base[0]:
  name: "temperature"
  netcdf:
    units: "K"
    long_name: "2 metre temperature"
    standard_name: "air_temperature"
    _FillValue: -32768
    add_offset: 273.15
    scale_factor: 0.01
    cell_methods: "time: mean"
    _global:
      Conventions: "CF-1.10"
      title: "ERA5 reanalysis"
  cf:
    units: "K"
    long_name: "2 metre temperature"
    standard_name: "air_temperature"
    cell_methods: "time: mean"

The netcdf map is a verbatim dump of every variable attribute (the _global sub-map carries the file-level attributes). The cf map is a filtered slice containing only the allow-listed keys, in the order they appear on the variable.

What is not extracted

The allow-list is intentionally narrow. The following CF concepts are out of scope for v0.7.0 — they are accessible via the verbose netcdf map but not surfaced under cf:

  • Grid mapping variable contents — only the grid_mapping reference is lifted, not the projection parameters of the referenced variable.
  • Coordinate variable contents — coordinate variables are converted to their own data objects, not inlined into other variables’ metadata.
  • Bounds variable contents — only the bounds reference is lifted.
  • Cell measurescell_measures is not in the allow-list.
  • Climatology boundsclimatology is not lifted.
  • Geometry containers — CF 1.8+ geometries are out of scope.
  • Labels and string-valued auxiliary coordinates — not in the allow-list.
  • Compound coordinates / compress — ragged-array support is out of scope.

If you need these, read the raw NetCDF metadata from base[i]["netcdf"] instead — every original attribute is preserved there, byte-for-byte.

Why a curated allow-list?

Two reasons:

  1. Schema stability. Downstream tooling (xarray engines, dashboards, indexers) wants to rely on a small, fixed key set without having to inspect every NetCDF file’s variable-attribute zoo. The cf map gives them that contract.
  2. Interop friendliness. The 16 allow-listed attributes are the ones that show up in essentially every CF-compliant climate or weather dataset. They are the lingua franca that makes CF data interoperable.

If you have a strong case for adding an attribute, file an issue on the GitHub project and we’ll evaluate it.

Error Handling

Tensogram uses typed errors across all language bindings. Every fallible operation returns a Result (Rust), raises an exception (Python / C++ / TypeScript), or returns an error code (C). No library code panics.

Error Categories

CategoryTriggerRustPythonC++TypeScriptC Code
FramingInvalid magic bytes, truncated message, bad terminatorTensogramError::FramingValueErrorframing_errorFramingErrorTGM_ERROR_FRAMING (1)
MetadataCBOR parse failure, missing required field, schema violationTensogramError::MetadataValueErrormetadata_errorMetadataErrorTGM_ERROR_METADATA (2)
EncodingEncoding pipeline failure (e.g. NaN in simple_packing)TensogramError::EncodingValueErrorencoding_errorEncodingErrorTGM_ERROR_ENCODING (3)
CompressionDecompression failure, unknown codecTensogramError::CompressionValueErrorcompression_errorCompressionErrorTGM_ERROR_COMPRESSION (4)
ObjectInvalid descriptor, object index out of rangeTensogramError::ObjectValueErrorobject_errorObjectErrorTGM_ERROR_OBJECT (5)
I/OFile not found, permission denied, disk fullTensogramError::IoOSErrorio_errorIoErrorTGM_ERROR_IO (6)
Hash MismatchPayload integrity check fails on verify_hash=TrueTensogramError::HashMismatchRuntimeErrorhash_mismatch_errorHashMismatchErrorTGM_ERROR_HASH_MISMATCH (7)
Invalid ArgNULL pointer or invalid argument at the API boundaryValueErrorinvalid_arg_errorInvalidArgumentErrorTGM_ERROR_INVALID_ARG (8)
RemoteS3 / GCS / Azure / HTTP(S) object-store failureTensogramError::RemoteOSErrorremote_errorRemoteErrorTGM_ERROR_REMOTE (10)
Streaming LimitdecodeStream internal buffer exceeded the configured maximumStreamingLimitError

Notes on the TypeScript column:

  • All TypeScript errors extend the abstract TensogramError base class, so a single catch (err) { if (err instanceof TensogramError) … } handles every library-raised error.
  • HashMismatchError in TypeScript additionally carries parsed expected and actual hex digests when the underlying Rust message is in the canonical "hash mismatch: expected X, got Y" form.
  • StreamingLimitError is TS-specific and is raised only from decodeStream when the internal buffer would grow past maxBufferBytes (default 256 MiB).

Error Paths by Operation

Encoding

Input data + metadata dict
  │
  ├─ Missing 'version' ──────────► Metadata error
  ├─ Missing 'type'/'shape'/'dtype' ► Metadata error
  ├─ Unknown dtype string ────────► Metadata error
  ├─ Unknown byte_order ──────────► Metadata error
  ├─ Data size ≠ shape × dtype ───► Metadata error
  ├─ Shape product overflow ──────► Metadata error
  ├─ NaN in simple_packing ───────► Encoding error
  ├─ Inf reference_value ─────────► Metadata error
  ├─ Client wrote _reserved_ ─────► Metadata error (message or base[i])
  ├─ base.len() > descriptors ────► Metadata error (extra entries would be lost)
  ├─ emit_preceders in buffered ──► Encoding error (use StreamingEncoder)
  ├─ Param out of range (i32/u32) ► Metadata error (zstd_level, szip_rsi, etc.)
  ├─ Unknown compression codec ───► Encoding error
  ├─ Compression codec failure ───► Compression error
  └─ File I/O failure ────────────► I/O error

Decoding

Raw bytes
  │
  ├─ No magic bytes / truncated ──► Framing error
  ├─ Bad frame type codes ────────► Framing error
  ├─ Frame total_length overflow ─► Framing error
  ├─ Frame ordering violation ────► Framing error (header→data→footer)
  ├─ cbor_offset out of range ────► Framing error
  ├─ CBOR parse failure ──────────► Metadata error
  ├─ Preceder base ≠ 1 entry ─────► Metadata error
  ├─ Dangling preceder (no obj) ──► Framing error
  ├─ Consecutive preceders ────────► Framing error
  ├─ base.len() > object count ───► Metadata error
  ├─ Object index out of range ───► Object error
  ├─ Shape product overflow ──────► Metadata error
  ├─ Decompression failure ───────► Compression error
  ├─ Decoding pipeline failure ───► Encoding error
  └─ Hash verification mismatch ──► HashMismatch error

File Operations

TensogramFile.open(path)
  │
  ├─ File not found ──────────────► I/O error
  ├─ Permission denied ───────────► I/O error
  └─ Invalid file content ────────► Framing error

TensogramFile.decode_message(index)
  │
  ├─ Index out of range ──────────► Object error / IndexError
  └─ Corrupt message at offset ───► Framing error

Streaming Encoder

StreamingEncoder
  │
  ├─ write_preceder(_reserved_) ──► Metadata error
  ├─ write_preceder twice ─────────► Framing error (no intervening write_object)
  ├─ finish() with pending prec ──► Framing error (dangling preceder)
  ├─ write_object invalid shape ──► Metadata error
  ├─ Encoding pipeline failure ───► Encoding error
  ├─ Variable-length hash algo ───► Framing error (see below)
  └─ I/O write failure ───────────► I/O error

The streaming path writes the frame header before the payload has been hashed, so it needs to know the final CBOR descriptor length up front. This works only when the configured HashAlgorithm produces a digest whose hex representation has a fixed length — currently only Xxh3 (always 16 hex chars). If a future hash algorithm with variable-length output is used, StreamingEncoder::write_object returns TensogramError::Framing before writing any bytes, so the caller’s sink is never corrupted. Use the buffered encode() API for such algorithms.

CLI Operations

set command
  │
  ├─ Immutable key (shape, dtype) ► Error (cannot modify structural key)
  ├─ _reserved_ namespace ────────► Error (library-managed)
  └─ Invalid object index ────────► Error (out of range)

merge command
  │
  ├─ No input files ──────────────► Error
  ├─ Invalid strategy name ───────► Error
  └─ Conflicting keys (error mode) ► Error (use first/last to resolve)

split command
  │
  └─ Single-object: pass through; multi-object: split per-object base metadata

Importer Operations (convert-grib / convert-netcdf)

Both importer crates (tensogram-grib, tensogram-netcdf) use typed error enums and never panic on invalid or exotic input. Anything the importer can’t represent cleanly is either surfaced as a typed error or skipped with a warning: … line on stderr so the operator can see what was dropped.

tensogram-netcdf errors (rust/tensogram-netcdf/src/error.rs)
  │
  ├─ NetcdfError::Netcdf(netcdf::Error)
  │     Low-level failure from libnetcdf — file missing, permission
  │     denied, format error, truncated file, HDF5 error.
  │
  ├─ NetcdfError::NoVariables
  │     Input file has zero supported numeric variables after skipping
  │     char/string/compound/vlen. Empty files also hit this.
  │
  ├─ NetcdfError::NoUnlimitedDimension { file }
  │     --split-by=record requested but the file has no unlimited
  │     dimension. Contains the file path for diagnostics.
  │
  ├─ NetcdfError::UnsupportedType { name, reason }
  │     Variable has a type we can't represent (e.g. compound,
  │     enum, opaque, vlen). Currently only the char / string
  │     variants hit this path — the other complex types are
  │     downgraded to a stderr warning and skipped because they
  │     frequently coexist with valid numeric variables.
  │
  ├─ NetcdfError::InvalidData(String)
  │     Catch-all for:
  │       - low-level read errors on a specific variable
  │       - unknown --encoding / --filter / --compression names
  │       - simple_packing compute_params failures on edge-case data
  │       - extract_variable_record invariant violations (should be
  │         unreachable; if it fires the importer is buggy)
  │
  ├─ NetcdfError::Encode(String)
  │     tensogram rejected the pipeline. Common cause:
  │     szip on raw f64 (bits_per_sample=64 exceeds libaec's
  │     32-bit cap). Fix: add --filter shuffle or --encoding
  │     simple_packing first.
  │
  └─ NetcdfError::Io(std::io::Error)
        Reserved for future use — the current importer reads
        through libnetcdf and writes through the CLI wrapper, so
        stdlib I/O errors don't currently reach this variant.

Soft warnings (stderr, exit 0):

warning: {file}: sub-groups found; only root-group variables are converted
warning: skipping variable '{name}': Char variables are not supported
warning: skipping variable '{name}': complex type Compound(_) is not supported
warning: skipping simple_packing for variable '{name}' (not a float64 payload)
warning: variable '{name}': failed to read attribute '{attr}': {cause}
warning: failed to read global attribute '{name}': {cause}

Note: NaN/Inf in a variable that targets simple_packing now hard-fails the conversion (see NetCDF Importer — simple_packing on Mixed-dtype Files below). The previous “warning: skipping simple_packing … NaN value encountered” line no longer fires; that case is an error rather than a warning.

The last two lines above are rare — they only fire on corrupt attribute values or unsupported upstream AttributeValue variants — but they surface instead of dropping data silently so operators can trace unexpected missing metadata.

tensogram-grib errors (rust/tensogram-grib/src/error.rs)
  │
  ├─ GribError::Eccodes(String) — ecCodes C library error
  ├─ GribError::NoMessages — empty GRIB file
  ├─ GribError::MissingKey — required ecCodes/MARS namespace key absent
  ├─ GribError::InvalidShape — grid dimension mismatch
  └─ GribError::Encode — tensogram encode failure

Language-Specific Patterns

Rust

#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions, TensogramError};

match decode(&buffer, &DecodeOptions::default()) {
    Ok((meta, objects)) => { /* use data */ }
    Err(TensogramError::Framing(msg)) => eprintln!("bad format: {msg}"),
    Err(TensogramError::HashMismatch { expected, actual }) =>
        eprintln!("integrity: {expected} ≠ {actual}"),
    Err(e) => eprintln!("error: {e}"),
}
}

Python

import tensogram

# Decode errors
try:
    msg = tensogram.decode(buf, verify_hash=True)
except ValueError as e:
    # Framing, Metadata, Encoding, Compression, Object errors
    print(f"decode failed: {e}")
except RuntimeError as e:
    # Hash verification mismatch
    print(f"integrity error: {e}")
except OSError as e:
    # File I/O and Remote (S3/GCS/Azure/HTTP) errors
    print(f"I/O error: {e}")

# File errors
try:
    f = tensogram.TensogramFile.open("missing.tgm")
except OSError:
    print("file not found")

# Index errors
with tensogram.TensogramFile.open("data.tgm") as f:
    try:
        msg = f[999]
    except IndexError:
        print("message index out of range")

# Packing errors
try:
    tensogram.compute_packing_params(nan_array, 16, 0)
except ValueError as e:
    print(f"NaN rejected: {e}")

C++

#include <tensogram.hpp>

try {
    auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
    // Invalid message structure
    std::cerr << "framing: " << e.what() << " (code " << e.code() << ")\n";
} catch (const tensogram::hash_mismatch_error& e) {
    // Payload integrity failure
    std::cerr << "hash: " << e.what() << "\n";
} catch (const tensogram::error& e) {
    // Any Tensogram error (base class)
    std::cerr << "error: " << e.what() << "\n";
}

C

#include "tensogram.h"

tgm_message* msg = tgm_decode(buf, len, 0);
if (!msg) {
    tgm_error code = tgm_last_error_code();
    const char* message = tgm_last_error();
    fprintf(stderr, "%s (%d): %s\n",
            tgm_error_string(code), code, message);
}

Note: tgm_last_error() returns a thread-local string valid until the next FFI call on the same thread. Copy it if you need to keep it.

TypeScript

Every error thrown by @ecmwf/tensogram is an instance of the abstract TensogramError base class. The concrete subclasses match the Rust variants one-to-one, plus a TS-specific InvalidArgumentError and StreamingLimitError.

import {
  decode,
  TensogramError,
  FramingError,
  HashMismatchError,
  ObjectError,
  StreamingLimitError,
} from '@ecmwf/tensogram';

try {
  const { metadata, objects } = decode(buf, { verifyHash: true });
  // ...
} catch (err) {
  if (err instanceof HashMismatchError) {
    // Structured fields are parsed from the Rust-side message.
    console.error('integrity failure:', err.expected, err.actual);
  } else if (err instanceof FramingError) {
    console.error('bad wire format:', err.message);
  } else if (err instanceof ObjectError) {
    console.error('object index error:', err.message);
  } else if (err instanceof TensogramError) {
    console.error('tensogram error:', err.name, err.message);
  } else {
    throw err;
  }
}

All concrete classes expose:

  • err.rawMessage — the untruncated string from the WASM / Rust side, including any error-variant prefix ("framing error: ...").
  • err.message — the human-readable message with the prefix stripped.
  • err.name — stable string name ("FramingError", etc.).

HashMismatchError additionally exposes parsed expected and actual hex digests when the underlying message follows the canonical "hash mismatch: expected X, got Y" form.

Streaming decode does not throw on a single corrupt message — the iterator skips and continues. Register an onError callback to observe the skips:

import { decodeStream, StreamingLimitError } from '@ecmwf/tensogram';

try {
  for await (const frame of decodeStream(res.body!, {
    maxBufferBytes: 64 * 1024 * 1024,
    onError: ({ message, skippedCount }) => {
      console.warn(`skipped corrupt message (#${skippedCount}): ${message}`);
    },
  })) {
    render(frame.descriptor.shape, frame.data());
    frame.close();
  }
} catch (err) {
  if (err instanceof StreamingLimitError) {
    // Stream exceeded maxBufferBytes; configure a larger limit or split.
  } else {
    throw err;
  }
}

Note: decodeStream does throw for infrastructure-level failures (buffer limit exceeded, AbortSignal fired, non-ReadableStream input). Only per-message corruption is routed through onError.

Common Error Scenarios

Garbage or Truncated Input

Any non-Tensogram bytes passed to decode() produce a Framing error. The decoder looks for the 8-byte magic TENSOGRM and a matching terminator.

Hash Mismatch After Corruption

v3 note. Frame-level integrity moved from the decoder to the validator. verify_hash=True (Python DecodeOptions) or TGM_DECODE_VERIFY_HASH (C) is retained for source compatibility but is a no-op on the decode path in v3.

To detect corruption in a v3 message, run the message through tensogram validate --checksum (CLI), validate_message (Rust), tgm_validate (C), or the equivalent Python / TypeScript helpers. The validator:

  1. Walks every frame and recomputes the xxh3-64 of its body (payload + masks + CBOR; cbor_offset, the hash slot, and ENDF are excluded — see plans/WIRE_FORMAT.md §2.4).
  2. Compares the recomputed digest to the inline hash slot at frame_end − 12. A mismatch emits a HashMismatch validation issue carrying the expected and actual hex values plus the frame offset.
  3. When both a HeaderHash and a FooterHash aggregate frame are present, cross-checks them against each other and against the inline slots. Disagreement also surfaces as a HashMismatch.
  4. An UnknownHashAlgorithm warning fires when the aggregate HashFrame.algorithm is not "xxh3" — the inline slots are still verified (they’re authoritative); only the aggregate’s algorithm identifier is advisory.

Messages encoded with hash_algorithm=None clear the HASHES_PRESENT preamble flag and leave every inline slot at 0x00…00. On such messages, validate --checksum emits NoHashAvailable at warning level and cannot detect corruption beyond structural errors — re-encode with hash_algorithm = Some(Xxh3) to enable integrity checking.

Object Index Out of Range

Accessing decode_object(buf, index=N) where N ≥ number of objects produces an Object error (Rust/C/C++) or ValueError (Python). File indexing file[N] raises IndexError for out-of-range N.

NaN / Inf in Simple Packing

compute_packing_params() rejects both NaN and ±Inf values with a ValueError that includes the index of the first offending sample. simple_packing’s scale-factor derivation has no meaningful value for non-finite input — rejecting them up front prevents the silent corruption path where an i32::MAX-saturated binary_scale_factor decodes to NaN everywhere.

0.17+ extends this contract to every pipeline: encoding="none" (and every compressor) rejects NaN / ±Inf input by default. The NaN / Inf Handling guide covers the allow_nan / allow_inf opt-in that substitutes non-finite values with 0.0 and records their positions in a bitmask companion section.

File Not Found / Permission Denied

TensogramFile.open() raises OSError (Python), io_error (C++), or returns TGM_ERROR_IO (C) for any file system failure.

NetCDF Importer — --split-by=record on Files Without Unlimited Dim

tensogram convert-netcdf --split-by record foo.nc where foo.nc has no unlimited dimension hard-errors with NetcdfError::NoUnlimitedDimension { file } (exit code 1). The error message includes the path so the caller can identify which file in a multi-input batch triggered it.

NetCDF Importer — simple_packing on Mixed-dtype Files

--encoding simple_packing is f64-only by design. Mixed files (a typical CF temperature file has f32 lat/lon coordinates alongside f64 data) are handled gracefully: non-f64 variables emit a stderr warning and pass through with encoding="none", and the conversion overall succeeds.

NaN or Inf in a targeted f64 variable is now a hard error (0.17+). The importer fails with NetcdfError::InvalidData("simple_packing failed for {var}: ...") and a recovery hint, rather than silently downgrading the variable to encoding="none". Pre-0.17 soft-downgrade hid data-quality problems; the new behaviour surfaces them at conversion time. Callers relying on the old fallback should either pick a non-simple_packing encoding up front, opt into the NaN / Inf bitmask companion via --allow-nan / --allow-inf (see NaN / Inf Handling), pre-process NaN / Inf out of the data, or use --split-by variable and choose per-variable encodings.

NetCDF Importer — Unknown Codec Name

--encoding foo, --filter bar, --compression baz all hard-error with NetcdfError::InvalidData listing the expected values. The pre-validation fires inside apply_pipeline so the error surfaces immediately, before any data is read from disk.

NetCDF Importer — szip on Raw f64

libaec szip caps at 32 bits per sample, but raw f64 gives bits_per_sample = 64, so --compression szip on unencoded f64 produces a low-level aec_encode_init failed error from tensogram wrapped in NetcdfError::Encode. Fix:

  • Combine with --encoding simple_packing --bits N (N ≤ 32), or
  • Combine with --filter shuffle (which makes the element size 8 bits).

Unknown Hash Algorithm (Forward Compatibility)

When the decoder encounters a hash algorithm string it doesn’t recognize (e.g. a future "sha256" hash), it logs a warning via tracing::warn! and skips verification rather than failing. This ensures forward compatibility: older decoders can still read messages produced by newer encoders that use new hash algorithms.

No-Panic Guarantee

All Rust library code in tensogram, tensogram-encodings, and tensogram-ffi is free from panic!(), unwrap(), expect(), todo!(), and unimplemented!() in non-test code paths. The library guarantees:

  • All fallible operations return Result<T, TensogramError>.
  • Integer arithmetic uses checked operations (checked_mul, try_from) to prevent overflow and truncation.
  • u64 → usize conversions use usize::try_from() to prevent truncation on 32-bit platforms.
  • Array indexing is guarded by prior bounds checks.
  • FFI boundary code returns error codes instead of panicking, and uses unwrap_or_default() only for CString::new() (interior null fallback).
  • The scan functions (scan, scan_file) tolerate truncation of total_length as usize because the subsequent bounds check catches it.
  • The hash-while-encoding pipeline (PipelineConfig.compute_hash = true plus the streaming encoder’s inline-hash path) verifies its CBOR-length invariant before writing any bytes and surfaces a TensogramError::Framing if a variable-length hash algorithm is ever configured — the caller’s sink is never left in a partial-write state on that specific failure mode. Internal debug assertions guard against non-deterministic CBOR serialisation during development.

Edge Cases

A collection of non-obvious situations and how the library handles them.

Corrupted Messages

What happens: The scanner (scan()) searches for TENSOGRM magic bytes and validates the postamble (last 8 bytes should be 39277777). If total_length is set, the scanner checks for the end magic at the expected position.

Recovery: If a message fails validation, the scanner skips one byte and resumes searching. A single corrupted message in a multi-message file does not prevent reading the others.

#![allow(unused)]
fn main() {
let offsets = scan(&file_bytes);
// offsets only contains valid (start, length) pairs
// Corrupted regions are silently skipped
}

Edge case within edge case: If a random byte sequence inside a valid payload happens to match TENSOGRM, the scanner might try to parse a “message” starting mid-payload. The postamble cross-check catches this: the false start’s postamble won’t contain the expected 39277777 end magic.

NaN in Simple Packing

Simple packing cannot represent NaN. The quantization formula maps the range [min, max] onto integers, and NaN has no defined place in this range.

What happens: compute_params() returns PackingError::NanValue(index) if any value is NaN. The encode() function also rejects NaN inputs before packing.

Solution: Replace NaN values with a sentinel (e.g. the minimum representable value, or a separate bitmask object) before encoding.

Inf in Simple Packing — Silent Corruption

Subtle gotcha — simple_packing’s compute_params scans for NaN but not for Inf. Passing [1.0, +Inf, 3.0]:

  • range = max - min = +Inf, which produces binary_scale_factor = i32::MAX (saturating cast from Inf as i32).
  • Encoding yields all-zero packed integers.
  • Decoding reconstructs NaN at every position (because Inf × 0 = NaN in IEEE 754).

Net effect: every decoded value silently becomes NaN.

Mitigation: turn on strict-finite encoding (see docs). It catches Inf upstream of the simple_packing encoder and fails with a clean EncodingError before the corruption path runs.

Also: extract_simple_packing_params catches a non-finite reference_value in the descriptor, so callers going through the high-level encode() API are protected when the computed reference happens to be ±Inf (e.g. data like [1.0, -Inf]). But for data like [1.0, +Inf, 3.0] the reference is 1.0 (finite) and only binary_scale_factor overflows — that’s not caught without the strict flag.

Decode Range on Compressed Data

decode_range() supports partial range decode for compressors that have random access capability: szip (via RSI block offsets), blosc2 (via chunk-based access), and zfp fixed-rate mode. Stream compressors (zstd, lz4, sz3) return CompressionError::RangeNotSupported.

Workaround for stream compressors: Decode the full object with decode_object() and slice the result in memory.

Bitmask Byte Width

Dtype::Bitmask returns 0 from byte_width(). This is a sentinel, not a real byte width.

Why: A bitmask of N elements occupies ceil(N / 8) bytes. The library cannot infer N from the byte width alone, so the “element size” concept doesn’t apply. Callers that need the payload size must compute it from the element count.

#![allow(unused)]
fn main() {
let num_elements: u64 = descriptor.shape.iter().product();
let payload_bytes = if descriptor.dtype == Dtype::Bitmask {
    let n = usize::try_from(num_elements)?;
    (n + 7) / 8
} else {
    let n = usize::try_from(num_elements)?;
    n * descriptor.dtype.byte_width()
};
}

verify_hash on Messages Without Hashes

If a message was encoded with hash_algorithm: None (no hash), and you decode it with verify_hash: true, the decoder silently skips hash verification for that object. No error is returned.

Rationale: The absence of a hash is not an error. The decoder cannot verify what was never stored. If you need to enforce that all messages have hashes, check descriptor.hash.is_some() after decoding.

Constant-Value Fields with simple_packing

If all values in a field are identical (range = 0), compute_params() sets binary_scale_factor such that all packed integers are 0, and the full value is recovered from reference_value alone. This is correct and handled without special cases.

Very Short Buffers

Passing a buffer shorter than the preamble size (24 bytes) to any decode function returns TensogramError::Framing("buffer too short ..."). No panic.

Object Index Out of Range

decode_object(&message, 99, &options) when the message has fewer than 100 objects returns TensogramError::Object("object index N out of range").

Empty Files

TensogramFile::message_count() returns 0. read_message(0) returns an error.

CBOR Key Ordering

The library uses canonical CBOR key ordering (RFC 8949 §4.2). If you construct a GlobalMetadata struct with keys in one order and then check the CBOR bytes, the bytes may not match your insertion order. This is intentional and correct — it ensures deterministic output.

If you need to compare metadata across languages or implementations, always compare the decoded values, not the raw CBOR bytes from different encoders.

You can verify that any CBOR output is canonical using the verify_canonical_cbor() utility:

#![allow(unused)]
fn main() {
use tensogram::verify_canonical_cbor;

let cbor_bytes = /* ... */;
verify_canonical_cbor(&cbor_bytes)?; // Returns Ok(()) if canonical, Err if not
}

Frame Ordering Violations

The decoder validates that frames appear in the expected order: header frames first, then data object frames, then footer frames. A message with frames out of order (e.g. a header metadata frame appearing after a data object frame) is rejected with TensogramError::Framing.

This catches malformed or tampered messages. Valid messages produced by the encoder always have correct ordering.

Streaming Mode (total_length = 0)

When encoding for a non-seekable output (e.g. TCP socket), the preamble’s total_length is set to 0. In this mode:

  • Header index and header hash frames are omitted (the encoder doesn’t know the data object count or offsets upfront).
  • The footer must contain at least the metadata frame.
  • The first_footer_offset in the postamble points to the first footer frame.

Decoders that encounter total_length = 0 should read from the postamble backward to find the footer frames, then use the footer index (if present) for random access to data objects.

The postamble’s first_footer_offset field always points to a valid position:

  • If footer frames exist: it points to the start of the first footer frame.
  • If no footer frames exist: it points to the start of the postamble itself.

This invariant means decoders can always seek to first_footer_offset and determine whether they’ve landed on a footer frame or the postamble.

Inter-Frame Padding

The encoder may insert padding bytes between frames for memory alignment (e.g. 64-bit alignment). Padding appears between the ENDF marker of one frame and the FR marker of the next. Decoders should scan for the FR marker rather than assuming frames are contiguous.

Zero-Element Tensors

Shapes containing zero dimensions are valid: shape: [0], shape: [3, 0, 5]. This matches numpy and PyTorch semantics where zero-element tensors are legitimate objects (e.g. an empty batch). The encoded payload for a zero-element tensor is zero bytes.

Scalar Tensors

shape: [] (empty shape, ndim: 0) represents a scalar tensor containing exactly one element. The payload size equals dtype.byte_width() bytes.

Metadata-Only Messages

A message with zero data objects is valid. This can be used to transmit metadata without any tensor data (e.g. coordination signals, timestamps, provenance records). Both encode() with an empty descriptors slice and StreamingEncoder with no write_object() calls produce valid messages.

Mixed Dtypes in One Message

Multiple data objects in the same message may have different dtypes. For example, a Float32 tensor paired with a Bitmask object used as a missing-data mask. Each object’s pipeline (encoding, filter, compression) is configured independently.

Bitmask with Encoding/Compression

Bitmask data is internally packed into uint8 bytes. Any encoding or compression pipeline that supports uint8 should work with bitmask data. The total bit count must be stored separately (in the shape) since the byte count ceil(N / 8) may not equal N exactly.

Strides Validation

Strides are validated for length: strides.len() must match shape.len(). Non-contiguous strides (e.g. shape: [4, 4], strides: [8, 1]) are accepted — they indicate a view into a larger array and are semantically valid.

Version Constraints

  • version: 0 and version: 1 are deprecated and must be rejected by the decoder.
  • version: 2 is the current version.
  • Higher versions (3+) are reserved for future use and will be valid once defined.

NaN/Infinity in Simple Packing Parameters

If reference_value is NaN or Infinity, encoding fails immediately with a clear error. This value is used in the quantization formula and would produce corrupt output. (binary_scale_factor and decimal_scale_factor are integers and cannot be NaN/Infinity.)

Duplicate CBOR Keys

Duplicate keys at the same level in a CBOR map are never accepted. The library uses canonical CBOR (RFC 8949 §4.2) which inherently rejects duplicate keys. Same-name keys at different nesting levels are acceptable: base[0]["foo"] and _extra_["foo"] are distinct keys.

Unknown Hash Algorithm on Decode

If a message contains a hash with an algorithm the decoder doesn’t recognize (e.g. "sha256" when only xxh3 is implemented), verify_hash: true issues a warning and skips verification rather than returning an error. This ensures forward compatibility when new hash algorithms are added.

decode_range with Empty Ranges

Calling decode_range() with an empty ranges slice (&[]) returns (descriptor, vec![]) — the parts vector is empty. This is not an error.

Preceder Metadata Error Paths

The decoder validates PrecederMetadata frames strictly:

ConditionError typeMessage
Consecutive preceders without DataObjectFraming“PrecederMetadata must be followed by a DataObject frame, got {type}”
Dangling preceder (no DataObject follows)Framing“dangling PrecederMetadata: no DataObject frame followed”
Base has 0 or 2+ entriesMetadata“PrecederMetadata base must have exactly 1 entry, got {n}”
Metadata base entries > data objectsMetadata“metadata base has {n} entries but message contains {m} objects”

On the encoder side:

  • StreamingEncoder::write_preceder() errors if called twice without an intervening write_object().
  • StreamingEncoder::finish() errors if a preceder was written without a following write_object().
  • encode() (buffered mode) errors if emit_preceders: true — use StreamingEncoder::write_preceder() instead.

File Concatenation

Tensogram is a message format, not a file format. Multiple .tgm files can be concatenated:

cat 1.tgm 2.tgm > all.tgm

The resulting file is valid. scan() and TensogramFile will find all messages from both source files.

xarray Layer Edge Cases

meta.base Out-of-Range

If a message has more data objects than meta.base entries (e.g. 3 objects but base has only 1 entry), the xarray layer logs a warning and treats the missing base entries as empty dicts. The objects are still decoded — they just have no per-object metadata attributes.

This can happen when a message is encoded with an incomplete base array, or when objects are appended to a message without updating base. The warning helps diagnose silent metadata loss:

WARNING: meta.base has 1 entries but object index 2 requested;
         per-object metadata will be empty for this object

Empty or Missing base Attribute

A message with base: [] or no base key at all is valid. All objects get empty per-object metadata and are named object_0, object_1, etc. The _reserved_ key (auto-populated by the encoder in each base entry) is always filtered out — it never appears in user-facing variable attributes.

Variable Naming with Dot Paths

When variable_key="mars.param" is used, the resolve_variable_name() function traverses the nested dict path. If any segment is missing, the function falls back to the generic object_<index> name. The obj_index used is the object’s position in the message (not its position among data variables), so a file with objects 0 (coord), 1 (data), 2 (data) would produce names like "object_1" and "object_2" for the data variables.

Coordinate Name Case Insensitivity

Coordinate detection (detect_coords) is case-insensitive: "LATITUDE", "Lat", and "latitude" all match the known coordinate name "latitude". The canonical dimension name is always lowercase (e.g. "latitude", not "LATITUDE").

Ambiguous Dimension Size Matching

When two coordinate arrays have the same size (e.g. latitude with 5 points and depth with 5 points), the dimension resolution assigns the first matching coord to the first axis that matches the size, and the second to the next axis. If the data variable is 2D [5, 5], one axis gets "latitude" and the other gets "depth". When no coord has the matching size, the axis gets a generic "dim_N" name.

Multi-Message Merge with Different Keys

When open_datasets() merges multiple messages, objects whose base entries have different key sets are handled as follows:

  • Keys present in all objects with identical values become Dataset attributes (constant).
  • Keys present in all objects with varying values become outer dimensions (if they form a hypercube) or separate variables.
  • Keys present in some objects but not others are treated as varying with None for missing entries.

reserved Filtering Consistency

The _reserved_ key is filtered at every access point:

  • TensogramDataStore._get_per_object_meta() (store.py)
  • _base_entry_from_meta() (scanner.py)
  • _filter_reserved() (zarr store.py)

This ensures the encoder’s auto-populated tensor info (ndim, shape, strides, dtype) never leaks into user-facing metadata.

Zarr Layer Edge Cases

Group Attributes from meta.extra

Group-level attributes in the root zarr.json come from meta.extra (message-level annotations). If meta.extra is empty or absent, the group zarr.json only contains internal attributes (_tensogram_version, _tensogram_variables).

Per-Array Attributes from meta.base[i]

Per-array attributes come from meta.base[i] with the _reserved_ key filtered out. Descriptor encoding params are stored under _tensogram_params to avoid namespace collisions.

Variable Name Resolution — No Extra Fallback

Variable names are resolved exclusively from per_object_meta (from meta.base[i]). The common_meta (from meta.extra) is not searched for variable naming. This prevents all objects in a message from sharing the same name when a name key exists only at the message level.

This is consistent across both xarray and zarr layers.

Zarr Metadata Key Collision

If a base entry has keys like "zarr", "chunks", or "shape", they go into the Zarr array’s attributes dict — not the top-level metadata. There is no collision with Zarr’s own shape, chunk_grid, etc. fields.

Write Path: reserved Filtering

When writing through TensogramStore, user-set array attributes are written into base[i] entries. The _reserved_ key is explicitly filtered from these entries to prevent collision with the encoder’s auto-populated _reserved_.tensor info.

Write Path: Group Attributes

Group attributes set via Zarr become unknown top-level keys in GlobalMetadata, which the encoder preserves as _extra_. On re-read, they appear in meta.extra. Internal keys (starting with _tensogram_) and reserved structural keys (version, base, _extra_, _reserved_) are excluded.

Empty TGM File

A .tgm file with zero messages produces a root group zarr.json with no arrays. A message with zero data objects produces a root group with the message’s extra metadata but no arrays.

Variable Name Deduplication

When multiple objects resolve to the same name, suffixes _1, _2, etc. are appended. For example, three objects named "x" become "x", "x_1", "x_2".

Variable Name Sanitization

Slashes and backslashes in resolved variable names are replaced with underscores to prevent spurious directory nesting in the Zarr virtual key space. Empty names are replaced with "_".

GRIB Importer Edge Cases

This section covers behaviour specific to the tensogram-grib importer and the tensogram convert-grib CLI — these notes apply when you are bringing GRIB data into Tensogram, not to Tensogram itself.

Single GRIB to base[0] Has ALL MARS Keys

In OneToOne mode, each GRIB message becomes one Tensogram message. All MARS namespace keys (plus gridType as "grid") go into base[0]["mars"]. When --all-keys is enabled, non-MARS namespace keys (geography, time, vertical, parameter, statistics) go into base[0]["grib"].

MergeAll with N Fields

In MergeAll mode, N GRIB fields become one Tensogram message with N data objects. Each base[i] holds ALL metadata for that object independently — there is no common/varying partitioning at encode time. This means metadata keys are duplicated across base entries.

Performance note: With 1000 GRIB fields, this means 1000 copies of common keys (class, type, stream, expver, date, time, etc.). This is by design — the wire format prioritizes simplicity and independent object access over byte savings. Use tensogram::compute_common() at display/merge time to extract shared keys.

Different Grid Types in MergeAll

GRIB fields with different grid types (e.g. regular_ll and reduced_gg) can be merged into the same Tensogram message. Each base[i]["mars"]["grid"] independently records its grid type. Downstream consumers (xarray, zarr) must handle the structural differences (e.g. different shapes).

GRIB Shape from Ni/Nj

The shape is derived from ecCodes Ni and Nj keys (row-major: [Nj, Ni]). If either is zero or missing (e.g. reduced Gaussian grids), the shape falls back to [numberOfPoints] (1-D).

Empty params in DataObjectDescriptor

GRIB-converted data objects have empty desc.params — all metadata lives in base[i]["mars"] and base[i]["grib"], not in the per-object descriptor. This is by design: the descriptor carries only what’s needed to decode the payload (shape, dtype, encoding pipeline).

Metadata Model Edge Cases (base / reserved / extra)

The v2 metadata model has three sections: base (per-object), _reserved_ (library internals), and _extra_ (client annotations). These create several non-obvious edge cases.

reserved is Protected

Client code must not set _reserved_ in any context:

  • Python: tensogram.encode({"version": 2, "_reserved_": {...}}) raises ValueError.
  • Python: encode({"version": 2, "base": [{"_reserved_": {...}}]}) raises ValueError.
  • FFI: JSON with "base": [{"_reserved_": {...}}] returns TgmError::Metadata.
  • CLI: set -s _reserved_.tensor.ndim=5 returns an error.

The encoder auto-populates _reserved_.tensor in each base entry (ndim, shape, strides, dtype) and _reserved_ at the message level (encoder, time, uuid).

Metadata Lookup Semantics (base first-match)

All lookup functions (__getitem__ in Python, tgm_metadata_get_string in FFI, lookup_key in CLI) use first-match semantics:

  1. Search base[0], then base[1], …, skipping the _reserved_ key within each entry.
  2. If not found in any base entry, search _extra_.
  3. If not found → None (FFI/CLI) or KeyError (Python).

Implication: If base[0] has product.name="temperature" and base[1] has product.name="pressure", lookups return "temperature" (the first match). This is message-level lookup, not per-object. The same applies to any namespace (MARS, BIDS, DICOM, etc.).

reserved is Hidden from Dict Access

  • meta["_reserved_"]KeyError (Python). The key is skipped during base entry iteration.
  • "_reserved_" in metaFalse.
  • tgm_metadata_get_string(meta, "_reserved_.tensor")NULL (FFI). The path is blocked.
  • To read _reserved_ data, use meta.reserved (Python) or read the base entry directly via meta.base[i]["_reserved_"].

Explicit extra / extra Prefix

The CLI and FFI support explicit _extra_.key or extra.key prefixes to target the _extra_ map directly, bypassing the base search:

# CLI: write to _extra_ map
tensogram set -s "extra.custom=value" input.tgm output.tgm
tensogram set -s "_extra_.custom=value" input.tgm output.tgm

# CLI: read from _extra_ map
tensogram get -p "_extra_.custom" input.tgm

Without the prefix, set writes to all base entries. With the prefix, it writes to _extra_ specifically.

Empty Key String

An empty key "" returns None (FFI/CLI) or raises KeyError (Python). This is not an error — it simply finds no match.

base vs Descriptor Count

The base array length should match the number of data objects. The encoder auto-extends base entries (adding _reserved_.tensor) for each object. If the user provides fewer base entries than objects, the encoder creates entries for the missing ones. If the user provides more base entries than objects, the encoder returns an error.

tgm_metadata_num_objects (FFI)

tgm_metadata_num_objects() returns base.len(), which is the number of per-object metadata entries. After encoding, this matches the actual data object count because the encoder populates one base entry per object.

set Command on Zero-Object Messages

The CLI set command redirects mutations to _extra_ when the message has zero data objects. This is because base entries must align 1:1 with descriptors, and a zero-object message has no descriptors.

Both extra and extra in Python Dict

When both "_extra_" and "extra" are present in a Python metadata dict, _extra_ takes precedence (it’s the wire-format name). The "extra" key is treated as a convenience alias and only used if "_extra_" is absent.

Filter Matching with Multi-Object Messages

CLI where-clause filters (-w mars.param=2t) match at the message level. If base[0] has mars.param=2t and base[1] has mars.param=msl, the filter matches "2t" (first base entry match). To filter by per-object values, split the message first.

Split Preserves Per-Object Metadata

When splitting a multi-object message, the CLI split command assigns each object its own base entry from the original message. The _reserved_ key is stripped from each entry (the encoder regenerates it). Extra metadata is copied to all split messages.

Merge Concatenates Base Arrays

When merging messages, the CLI merge command concatenates all base arrays. The merge strategy (first/last/error) only applies to _extra_ key conflicts. The _reserved_ section is cleared and regenerated by the encoder.

Deeply Nested Paths

Dot-notation paths support arbitrary nesting depth: grib.geography.Ni, a.b.c.d.e. The recursive resolver walks through CBOR Map values at each level. If a non-Map value is encountered before the path is fully resolved, the lookup returns None.

JSON Output Structure

CLI dump -j and ls -j output uses the wire-format structure:

{
  "version": 2,
  "base": [{"mars": {"param": "2t"}, "_reserved_": {"tensor": {"ndim": 1}}}],
  "extra": {"custom": "value"}
}

The _reserved_ keys within base entries are included in JSON output for transparency.


Metadata Refactor: Detailed Edge Cases

The following edge cases were identified during systematic review of the Rust core crate (tensogram) after the metadata refactor.

base Array Count Validation

ScenarioBehaviour
base.len() < descriptors.len()Auto-extended with empty entries. _reserved_.tensor is inserted in each.
base.len() == descriptors.len()Normal path. Pre-existing application keys preserved.
base.len() > descriptors.len()Error: “metadata base has N entries but only M descriptors provided; extra base entries would be discarded”.

Rationale: Silently truncating excess base entries would lose user data. Auto-extending is safe because the library adds _reserved_.tensor to each new entry.

_reserved_.tensor After Encode

After encoding, each base[i]["_reserved_"]["tensor"] always contains exactly four keys:

KeyValueExample
ndimCBOR integer0 for scalar, 2 for matrix
shapeCBOR array of integers[] for scalar, [10, 20] for matrix
stridesCBOR array of integers[] for scalar, [20, 1] for matrix
dtypeCBOR text"float32", "int64", etc.

For scalar tensors (ndim: 0), shape and strides are empty arrays [].

Preceder _reserved_ Protection

Encoder side: StreamingEncoder::write_preceder() rejects any metadata map containing a _reserved_ key. Error: “client code must not write ‘reserved’ in preceder metadata”.

Decoder side: When the decoder encounters a _reserved_ key in a preceder’s base[0], it strips the key rather than rejecting the message. This is permissive — the data may come from a non-standard producer. The encoder-populated _reserved_.tensor from the footer metadata is preserved.

Merge order in finish(): Footer metadata is populated first (_reserved_.tensor), then preceder payloads are merged on top. Since the decoder strips _reserved_ from preceders, there is no risk of preceder _reserved_ clobbering the encoder’s _reserved_.tensor.

Backward Compatibility with Old CBOR Keys

Old keyBehaviour on decode
"common" (v2 pre-refactor)Silently ignored (unknown CBOR key).
"payload" (v2 pre-refactor)Silently ignored.
"reserved" (old name)Silently ignored — only "_reserved_" is recognized.
Both "reserved" and "_reserved_"Only "_reserved_" is captured; "reserved" is ignored.

GlobalMetadata does not use #[serde(deny_unknown_fields)], so serde drops unrecognized keys.

compute_common() Key Selection

compute_common() only examines keys from the first base entry as candidates for common keys. Keys present in later entries but absent from the first entry are never promoted to common.

Example: if entry 0 has keys {a, b} and entry 1 has {b, c}, only b is a candidate (and becomes common if values match). Key c appears only in entry 1’s remaining set.

compute_common() NaN Handling

CBOR Float(NaN) values with identical bit patterns are treated as equal by cbor_values_equal(), using f64::to_bits() comparison. This means NaN values are classified as common when all entries share the same NaN bit pattern. Standard CBOR equality (PartialEq) would fail because NaN != NaN.

compute_common() CBOR Map Ordering

cbor_values_equal() compares CBOR maps positionally (entry-by-entry). Two maps with the same keys and values in different order are NOT equal. This is correct because canonical CBOR encoding ensures all maps are always sorted — different-order maps can only arise from non-canonical input.

Shape Product Overflow

All shape-product computations use checked_mul to detect overflow. This applies to encode(), decode(), ObjectIter::next(), and decode_range(). If the product overflows u64, a TensogramError::Metadata("shape product overflow") is returned. No silent wraparound.

_extra_ Scope Independence

_extra_ is message-level, while base[i] entries are per-object. Keys with the same name can exist in both:

#![allow(unused)]
fn main() {
meta.base[0].insert("mars".into(), ...);  // per-object
meta.extra.insert("mars".into(), ...);     // message-level
// Both preserved after encode/decode round-trip
}

Empty _extra_ in CBOR

An empty _extra_ map is omitted from CBOR output via skip_serializing_if = "BTreeMap::is_empty". On decode, a missing _extra_ key is deserialized as an empty BTreeMap. Round-trips correctly.

Deeply Nested _reserved_ in base Entries

Only the top-level _reserved_ key in base[i] is rejected by the encoder. Deeply nested _reserved_ keys (like {"foo": {"_reserved_": ...}}) are allowed and preserved. The encoder only checks entry.contains_key("_reserved_").

CLI set on Zero-Object Messages

When tensogram set modifies a zero-object message, keys that would normally go into base are redirected to _extra_ instead (since base entries must align 1:1 with data objects, and there are none).


Error Handling Reference

This section documents all error types, how they propagate across languages, and what messages users can expect.

TensogramError Variants (Rust)

The core library defines seven error variants in TensogramError:

VariantWhen it occursExample message
Framing(String)Invalid wire format — magic bytes, postamble, frame ordering"buffer too short (12 bytes, need >= 24)"
Metadata(String)Metadata validation failures — version, base count, CBOR parse"metadata base has 3 entries but only 2 descriptors provided"
Encoding(String)Encoding pipeline errors — simple_packing NaN, bit-width"NaN value at index 42"
Compression(String)Compression/decompression failures — codec errors, range access"RangeNotSupported: zstd does not support partial decode"
Object(String)Per-object errors — index out of range, shape overflow"object index 99 out of range (num_objects=2)"
Io(io::Error)File system errors — open, read, write, seek"data.tgm: No such file or directory"
HashMismatch { expected, actual }Integrity check failure"hash mismatch: expected=abc123, actual=def456"

Python Exception Mapping

The Python bindings convert TensogramError to Python exceptions:

Rust variantPython exceptionPrefix in message
FramingValueErrorFramingError:
MetadataValueErrorMetadataError:
EncodingValueErrorEncodingError:
CompressionValueErrorCompressionError:
ObjectValueErrorObjectError:
IoIOError(raw io message)
HashMismatchRuntimeErrorHashMismatch:

Additional Python-side exceptions:

FunctionExceptionCondition
encode()ValueErrorMissing version key, _reserved_ in dict, unknown dtype
decode()ValueErrorCorrupted buffer, invalid CBOR
Metadata.__getitem__()KeyErrorKey not found in base or extra
Metadata.__getitem__("_reserved_")KeyError_reserved_ is always hidden from dict access
TensogramFile.__getitem__()IndexErrorMessage index out of range
TensogramFile.__getitem__()TypeErrorNon-integer, non-slice index
compute_packing_params()ValueErrorNaN in input array
encode(hash="sha256")ValueError"unknown hash: sha256"

Example: handling errors in Python:

import tensogram

# File not found
try:
    with tensogram.TensogramFile.open("missing.tgm") as f:
        pass
except IOError as e:
    print(f"File error: {e}")
    # → "File error: file not found: missing.tgm"

# Corrupted buffer
try:
    tensogram.decode(b"garbage")
except ValueError as e:
    print(f"Decode error: {e}")
    # → "Decode error: FramingError: buffer too short ..."

# Hash verification failure
try:
    meta, objects = tensogram.decode(buf, verify_hash=True)
except RuntimeError as e:
    print(f"Integrity error: {e}")
    # → "Integrity error: HashMismatch: expected=..., actual=..."

# Missing metadata key
meta, objects = tensogram.decode(buf)
try:
    val = meta["nonexistent"]
except KeyError:
    print("Key not found")

# Index out of range
with tensogram.TensogramFile.open("data.tgm") as f:
    try:
        msg = f[999]
    except IndexError as e:
        print(f"Index error: {e}")
        # → "message index 999 out of range for file with 2 messages"

CLI Error Handling

All CLI commands:

  • Print errors to stderr with error: prefix
  • Show the full error chain (nested causes)
  • Exit with code 1 on any error
  • Exit with code 0 on success

Common CLI error scenarios:

# File not found
$ tensogram ls nonexistent.tgm
error: file not found: nonexistent.tgm

# Invalid where clause
$ tensogram ls -w "bad-clause" data.tgm
error: invalid where clause: invalid where-clause: bad-clause (expected key=value or key!=value)

# Missing key in strict get
$ tensogram get -p "nonexistent" data.tgm
error: key not found: nonexistent

# Protected namespace
$ tensogram set -s "_reserved_.tensor.ndim=5" input.tgm output.tgm
error: cannot modify '_reserved_' — this namespace is managed by the library

# Immutable descriptor key
$ tensogram set -s "shape=broken" input.tgm output.tgm
error: cannot modify immutable key: shape

# Merge conflict with error strategy
$ tensogram merge --strategy error a.tgm b.tgm -o merged.tgm
error: conflicting values for key 'param' (use --strategy first or last to resolve)

# Invalid merge strategy
$ tensogram merge --strategy unknown a.tgm b.tgm -o merged.tgm
error: unknown merge strategy 'unknown': expected first, last, or error

# Message index out of range (via file.read_message)
$ tensogram dump corrupt.tgm
error: framing error: buffer too short ...

xarray Backend Error Handling

ScenarioBehaviour
File not foundIOError from tensogram.TensogramFile.open()
Corrupt fileValueError from tensogram.decode_descriptors()
message_index out of rangeValueError from TensogramFile.read_message()
message_index < 0ValueError("message_index must be >= 0, got -1")
meta.base shorter than objectsWarning logged; missing entries treated as empty dicts
Unsupported dtypeTypeError("unsupported tensogram dtype ...")
dim_names count mismatchValueError("dim_names has N entries but tensor has M dimensions")
decode_range failureWarning logged; falls back to full decode_object()
File with zero messages + merge_objects=TrueReturns empty xr.Dataset()

Zarr Store Error Handling

ScenarioBehaviour
File not foundOSError("failed to open TGM file ...") wrapping the original error
Corrupt messageValueError("failed to decode message ...") wrapping the original error
Failed object decodeValueError("failed to decode object N ...") wrapping the original error
message_index out of rangeIndexError("message_index N out of range (file has M message(s))")
message_index < 0ValueError("message_index must be >= 0, got -1")
Invalid modeValueError("invalid mode 'x'; expected 'r', 'w', or 'a'")
Empty pathValueError("path must be a non-empty string, got ''")
Store already openValueError("store is already open")
Write to read-only storeRaises from Zarr base class
Flush failure during exceptionWarning logged; original exception preserved
Unsupported dtype on writeValueError("unsupported dtype for variable ...")
Chunk size mismatch on writeValueError("chunk data for 'var': expected N bytes ... got M")
Multiple chunks per variableValueError("variable 'var' has N chunk keys; TensogramStore only supports single-chunk arrays")
Unsupported ByteRequest typeTypeError("unsupported ByteRequest type: ...")
Zero messages in fileRoot group zarr.json with empty attributes; no arrays

IO Error Path Context

All file I/O errors include the file path in the error message. This applies to:

  • TensogramFile::open()"file not found: /path/to/file.tgm"
  • TensogramFile::create()"cannot create /path/to/file.tgm: Permission denied"
  • Internal re-opens (scan, read, append) — "/path/to/file.tgm: No such file or directory"

This ensures that when errors propagate through multiple layers (e.g. Rust → Python → xarray), the original file path is always visible in the error message.

Internals

This page explains implementation decisions that are not obvious from the public API. Useful if you’re contributing to the library or implementing a compatible reader in another language.

Deterministic CBOR Canonicalization

The library encodes all CBOR structures (global metadata, data object descriptors, index frames, hash frames) using a three-step process:

  1. Serialize the struct to a ciborium::Value tree using serde.
  2. Recursively sort all map keys by their CBOR byte encoding.
  3. Write the sorted Value tree to bytes.

Standard serde serialization into ciborium does not guarantee key order (it depends on the HashMap/BTreeMap iteration order of the struct). Even though the library uses BTreeMap throughout (which gives alphabetical iteration order for string keys), relying on that would be fragile. The explicit canonicalization step ensures the output matches RFC 8949 §4.2 regardless of how the keys were stored.

GlobalMetadata / DataObjectDescriptor struct
    ↓ serde serialization
ciborium::Value::Map (arbitrary key order)
    ↓ canonicalize() — sort all maps recursively by CBOR-encoded key bytes
ciborium::Value::Map (canonical order)
    ↓ write to bytes
CBOR bytes (deterministic)

Note: canonicalize() returns Result<()> and propagates errors rather than panicking.

BTreeMap Throughout

The extra (serialized as _extra_), reserved (serialized as _reserved_), and base entry fields in GlobalMetadata, as well as the params field in DataObjectDescriptor, are BTreeMap<String, ciborium::Value>. This:

  • Gives alphabetical iteration order for string keys (which matches CBOR canonical order for short strings).
  • Avoids the non-determinism of HashMap.
  • Makes it easy to read and write keys without worrying about order.

Frame-Based Wire Format (v2)

The v2 wire format uses a frame-based structure instead of the v1 monolithic binary header.

Preamble (24 bytes)

MAGIC "TENSOGRM" (8) + version u16 (2) + flags u16 (2) + reserved u32 (4) + total_length u64 (8)

The preamble flags indicate which optional frames are present (header/footer metadata, index, hashes). total_length = 0 signals streaming mode.

Frame Header (16 bytes)

Every frame (metadata, index, hash, data object) starts with:

"FR" (2) + frame_type u16 (2) + version u16 (2) + flags u16 (2) + total_length u64 (8)

And ends with "ENDF" (4 bytes). Frame versions are independent of message version.

Data Object Frame Layout

Each data object is a self-contained frame:

Frame header (16B) + [CBOR descriptor] + payload bytes + [CBOR descriptor] + cbor_offset u64 (8B) + "ENDF" (4B)

The cbor_offset is the byte offset from the frame start to the CBOR descriptor. A flag bit controls whether the CBOR descriptor appears before or after the payload (default: after, since encoding parameters like hash are only known after encoding completes).

Postamble (16 bytes)

first_footer_offset u64 (8) + END_MAGIC "39277777" (8)

first_footer_offset is never zero. It points to the first footer frame, or to the postamble itself when no footer frames are present.

Two-Pass Index Construction

When encoding a non-streaming message, the index frame contains byte offsets of each data object. But the index frame’s own size affects those offsets (circular dependency). The encoder solves this with a two-pass approach:

  1. First pass: compute index CBOR with placeholder offsets to determine the index frame size.
  2. Second pass: compute final offsets using the known index frame size, re-encode the index CBOR.

If the re-encoded CBOR changes size (edge case), the encoder returns an error rather than silently producing incorrect offsets.

Encoder Structure

The encode_message() function delegates to five focused helpers:

  • build_hash_frame_cbor() — collects hashes from objects and serializes the HashFrame
  • build_index_frame() — runs the two-pass index construction described above
  • compute_object_offsets() — calculates byte offsets with 8-byte alignment
  • compute_message_flags() — sets preamble flags from optional frame presence
  • assemble_message() — writes preamble, frames, and postamble into the final buffer

simple_packing Bit Layout

Values are packed MSB-first (most significant bit first), following the same bit layout as the GRIB 2 simple_packing specification so that quantised payloads are interoperable with existing GRIB tooling:

Element 0: bits [0 .. B-1]
Element 1: bits [B .. 2B-1]
Element 2: bits [2B .. 3B-1]
...

The last byte is zero-padded on the right if N × B is not a multiple of 8.

The decode formula is:

V[i] = R + (packed[i] × 2^E) / 10^D

Where:

  • R = reference_value (minimum of original data)
  • E = binary_scale_factor
  • D = decimal_scale_factor
  • packed[i] = the integer read from the packed bits

Lazy File Scanning

TensogramFile::open() does not read the file. The first call that needs the message list (e.g. message_count(), read_message()) triggers a streaming scan using scan_file(). The scanner reads only preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, the list of (offset, length) pairs is cached in memory for the lifetime of the TensogramFile object.

// No I/O here
let mut file = TensogramFile::open("huge.tgm")?;

// Streaming scan happens here (once) — reads preamble chunks, seeks forward
let count = file.message_count()?;

// O(1) seek + read
let msg = file.read_message(999)?;

Error Hierarchy

TensogramError
├── Framing     — invalid magic, truncated preamble, bad frame markers, missing postamble
├── Metadata    — CBOR serialization/deserialization failure
├── Encoding    — invalid encoding params, NaN in simple_packing
├── Compression — compressor error (szip, zstd, lz4, blosc2, zfp, sz3)
├── Object      — index out of range
├── Io          — filesystem errors (wraps std::io::Error)
└── HashMismatch { expected, actual } — payload integrity failure

All public functions return Result<T> where the error is TensogramError. The Io variant wraps std::io::Error via the From impl, so ? on any std::io::Result produces a TensogramError::Io automatically.

Memory-Mapped I/O (mmap feature)

The mmap feature gate enables memory-mapped file access via memmap2. When you open a file with TensogramFile::open_mmap(), the file is mapped into virtual memory and the existing scan() function runs directly on the mapped buffer. Subsequent read_message() calls return copies from the mapped region without additional seeks.

// Requires: cargo build --features mmap
let mut file = TensogramFile::open_mmap("huge.tgm")?;
let count = file.message_count()?; // already scanned during open_mmap
let msg = file.read_message(42)?;  // copies from mmap, no seek

The regular open() path still works without the feature and uses streaming seek-based scanning.

Async I/O (async feature)

The async feature gate adds tokio-based async variants: open_async(), read_message_async(), and decode_message_async(). All CPU-intensive work (scanning, decoding, FFI calls to libaec/zfp/blosc2) runs via spawn_blocking to avoid blocking the async runtime.

// Requires: cargo build --features async
let mut file = TensogramFile::open_async("forecast.tgm").await?;
let (meta, objects) = file.decode_message_async(0, &opts).await?;

Frame Ordering Validation

The decoder enforces that frames appear in the expected order within a message: header frames first, then data object frames, then footer frames. A DecodePhase state machine tracks the current phase and returns TensogramError::Framing if a frame type appears out of order.

This catches malformed messages where, for example, a header metadata frame appears after a data object frame.

Canonical CBOR Verification

The library provides verify_canonical_cbor() to check that a CBOR byte slice is in RFC 8949 §4.2.1 canonical form. This is used internally by tests to verify that all CBOR output (metadata, descriptors, index frames, hash frames) is deterministic. It can also be used by external tools that need to validate Tensogram CBOR output against the spec.