Introduction

Tensogram is a binary message format for N-dimensional scientific tensors — the kind of data that appears in weather and climate forecasting, Earth observation, medical and microscopy imaging, genomics, particle physics, materials simulation, and machine-learning pipelines. It carries its own metadata, supports arbitrary tensor dimensions, and is fast to encode and decode.

What Tensogram gives you

Self-describing messages. Every message carries the metadata needed to decode it — shape, dtype, encoding pipeline, application annotations — using CBOR. No external schema required.
Any number of dimensions. A single message can carry multiple tensors, each with its own shape, dtype, and encoding. A 3-D spectrum, a 2-D field, and a 4-D ensemble tensor can coexist in one message.
Vocabulary-agnostic. The library never interprets metadata keys. Application layers (MARS at ECMWF, CF in climate, BIDS in neuroimaging, your in-house taxonomy) own key names.
Transport and file in one format. The same bytes that traverse a socket can be appended to a .tgm file; both support O(1) random access to any object.
Interop with existing formats. Importers for GRIB and NetCDF let you bring existing data into Tensogram pipelines without a lossy re-modelling step.
Partial range decode. Extract sub-tensor slices without decoding the whole object — useful for remote data at scale.

Tensogram is developed and maintained by ECMWF and is used in operational weather-forecasting workloads, but nothing in the format is weather-specific. The design targets the N-tensor-at-scale problem common to many scientific domains.

Crate Layout

The primary four Rust crates make up the default workspace build:

tensogram/
├── rust/
│   ├── tensogram        ← encode, decode, framing, file API,
│   │                            validation, remote object store
│   ├── tensogram-encodings   ← simple_packing, shuffle, compression
│   ├── tensogram-cli         ← `tensogram` command-line tool
│   └── tensogram-ffi         ← C FFI layer for C/C++ callers
├── python/
│   └── bindings/             ← Python bindings (PyO3 / maturin)
├── cpp/
│   └── include/              ← C++ wrapper header + C header

On top of those, the repository ships several opt-in crates — the tensogram-grib / tensogram-netcdf importers (exposed as the convert-grib / convert-netcdf CLI subcommands), the tensogram-wasm WebAssembly bindings, and the pure-Rust tensogram-szip / tensogram-sz3 / tensogram-sz3-sys compression crates — together with the separate Python packages tensogram-xarray (xarray backend) and tensogram-zarr (Zarr v3 store backend), and a tensogram-benchmarks crate. See plans/ARCHITECTURE.md for the full crate list and build recipes.

Most users interact with tensogram and the CLI. The encodings crate is used internally by the core but is also importable directly if you need to call the encoding functions outside of a full message.

Installation

Rust:

cargo add tensogram

Python:

pip install tensogram          # core
pip install tensogram[all]     # with xarray + zarr backends

CLI:

cargo install tensogram-cli

See the Quick Start for feature flags, optional dependencies, and detailed setup.

Quick Example

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
    encode, decode, GlobalMetadata, DataObjectDescriptor,
    ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};

// Describe what you're storing: a 100×200 grid of f32 values
let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    masks: None,
    params: BTreeMap::new(),
};

// `GlobalMetadata::default()` stamps the current wire version (3).
let global_meta = GlobalMetadata::default();

// Your raw bytes (100 × 200 × 4 bytes = 80,000 bytes)
let data = vec![0u8; 100 * 200 * 4];

// Encode into a self-contained message
let message = encode(&global_meta, &[(&desc, &data)], &EncodeOptions::default()).unwrap();

// Decode it back
let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();
assert_eq!(objects[0].0.shape, vec![100, 200]);
assert_eq!(objects[0].1, data);
}

The message bytes can be written to a file, sent over a socket, or stored in a database. The receiver does not need any external schema — everything is self-describing.

What is a Message?

A Tensogram message is a single, self-contained binary blob. It carries:

A Preamble – fixed-size header with magic bytes, version, flags, and total length
Optional header frames – metadata, index, and hash frames for fast random access
One or more data object frames – each containing a CBOR descriptor and the actual tensor bytes
Optional footer frames – metadata, index, and hash frames (used in streaming mode)
A Postamble – footer offset and terminator magic

Every message begins with the ASCII string TENSOGRM and ends with 39277777. This makes it trivial to find message boundaries even in a file containing hundreds of concatenated messages.

Structure at a Glance

block-beta
    columns 1
    A["PREAMBLE (24 bytes)\nTENSOGRM · version · flags · total_length"]
    B["Header Metadata Frame (optional)\nCBOR GlobalMetadata"]
    C["Header Index Frame (optional)\nobject count + offsets"]
    D["Header Hash Frame (optional)\nobject count + hash type + hashes"]
    E["Data Object Frame 0\nCBOR descriptor + payload bytes"]
    F["Data Object Frame 1 (if present)\nCBOR descriptor + payload bytes"]
    G["... (more data object frames)"]
    H["Footer Hash / Index / Metadata Frames (optional)"]
    I["POSTAMBLE (16 bytes)\nfirst_footer_offset · 39277777"]

Frame-Based Design

The v3 wire format is entirely frame-based. Every piece of data between the Preamble and Postamble is wrapped in a frame. Each frame starts with a 4-byte marker (FR + a uint16 frame type), a version, flags, and a length field. This uniform structure means a decoder can skip any frame it does not understand by jumping over its declared length.

Frame types:

Type ID	Name	Location
1	Header Metadata Frame	Header
2	Header Index Frame	Header
3	Header Hash Frame	Header
4	(reserved — obsolete v2 `NTensorFrame`; rejected)	—
5	Footer Hash Frame	Footer
6	Footer Index Frame	Footer
7	Footer Metadata Frame	Footer
8	Preceder Metadata Frame	Body (before a Data Object)
9	`NTensorFrame` (data object)	Body

Padding between frames is allowed (from ENDF to the next FR marker) for 64-bit memory alignment.

Why Header Frames?

When a message is encoded in a single buffer (the common case), the index and hash frames are placed in the header, right after the Preamble. A decoder reads the Preamble, then the metadata frame, then the index frame, and can immediately seek to any data object by offset. That is O(1) random access, which matters when a message carries many large tensors.

Streaming Support

When encoding in streaming mode, the producer may not know in advance how many data objects the message will contain. In this case:

total_length in the Preamble is set to 0 (unknown)
Index and hash frames are written in the footer instead of the header
The Postamble’s first_footer_offset field points back to where the footer frames begin

A decoder reading a streamed message seeks to the end, reads the Postamble, then jumps to the footer frames to find the index. Both paths (header index and footer index) give O(1) access to any object.

Data Object Frames

Each data object is self-contained in its own frame. The frame carries:

A CBOR descriptor (DataObjectDescriptor) describing the tensor shape, dtype, and encoding pipeline
The binary payload (the actual encoded tensor bytes)
An inline xxh3-64 hash slot in the frame footer (populated when the message’s HASHES_PRESENT preamble flag is set)

The CBOR descriptor can appear before or after the payload within the frame. By default it is placed after the payload, since some encoding parameters are only known after the payload has been written. A flag in the frame header indicates the position.

Messages vs Files

A .tgm file is just a sequence of messages written one after another:

[message 1][message 2][message 3]...

There is no file-level index or header. The TensogramFile API scans the file once (lazily, on first access) and builds an in-memory list of (offset, length) pairs for each message. After that, reading any message is a seek + read – no scan needed.

To find message boundaries in a file:

Scan for TENSOGRM magic (8 bytes)
If total_length is non-zero, use it to advance to the next message
Otherwise, walk frames using their length fields until the next magic or EOF

Self-Description

Every message carries all the information needed to decode it:

The dtype of every object (float32, int16, etc.)
The shape and strides (dimensions and memory layout)
The full encoding pipeline applied to the payload (encoding, filter, compression)
The byte order of each object’s data
Any application-level metadata (MARS keys, units, timestamps, etc.)

This means a decoder never needs an external schema. You can receive a Tensogram message on a new machine, years after it was encoded, and decode it correctly.

Edge Case: Zero-Object Messages

A message with no data object frames is valid. It contains only the Preamble, a metadata frame, and the Postamble. This is useful for sending pure metadata (e.g. a control message or an acknowledgement with provenance information) without any tensor payload.

#![allow(unused)]
fn main() {
let metadata = GlobalMetadata {
    version: 3,
    ..Default::default()
};
let msg = encode(&metadata, &[], &EncodeOptions::default()).unwrap();
}

Metadata

Metadata in Tensogram is stored as CBOR – Concise Binary Object Representation (RFC 8949). Think of it as a compact, binary version of JSON. It supports the same types (strings, integers, floats, booleans, arrays, maps), but is smaller and faster to parse.

Metadata Locations

In v3, metadata lives in two distinct places:

Level	Where it lives	What it contains
Global	Header or footer metadata frame	`GlobalMetadata`: `base` (per-object metadata array) + `_reserved_` (library internals) + `_extra_` (client annotations). The wire-format version lives in the preamble, not in the CBOR metadata frame.
Per-object	Each data object frame’s CBOR descriptor	`DataObjectDescriptor`: tensor shape, encoding pipeline, plus `params` for encoding parameters

Each data object carries its own descriptor inline within its frame.

GlobalMetadata

The global metadata frame contains a GlobalMetadata struct with three named sections:

#![allow(unused)]
fn main() {
GlobalMetadata {
    base: Vec::new(),              // one BTreeMap per data object (independent entries)
    reserved: BTreeMap::new(),     // library internals (_reserved_ in CBOR)
    extra: BTreeMap::new(),        // client-writable catch-all (_extra_ in CBOR)
}
}

In CBOR, this looks like (using ECMWF MARS keys as one concrete example vocabulary):

{
  "base": [
    {
      "mars": {
        "class": "od", "type": "fc",
        "date": "20260401", "time": "1200", "param": "2t"
      }
    }
  ],
  "_extra_": {
    "source": "ifs-cycle49r2"
  }
}

The same mechanism works for any application vocabulary. A neuroimaging pipeline might use a BIDS namespace:

{
  "base": [{
    "bids": { "subject": "sub-01", "session": "ses-01",
              "task": "rest", "run": 1 }
  }]
}

A materials-simulation pipeline might use a custom namespace:

{
  "base": [{
    "material": { "composition": "Fe3O4", "lattice": "cubic", "T_K": 300.0 }
  }]
}

The library does not know or care which vocabulary is used — it simply stores, serialises, and returns the keys you supply.

There are no required top-level keys. The CBOR metadata frame is fully free-form — only base, _reserved_, and _extra_ are library-interpreted. Any other top-level key the caller supplies (including a stray legacy "version" from pre-0.17 producers) is routed into _extra_ on decode so the data round-trips cleanly. _extra_ itself is a free-form catch-all — you can add any key using any CBOR value type. The library does not interpret or validate these keys. Your application layer assigns meaning.

Reading the wire version. The wire-format version is carried in the preamble (see ../format/wire-format.md §3). Rust callers use tensogram::WIRE_VERSION; Python uses tensogram.WIRE_VERSION; TypeScript uses WIRE_VERSION from @ecmwf.int/tensogram; FFI / C++ callers call tgm_message_version / msg.version(). All of these resolve to the constant 3 in v3.

Per-Object Metadata in `base`

The base section is a CBOR array of maps — one entry per data object. Each entry holds ALL structured metadata for that object independently. Entries are self-contained — there is no tracking of which keys are common across objects.

The encoder auto-populates _reserved_.tensor (with ndim, shape, strides, dtype) in each entry when you call encode() or StreamingEncoder::finish(). Application keys are preserved:

{
  "base": [
    {
      "mars": { "class": "od", "type": "fc", "param": "2t", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    },
    {
      "mars": { "class": "od", "type": "fc", "param": "10u", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    }
  ]
}

This lets readers discover the shape, type, and per-object metadata of every object by reading only the global metadata frame — without opening each data object frame.

No common/varying split: Every base[i] entry is self-contained. MARS keys shared across all objects (e.g. class, type) are simply repeated in each entry. If you need to extract commonalities (e.g. for display or merges), use the compute_common() utility in software after decoding.

DataObjectDescriptor

The params field of each DataObjectDescriptor is a BTreeMap<String, ciborium::Value> for encoding parameters only (e.g. sp_reference_value, sp_bits_per_value). These are flattened into the CBOR descriptor alongside the fixed tensor fields.

For example, a data object’s CBOR descriptor might look like:

{
  "type": "ntensor",
  "ndim": 2,
  "shape": [721, 1440],
  "strides": [1440, 1],
  "dtype": "float32",
  "byte_order": "big",
  "encoding": "simple_packing",
  "filter": "none",
  "compression": "szip",
  "sp_reference_value": 230.5,
  "sp_bits_per_value": 16
}

In v3 the per-object payload hash lives in the frame footer’s inline [hash u64] slot (see ../format/wire-format.md §2.2), not in the CBOR descriptor.

Here, sp_reference_value and sp_bits_per_value live in the params map. Application metadata such as MARS keys belongs in base[i]["mars"] in the global metadata.

Namespaced Keys

Convention: application-layer keys are grouped under a namespace key, so that multiple vocabularies can coexist in the same message. For example, ECMWF’s MARS vocabulary lives under "mars":

{
  "base": [
    {
      "mars": {
        "class": "od", "type": "fc",
        "param": "2t", "date": "20260401", "step": 6
      }
    }
  ]
}

Other pipelines use other namespaces — "cf" for CF conventions, "bids" for neuroimaging, "dicom" for medical imaging, or anything your application defines. This convention applies at both levels — global metadata and per-object params.

Filtering with the CLI

The -w flag on ls, dump, get, and copy uses dot-notation to filter messages on any namespace. The examples below use the MARS vocabulary, but the same syntax works with any application namespace (e.g. bids.subject, dicom.Modality, product.name):

# Only messages where mars.param equals "2t" or "10u"
tensogram ls data.tgm -w "mars.param=2t/10u"

# Exclude messages where mars.class equals "od"
tensogram ls data.tgm -w "mars.class!=od"

The / character separates OR values. Key lookup searches base[i] entries first (skipping _reserved_, first match across entries), then _extra_ for backwards compatibility.

Preceder Metadata Frames

In streaming mode, per-object metadata is normally only available in the footer metadata frame (written after all objects). A Preceder Metadata Frame (frame type 8) allows producers to send per-object metadata before the data object, without waiting for the footer.

A preceder carries a GlobalMetadata CBOR with a single-entry base array for the next data object:

{
  "base": [{"product": {"name": "temperature"}, "units": "K"}]
}

Merge rule: On decode, preceder keys override footer base[i] keys on conflict. Structural keys auto-populated by the encoder (in _reserved_.tensor: ndim, shape, strides, dtype) are preserved from the footer when absent from the preceder. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.

Use StreamingEncoder::write_preceder() before write_object() to emit a preceder frame. Preceders are optional per-object: some objects may have them, others may not.

Value Type Rules

Keys must be text strings. Values must be JSON-compatible CBOR types: string, integer, float, boolean, null, array, or map. Byte strings, CBOR tags, undefined, and half-precision floats are not allowed. See Metadata Value Types for the full rules and rationale.

Deterministic Encoding

When Tensogram encodes metadata to CBOR, it sorts all map keys by their CBOR byte representation (RFC 8949 Section 4.2 canonical form). This guarantees that the same metadata always produces the same bytes, regardless of the order you inserted keys in your application code. This matters for hashing and reproducibility.

Edge case: Nested maps are also sorted recursively. Even metadata stored inside a CBOR map value (like the "mars" namespace) gets canonical ordering.

Objects and Dtypes

An object is one N-dimensional tensor inside a message. A message can carry multiple objects. In v2, each object is fully described by a single struct:

A DataObjectDescriptor carrying tensor metadata, encoding pipeline, and integrity hash – all in one place
The actual binary payload within the object’s frame

There is no separate “payload descriptor” array. The descriptor travels with the data inside the same frame.

DataObjectDescriptor

#![allow(unused)]
fn main() {
DataObjectDescriptor {
    // ── Tensor metadata ──
    obj_type: "ntensor",           // always "ntensor" for now
    ndim: 2,                       // number of dimensions
    shape: vec![100, 200],         // size of each dimension
    strides: vec![200, 1],         // elements to skip per dimension step
    dtype: Dtype::Float32,         // element type

    // ── Encoding pipeline ──
    byte_order: ByteOrder::Big,    // big or little endian
    encoding: "simple_packing",    // or "none"
    filter: "shuffle",             // or "none"
    compression: "szip",           // or "none", "zstd", "lz4", etc.

    // ── Optional NaN / Inf bitmask companion ──
    masks: None,                   // or Some(MasksMetadata { .. })

    // ── Flexible parameters (encoding only) ──
    params: BTreeMap::from([       // BTreeMap<String, ciborium::Value>
        ("sp_reference_value".into(), ciborium::Value::Float(230.5)),
        ("sp_bits_per_value".into(), ciborium::Value::Integer(16.into())),
    ]),
}
}

In v3 the per-object integrity hash lives in an inline 8-byte slot in the frame footer, not on the descriptor. The slot is populated when the message’s HASHES_PRESENT preamble flag is set (the default).

The params map is flattened into the CBOR alongside the fixed fields, so the on-wire CBOR is a single flat map. This keeps things simple for decoders – no nested “encoding” or “tensor” sub-objects to navigate.

Each data object has its own descriptor, so different objects in the same message can use different encodings, byte orders, and hash algorithms.

Strides

Strides tell you how to navigate the memory layout. For a C-contiguous (row-major) array of shape [100, 200]:

Advancing along axis 0 (rows) skips 200 elements
Advancing along axis 1 (columns) skips 1 element

So strides = [200, 1]. For a Fortran-contiguous (column-major) array the strides would be reversed: [1, 100].

To compute C-contiguous strides from shape:

#![allow(unused)]
fn main() {
fn compute_strides(shape: &[u64]) -> Vec<u64> {
    let mut strides = vec![1u64; shape.len()];
    for i in (0..shape.len() - 1).rev() {
        strides[i] = strides[i + 1] * shape[i + 1];
    }
    strides
}
// shape [100, 200] → strides [200, 1]
// shape [4, 5, 6]  → strides [30, 6, 1]
}

Supported Data Types

Name	Size	Description
`float16`	2 bytes	IEEE 754 half-precision float
`bfloat16`	2 bytes	Brain float (truncated float32)
`float32`	4 bytes	IEEE 754 single-precision float
`float64`	8 bytes	IEEE 754 double-precision float
`complex64`	8 bytes	Two float32 (real + imag)
`complex128`	16 bytes	Two float64 (real + imag)
`int8`	1 byte	Signed integer
`int16`	2 bytes	Signed integer
`int32`	4 bytes	Signed integer
`int64`	8 bytes	Signed integer
`uint8`	1 byte	Unsigned integer
`uint16`	2 bytes	Unsigned integer
`uint32`	4 bytes	Unsigned integer
`uint64`	8 bytes	Unsigned integer
`bitmask`	< 1 byte	Packed bits (sub-byte; size depends on element count)

Edge case: bitmask returns 0 from byte_width(). Callers that need the actual byte count must compute it from the element count: (num_elements + 7) / 8.

Multiple Objects in One Message

A message can carry several related tensors. Two concrete examples:

A wave-spectrum message with the spectrum itself as a 3-tensor and a land/sea mask as a 2-tensor.
A medical-imaging message with a 4-D time-series volume, a 3-D segmentation mask, and a 1-D array of acquisition timestamps.

block-beta
    columns 3
    A["Object 0\nSpectrum\nf32 · 721×1440×30\nencoding: simple_packing"]:2
    B["Object 1\nLand mask\nuint8 · 721×1440\nencoding: none"]:1

All objects live in the same message. Each object has its own DataObjectDescriptor embedded in its frame and its own entry in GlobalMetadata.base holding per-object application metadata. Different objects can use completely different encoding pipelines.

Edge case: The number of DataObjectDescriptor entries and the data slices passed to encode() must be equal. The encoder returns an error if they do not match.

The Encoding Pipeline

Every object payload passes through a three-stage pipeline on the way in (encoding) and out (decoding). The stages always run in the same order:

flowchart TD
    subgraph Encode["Encode Path"]
        direction TB
        A["Raw bytes"]
        B["Stage 1 — Encoding
        (lossy quantization)"]
        C["Stage 2 — Filter
        (byte shuffle)"]
        D["Stage 3 — Compression
        (szip / zstd / lz4 / blosc2 / zfp / sz3)"]
        A --> B --> C --> D
    end

    S[("Stored bytes")]

    subgraph Decode["Decode Path"]
        direction TB
        F["Stage 3 — Decompress"]
        G["Stage 2 — Unshuffle"]
        H["Stage 1 — Dequantize"]
        I["Raw bytes"]
        F --> G --> H --> I
    end

    D --> S --> F

    style A fill:#e8f5e9,stroke:#388e3c
    style S fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style I fill:#e8f5e9,stroke:#388e3c
    style Encode fill:#e3f2fd,stroke:#1565c0,color:#1565c0
    style Decode fill:#fce4ec,stroke:#c62828,color:#c62828

Each stage is independently configurable per object via fields in the DataObjectDescriptor. Set a stage to "none" to skip it. For callers with already-encoded payloads, a pipeline-bypass option exists via encode_pre_encoded (see Pre-encoded Payloads).

Stage 1: Encoding

Encoding transforms values to reduce the number of bits needed to represent them. The only supported encoding right now is simple_packing — a lossy quantisation that maps a bounded range of floating-point values onto N-bit integers. The bit layout matches GRIB 2 simple_packing so quantised payloads are interoperable with existing GRIB tooling.

Value	Meaning
`"none"`	Pass through unchanged
`"simple_packing"`	Lossy quantization (see Simple Packing)

Stage 2: Filter

Filters rearrange bytes to improve compression ratios. The shuffle filter reorders bytes by their significance level (all most-significant bytes first, then all second-most-significant bytes, etc.), which makes float data much more compressible because nearby values have similar high bytes.

Value	Meaning
`"none"`	Pass through unchanged
`"shuffle"`	Byte-level shuffle (see Byte Shuffle Filter)

Stage 3: Compression

Compression reduces the total byte count. Seven compressors are implemented:

Value	Type	Random Access	Notes
`"none"`	Pass-through	Yes	No compression
`"szip"`	Lossless	Yes	CCSDS 121.0-B-3 via libaec
`"zstd"`	Lossless	No	Excellent ratio/speed tradeoff
`"lz4"`	Lossless	No	Fastest decompression
`"blosc2"`	Lossless	Yes	Multi-codec, chunk-level access
`"zfp"`	Lossy	Yes (fixed-rate)	Floating-point arrays
`"sz3"`	Lossy	No	Error-bounded scientific data

See Compression for full details on each compressor, including parameters and random access support.

Note: ZFP and SZ3 operate directly on typed floating-point data. Use them with encoding: "none" and filter: "none" – they replace both encoding and compression.

Typical Combinations

Use case	encoding	filter	compression
Exact integers (e.g. a mask)	`none`	`none`	`none`
Lossy bounded-range floats	`simple_packing`	`none`	`szip`
Best lossless (floats)	`none`	`shuffle`	`szip` or `blosc2`
GRIB 2 CCSDS-interoperable	`simple_packing`	`none`	`szip`
Real-time streaming	`none`	`none`	`lz4`
Archival storage	`none`	`shuffle`	`zstd`
ML model weights	`none`	`none`	`blosc2`
Lossy float w/ random access	`none`	`none`	`zfp` (fixed_rate)
Error-bounded science	`none`	`none`	`sz3`

How It Looks in Code

The entire pipeline is configured through the DataObjectDescriptor:

#![allow(unused)]
fn main() {
DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 2,
    shape: vec![721, 1440],
    strides: vec![1440, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "szip".into(),
    masks: None,
    params: BTreeMap::from([
        ("sp_reference_value".into(), Value::Float(230.5)),
        ("sp_bits_per_value".into(), Value::Integer(16.into())),
    ]),
}
}

All encoding parameters (reference_value, bits_per_value, szip_block_offsets, etc.) go into the params map. The encoder populates additional params during encoding (like block offsets for szip), and the decoder reads them back.

Integrity Hashing

Every frame ends with an inline 8-byte hash slot followed by the ENDF marker. For data object frames, the slot lives at frame_end − 12, and the hash covers the frame body (payload + any mask blobs + CBOR descriptor). Populating the slot is controlled message-wide via the HASHES_PRESENT preamble flag, set by EncodeOptions.hash_algorithm = Some(HashAlgorithm::Xxh3) (the default).

To verify integrity after decoding, run tensogram validate --checksum. The validator walks every frame and recomputes the xxh3-64 digest against the stored slot without parsing CBOR on the fast path.

Algorithm	Hash length	Notes
`xxh3`	8-byte raw / 16 hex chars (64-bit)	Default. Fast, non-cryptographic

Edge case: The hash covers the frame body only — header, cbor_offset, the hash slot itself, and the ENDF marker are not part of the hashed region.

Wire Format (v3)

This page describes the exact byte layout of a Tensogram v3 message — the format shipped in 0.17.0. You need this if you are implementing a reader in another language, debugging a corrupted file, or just want to understand what is happening under the hood. For the normative specification, see plans/WIRE_FORMAT.md.

All integer fields are big-endian (network byte order).

Overview

A Tensogram message is built from three sections: a header (preamble + optional frames), one or more data object frames, and a footer (optional frames + postamble).

┌────────────────────────────────────────────────────────────────────┐
│  PREAMBLE                  magic, version, flags, length  (24 B)   │
├────────────────────────────────────────────────────────────────────┤
│  HEADER METADATA FRAME     CBOR global metadata      (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  HEADER INDEX FRAME        CBOR object offsets       (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  HEADER HASH FRAME         CBOR object hashes        (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  PRECEDER METADATA FRAME   per-object metadata       (optional)    │
│  DATA OBJECT FRAME 0       header + payload + descriptor           │
│  PRECEDER METADATA FRAME   per-object metadata       (optional)    │
│  DATA OBJECT FRAME 1       ...                                     │
│  DATA OBJECT FRAME 2       (no preceder)                           │
│  ...                       (any number of objects)                 │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER HASH FRAME         CBOR object hashes        (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER INDEX FRAME        CBOR object offsets       (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  FOOTER METADATA FRAME     CBOR global metadata      (optional)    │
├────────────────────────────────────────────────────────────────────┤
│  POSTAMBLE   first_footer_offset, total_length, end_magic  (24 B)  │
└────────────────────────────────────────────────────────────────────┘

At least one metadata frame (header or footer) must be present — messages cannot exist without metadata. Index and hash frames are optional but highly encouraged. By default, the encoder places them in the header when writing to a buffer, or in the footer when streaming.

Frame ordering: The decoder enforces that frames appear in order: header frames, then data object frames, then footer frames. A header frame appearing after a data object frame, or a data object frame appearing after a footer frame, is rejected as malformed.

Preamble (24 bytes)

The preamble is the fixed-size start of every message.

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       8       Magic: "TENSOGRM" (ASCII)
8       2       Version (uint16 BE) — must be 3 in v3
10      2       Flags (uint16 BE)
12      4       Reserved (uint32 BE) — set to zero
16      8       Total length (uint64 BE)

Total length is the byte count of the entire message from the first byte of the preamble to the last byte of the postamble. A value of zero means the encoder is in streaming mode — the total length was not known when the preamble was written.

Version compatibility. v3 decoders reject any preamble whose version field is not exactly 3. Older v1/v2 messages must be re-encoded.

Preamble flags

The flags field is a bitmask indicating which optional frames are present and, new in v3, whether inline per-frame hash slots are populated:

Bit	Flag	Meaning
0	`HEADER_METADATA`	A HeaderMetadata frame is present.
1	`FOOTER_METADATA`	A FooterMetadata frame is present.
2	`HEADER_INDEX`	A HeaderIndex frame is present.
3	`FOOTER_INDEX`	A FooterIndex frame is present.
4	`HEADER_HASHES`	A HeaderHash aggregate frame is present.
5	`FOOTER_HASHES`	A FooterHash aggregate frame is present.
6	`PRECEDER_METADATA`	At least one PrecederMetadata frame is present.
7	`HASHES_PRESENT`	Every frame’s inline hash slot is populated with a non-zero xxh3-64 digest (new in v3).

Unused flag bits must be set to zero.

Frames

Every frame (header, footer, and data object) shares a common 16-byte frame header and ends with a type-specific footer whose last 12 bytes are always [hash u64][ENDF 4] (new in v3).

Frame header (16 bytes)

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       2       Start marker: "FR" (ASCII)
2       2       Frame type (uint16 BE)
4       2       Frame version (uint16 BE)
6       2       Reserved flags (uint16 BE)
8       8       Frame length — offset to end of frame (uint64 BE)

Frame versions are independent from the message version and from each other.

Every frame ends with this fixed-size tail:

Offset (from frame end)  Size    Field
───────────────────────  ──────  ─────────────────────────────────
-12                      8       hash (uint64 BE) — xxh3-64 digest of the frame body, or 0x0000000000000000 when HASHES_PRESENT = 0
-4                       4       End marker: "ENDF" (ASCII)

Data-object frames (type 9) have a larger 20-byte footer that adds an 8-byte cbor_offset field before the common tail.

Frame types

Type	Name	Contents
1	Header Metadata	CBOR global metadata map
2	Header Index	CBOR index of data object offsets
3	Header Hash	CBOR aggregate of per-object hashes
4	(reserved)	Occupied by the obsolete v2 `NTensorFrame`; any v3 decoder errors on read
5	Footer Hash	CBOR aggregate of per-object hashes
6	Footer Index	CBOR index of data object offsets
7	Footer Metadata	CBOR global metadata map
8	Preceder Metadata	Per-object CBOR metadata (see below)
9	`NTensorFrame`	Descriptor + payload + optional NaN / Inf bitmask companion sections (see NaN / Inf Handling)

The body phase of a v3 message carries one or more data-object frames. In v3 only NTensorFrame (type 9) is defined; future types can slot in at fresh unused numbers without bumping the wire version.

Padding between frames

It is valid to have padding bytes between a frame’s ENDF marker and the next frame’s FR marker. This allows encoders to align frame starts to 8-byte (64-bit) boundaries for memory-mapped access.

Data Object Frames

A data object frame wraps one tensor’s payload together with its CBOR descriptor. v3 defines exactly one concrete data-object type, NTensorFrame (type 9). The descriptor can go either before or after the payload — flag bit 0 in the frame header controls this. The default is after, because when encoding the descriptor is sometimes only fully known once the payload has been written (e.g. after computing a hash or determining compressed size).

`NTensorFrame` (type 9) — v3 canonical layout

┌──────────────────────────────────────────────────────────────┐
│  FRAME HEADER       "FR" + type(9) + ver + flags + len (16 B)│
├──────────────────────────────────────────────────────────────┤
│  DATA PAYLOAD       raw or compressed bytes, NaN/Inf         │
│                     positions substituted with 0.0           │
├──────────────────────────────────────────────────────────────┤
│  mask_nan blob      OPTIONAL — compressed NaN position mask  │
├──────────────────────────────────────────────────────────────┤
│  mask_inf+ blob     OPTIONAL — compressed +Inf position mask │
├──────────────────────────────────────────────────────────────┤
│  mask_inf- blob     OPTIONAL — compressed -Inf position mask │
├──────────────────────────────────────────────────────────────┤
│  CBOR DESCRIPTOR    carries a top-level "masks" sub-map      │
│                     when any mask is present (see below)     │
├──────────────────────────────────────────────────────────────┤
│  cbor_offset (uint64 BE, 8 B)                                │
│  hash        (uint64 BE, 8 B)   xxh3-64 of body              │
│  "ENDF"      (4 B)                                           │
└──────────────────────────────────────────────────────────────┘

The data-object footer is 20 bytes: [cbor_offset u64] [hash u64][ENDF 4]. The cbor_offset field points at the CBOR descriptor’s start relative to the frame’s first byte. The inline hash slot carries the xxh3-64 of the frame body (everything between the 16-byte header and this 20-byte footer) when the message’s HASHES_PRESENT preamble flag is set; otherwise it is 0x0000000000000000.

Hash scope includes payload + masks + CBOR. It does NOT include the header, the cbor_offset field, the hash slot itself, or ENDF.

The CBOR descriptor fully describes the data object: its type, shape, strides, data type, byte order, encoding pipeline, and optional per-object metadata. See the CBOR Metadata page for the schema.

See NaN / Inf Handling for the mask encode / decode semantics and the documented lossy-reconstruction caveat.

Preceder Metadata Frame

A Preceder Metadata Frame (type 8) optionally appears immediately before a Data Object Frame. It carries per-object metadata for the following data object, using the same GlobalMetadata CBOR format but with a single-entry base array.

Use case: Streaming producers that do not know ahead of time when the message will end can emit per-object metadata early via preceders, rather than waiting for the footer.

Ordering rules:

Must appear in the data objects phase (after headers, before footers).
Must be followed by exactly one Data Object Frame.
Two consecutive preceders without an intervening DataObject are invalid.
A dangling preceder (not followed by a DataObject) is invalid.
Preceders are optional per-object.

CBOR structure:

{
     "base": [{"mars": {"param": "2t"}, "units": "K"}]
}

Merge on decode: Preceder keys override footer base[i] keys on conflict. Footer-only keys (e.g., auto-populated _reserved_.tensor with ndim, shape, strides, dtype) are preserved. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.

Postamble (16 bytes)

The postamble sits at the very end of every message.

Offset  Size    Field
──────  ──────  ─────────────────────────────────
0       8       first_footer_offset (uint64 BE)
8       8       End magic: "39277777" (ASCII)

first_footer_offset is the byte offset (from the start of the message) to the first footer frame. This is never zero:

If footer frames exist, it points to the start of the first one (e.g., the Footer Hash Frame).
If no footer frames exist, it points to the postamble itself.

This guarantee means a reader can always distinguish “no footer frames” from “footer at offset 0” without ambiguity.

The end magic 39277777 was chosen because it is unlikely to appear naturally in floating-point or integer data, making it useful as a corruption boundary detector.

Random Access Patterns

With a header index (most common)

When a message was written in non-streaming mode, the index is in the header. This is the fastest path — no seeking to the end required.

1. Read preamble (24 B) → check flags
2. Read header metadata frame → global context
3. Read header index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode

When a message was written in streaming mode, the encoder did not know the object count or offsets up front. The index lives in the footer.

1. Seek to end − 24, read postamble → first_footer_offset
2. Seek to first_footer_offset, scan footer frames → find index
3. Read footer index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode

Both paths give O(1) access to any data object by index. The object count is derived from offsets.len().

Scanning a Multi-Message File

Multiple messages can be concatenated into a single .tgm file. To find message boundaries:

Scan forward for the TENSOGRM magic (8 bytes).
Read total_length from the preamble.
- If total_length is non-zero, advance by that many bytes to reach the next message.
- If total_length is zero (streaming mode), use the header index frame length if present.
If neither total length nor header index is available, walk frame-by-frame — each frame header contains a length field — until the next TENSOGRM magic or EOF.
Verify the 39277777 end magic at the expected position to confirm message integrity.

flowchart TD
    A[Start of file] --> B{Find TENSOGRM?}
    B -- No --> Z[End of scan]
    B -- Yes --> C[Read total_length at +16]
    C --> D{total_length > 0?}
    D -- Yes --> E[Advance to offset + total_length]
    D -- No --> F[Walk frame-by-frame to next magic]
    E --> G[Verify 39277777 end magic]
    F --> G
    G -- Valid --> H[Record message]
    H --> B
    G -- Invalid --> I[Skip 1 byte, resume scan]
    I --> B

If the end magic does not match, the message is likely corrupt. The scanner skips one byte and resumes searching — this is the corruption recovery path.

A Note on CBOR

Frames that contain CBOR data (metadata, index, hash) use length-prefixed CBOR encoding — there are no explicit start/end markers within the CBOR stream itself. The CBOR decoder reads the first byte to determine the data type and item count, then consumes exactly that many bytes. The frame boundaries (FR…ENDF) provide the outer containment.

All CBOR maps use deterministic encoding with canonical key ordering (RFC 8949 section 4.2). See CBOR Metadata for details.

CBOR Metadata Schema

Tensogram v3 uses CBOR (Concise Binary Object Representation) for all structured metadata. There are four kinds of CBOR structures in a message, each living in its own frame:

GlobalMetadata — in header or footer metadata frames
DataObjectDescriptor — inside each data object frame
IndexFrame — in header or footer index frames
HashFrame — in header or footer hash frames

All CBOR maps use deterministic encoding with canonical key ordering per RFC 8949 section 4.2. Keys are sorted by the byte representation of their CBOR-encoded key, applied recursively to nested maps. This means the same metadata always produces the same bytes — important if you hash messages or compare them by digest.

GlobalMetadata

The global metadata frame contains a single CBOR map. The frame is free-form — the library interprets only three top-level keys (base, _reserved_, _extra_) and routes anything else into _extra_ on decode. There are no required top-level keys.

Key	Type	Required	Description
`base`	array of maps	No	Per-object metadata — one entry per data object, each entry holds ALL metadata for that object independently
`_reserved_`	map	No	Library internals (provenance: encoder, time, uuid). Client code MUST NOT write to this.
`_extra_`	map	No	Client-writable catch-all for ad-hoc message-level annotations
any other top-level key	any	No	Routed into `_extra_` on decode (free-form passthrough; covers legacy `"version"` keys from pre-0.17 producers)

Wire-format version lives in the preamble. The CBOR metadata frame carries no version key in v3. The preamble’s 2-byte version field is the single source of truth (currently 3); every binding exposes it as WIRE_VERSION (Rust / Python / TypeScript) or tgm_message_version / msg.version() (FFI / C++). See wire-format.md §3.

Each data object is self-describing via its own per-frame descriptor (see below). The base array provides per-object metadata at the message level so readers can discover object metadata from the global frame alone, without opening each data object frame.

The `base` Array

The base array is one entry per data object. Each entry is a CBOR map holding ALL structured metadata for that object. The encoder auto-populates _reserved_.tensor (containing ndim, shape, strides, dtype) in each entry. Application keys (e.g. "mars") are preserved:

{
  "base": [
    {
      "mars": { "class": "od", "stream": "oper", "param": "2t", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": { "class": "od", "stream": "oper", "param": "10u", "date": "20260404" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ]
}

Each entry corresponds to one data object in order. Entries are independent — there is no tracking of which keys are common across objects. If you need to extract commonalities (e.g. for display or merge operations), use the compute_common() utility in software after decoding.

Key difference from earlier versions: There is no common/payload split. Every base[i] entry is self-contained. MARS keys that are shared across all objects (e.g. class, stream, date) are simply repeated in each entry.

The `_reserved_` Section

The _reserved_ section at the message level holds library-managed provenance information. Client code can read these values but must not write to _reserved_ — the encoder validates this and rejects messages where client code has written to it.

{
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.1.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  }
}

Note: _reserved_.encoder.version is set to the library’s crate version at compile time via env!("CARGO_PKG_VERSION") — the value above reflects the tensogram version in use.

Within each base[i] entry, the encoder also auto-populates _reserved_.tensor:

{
  "_reserved_": {
    "tensor": {
      "ndim": 2,
      "shape": [721, 1440],
      "strides": [1440, 1],
      "dtype": "float32"
    }
  }
}

The `_extra_` Section

The _extra_ section is a client-writable catch-all for ad-hoc message-level annotations:

{
  "_extra_": {
    "source": "ifs-cycle49r2",
    "experiment_tag": "alpha-run-003"
  }
}

Example GlobalMetadata

A complete example with two data objects (temperature and wind fields):

{
  "base": [
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "2t"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    },
    {
      "mars": {
        "class": "od", "stream": "oper", "expver": "0001",
        "date": "20260404", "time": "0000", "step": "0",
        "levtype": "sfc", "grid": "regular_ll", "param": "10u"
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
      }
    }
  ],
  "_reserved_": {
    "encoder": { "name": "tensogram", "version": "0.6.0" },
    "time": "2026-04-06T12:00:00Z",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "_extra_": {
    "source": "ifs-cycle49r2"
  }
}

Each base[i] entry is fully self-contained. The only key that varies between the two entries above is param. All other MARS keys are repeated — this is by design. Commonalities can be computed in software via compute_common() when needed.

Optional: Full GRIB Namespace Keys

When the GRIB importer runs with preserve_all_keys (CLI: --all-keys), all non-mars ecCodes namespace keys are stored under a "grib" sub-object within each base[i] entry:

{
  "base": [
    {
      "mars": { "class": "od", "grid": "regular_ll", "param": "2t", "..." : "..." },
      "grib": {
        "geography": { "Ni": 1440, "Nj": 721, "gridType": "regular_ll" },
        "time":      { "dataDate": 20260404, "dataTime": 0 },
        "ls":        { "edition": 2, "centre": "ecmf", "packingType": "grid_ccsds" },
        "parameter":  { "paramId": 167, "shortName": "2t", "units": "K" },
        "statistics": { "max": 311.03, "min": 212.84, "avg": 277.6 }
      },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    }
  ]
}

The namespaces captured are: ls, geography, time, vertical, parameter, statistics. Keys may overlap between namespaces (e.g. gridType appears in both ls and geography); each namespace stores its own copy. Empty namespaces are omitted.

DataObjectDescriptor

Each data object frame contains its own CBOR descriptor. This descriptor fully describes how to decode the payload — its type, shape, encoding pipeline, and optional per-object metadata. It lives inside the data object frame (not in a central metadata block).

Key	Type	Required	Description
`type`	text	Yes	Object type, e.g. `"ntensor"` (Rust field: `obj_type`)
`ndim`	uint	Yes	Number of dimensions
`shape`	array of uint	Yes	Size of each dimension
`strides`	array of uint	Yes	Element stride per dimension
`dtype`	text	Yes	Data type string (see Data Types)
`byte_order`	text	Yes	`"big"` or `"little"`
`encoding`	text	Yes	`"none"` or `"simple_packing"`
`filter`	text	Yes	`"none"` or `"shuffle"`
`compression`	text	Yes	`"none"`, `"szip"`, `"zstd"`, `"lz4"`, `"blosc2"`, `"zfp"`, `"sz3"`, `"rle"`, or `"roaring"` (last two `bitmask` dtype only)
`masks`	map	No	NaN / Inf bitmask companion descriptors (see below)
encoding params	various	Conditional	Required when `encoding != "none"`
filter params	various	Conditional	Required when `filter != "none"`
compression params	various	Conditional	Required when `compression != "none"`
any other key	any	No	Per-object encoding parameters

Note. The hash key is not part of the descriptor in v3. Per-object integrity is carried by an inline 8-byte slot at the end of each frame’s footer; see plans/WIRE_FORMAT.md §2.4.

Example: Temperature Field Descriptor

Here is what a descriptor might look like for a global temperature field at 0.25-degree resolution, compressed with zstd:

{
  "type": "ntensor",
  "ndim": 2,
  "shape": [721, 1440],
  "strides": [1440, 1],
  "dtype": "float32",
  "byte_order": "little",
  "encoding": "simple_packing",
  "sp_reference_value": 193.72,
  "sp_binary_scale_factor": -16,
  "sp_decimal_scale_factor": 0,
  "sp_bits_per_value": 16,
  "filter": "none",
  "compression": "zstd",
  "zstd_level": 3
}

The params field in DataObjectDescriptor is for encoding parameters only (e.g. sp_reference_value, sp_bits_per_value). MARS keys and other application metadata are stored in the global metadata base[i]["mars"]. Per-object hashes live in the frame’s inline hash slot, not in the descriptor.

Encoding Parameters (simple_packing)

Key	Type	Description
`sp_reference_value`	float	Minimum value in the original data
`sp_binary_scale_factor`	int	Power-of-2 scaling factor
`sp_decimal_scale_factor`	int	Power-of-10 scaling factor
`sp_bits_per_value`	uint	Number of bits per packed value (1-64)

Filter Parameters (shuffle)

Key	Type	Description
`shuffle_element_size`	uint	Byte width of each element (e.g., 4 for float32)

Compression Parameters

szip:

Key	Type	Description
`szip_rsi`	uint	Reference sample interval
`szip_block_size`	uint	Block size (typically 8 or 16)
`szip_flags`	uint	AEC encoding flags
`szip_block_offsets`	array of uint	Bit offsets of RSI block boundaries (computed by the library or provided via `encode_pre_encoded`, see Pre-encoded Payloads)

zstd:

Key	Type	Default	Description
`zstd_level`	int	3	Compression level (1-22)

lz4: No additional parameters required.

blosc2:

Key	Type	Default	Description
`blosc2_codec`	text	`"lz4"`	Internal codec: `blosclz`, `lz4`, `lz4hc`, `zlib`, `zstd`
`blosc2_clevel`	int	5	Compression level (0-9)
`blosc2_typesize`	uint	(auto)	Element byte width for shuffle optimization

zfp:

Key	Type	Description
`zfp_mode`	text	`"fixed_rate"`, `"fixed_precision"`, or `"fixed_accuracy"`
`zfp_rate`	float	Bits per value (only for `fixed_rate`)
`zfp_precision`	uint	Bit planes to keep (only for `fixed_precision`)
`zfp_tolerance`	float	Max absolute error (only for `fixed_accuracy`)

sz3:

Key	Type	Description
`sz3_error_bound_mode`	text	`"abs"`, `"rel"`, or `"psnr"`
`sz3_error_bound`	float	Error bound value

NaN / Inf mask companion (`masks`)

When the object was encoded with allow_nan=true and/or allow_inf=true AND the payload actually contained at least one matching non-finite value, the descriptor carries a masks sub-map. Each kind (nan, inf+, inf-) is independently optional — only the kinds that appeared are present.

{
  ... standard DataObjectDescriptor fields ...,
  "masks": {
    "nan": {
      "method": "roaring",
      "offset": 40,
      "length": 12
    },
    "inf+": {
      "method": "rle",
      "offset": 52,
      "length": 3
    }
  }
}

Each entry:

Key	Type	Description
`method`	text	`"none"` \| `"rle"` \| `"roaring"` \| `"blosc2"` \| `"zstd"` \| `"lz4"` — compression method actually used (may differ from the requested method due to the small-mask auto-fallback)
`offset`	uint	Byte offset of the mask blob from the start of the payload region (= first byte after the 16-byte frame header)
`length`	uint	Byte length of the mask blob on disk
`params`	map	Optional method-specific parameters (e.g. `{"level": 3}` for zstd, `{"codec": "lz4", "level": 5}` for blosc2)

Canonical key order for masks is the byte-lex sort inf+ < inf- < nan. The encoder writes mask blobs between the payload and the CBOR descriptor in the same canonical order. See NaN / Inf Handling for the encode / decode semantics.

IndexFrame

Index frames (header or footer) contain a CBOR map that lets readers jump directly to any data object without scanning.

Key	Type	Description
`offsets`	array of uint	Byte offset of each data object frame from message start
`lengths`	array of uint	Byte length of each data object frame

Object count is derived from offsets.len(); lengths.len() must equal offsets.len() or the decoder emits a MetadataError.

Example IndexFrame

{
  "offsets": [256, 1048832, 2097408],
  "lengths": [1048576, 1048576, 524288]
}

The offsets array gives O(1) random access to any object — seek to offsets[i] and read lengths[i] bytes.

HashFrame

Hash frames (header or footer) mirror the per-object inline hash slots of each data-object frame’s footer (see wire-format.md §2.4), so readers can inspect the aggregate without walking every frame.

Key	Type	Description
`algorithm`	text	Hash algorithm name. `"xxh3"` is the only value a v3 encoder emits.
`hashes`	array of text	Hex-encoded digest for each object, in emission order.

Object count is derived from hashes.len(). An unknown algorithm value triggers an UnknownHashAlgorithm warning at validate time; the inline slots remain the authoritative check.

Example HashFrame

{
  "algorithm": "xxh3",
  "hashes": [
    "a1b2c3d4e5f60718",
    "b2c3d4e5f6071829",
    "c3d4e5f60718293a"
  ]
}

Canonical Encoding

All CBOR maps are encoded with keys sorted by the byte representation of their CBOR-encoded key (RFC 8949 section 4.2). This sorting is applied recursively — nested maps are also sorted.

For short string keys (the common case), this is equivalent to sorting by the key string itself. For long keys or non-string keys, the CBOR byte encoding determines the order.

Why does this matter? If you hash an entire message or compare messages by digest, deterministic encoding ensures that logically identical messages produce identical bytes even if the keys were inserted in different order during construction.

Metadata Value Types

All Tensogram metadata — whether in GlobalMetadata, the base / _reserved_ / _extra_ sections, or per-object params — is stored as CBOR. This page describes which value types are valid, which are forbidden, and why.

Allowed Types

Use only the subset of CBOR types that have direct JSON equivalents:

CBOR type	Rust / Python equivalent	Example
text string	`String` / `str`	`"imaging"`, `"2026-01-12"`
integer	`i64` / `int`	`850`, `-1`, `0`
float	`f64` / `float`	`3.14`, `-273.15`
boolean	`bool` / `bool`	`true`, `false`
null	`None` / `None`	(absence of a value)
array	`Vec<Value>` / `list`	`[1440, 721]`, `["t2", "flair"]`
map	`BTreeMap<String, Value>` / `dict`	`{"device": "mri", "sequence": "t2_flair"}`

Map keys must be text strings. Nested arrays and maps are allowed and encoded recursively.

Forbidden Types

The following CBOR types are not allowed in Tensogram metadata:

Type	Reason
byte strings	Opaque blobs break cross-language interoperability; use base64 text instead
CBOR tags	Tags (`#6.<n>`) are not parsed by most CBOR libraries and can change value semantics
undefined	Only valid in streaming CBOR; never appears in map values
half-precision floats (f16)	Not supported by many JSON bridges; use `f64`
non-string map keys	Integer or binary keys are non-canonical and not searchable

The `base` Section

The base section of GlobalMetadata is a CBOR array of maps — one entry per data object. Each entry holds ALL structured metadata for that object independently. The encoder auto-populates _reserved_.tensor (with ndim, shape, strides, dtype) in each entry when you call encode() or StreamingEncoder::finish(). Any other keys the application placed in a base entry before encoding (e.g. a per-object vocabulary namespace) are preserved. The example below uses the MARS vocabulary; any application namespace works the same way:

{
     "base": [
    {
      "mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "2t", "levtype": "sfc" },
      "_reserved_": {
        "tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
      }
    },
    {
      "mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "lnsp", "levtype": "ml" },
      "_reserved_": {
        "tensor": { "ndim": 1, "shape": [137], "strides": [1], "dtype": "float64" }
      }
    }
  ]
}

Each entry is fully self-contained — all keys for that object appear in its entry. There is no separate “common” section for shared keys. If you need to extract commonalities (e.g. for display), use the compute_common() utility in software after decoding.

Note: base describes the collection of objects at the message level. Individual tensor encoding details (encoding pipeline, hash) remain in each object’s own DataObjectDescriptor. The DataObjectDescriptor.params field is reserved for encoding parameters only — it does not carry application metadata.

Practical Guidance

Prefer integers for numeric identifiers (paramId, date, run_id).
Use text strings for classification codes even if they happen to be numeric-looking — consistency with your chosen vocabulary is more important than type optimisation.
Use nested maps for namespaced keys (e.g., "mars": {...}, "bids": {...}, "dicom": {...}).
Keep individual values small. Avoid storing large arrays (e.g., grid coordinates) in metadata — they belong in data objects.

Data Types

The dtype field in an object descriptor names the element type of the tensor. It is stored as a lowercase text string in CBOR.

Type Table

CBOR string	Rust variant	Bytes per element	Notes
`float16`	`Dtype::Float16`	2	IEEE 754 half-precision
`bfloat16`	`Dtype::Bfloat16`	2	Brain float — same exponent range as float32, less mantissa precision
`float32`	`Dtype::Float32`	4	IEEE 754 single-precision
`float64`	`Dtype::Float64`	8	IEEE 754 double-precision
`complex64`	`Dtype::Complex64`	8	Pair of float32 (real, imaginary)
`complex128`	`Dtype::Complex128`	16	Pair of float64 (real, imaginary)
`int8`	`Dtype::Int8`	1	Signed
`int16`	`Dtype::Int16`	2	Signed
`int32`	`Dtype::Int32`	4	Signed
`int64`	`Dtype::Int64`	8	Signed
`uint8`	`Dtype::Uint8`	1	Unsigned
`uint16`	`Dtype::Uint16`	2	Unsigned
`uint32`	`Dtype::Uint32`	4	Unsigned
`uint64`	`Dtype::Uint64`	8	Unsigned
`bitmask`	`Dtype::Bitmask`	0*	Packed bits

*bitmask returns 0 from byte_width() — see the edge case note below.

Byte Order

The byte_order field in the payload descriptor specifies whether multi-byte elements are stored in big-endian ("big") or little-endian ("little") order. This applies to the stored payload bytes after encoding.

Single-byte types (int8, uint8, bitmask) are unaffected by byte order.

Bitmask Edge Case

Dtype::Bitmask is for packing boolean or categorical data sub-byte. The payload size is ceil(num_elements / 8) bytes. The byte_width() method returns 0 as a sentinel; callers that need the actual payload size must compute it:

#![allow(unused)]
fn main() {
let payload_bytes = if dtype == Dtype::Bitmask {
    (num_elements + 7) / 8
} else {
    num_elements * dtype.byte_width()
};
}

Choosing a dtype

Situation	Recommended dtype
Temperature, wind speed, pressure (weather)	`float32`
High-precision scientific analysis	`float64`
ML model weights	`bfloat16` or `float16`
Integer indices, counts	`int32` or `int64`
Land-sea masks, validity flags	`uint8` or `bitmask`
Complex wave spectra	`complex64`

Quick Start

This page walks you through encoding and decoding a real tensor — a 2D temperature field — in about 20 lines of Rust.

Installation

Rust

cargo add tensogram

cargo add pulls the latest published version. To pin an explicit version, add the dependency to Cargo.toml by hand; see the crates.io page for the current version.

Optional features:

Feature	What it adds
`mmap`	Zero-copy memory-mapped file reads
`async`	Async I/O via tokio
`remote`	Read from S3, GCS, Azure Blob, or HTTP
`szip-pure`	Pure-Rust szip (no C dependency)
`zstd-pure`	Pure-Rust zstd (no C dependency)

All compression codecs (szip, zstd, lz4, blosc2, zfp, sz3) and multi-threading are enabled by default.

cargo add tensogram --features mmap,async,remote

Python

pip install tensogram

With xarray and Zarr backends:

pip install tensogram[all]      # everything
pip install tensogram[xarray]   # xarray backend only
pip install tensogram[zarr]     # Zarr v3 store only

CLI

cargo install tensogram-cli

Encode a 2-D Float Field

This example encodes a 100×200 float32 grid — representative of many scientific 2-D fields (temperature, pressure, intensity, density, …).

use std::collections::BTreeMap;
use tensogram::{
    encode, decode, GlobalMetadata, DataObjectDescriptor,
    ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};

fn main() {
    // 1. Make some synthetic data: 100×200 float32 grid
    //    In production, this would come from your model output, sensor,
    //    or upstream pipeline.
    let shape = vec![100u64, 200];
    let strides = vec![200u64, 1]; // C-contiguous (row-major)
    let num_elements = 100 * 200;
    let data: Vec<u8> = (0..num_elements)
        .flat_map(|i| (273.15f32 + (i as f32 / 100.0)).to_ne_bytes())
        .collect();

    // 2. Describe the tensor (`GlobalMetadata::default()` stamps the
    //    current wire version).
    let global = GlobalMetadata::default();

    let desc = DataObjectDescriptor {
        obj_type: "ntensor".to_string(),
        ndim: 2,
        shape,
        strides,
        dtype: Dtype::Float32,
        byte_order: ByteOrder::native(),
        encoding: "none".to_string(),
        filter: "none".to_string(),
        compression: "none".to_string(),
        masks: None,
        params: BTreeMap::new(),
    };

    // 3. Encode — produces a self-contained message
    let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default()).unwrap();

    println!("Encoded {} bytes", message.len());

    // 4. Decode it back
    let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();

    println!(
        "Decoded: {} objects, shape {:?}, dtype {}",
        objects.len(),
        objects[0].0.shape,
        objects[0].0.dtype,
    );
    assert_eq!(objects[0].1, data);
}

Add Application Metadata

Real messages need application-layer metadata so downstream tools know what the data represents. Per-object metadata goes into the base array — one entry per data object — and is organised under a namespace key so that multiple vocabularies can coexist.

The example below uses ECMWF’s MARS vocabulary for concreteness. The same mechanism works with any vocabulary: CF conventions ("cf"), BIDS ("bids"), DICOM ("dicom"), or your own ("product", "experiment", "device", …).

#![allow(unused)]
fn main() {
use ciborium::Value;

// Build a "mars" namespace for the object — one concrete vocabulary example.
// You can just as easily use "bids", "dicom", "product", or any custom name.
let mars_map = vec![
    (Value::Text("class".into()), Value::Text("od".into())),
    (Value::Text("date".into()),  Value::Text("20260401".into())),
    (Value::Text("step".into()),  Value::Integer(6.into())),
    (Value::Text("type".into()),  Value::Text("fc".into())),
    (Value::Text("param".into()), Value::Text("2t".into())),
];

let mut entry = BTreeMap::new();
entry.insert("mars".to_string(), Value::Map(mars_map));

let global = GlobalMetadata {
    base: vec![entry], // one entry per data object
    ..Default::default()
};

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::native(),
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    masks: None,
    params: BTreeMap::new(),
};
}

What’s Next?

Use simple_packing to reduce payload size by 4-8x
Use the File API to append many messages to a .tgm file
Use the CLI to inspect files without writing any code

Vocabularies

Tensogram is vocabulary-agnostic: the library never interprets metadata keys. The same message can carry any combination of application-defined namespaces alongside the auto-populated library-reserved keys. This page collects example vocabularies that have been (or could naturally be) used with Tensogram, so you can pick a convention that matches your domain — or invent your own.

How metadata is structured

A Tensogram message’s per-object metadata lives in base[i], a BTreeMap<String, ciborium::Value>. By convention, each application vocabulary sits under its own top-level namespace key so that multiple vocabularies can coexist without collision:

{
     "base": [{
    "mars":   { "class": "od", "param": "2t" },
    "cf":     { "standard_name": "air_temperature", "units": "K" },
    "custom": { "experiment": "run-042" }
  }]
}

All three namespaces above are valid, visible to tooling, and survive round-trip. The library never reads or validates their contents.

Example vocabularies

MARS (ECMWF, weather forecasting)

Used internally at ECMWF and by downstream consumers of ECMWF’s MARS archive. Keys describe the operational provenance of a forecast field: class, stream, type, parameter, level, date/time, step, etc.

{
  "mars": {
    "class": "od", "stream": "oper", "type": "fc",
    "date": "20260401", "time": "1200", "step": 6,
    "param": "2t", "levtype": "sfc"
  }
}

The GRIB importer (tensogram convert-grib) automatically populates this namespace from GRIB MARS keys. See MARS Key Mapping for the full key list.

CF Conventions (climate, ocean, atmospheric)

CF Conventions are the standard attribute vocabulary for climate and forecast data in NetCDF. The NetCDF importer (tensogram convert-netcdf --cf) lifts the CF allow-list into a "cf" sub-map. See NetCDF CF Metadata Mapping.

{
  "cf": {
    "standard_name": "air_temperature",
    "long_name": "2 metre temperature",
    "units": "K",
    "cell_methods": "time: mean"
  }
}

BIDS (neuroimaging)

The Brain Imaging Data Structure organises neuroimaging datasets with entity-level metadata. A natural fit for Tensogram messages carrying fMRI, dMRI, or EEG tensors.

{
  "bids": {
    "subject": "sub-01", "session": "ses-01",
    "task": "rest", "run": 1, "acq": "hires"
  }
}

DICOM (medical imaging)

DICOM tags are the standard descriptors for medical imaging studies. They can be mapped into a "dicom" namespace for use with Tensogram messages carrying imaging volumes, time-series, or segmentation masks.

{
  "dicom": {
    "Modality": "MR", "SeriesDescription": "T2_FLAIR",
    "SliceThickness": 1.0, "RepetitionTime": 8000
  }
}

Zarr attributes (generic)

Zarr v3 attribute maps are generic key-value stores. When using the Zarr backend (tensogram-zarr), group-level and array-level attributes are surfaced through _extra_ and per-array descriptor params.

Custom namespaces

For any domain that does not have an established vocabulary, or when a pipeline wants to carry bespoke fields alongside a standard namespace, invent your own:

{
  "experiment": {
    "id": "run-042",
    "operator": "alice",
    "hypothesis": "beam stability",
    "started_at": "2026-04-18T10:30:00Z"
  }
}

Suggested conventions for custom namespaces:

Use a short, lowercase namespace key ("product", "instrument", "run", "experiment", "device").
Group related fields under a single namespace rather than scattering them at the top level of base[i].
Prefer ISO 8601 timestamps, SI units in units fields, and UTF-8 text for identifiers.
Document your namespace schema somewhere versioned (a README, a JSON schema, a wiki page) so downstream consumers can interpret it consistently.

Multiple vocabularies in one message

You can freely mix vocabularies in the same base[i] entry — the library preserves all of them:

{
  "base": [{
    "mars":       { "param": "2t", "levtype": "sfc" },
    "cf":         { "standard_name": "air_temperature", "units": "K" },
    "provenance": { "pipeline_id": "pp-17", "stage": "post-process" }
  }]
}

This lets one team’s producers emit messages that are simultaneously interpretable by tools expecting MARS, CF-aware tooling, and an internal provenance tracker.

Looking up keys

The dotted-path helpers exposed by each binding vary. The CLI, the C FFI (tgm_metadata_get_string / _get_int / _get_float), the C++ wrapper (metadata::get_string / get_int / get_float), and the TypeScript package (getMetaKey) all accept a full dotted path. The Rust crate and the Python package do not expose a dotted-path helper at this time; use direct nested access instead.

TypeScript — dotted path

import { getMetaKey } from '@ecmwf.int/tensogram';

const param   = getMetaKey(meta, 'mars.param');
const subject = getMetaKey(meta, 'bids.subject');

CLI — dotted path

# Filter messages on a namespaced key
tensogram ls data.tgm -w "mars.param=2t/10u"
tensogram ls data.tgm -w "bids.subject=sub-01"

# Print specific keys
tensogram get -p "cf.standard_name,cf.units" data.tgm

Python — dict-style nested access

# Metadata.__getitem__ does a top-level search across base[i] (skipping
# _reserved_) and falls back to the message-level _extra_ map. The returned
# value is a plain Python dict, so the next lookup is standard dict access.
param   = meta["mars"]["param"]
subject = meta["bids"]["subject"]

# meta.base[i], meta.reserved, and meta.extra are also available directly
# if you want the raw per-object / reserved / extra dicts.
first_base = meta.base[0]

Rust — pattern-match on `ciborium::Value`

#![allow(unused)]
fn main() {
use ciborium::Value;
use tensogram::GlobalMetadata;

// `meta.base` is `Vec<BTreeMap<String, Value>>`. Find the namespace on
// the first-matching base entry, then pull a text field from the nested
// map. Falls back to `meta.extra` for message-level annotations.
fn get_text<'a>(meta: &'a GlobalMetadata,
                namespace: &str, field: &str) -> Option<&'a str> {
    let pull = |map: &'a [(Value, Value)]| -> Option<&'a str> {
        map.iter().find_map(|(k, v)| match (k, v) {
            (Value::Text(k), Value::Text(v)) if k == field => Some(v.as_str()),
            _ => None,
        })
    };
    for entry in &meta.base {
        if let Some(Value::Map(items)) = entry.get(namespace)
            && let Some(val) = pull(items)
        {
            return Some(val);
        }
    }
    if let Some(Value::Map(items)) = meta.extra.get(namespace) {
        return pull(items);
    }
    None
}

let param = get_text(&meta, "mars", "param");
}

Tensogram keeps the Rust surface small on purpose. If your pipeline needs dotted-path lookup in Rust, wrap the snippet above in a helper of your own, or call out to the CLI.

Lookup semantics (all bindings that support dotted paths)

First match across base[0], base[1], … (skipping _reserved_ within each entry), then fall back to the message-level _extra_ map. An explicit _extra_.key (or extra.key) prefix bypasses the base search.

Jupyter Notebook Walk-through

The examples/jupyter/ directory carries a curated set of narrative notebooks that introduce Tensogram interactively, with live visualisations. Unlike the flat .py examples under examples/python/ — which are minimal reference snippets for copy-paste — the notebooks are for learning.

Every notebook is executed end-to-end on every PR by the notebooks CI job, so they cannot rot.

The five journeys

#	Notebook	What you will learn
1	`01_quickstart_and_mars.ipynb`	Encode & decode a 2D tensor, visualise it, attach MARS metadata, walk the `base` / `_reserved_` / `_extra_` layout.
2	`02_encoding_and_fidelity.ipynb`	Sweep every encoding × filter × compression combination and plot ratio vs time vs fidelity.
3	`03_from_grib_to_tensogram.ipynb`	Convert a real ECMWF opendata GRIB2 file with the new Python API (`tensogram.convert_grib` + `tensogram.convert_grib_buffer`).
4	`04_from_netcdf_to_tensogram.ipynb`	Build a small CF-compliant NetCDF in-process, convert it with `tensogram.convert_netcdf`, and open the result as an xarray Dataset via `engine="tensogram"`.
5	`05_validation_and_parallelism.ipynb`	Run the four validation levels, inject corruption, sweep `threads=0…N` and plot the speedup.

Running the notebooks locally

Option 1 — `uv pip install` (recommended)

# Build the Python bindings with GRIB and NetCDF support.
# Requires libeccodes + libnetcdf installed at the OS level.
uv venv .venv --python 3.13
source .venv/bin/activate
uv pip install maturin
cd python/bindings
maturin develop --features grib,netcdf
cd ../..

# Install notebook-only dependencies + the xarray backend.
uv pip install -e examples/jupyter

# Launch JupyterLab.
jupyter lab examples/jupyter/

Option 2 — `conda env create`

conda env create -f examples/jupyter/environment.yml
conda activate tensogram-jupyter
jupyter lab examples/jupyter/

Option 3 — Binder / Colab

Launch badges in the notebook directory’s README.md — zero local install.

OS-level dependencies

Notebooks 03 (GRIB) and 04 (NetCDF) need C libraries installed at the operating system level. They are not Python packages.

Library	Needed by	macOS (Homebrew)	Debian / Ubuntu
`libeccodes`	notebook 03	`brew install eccodes`	`apt install libeccodes-dev`
`libnetcdf` + `libhdf5`	notebook 04	`brew install netcdf hdf5`	`apt install libnetcdf-dev libhdf5-dev`

The official PyPI wheels (pip install tensogram) do not ship GRIB / NetCDF support: the manylinux_2_28 base image lacks the C libraries. If you try to call tensogram.convert_grib(...) on a wheel without the feature, you get a clean RuntimeError("tensogram was built without GRIB support...") that points you at this page.

To enable the feature, rebuild from source:

git clone https://github.com/ecmwf/tensogram
cd tensogram/python/bindings
maturin develop --features grib,netcdf

Running the notebooks in CI

The repository runs the notebooks end-to-end on every PR via a dedicated notebooks job. The gate is:

pytest --nbval-lax examples/jupyter/ -v

--nbval-lax executes every cell in every notebook and fails the build on any exception. Cell outputs are not compared — we commit the notebooks with empty outputs (enforced by the python/tests/test_jupyter_structure.py guard).

Output hygiene

Committed notebooks must have empty cell outputs. Install the nbstripout pre-commit hook once:

uv pip install nbstripout
nbstripout --install

With the hook installed, git commit automatically strips outputs.

Adding a new notebook

Copy an existing .ipynb as a template.
First cell must be a markdown license banner mentioning “ECMWF” or “Apache”.
Last cell must be a “Where to go next” markdown pointer.
If you import matplotlib, call matplotlib.use("Agg") before the first import matplotlib.pyplot.
Update EXPECTED_NOTEBOOKS in python/tests/test_jupyter_structure.py.
Link it from examples/jupyter/README.md and this guide page.
Run pytest --nbval-lax examples/jupyter/ locally before committing.

Encoding Data

This page covers the encode() function and EncodeOptions in detail.

Function Signature

#![allow(unused)]
fn main() {
pub fn encode(
    global_metadata: &GlobalMetadata,
    descriptors: &[(&DataObjectDescriptor, &[u8])],
    options: &EncodeOptions,
) -> Result<Vec<u8>>
}

global_metadata — reference to message-level metadata (version, base entries, _extra_ fields)
descriptors — a slice of (descriptor, data) pairs, one per object
options — controls hash algorithm and compression backend selection (the emit_preceders field is reserved for future buffered-mode support; preceders are currently only emitted via StreamingEncoder::write_preceder)

Returns a complete, self-contained message as a Vec<u8>.

EncodeOptions

#![allow(unused)]
fn main() {
pub struct EncodeOptions {
    /// Hash algorithm to use. None disables hashing entirely.
    pub hash_algorithm: Option<HashAlgorithm>,
    /// Reserved — buffered `encode()` rejects `true`. Use
    /// `StreamingEncoder::write_preceder()` instead.
    pub emit_preceders: bool,
    /// Which backend to use for szip / zstd when both FFI and pure-Rust
    /// implementations are compiled in.
    pub compression_backend: CompressionBackend,
}

impl Default for EncodeOptions {
    fn default() -> Self {
        Self {
            hash_algorithm: Some(HashAlgorithm::Xxh3),
            emit_preceders: false,
            compression_backend: CompressionBackend::default(),
        }
    }
}
}

The default applies xxh3 hashing to every object payload. Use None to skip hashing:

#![allow(unused)]
fn main() {
let options = EncodeOptions {
    hash_algorithm: None,
    ..Default::default()
};
}

What Encode Does

For each object, in order:

Validate — checks that each pair has a descriptor and corresponding data
Run the encoding pipeline — applies encoding, filter, compression from the object’s DataObjectDescriptor
Hash — if hash_algorithm is set, computes and stores the hash in the descriptor
Serialize CBOR — encodes the GlobalMetadata and all DataObjectDescriptors to canonical CBOR
Frame — assembles preamble, header frames (metadata/index/hash), data object frames, and postamble

Encoding with Simple Packing

To use simple_packing, you need to compute the quantization parameters first, then put them in the DataObjectDescriptor:

#![allow(unused)]
fn main() {
use tensogram_encodings::simple_packing;
use ciborium::Value;

// Your original values as f64 (simple_packing always works on f64).
// source_data might be a temperature grid, pressure field, intensity
// image, or any other bounded-range scalar field.
let values: Vec<f64> = source_data.iter().map(|&x| x as f64).collect();

// Compute quantization parameters for 16 bits per value
let params = simple_packing::compute_params(&values, 16, 0)?;

// Put the parameters into the descriptor
let mut packing_params = BTreeMap::new();
packing_params.insert("sp_reference_value".into(),
    Value::Float(params.reference_value));
packing_params.insert("sp_binary_scale_factor".into(),
    Value::Integer((params.binary_scale_factor as i64).into()));
packing_params.insert("sp_decimal_scale_factor".into(),
    Value::Integer((params.decimal_scale_factor as i64).into()));
packing_params.insert("sp_bits_per_value".into(),
    Value::Integer((params.bits_per_value as i64).into()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float64,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    masks: None,
    params: packing_params,
};
}

Then encode as normal, passing the original raw bytes (as f64 bytes):

#![allow(unused)]
fn main() {
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();

let global = GlobalMetadata::default();
let message = encode(&global, &[(&desc, &raw)], &EncodeOptions::default())?;
}

The encoder applies simple_packing internally. The payload stored in the message is the packed bits, not the original f64 bytes.

Encoding Multiple Objects

Pass multiple (descriptor, data) pairs:

#![allow(unused)]
fn main() {
let global = GlobalMetadata::default();

let message = encode(
    &global,
    &[(&spectrum_desc, &spectrum_data), (&mask_desc, &land_mask_data)],
    &EncodeOptions::default(),
)?;
}

Each descriptor independently specifies its own encoding, compression, dtype, and byte order. The encoder processes each pair in sequence.

Error Conditions

Error	Cause
`Encoding`	NaN in data when using simple_packing
`Encoding`	bits_per_value out of range (0–64)
`Compression`	Compressor-specific error (invalid params, unsupported dtype)
`Metadata`	CBOR serialization failed

Pre-Encoded Data API (Advanced)

When to use this API

The encode_pre_encoded API is for advanced callers whose data is already encoded by an external pipeline (e.g., a GPU kernel that emits packed bytes, or a streaming receiver passing payloads through). It bypasses Tensogram’s internal encoding pipeline and uses the supplied bytes verbatim.

Do NOT use this API for ordinary encoding. Use encode() instead.

⚠️ The bit-vs-byte trap

WARNING: When using compression="szip", the szip_block_offsets parameter contains bit offsets, not byte offsets. The first offset must be 0 and every offset must satisfy offset <= encoded_bytes_len * 8. This matches the libaec/szip wire format. See cbor-metadata.md for the format reference.

Getting this wrong is the #1 caller mistake. Tensogram validates the offsets structurally (monotonicity, bounds) but cannot detect a byte-instead-of-bit mistake until decode_range fails.

API surface

Rust

#![allow(unused)]
fn main() {
pub fn encode_pre_encoded(
    metadata: &GlobalMetadata,
    descriptors_and_data: &[(&DataObjectDescriptor, &[u8])],
    options: &EncodeOptions,
) -> Result<Vec<u8>, TensogramError>
}

Python

import tensogram

msg: bytes = tensogram.encode_pre_encoded(
    global_meta_dict={},
    descriptors_and_data=[(descriptor_dict, raw_bytes)],
    hash="xxh3",
)

C

tgm_error tgm_encode_pre_encoded(
    const char *metadata_json,
    const uint8_t *const *data_ptrs,
    const size_t *data_lens,
    size_t num_objects,
    const char *hash_algo,
    tgm_bytes_t *out
);

C++

std::vector<std::uint8_t> tensogram::encode_pre_encoded(
    const std::string& metadata_json,
    const std::vector<std::pair<const std::uint8_t*, std::size_t>>& objects,
    const encode_options& opts = {}
);

Hash semantics

The library always recomputes the hash of the pre-encoded bytes using the algorithm specified in EncodeOptions.hash_algorithm (default xxh3). Any hash the caller stored on the descriptor is silently overwritten. This guarantees the wire format invariant descriptor.hash == hash_algo(bytes) always holds.

Provenance semantics

The encoded message is byte-format-indistinguishable from one produced by encode(). The decoder cannot tell which API produced it. The provenance fields _reserved_.encoder.name, _reserved_.time, and _reserved_.uuid are populated identically.

Self-consistency checks

Before encoding, the library validates:

Caller has not set EncodeOptions.emit_preceders (rejected).
Caller has not put _reserved_ in their metadata (rejected).
Each descriptor passes the standard validate_object checks.
If compression="szip" and szip_block_offsets is supplied:
- It’s a CBOR Array of u64.
- First offset is 0.
- Strictly monotonically increasing.
- All bit offsets <= bytes_len * 8.
If szip_block_offsets is supplied but compression != "szip", rejected.

These are structural checks only. The library does NOT trial-decode the bytes to verify they actually decode correctly.

Limitation: encoding=“none” size check

When encoding="none", the validate_object check enforces payload_len == shape_product * dtype_byte_width. This means you cannot pass compression-only payloads (e.g., zstd-compressed raw bytes) with encoding="none" because the compressed size will not match the expected raw size. Wrap such payloads in at least simple_packing or another encoding.

Worked example: simple_packing + szip with decode_range

#![allow(unused)]
fn main() {
use tensogram::{
    encode_pre_encoded, DataObjectDescriptor, EncodeOptions,
    GlobalMetadata, ByteOrder, Dtype,
};
use std::collections::BTreeMap;
use ciborium::Value;

// Pre-encoded bytes from a GPU kernel + szip block offsets in BITS
let pre_encoded_bytes: Vec<u8> = /* from GPU */;
let szip_offsets_bits: Vec<u64> = vec![0, 8192, 16384, /* ... */];

let mut params: BTreeMap<String, ciborium::Value> = BTreeMap::new();
params.insert("sp_bits_per_value".into(), Value::Integer(24u64.into()));
params.insert("sp_reference_value".into(), Value::Float(0.0));
params.insert("sp_binary_scale_factor".into(), Value::Integer((-10i64).into()));
params.insert("sp_decimal_scale_factor".into(), Value::Integer(0i64.into()));
params.insert("szip_rsi".into(), Value::Integer(128i64.into()));
params.insert("szip_block_size".into(), Value::Integer(16i64.into()));
params.insert("szip_flags".into(), Value::Integer(8i64.into()));
params.insert("szip_block_offsets".into(),
    Value::Array(szip_offsets_bits.into_iter()
        .map(|o| Value::Integer(o.into()))
        .collect()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 2,
    shape: vec![1024, 1024],
    strides: vec![1024, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "szip".into(),
    masks: None,
    params,
};

let msg = encode_pre_encoded(
    &GlobalMetadata::default(),
    &[(&desc, &pre_encoded_bytes)],
    &EncodeOptions::default(),
)?;

// decode_range works because szip_block_offsets is present.
}

How it works

flowchart TD
    subgraph pre["encode_pre_encoded path"]
        A[Caller bytes] --> B[validate_object]
        B --> C[validate_szip_block_offsets]
        C --> D[Recompute hash]
    end

    subgraph normal["encode path"]
        G[Caller bytes] --> H[Run encoding pipeline]
        H --> D
    end

    D --> E[Wrap in CBOR framing]
    E --> F[Wire message]

The pre-encoded path skips the pipeline entirely. The wire format is identical.

Byte order

When using encoding="none", the caller’s bytes are stored verbatim — the library does NOT validate or flip byte order on encode. The bytes must be in the byte order declared in the descriptor’s byte_order field.

For example, if byte_order="big" and encoding="none", the caller must provide big-endian bytes.

On decode, the library automatically converts to native byte order by default (native_byte_order=true). Callers can use from_ne_bytes() or data_as<T>() directly without worrying about which byte order was used on the wire. Set native_byte_order=false to get the raw wire-order bytes.

Streaming API

StreamingEncoder::write_object_pre_encoded() is the streaming counterpart of encode_pre_encoded(). It writes a single pre-encoded object to the stream. It can be interleaved freely with write_object() (normal encode) calls.

Rust

#![allow(unused)]
fn main() {
let mut enc = StreamingEncoder::new(output, &metadata, &options)?;
enc.write_object_pre_encoded(&descriptor, &pre_encoded_bytes)?;
enc.finish()?;
}

Python

enc = tensogram.StreamingEncoder({})
enc.write_object_pre_encoded(descriptor_dict, raw_bytes)
msg = enc.finish()

C++

tensogram::streaming_encoder enc(path, metadata_json);
enc.write_object_pre_encoded(descriptor_json, data_ptr, data_len);
enc.finish();

Error reference

encode_pre_encoded can raise the following errors:

Error condition	Message contains
`obj_type` is empty	`"obj_type must not be empty"`
`ndim` doesn’t match `shape.len()`	`"ndim … does not match shape.len()"`
`strides.len()` doesn’t match `shape.len()`	`"strides.len() … does not match shape.len()"`
`encoding="none"` and data size wrong	`"data_len … does not match expected … bytes"`
`emit_preceders=true` in buffered mode	`"emit_preceders is not supported"`
Caller set `_reserved_` in metadata	`"_reserved_"`
`szip_block_offsets` not starting at 0	`"first offset must be 0"`
`szip_block_offsets` not strictly increasing	`"strictly increasing"`
`szip_block_offsets` exceeds bit bound	`"exceeds … bit bound"`
`szip_block_offsets` with non-szip compression	`"szip_block_offsets provided but compression"`
Unknown encoding string	`"encoding"`
Unknown dtype	`"unknown dtype"`

Strides convention

The library treats strides as opaque metadata — it only validates that strides.len() == shape.len(). The convention differs between language bindings:

Rust tests use element strides (e.g., [1] for 1D, [5, 1] for shape [4, 5])
C++ tests use byte strides (e.g., [4] for float32, [12, 4] for shape [2, 3] float32)

Both conventions work correctly since the library does not interpret stride values.

Cross-references

Encoding — the normal encode() API
Decoding — decode_range requirements for partial reads
Compression — szip details
CBOR metadata — wire format reference

Decoding Data

Tensogram provides four decode functions for different use cases. Choose the one that does the least work for your situation — they are all zero-copy on the metadata path.

The DecodedObject Type

Before diving in, it helps to know the common return type:

#![allow(unused)]
fn main() {
type DecodedObject = (DataObjectDescriptor, Vec<u8>);
}

A DecodedObject is a tuple of the object’s descriptor (shape, dtype, encoding info, etc.) and the decoded raw bytes. You will see this pattern throughout the decode API.

Four Decode Functions

`decode` — full message

#![allow(unused)]
fn main() {
pub fn decode(
    message: &[u8],
    options: &DecodeOptions,
) -> Result<(GlobalMetadata, Vec<(DataObjectDescriptor, Vec<u8>)>)>
}

Decodes all objects. Returns the global metadata and a vector of DecodedObject tuples — one per object, with raw bytes in the logical dtype after de-quantization.

#![allow(unused)]
fn main() {
let (meta, objects) = decode(&message, &DecodeOptions::default())?;

// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}, bytes: {}", desc.shape, desc.dtype, data.len());
}

`decode_metadata` — metadata only

#![allow(unused)]
fn main() {
pub fn decode_metadata(message: &[u8]) -> Result<GlobalMetadata>
}

Reads only the CBOR section. Does not touch any payload bytes. Use this for filtering and listing.

#![allow(unused)]
fn main() {
let meta = decode_metadata(&message)?;
println!("version: {}", meta.version);
}

`decode_object` — single object by index

#![allow(unused)]
fn main() {
pub fn decode_object(
    message: &[u8],
    index: usize,
    options: &DecodeOptions,
) -> Result<(GlobalMetadata, DataObjectDescriptor, Vec<u8>)>
}

Decodes one object without reading the others. Uses the binary header’s offset table to seek directly to the right payload. O(1) seek regardless of how many objects the message contains.

Returns the global metadata, the object’s descriptor, and the decoded bytes as a three-element tuple.

#![allow(unused)]
fn main() {
// Decode only the second object (index 1)
let (meta, descriptor, payload) = decode_object(&message, 1, &DecodeOptions::default())?;
println!("shape: {:?}, dtype: {}", descriptor.shape, descriptor.dtype);
}

Edge case: If index >= num_objects, returns TensogramError::Object("index out of range").

`decode_range` — partial sub-tensor

#![allow(unused)]
fn main() {
pub fn decode_range(
    message: &[u8],
    object_index: usize,
    ranges: &[(u64, u64)],  // (offset, count) in flattened element order
    options: &DecodeOptions,
) -> Result<(DataObjectDescriptor, Vec<Vec<u8>>)>
}

Decodes one or more contiguous slices of elements from an object. Each (offset, count) pair in ranges selects a span of elements along the flattened dimension; the function returns one byte vector per range by default. This split-result design avoids an unnecessary copy when the caller needs the ranges individually (e.g. to feed separate array slices).

Rust — split results (default)

#![allow(unused)]
fn main() {
// Two separate ranges from object 0
let (desc, parts) = decode_range(
    &message, 0,
    &[(100, 50), (300, 25)],
    &DecodeOptions::default(),
)?;
assert_eq!(parts.len(), 2);           // one Vec<u8> per range
println!("first  range bytes: {}", parts[0].len());
println!("second range bytes: {}", parts[1].len());
}

Rust — joined result

If you prefer a single contiguous buffer, flatten the results:

#![allow(unused)]
fn main() {
let joined: Vec<u8> = parts.into_iter().flatten().collect();
}

Python — split results (default, `join=False`)

import tensogram

parts = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range
print(len(parts))        # 2
print(parts[0].shape)    # (50,)

Python — joined result (`join=True`)

arr = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# arr is a single flat numpy array with all ranges concatenated
print(arr.shape)          # (75,)

N-dimensional slicing: The xarray backend maps N-dimensional slice notation (e.g. ds["temperature"].sel(lat=slice(10, 20), lon=slice(30, 40))) into the (offset, count) pairs that decode_range expects, so you rarely need to compute flattened offsets by hand when working through xarray.

Pre-encoded messages: Messages produced via encode_pre_encoded only support decode_range if the caller provided the necessary bit-precise szip_block_offsets (see Pre-encoded Payloads).

Edge case: decode_range works with all encoding+compression combinations that support random access: uncompressed data, simple_packing (bit extraction), szip (RSI block seeking), blosc2 (chunk access), and zfp fixed-rate mode. It returns an error for the shuffle filter (byte rearrangement breaks contiguous sample ranges) and for stream compressors (zstd, lz4, sz3) that don’t support partial decode.

DecodeOptions

#![allow(unused)]
fn main() {
pub struct DecodeOptions {
    /// If true, verify the hash of each decoded payload.
    pub verify_hash: bool,
    /// When true (the default), decoded payloads are converted to the
    /// caller's native byte order. Set to false to receive bytes in the
    /// message's declared wire byte order.
    pub native_byte_order: bool,
    /// Which backend to use for szip / zstd when both FFI and pure-Rust
    /// implementations are compiled in.
    pub compression_backend: CompressionBackend,
}

impl Default for DecodeOptions {
    fn default() -> Self {
        Self {
            verify_hash: false,
            native_byte_order: true,
            compression_backend: CompressionBackend::default(),
        }
    }
}
}

Native byte order (default)

By default, all decoded data is returned in the caller’s native byte order — the library handles any necessary byte-swapping automatically. You never need to check byte_order or call .byteswap():

#![allow(unused)]
fn main() {
let (_, objects) = decode(&message, &DecodeOptions::default())?;
let floats: Vec<f32> = objects[0].1
    .chunks_exact(4)
    .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

In Python, numpy arrays are always directly usable:

_, objects = tensogram.decode(msg)
arr = objects[0][1]   # numpy array — values are correct, no byteswap needed

This applies to all decode functions (decode, decode_object, decode_range), all encodings (none, simple_packing), all compression codecs, and all language bindings (Rust, Python, C, C++).

Wire byte order (opt-in)

Set native_byte_order: false to receive the raw bytes in the message’s declared wire byte order. This is useful for zero-copy forwarding or when you need the exact on-wire representation:

#![allow(unused)]
fn main() {
let opts = DecodeOptions { native_byte_order: false, ..Default::default() };
let (_, objects) = decode(&message, &opts)?;
// objects[0].1 is in the descriptor's declared byte_order (e.g. big-endian)
}

Hash verification

Hash verification is opt-in. Enable it when data integrity is critical:

#![allow(unused)]
fn main() {
let options = DecodeOptions { verify_hash: true, ..Default::default() };
let result = decode(&message, &options);
// Returns Err(TensogramError::HashMismatch { expected, actual }) if corrupted
}

Edge case: If the descriptor has no hash (i.e. the message was encoded with hash_algorithm: None), verify_hash: true silently skips verification for that object. No error is returned.

Working with the Decoded Bytes

Decoded bytes are in native byte order (with the default DecodeOptions). Cast them as native:

#![allow(unused)]
fn main() {
// float32 object → use from_ne_bytes
let floats: Vec<f32> = data
    .chunks_exact(4)
    .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

For simple_packing decoded data, the output is always f64 bytes (8 bytes per element), regardless of the original dtype stored in the descriptor:

#![allow(unused)]
fn main() {
// simple_packing always decodes to f64, in native byte order
let values: Vec<f64> = data
    .chunks_exact(8)
    .map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
    .collect();
}

Scanning for Messages First

If you’re working with a buffer that might contain multiple messages (e.g. a .tgm file loaded into memory), scan it first to get message boundaries:

#![allow(unused)]
fn main() {
let offsets = scan(&big_buffer); // Vec<(usize, usize)> = (start, length)

for (start, len) in offsets {
    let msg = &big_buffer[start..start + len];
    let meta = decode_metadata(msg)?;
    println!("version: {}", meta.version);
}
}

The scan function is tolerant of corruption — it skips invalid regions and continues looking for the next valid TENSOGRM marker.

NaN / Inf handling

By default the Tensogram encoder rejects any NaN or ±Inf in float / complex payloads. The encode call fails with TensogramError::Encoding (C FFI: TgmError::Encoding; Python: EncodingError; TypeScript: EncodingError; C++: tensogram::encoding_error) and names the element index, dtype, and a hint that points at the opt-in flags described below.

This chapter walks through the three policies available on encode:

Reject (default) — any non-finite input fails the call. Use this when your pipeline guarantees finite values and any NaN / Inf is a bug you want to surface loudly.
Allow NaN — NaN values are substituted with 0.0 on the wire and their positions are recorded in a compressed bitmask stored alongside the payload. Decode restores canonical NaN at those positions by default.
Allow ±Inf — same as allow_nan but for +∞ and −∞ together (the flag covers both signs; two per-sign bitmasks are written when both kinds appear in the payload).

The mask companion is part of the NTensorFrame — wire-format type 9. For the byte-level specification, see the wire-format reference and the normative spec in plans/WIRE_FORMAT.md §6.5.

When to use which policy

Situation	Flag to set
Finite data only, want hard failure on contamination	default (both off)
NetCDF `_FillValue` → NaN, Zarr missing data, sensor gaps	`allow_nan=true`
Propagating numerical overflow as `±Inf`	`allow_inf=true`
Mixed missing-value / overflow data	both `true`

Don’t pre-process to a sentinel value when allow_nan / allow_inf does the job — the bitmask is designed to compress aggressively (hybrid Roaring containers by default) and keeps the missing-data semantics visible to the decoder. Sentinel values throw that information away.

Cross-language opt-in

Rust

#![allow(unused)]
fn main() {
use tensogram::{encode, EncodeOptions, GlobalMetadata, DataObjectDescriptor};

let options = EncodeOptions {
    allow_nan: true,
    allow_inf: true,
    ..Default::default()
};
let msg = encode(&meta, &[(&desc, payload_bytes)], &options)?;
}

Python

import numpy as np
import tensogram

data = np.array([1.0, np.nan, 3.0], dtype=np.float64)
msg = tensogram.encode(
    {},
    [(desc, data)],
    allow_nan=True,
)
decoded = tensogram.decode(msg)
# decoded.objects[0].data() → [1.0, nan, 3.0]

TypeScript

import { encode, decode } from '@ecmwf.int/tensogram';

const msg = encode(
    { version: 3 },
    [{ descriptor, data: new Float64Array([1, NaN, 3]) }],
    { allowNan: true },
);
const decoded = decode(msg);

C++

tensogram::encode_options opts;
opts.allow_nan = true;
auto msg = tensogram::encode(metadata_json, objects, opts);

CLI

$ tensogram --allow-nan reshuffle -o out.tgm input.tgm
$ TENSOGRAM_ALLOW_NAN=1 tensogram convert-netcdf data.nc -o data.tgm

Decode-side reconstruction

By default every decode path restores the canonical quiet-NaN / ±Inf bit pattern at every masked position. Opt out (e.g. to inspect the on-disk zero-substituted representation) by passing restore_non_finite=false:

# Get the 0.0-substituted payload without the NaN bits.
raw = tensogram.decode(msg, restore_non_finite=False)
# raw.objects[0].data() → [1.0, 0.0, 3.0]

The advanced decode_with_masks API (Rust + Python) returns both the zero-substituted payload AND the raw decompressed per-kind Vec<bool> masks, so callers can build custom missing-value representations without materialising canonical NaN bytes.

Lossy reconstruction — read this carefully

The masked encode path does not preserve the original NaN payload bits. On decode every masked NaN is restored with the canonical quiet-NaN pattern:

f32::NAN bits = 0x7FC00000
f64::NAN bits = 0x7FF8000000000000
Float16 / bfloat16 use their dtype-native quiet-NaN patterns
Complex64 / complex128 restore the canonical pattern to both real and imag components

Signalling NaNs, custom payload bits, and mixed real / imag kinds for complex dtypes are therefore flattened to the canonical form through a mask round-trip. If you need bit-exact NaN preservation, pre-encode your payload and use encode_pre_encoded to bypass the substitute-and-mask stage entirely. See plans/WIRE_FORMAT.md §6.5.4 for the normative spec.

Mask compression methods

Six methods are available per-kind:

Method	Best for	Feature
`roaring` (default)	any mask shape	pure Rust, works on WASM
`rle`	highly clustered masks (land / sea, swath gaps)	pure Rust
`blosc2`	dense dtype-aligned masks	`blosc2` feature
`zstd`	generic good-ratio	`zstd` feature
`lz4`	decode-speed priority	`lz4` feature
`none`	tiny masks (auto-fallback)	always available

Small masks (uncompressed bit-packed byte count ≤ 128 by default) automatically fall back to none regardless of the requested method — compressing a few bytes costs more than it saves. Set small_mask_threshold_bytes = 0 to disable the auto-fallback.

Set per-kind methods via the matching options:

msg = tensogram.encode(
    meta, [(desc, data)],
    allow_nan=True, allow_inf=True,
    nan_mask_method='rle',
    pos_inf_mask_method='roaring',
    neg_inf_mask_method='roaring',
    small_mask_threshold_bytes=0,
)

Validation

tensogram validate --full cross-checks every NaN / ±Inf in the decoded payload against the frame’s mask companion: masked positions are expected and pass; any NaN / Inf at a non-masked position is reported as NanDetected / InfDetected (see the validator reference).

Files without a mask companion keep the pre-0.17 semantics — any non-finite value in the decoded output is an error.

Migration from pre-0.17

Prior to 0.17 the reject_nan / reject_inf opt-in flags upgraded the NaN check to be pipeline-independent. These flags are removed in 0.17 (breaking change). Rejection is now always on by default; opt in to masked substitution with the replacement flags:

Pre-0.17	0.17+
`reject_nan=False` (default, pass-through)	`allow_nan=True` (substitute + mask)
`reject_nan=True` (opt-in reject)	default (always reject)
`reject_inf=False` / `True`	same split, `allow_inf`

See CHANGELOG.md for the full breaking-change list and upgrade notes.

Working with Files

The TensogramFile struct provides a high-level API for reading and writing .tgm files. It handles lazy scanning, buffered appending, and random access by message index.

Creating a File

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, EncodeOptions};

let mut file = TensogramFile::create("forecast.tgm")?;
}

This creates (or truncates) the file. No data is written yet.

Appending Messages

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
    GlobalMetadata, DataObjectDescriptor, ByteOrder, Dtype, EncodeOptions,
};

let global = GlobalMetadata::default();

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 2,
    shape: vec![100, 200],
    strides: vec![200, 1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "none".to_string(),
    compression: "none".to_string(),
    masks: None,
    params: BTreeMap::new(),
};

    file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}

Each append encodes one message and appends it to the end of the file. You can call it as many times as you like — each message is independent and self-describing.

Typical pattern for writing a multi-message file (one message per parameter, run, subject, sample, experiment — whatever your pipeline produces):

#![allow(unused)]
fn main() {
let mut file = TensogramFile::create("output.tgm")?;

for key in ["2t", "10u", "10v", "msl"] {
    let (global, desc, data) = produce_field(key);
    file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}
}

Opening and Counting Messages

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;

// Streaming scan happens here (lazily, on first access)
let count = file.message_count()?;
println!("{} messages in file", count);
}

The first access triggers a streaming scan that reads preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, every read_message call is a seek + read — no further scanning.

Reading Messages

#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions};

// Read raw bytes of message 3
let raw_bytes = file.read_message(3)?;

// Decode message 3
let (meta, objects) = decode(&raw_bytes, &DecodeOptions::default())?;

// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}", desc.shape, desc.dtype);
}

Both are O(1) after the initial scan: they seek to the stored offset and read length bytes.

Iterating Over All Messages

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;

for raw in file.iter()? {
    let raw = raw?;
    let meta = tensogram::decode_metadata(&raw)?;
    println!("version: {}", meta.version);
}
}

Memory note: For files with many large messages, prefer iterating by index with read_message(i) inside a loop to process one at a time.

Random Access by Index

One of Tensogram’s design goals is O(1) object access. After scanning, any message is reachable in constant time. Within a message, any object is reachable in constant time via the binary header’s offset table:

flowchart TD
    A["file.read_message(42)"]
    B["Message bytes"]
    C["Binary header"]
    D["Seek to payload 2"]
    E["Decode only object 2"]

    A -- "seek + read" --> B
    B --> C
    C -- "lookup offset for object 2" --> D
    D --> E

    style A fill:#388e3c,stroke:#2e7d32,color:#fff
    style E fill:#1565c0,stroke:#0d47a1,color:#fff

File Layout Diagram

forecast.tgm
├── [message 0] — TENSOGRM ... 39277777
├── [message 1] — TENSOGRM ... 39277777
├── [message 2] — TENSOGRM ... 39277777
│   ├── Preamble (24B)
│   ├── Header Metadata Frame (CBOR GlobalMetadata)
│   ├── Header Index Frame (CBOR offsets)
│   ├── Data Object Frame 0 (payload + CBOR descriptor)
│   └── Data Object Frame 1 (payload + CBOR descriptor)
│   └── Postamble (16B)
└── ...

No file-level header, no file-level index. All indexing is per-message, built in-memory at scan time.

Remote Access (optional)

Enable the remote feature to open .tgm files on S3, GCS, Azure, or HTTP with selective range-based reads:

[dependencies]
tensogram = { path = "...", features = ["remote"] }

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};

let mut file = TensogramFile::open_source("s3://bucket/forecast.tgm")?;

// Fetch only the second object from message 0 — no full download
let (meta, desc, data) = file.decode_object(0, 1, &DecodeOptions::default())?;
}

Supports header-indexed and footer-indexed files (read-only) from Rust, Python, xarray, and zarr. See the Remote Access guide for storage options, request budgets, and limitations.

Memory-Mapped I/O (optional)

Enable the mmap feature to use memory-mapped file access:

[dependencies]
tensogram = { path = "...", features = ["mmap"] }

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_mmap("forecast.tgm")?;

// Scan happens during open_mmap — no lazy scan needed
let count = file.message_count()?;

// Reads from the memory-mapped region (no additional seek)
let raw = file.read_message(0)?;
}

This is useful for large files where you want to avoid per-message seek + read overhead. The file is mapped read-only. All existing decode functions work unchanged.

Async I/O (optional)

Enable the async feature for tokio-based non-blocking file operations:

[dependencies]
tensogram = { path = "...", features = ["async"] }

#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_async("forecast.tgm").await?;

// Read a message without blocking the async runtime
let raw = file.read_message_async(0).await?;

// Decode also runs on a blocking thread (safe for FFI codecs)
let (meta, objects) = file.decode_message_async(0, &opts).await?;
}

All CPU-intensive work (scanning, decoding, FFI calls to compression libraries) runs via tokio::task::spawn_blocking, so it won’t block the async runtime.

Edge Cases

Appending to an Existing File

TensogramFile::create truncates. To append to an existing file, use standard file I/O:

#![allow(unused)]
fn main() {
use std::io::Write;
let mut f = std::fs::OpenOptions::new().append(true).open("forecast.tgm")?;

let global = GlobalMetadata::default();
let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default())?;
f.write_all(&message)?;
}

Or open the file with TensogramFile::open and use append() — the append method always writes at the end regardless of how the file was opened.

Corrupted Messages

The scanner skips corrupted messages and continues. A message is considered corrupted if:

The total_length field points to a location where 39277777 is not present
The header is truncated

The scanner recovers by advancing one byte and searching for the next TENSOGRM.

Empty Files

message_count() returns 0 for an empty file. read_message(0) returns an error.

Remote Access

Enable the remote feature to open .tgm files on HTTP, S3, GCS, or Azure without downloading the whole file. Individual objects are fetched via targeted range requests.

[dependencies]
tensogram = { path = "...", features = ["remote"] }

Opening a Remote File

#![allow(unused)]
fn main() {
use tensogram::TensogramFile;

// Auto-detect: local path or remote URL.  The second argument
// is `scan_opts: Option<RemoteScanOptions>` — pass `None` for
// the forward-only walker (see "Bidirectional scan" below).
let mut file = TensogramFile::open_source("https://example.com/data.tgm", None)?;

// S3
let mut file = TensogramFile::open_source("s3://bucket/data.tgm", None)?;
}

open_source inspects the URL scheme and routes to the remote backend for s3://, s3a://, gs://, az://, azure://, http://, https://. Everything else is treated as a local path.

The Rust open() method is unchanged and always opens a local file. In Python, TensogramFile.open() auto-detects remote URLs.

You can also check whether a string is a remote URL without opening:

#![allow(unused)]
fn main() {
use tensogram::is_remote_url;

assert!(is_remote_url("s3://bucket/file.tgm"));
assert!(!is_remote_url("/local/path/file.tgm"));
}

Storage Options (Credentials, Region, etc.)

Pass an explicit options map for fine-grained control:

#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::TensogramFile;

let mut opts = BTreeMap::new();
opts.insert("aws_access_key_id".to_string(), "AKIA...".to_string());
opts.insert("aws_secret_access_key".to_string(), "...".to_string());
opts.insert("region".to_string(), "eu-west-1".to_string());

let mut file = TensogramFile::open_remote("s3://bucket/data.tgm", &opts, None)?;
}

When no options are passed, credentials are read from the environment (e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, GOOGLE_APPLICATION_CREDENTIALS).

Bidirectional Scan

open_source, open_remote, and their async siblings accept a scan_opts: Option<RemoteScanOptions> argument that controls the walker. The default is bidirectional: a pipelined scanner pairs forward preamble fetches with backward postamble fetches and overlaps each round’s candidate-preamble validation with the next round’s primary fetches, so each paired round costs one HTTP round trip rather than two. On real-network workloads this roughly halves wall-clock for full layout discovery and tail / middle access.

#![allow(unused)]
fn main() {
use tensogram::{RemoteScanOptions, TensogramFile};

// Default (bidirectional).
let file = TensogramFile::open_source("https://example.com/data.tgm", None)?;

// Force forward-only.
let file = TensogramFile::open_source(
    "https://example.com/data.tgm",
    Some(RemoteScanOptions { bidirectional: false }),
)?;
}

Python and TypeScript expose the same default and the same opt-out:

import tensogram

# Default (bidirectional).
with tensogram.TensogramFile.open_remote("https://example.com/data.tgm") as f:
    ...

# Force forward-only.
with tensogram.TensogramFile.open_remote(
    "https://example.com/data.tgm",
    bidirectional=False,
) as f:
    ...

import { TensogramFile, init } from '@ecmwf.int/tensogram';

await init();

// Default (bidirectional).
const file = await TensogramFile.fromUrl('https://example.com/data.tgm');

// Force forward-only.
const file = await TensogramFile.fromUrl('https://example.com/data.tgm', {
  bidirectional: false,
});

Both walkers produce identical layouts; the difference is only in the discovery path. Forward-only is appropriate when an adversarial server might serve disagreeing forward and backward reads, or when the consumer specifically wants serial Range fetches with concurrency: 1 (TypeScript only — concurrency: 1 is illegal alongside the bidirectional default).

Set debug: true (TypeScript) or subscribe to the tensogram::remote_scan tracing target (Rust) to observe walker state transitions: tensogram:scan:mode, tensogram:scan:fallback, tensogram:scan:fwd-terminated, tensogram:scan:gap-closed, tensogram:scan:hop, tensogram:scan:footer-eager. Runnable end-to-end demos: examples/rust/src/bin/18_remote_scan_trace.rs and examples/typescript/18_remote_scan_trace.ts.

RemoteScanOptions configures only the remote backend. Local paths run through framing::scan_file with the in-memory ScanOptions, which is unrelated to this flag and not controlled by the remote-walker option.

Wire-format compatibility: existing .tgm files gain the speedup automatically — the walker is reader-side only, with no migration, no re-encoding, and no wire-format bump.

When the bidirectional walker discovers a footer-indexed message via its postamble, the dispatcher folds an eager footer-region fetch into the same paired round as the candidate-preamble validation. The metadata + index frames land in the cached layout inline, so a subsequent read_metadata / messageMetadata accessor short-circuits without issuing a separate footer-region GET.

The optimisation fires only on footer-indexed messages discovered backward:

Header-indexed messages on backward keep the lazy path. The forward walker fetches the header chunk in one GET; backward fetching postamble + preamble + separate header chunk would be net-worse (3 GETs vs 1).
Footer-indexed messages discovered forward keep the existing forward-only optimisation (one suffix-chunk fetch per message).
Header-indexed messages with footer hash frames have a non-empty footer region; the dispatcher fetches the bytes speculatively but discards them after the preamble flags reveal HEADER_INDEX. Cost: a few hundred bytes per such message; benefit: zero extra GETs.

The footer fetch is best-effort: a transport failure or footer parse error never poisons preamble validation. The layout still commits via the validated preamble alone, and the lazy ensure_layout path picks up footer discovery on first metadata access. Behaviour is symmetric across the Rust sync, Rust async, and TypeScript dispatchers.

Python Usage

import tensogram

# Auto-detect remote URL
with tensogram.TensogramFile.open("s3://bucket/data.tgm") as f:
    meta = f.file_decode_metadata(0)
    result = f.file_decode_object(0, 0)
    data = result["data"]  # numpy array

# With explicit storage options
with tensogram.TensogramFile.open_remote(
    "s3://bucket/data.tgm",
    {"region": "eu-west-1"}
) as f:
    print(f.source())   # "s3://bucket/data.tgm"
    print(f.is_remote()) # True

# With the bidirectional walker (see "Bidirectional Scan" above)
with tensogram.TensogramFile.open_remote(
    "s3://bucket/data.tgm",
    {"region": "eu-west-1"},
    bidirectional=True,
) as f:
    print(f.message_count())

xarray Usage

import xarray as xr

ds = xr.open_dataset(
    "s3://bucket/data.tgm",
    engine="tensogram",
    storage_options={"region": "eu-west-1"},
)

Supported Schemes

Scheme	Backend	Notes
`http://`, `https://`	HTTP	`allow_http` is set automatically for `http://`
`s3://`, `s3a://`	Amazon S3	Env-based or explicit credentials
`gs://`	Google Cloud Storage	Service account or env
`az://`, `azure://`	Azure Blob Storage	MSI or env
`file://`	Local filesystem	Via `open_remote`; for ordinary local files prefer a plain path or `open()` (`open_source` does not auto-route `file://`)

All backends are provided by the object_store crate.

Object-Level Access

Three methods provide selective access without downloading full messages:

#![allow(unused)]
fn main() {
use tensogram::DecodeOptions;

// Metadata only — triggers layout discovery on first call, then cached
let meta = file.decode_metadata(0)?;

// Descriptors — reads only the descriptor data needed for each object
let (meta, descriptors) = file.decode_descriptors(0)?;

// Single object by index — fetches only the target object frame
let (meta, desc, data) = file.decode_object(0, 2, &DecodeOptions::default())?;
}

These methods also work on local files, where they read the full message and decode the requested parts.

Request Budget

Header-indexed files (buffered writes)

Phase	Operation	HTTP Requests
Open	`open_source` / `open_remote`	1 HEAD + 1 GET (first preamble only, 24 B)
Next message	first data access to message i	1 GET (preamble + layout combined)
Cached	`decode_metadata(i)` again	0 (served from cache)
Object read	`decode_object(i, j)`	1 GET per object (if layout already cached)
Descriptors	`decode_descriptors(i)`	1–3 GETs per object (descriptor-only reads for large frames)
Message count	`message_count()`	1 GET per undiscovered message (24 B each, preamble only)

Phase	Operation	HTTP Requests
Open	`open_source` / `open_remote`	1 HEAD + 1 GET (first preamble only, 24 B)
Next message	first data access to message i	1 GET (preamble) + 1 GET (suffix)
Cached	`decode_metadata(i)` again	0 (served from cache)
Object read	`decode_object(i, j)`	1 GET per object (if layout already cached)
Descriptors	`decode_descriptors(i)`	1–3 GETs per object
Message count	`message_count()`	1 GET per undiscovered message (24 B each)

Streaming files (total_length=0)

Phase	Operation	HTTP Requests
Open	`open_source` / `open_remote`	1 HEAD + 1 GET (preamble) + 1 GET (END_MAGIC check)
First access	`decode_metadata(0)`	2 GETs (postamble + footer region)
Object read	`decode_object(0, j)`	1 GET per object
Message count	`message_count()`	0 (streaming is always the last message)

Layout discovery is combined with message scanning for both header-indexed and footer-indexed messages — the library reads the preamble and layout in one GET (header-indexed) or two GETs (footer-indexed suffix read). message_count() uses a lean scan path (24 bytes per preamble). Streaming messages (total_length=0) must be the last message in a multi-message file.

How It Works (Header-Indexed Example)

sequenceDiagram
    participant App
    participant TensogramFile
    participant ObjectStore

    App->>TensogramFile: open_source("s3://bucket/file.tgm")
    TensogramFile->>ObjectStore: HEAD (get file size)
    TensogramFile->>ObjectStore: GET range 0..24 (preamble)
    Note right of TensogramFile: Discover message offsets

    App->>TensogramFile: decode_object(0, 2)
    TensogramFile->>ObjectStore: GET range 24..N (header chunk, up to 256KB)
    Note right of TensogramFile: First access: parse metadata + index, cache layout
    TensogramFile->>ObjectStore: GET range offset..offset+len (object frame 2)
    TensogramFile-->>App: (metadata, descriptor, decoded_bytes)

Checking if a File is Remote

#![allow(unused)]
fn main() {
use tensogram::TensogramFile;

let file = TensogramFile::open_source("s3://bucket/data.tgm", None)?;
assert!(file.is_remote());
println!("source: {}", file.source()); // "s3://bucket/data.tgm"
}

source() returns the original URL for remote files and the file path for local files.

Error Handling

Remote access can return different TensogramError variants depending on the failure:

Error condition	Error type	When it happens
Invalid URL	`Remote`	`open_source` / `open_remote` with a malformed URL
Connection failure	`Remote`	Network unreachable, DNS failure, timeout
File not found	`Remote`	HTTP 404, S3 NoSuchKey
No valid messages	`Remote`	File contains no parseable messages
Unsupported layout	`Remote`	Message lacks both header-index and footer-index flags
Object index out of range	`Object`	`decode_object(i, j)` where `j >= object_count`

All errors are returned as Result. The library avoids panics.

Shared Runtime

Remote I/O uses a process-wide shared tokio runtime (multi-thread, 2 workers) created on first use. All RemoteBackend instances share the same runtime, so TCP connection pools and DNS caches are reused across calls.

The sync bridge adapts to the calling context:

Not in a tokio runtime (Python, CLI): the shared runtime’s handle drives the future directly — no extra thread creation.
Inside a multi-thread tokio runtime (#[tokio::test], server handler): block_in_place tells tokio to spawn a replacement worker so the blocked thread doesn’t cause runtime starvation.
Inside a current-thread tokio runtime: falls back to a scoped thread, since block_in_place is not supported on single-threaded runtimes.

Async API

The async feature enables async methods for decode, read, and metadata extraction. These work for both local and remote files:

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};

// Async decode methods (feature = "async")
let meta = file.decode_metadata_async(0).await?;
let (meta, descs) = file.decode_descriptors_async(0).await?;
let (meta, desc, data) = file.decode_object_async(0, 0, &DecodeOptions::default()).await?;
let msg = file.read_message_async(0).await?;
}

When both remote and async features are enabled, async open methods are also available:

#![allow(unused)]
fn main() {
// Async open (auto-detects local vs remote) — requires remote + async
let file = TensogramFile::open_source_async("s3://bucket/data.tgm", None).await?;

// Async open with explicit storage options
let file = TensogramFile::open_remote_async(
    "s3://bucket/data.tgm",
    &opts,
    None,
).await?;
}

For remote backends, async methods directly await object store operations, bypassing the sync bridge entirely. For local backends, they use spawn_blocking for file I/O.

[dependencies]
tensogram = { path = "...", features = ["remote", "async"] }

Range Reads

TensogramFile::decode_range() supports partial object decoding for both local and remote files. It takes an object index and a list of (offset, count) element ranges, returning only the requested elements without decoding the entire object.

For remote files, it fetches the full object frame (via indexed access) then runs the range decode pipeline on the raw payload. This is most beneficial with szip-compressed objects that have szip_block_offsets, where only the compressed blocks covering the requested range are decompressed.

#![allow(unused)]
fn main() {
// Rust: decode elements 100..200 from object 0
let ranges = vec![(100, 100)];
let (desc, parts) = file.decode_range(0, 0, &ranges, &DecodeOptions::default())?;
}

# Python: decode elements 100..200 from object 0
arr = file.file_decode_range(0, 0, [(100, 100)], join=True)

The xarray backend uses file_decode_range automatically when slicing remote arrays that support partial decode (uncompressed or szip-compressed objects without shuffle filters).

Descriptor-Only Reads

decode_descriptors() fetches only the CBOR descriptor from each data object frame, not the full payload. For large objects (hundreds of MB), this avoids downloading the entire frame just to extract a few hundred bytes of metadata.

For frames smaller than 64 KB, the full frame is read in a single request (fewer round-trips). For larger frames, the library reads only the frame header (16 bytes), footer (12 bytes), and the CBOR descriptor region.

Limitations

Streaming messages must be last. In multi-message files, streaming-encoded messages (total_length=0) must be the last message. The remote scanner assumes the streaming message extends to the end of the file.
Optimistic scan for buffered messages. Remote message scanning validates preamble magic and total_length plausibility but does not verify end-of-message markers for buffered messages. Streaming messages (total_length=0) do validate the END_MAGIC at EOF.
Read-only. Remote writes are not supported.
Header probe size. Layout discovery reads a single chunk of up to 256 KB from the header region. If the metadata or index frame does not fit in this chunk, decode_metadata() will error (it does not retry with a larger read).
HTTP server requirements. The remote HTTP server must support HEAD requests (for file size) and Range request headers (for partial reads).
read_message() and decode_message() download the full message even for remote files. Use decode_metadata(), decode_descriptors(), or decode_object() for selective access.
Zarr remote reads are lazy per-chunk. The zarr store fetches only metadata at open time; individual chunks are decoded on first access. Local files still use eager decode for lower latency.
Concurrent async access. Async methods take &self, so a single TensogramFile handle can serve concurrent async reads. The remote backend serialises forward-walker scan rounds via an internal mutex, but reads of already-discovered messages run truly concurrently — asyncio.gather / tokio::join! achieve real I/O overlap on a single handle.

Iterators

Tensogram provides lazy iterator APIs for traversing messages and objects without loading everything into memory at once.

Hierarchy

graph TD
    F[File / Buffer] -->|messages| M1[Message 1]
    F -->|messages| M2[Message 2]
    F -->|messages| M3[Message N]
    M1 -->|objects| O1["(DataObjectDescriptor, Vec&lt;u8&gt;)"]
    M1 -->|objects| O2["(DataObjectDescriptor, Vec&lt;u8&gt;)"]
    O1 -->|access| D1["descriptor + data"]
    O2 -->|access| D2["descriptor + data"]

Rust API

Buffer message iterator

Iterate over messages in a &[u8] byte buffer. Zero-copy: yields slices pointing into the original buffer.

#![allow(unused)]
fn main() {
use tensogram::{messages, decode, DecodeOptions};

let buf: Vec<u8> = std::fs::read("multi.tgm")?;

for msg_bytes in messages(&buf) {
    let (meta, objects) = decode(msg_bytes, &DecodeOptions::default())?;
    println!("version={} objects={}", meta.version, objects.len());
}
}

The iterator calls scan() once on construction, then yields &[u8] slices in sequence. Garbage between valid messages is silently skipped.

MessageIter implements ExactSizeIterator, so .len() returns the remaining count at any point.

Object iterator

Iterate over the decoded objects (tensors) inside a single message. Each item is a (DataObjectDescriptor, Vec<u8>) tuple:

#![allow(unused)]
fn main() {
use tensogram::{objects, DecodeOptions};

for result in objects(&msg_bytes, DecodeOptions::default())? {
    let (descriptor, data) = result?;
    println!("shape={:?} dtype={} encoding={} bytes={}",
             descriptor.shape, descriptor.dtype, descriptor.encoding, data.len());
}
}

Each object is decoded through the full pipeline on demand — objects you don’t consume are never decoded.

For metadata-only access (no payload decode), use objects_metadata. This returns DataObjectDescriptors without decoding any payloads:

#![allow(unused)]
fn main() {
use tensogram::objects_metadata;

for desc in objects_metadata(&msg_bytes)? {
    println!("shape={:?} dtype={} byte_order={}", desc.shape, desc.dtype, desc.byte_order);
}
}

File iterator

Iterate over messages stored in a .tgm file with seek-based lazy I/O:

#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, objects, DecodeOptions};

let mut file = TensogramFile::open("forecast.tgm")?;
for raw in file.iter()? {
    let raw = raw?;
    // Nested: iterate objects within this message
    for result in objects(&raw, DecodeOptions::default())? {
        let (desc, data) = result?;
        println!("{:?} {} {} bytes", desc.shape, desc.dtype, data.len());
    }
}
}

file.iter() scans the file once (if not already scanned), then returns a FileMessageIter that reads each message via seek + read. The iterator does not borrow the TensogramFile — it owns an open file handle and a copy of the message offsets.

C / C++ API

The C FFI uses an opaque-handle + next() pattern. Each iterator returns TGM_OK while items remain, and TGM_END_OF_ITER as an end sentinel.

Buffer iterator

tgm_buffer_iter_t *iter;
tgm_buffer_iter_create(buf, buf_len, &iter);

const uint8_t *msg_ptr;
size_t msg_len;
while (tgm_buffer_iter_next(iter, &msg_ptr, &msg_len) == TGM_OK) {
    // msg_ptr borrows from the original buffer
    tgm_message_t *msg;
    tgm_decode(msg_ptr, msg_len,
               /*native_byte_order=*/1, /*threads=*/0,
               /*verify_hash=*/0, &msg);
    // ... use msg ...
    tgm_message_free(msg);
}
tgm_buffer_iter_free(iter);

Lifetime: the buffer must remain valid until tgm_buffer_iter_free.

File iterator

tgm_file_t *file;
tgm_file_open("data.tgm", &file);

tgm_file_iter_t *iter;
tgm_file_iter_create(file, &iter);

tgm_bytes_t raw;
while (tgm_file_iter_next(iter, &raw) == TGM_OK) {
    // raw.data is owned — free with tgm_bytes_free
    tgm_message_t *msg;
    tgm_decode(raw.data, raw.len,
               /*native_byte_order=*/1, /*threads=*/0,
               /*verify_hash=*/0, &msg);
    // ... use msg ...
    tgm_message_free(msg);
    tgm_bytes_free(raw);
}
tgm_file_iter_free(iter);
tgm_file_close(file);

Object iterator

tgm_object_iter_t *iter;
tgm_object_iter_create(msg_ptr, msg_len, 0, &iter);

tgm_message_t *obj;
while (tgm_object_iter_next(iter, &obj) == TGM_OK) {
    uint64_t ndim = tgm_object_ndim(obj, 0);
    const uint64_t *shape = tgm_object_shape(obj, 0);
    // ... use shape, data ...
    tgm_message_free(obj);
}
tgm_object_iter_free(iter);

C++ API

The C++ wrapper (include/tensogram.hpp) provides RAII iterator classes that manage the underlying C handles automatically.

Buffer iterator

#include <tensogram.hpp>

auto buf = /* read file into std::vector<uint8_t> */;
tensogram::buffer_iterator iter(buf.data(), buf.size());

const std::uint8_t* msg_ptr;
std::size_t msg_len;
while (iter.next(msg_ptr, msg_len)) {
    auto msg = tensogram::decode(msg_ptr, msg_len);
    std::printf("version=%llu objects=%zu\n", msg.version(), msg.num_objects());
}

File iterator

auto f = tensogram::file::open("forecast.tgm");
tensogram::file_iterator iter(f);

std::vector<std::uint8_t> raw;
while (iter.next(raw)) {
    auto msg = tensogram::decode(raw.data(), raw.size());
    std::printf("objects=%zu\n", msg.num_objects());
}

Object iterator

tensogram::object_iterator iter(msg_ptr, msg_len);
tensogram::message obj = tensogram::decode(msg_ptr, msg_len); // placeholder for next()
while (iter.next(obj)) {
    auto o = obj.object(0);
    auto shape = o.shape();
    std::printf("dtype=%s shape=[%llu, %llu]\n",
                o.dtype_string().c_str(), shape[0], shape[1]);
}

Range-based for on message

auto msg = tensogram::decode(buf, len);
for (const auto& obj : msg) {
    std::printf("dtype=%s bytes=%zu\n",
                obj.dtype_string().c_str(), obj.data_size());
}

Python API

TensogramFile supports iteration, indexing, and slicing:

import tensogram

# Iterate all messages
with tensogram.TensogramFile.open("forecast.tgm") as f:
    for meta, objects in f:
        for desc, arr in objects:
            print(f"  shape={arr.shape}  dtype={desc.dtype}")

# Index and slice
with tensogram.TensogramFile.open("forecast.tgm") as f:
    meta, objects = f[0]        # first message
    meta, objects = f[-1]       # last message
    subset = f[10:20]           # range of messages
    every_5th = f[::5]          # strided access

# Buffer iteration
buf = open("data.tgm", "rb").read()
for meta, objects in tensogram.iter_messages(buf):
    desc, arr = objects[0]
    print(f"  shape={arr.shape}")

decode(), decode_message(), file iteration, and iter_messages() return Message namedtuples with .metadata and .objects fields. Tuple unpacking (meta, objects = msg) also works. TensogramFile supports len(f) and context manager (with).

Thread safety: iterators own independent file handles and buffer copies — no shared mutable state. Safe under free-threaded Python (PEP 703, no GIL).

Edge cases

Scenario	Behavior
Empty buffer / file	Iterator yields zero items
Garbage between messages	Silently skipped by scanner
Truncated message at end	Skipped (not yielded)
Zero-object message	`objects()` returns empty iterator
I/O error during file iteration	`FileMessageIter::next()` yields `Err(...)`

Python API

Tensogram provides native Python bindings via PyO3. All tensor data crosses the boundary as NumPy arrays.

Installation

# From PyPI (once published)
pip install tensogram

# From source
pip install maturin numpy
cd python/bindings && maturin develop

Quick Start

import numpy as np
import tensogram

# Encode a 2D temperature field
temps = np.random.randn(100, 200).astype(np.float32) + 273.15
meta = {}
desc = {"type": "ntensor", "shape": [100, 200], "dtype": "float32"}

msg = tensogram.encode(meta, [(desc, temps)])

# Decode it back
meta, objects = tensogram.decode(msg)
desc, array = objects[0]
print(array.shape)  # (100, 200)

Encoding

Basic encoding

tensogram.encode() takes metadata, a list of (descriptor, array) pairs, and returns wire-format bytes:

msg = tensogram.encode(
    {},
    [({"type": "ntensor", "shape": [3], "dtype": "float32"}, np.array([1, 2, 3], dtype=np.float32))],
    hash="xxh3",  # default; use None to skip hashing
)

Descriptor keys

Every object in a message is described by a dict. The three required keys define what the tensor looks like; the optional keys control how it is stored on the wire.

Key	Required	Default	Description
`"type"`	yes	—	Object type, e.g. `"ntensor"`
`"shape"`	yes	—	Tensor dimensions, e.g. `[100, 200]`
`"dtype"`	yes	—	Data type name (see Data Types)
`"strides"`	no	row-major	Element strides; computed automatically if omitted
`"byte_order"`	no	native	`"little"` or `"big"`; defaults to host byte order
`"encoding"`	no	`"none"`	Encoding stage — see below
`"filter"`	no	`"none"`	Filter stage — see below
`"compression"`	no	`"none"`	Compression stage — see below

Any additional keys (e.g. "sp_reference_value", "sp_bits_per_value") are stored in the descriptor’s .params dict and passed through to the encoding pipeline.

The encoding pipeline

Each object passes through a three-stage pipeline before it is stored. You control each stage via descriptor keys:

raw bytes → encoding → filter → compression → wire payload

Encoding transforms the data representation:

Value	What it does	Use case
`"none"`	Pass-through (default)	Exact values, integer data
`"simple_packing"`	Quantize floats to packed integers	Bounded-range scalar fields (GRIB-compatible)

Filter rearranges bytes to improve compressibility:

Value	What it does	Use case
`"none"`	Pass-through (default)	Most cases
`"shuffle"`	Byte-transpose by element width (requires `"shuffle_element_size"`)	Improves lz4/zstd ratio on typed data

Compression reduces the payload size:

Value	Random access	Type	Use case
`"none"`	yes	—	No compression
`"zstd"`	no	lossless	General-purpose, best ratio/speed tradeoff
`"lz4"`	no	lossless	Fastest decompression
`"szip"`	yes (RSI blocks)	lossless	Integer/packed data (CCSDS 121.0-B-3)
`"blosc2"`	yes (chunks)	lossless	Large tensors, multi-codec
`"zfp"`	yes (fixed-rate)	lossy	Floating-point arrays
`"sz3"`	no	lossy	Error-bounded scientific data

Compression parameters are passed as extra descriptor keys. For example, zstd level:

desc = {
    "type": "ntensor", "shape": [1000], "dtype": "float32",
    "compression": "zstd", "zstd_level": 9,
}

For the full list of compressor parameters, see Compression.

Common pipeline combinations

# Lossless, fast decompression
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
        "compression": "lz4"}

# Lossless, best ratio (shuffle_element_size must match dtype byte width)
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
        "filter": "shuffle", "shuffle_element_size": 4, "compression": "zstd", "zstd_level": 12}

# Quantise a bounded-range float field to 16-bit packed ints, then compress
# (the same pipeline GRIB 2 uses for simple_packing + CCSDS).
# compute_packing_params expects a flat float64 array
values = data.astype(np.float64).ravel()
params = tensogram.compute_packing_params(values, bits_per_value=16, decimal_scale_factor=0)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
        "encoding": "simple_packing", "compression": "zstd", **params}

# Lossy float compression with error bound (zfp operates on float64)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
        "compression": "zfp", "zfp_mode": "fixed_accuracy", "zfp_tolerance": 0.01}

Invalid combinations: Some pipeline combinations are rejected at encode time — e.g. zfp + shuffle (ZFP operates on typed floats, not byte-shuffled data) or simple_packing + sz3 (both are encoding stages). See Compression — Invalid Combinations.

Multiple objects per message

A single message can contain multiple tensors, each with its own descriptor:

spectrum = np.random.randn(256).astype(np.float64)
mask = np.array([1, 0, 1, 1, 0], dtype=np.uint8)

msg = tensogram.encode(
    {},
    [
        ({"type": "ntensor", "shape": [256], "dtype": "float64", "compression": "zstd"}, spectrum),
        ({"type": "ntensor", "shape": [5], "dtype": "uint8"}, mask),
    ],
)

Pre-encoded data

If you already have compressed/packed payloads (e.g. from another system), use tensogram.encode_pre_encoded() with the same interface. The library skips the encoding pipeline and writes the bytes as-is:

msg = tensogram.encode_pre_encoded(meta, [(desc, pre_compressed_bytes)])

See Pre-Encoded Data API for details.

Decoding

Full decode

meta, objects = tensogram.decode(msg)

Returns a Message namedtuple with .metadata and .objects. Tuple unpacking works directly.

By default, decoded arrays are in the caller’s native byte order — the library handles byte-swapping automatically. Pass native_byte_order=False to receive the raw wire byte order instead:

meta, objects = tensogram.decode(msg, native_byte_order=False)

Metadata

meta is a Metadata object:

meta.version     # int — always 3
meta.base        # list[dict] — per-object metadata (one entry per object)
meta.extra       # dict — message-level annotations (_extra_ in CBOR)
meta.reserved    # dict — library internals (_reserved_ in CBOR, read-only)
meta["key"]      # dict-style access (checks base entries, then extra)

To read metadata without decoding any payloads:

meta = tensogram.decode_metadata(msg)

To read metadata and descriptors (no payload decode):

meta, descriptors = tensogram.decode_descriptors(msg)
for desc in descriptors:
    print(desc.shape, desc.dtype, desc.compression)

Selective decode

Decode a single object without touching the others — O(1) seek via the binary header’s offset table:

meta, desc, array = tensogram.decode_object(msg, index=2)

Decode a sub-range of elements from one object (for compressors that support random access):

# Elements 100-149 and 300-324 from object 0
parts = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range

# Or join into a single contiguous array
joined = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# joined is a single flat numpy array of shape (75,)

decode_range works with uncompressed data, simple_packing, szip, blosc2, and zfp fixed-rate mode. It returns an error for stream compressors (zstd, lz4, sz3) and for the shuffle filter. See Decoding Data for details.

Scanning and iteration

To find message boundaries in a buffer without decoding:

offsets = tensogram.scan(buf)  # list of (offset, length) pairs

To iterate messages in a multi-message buffer:

for meta, objects in tensogram.iter_messages(buf):
    print(meta.version, len(objects))

Hash verification

meta, objects = tensogram.decode(msg, verify_hash=True)

Raises RuntimeError if any object’s payload hash doesn’t match. If the message was encoded without a hash (hash=None), verification is silently skipped.

File API

Writing

with tensogram.TensogramFile.create("forecast.tgm") as f:
    for step in range(24):
        data = model.run(step)
        desc = {"type": "ntensor", "shape": list(data.shape), "dtype": "float32",
                "compression": "zstd"}
        f.append({"base": [{"step": step}]}, [(desc, data)])

Each append encodes one message and writes it to the end of the file. Messages are independent and self-describing.

Reading

with tensogram.TensogramFile.open("forecast.tgm") as f:
    print(len(f))                    # message count

    meta, objects = f[0]             # index (supports negative indices)
    subset = f[1:10:2]              # slice → list[Message]

    for meta, objects in f:          # iterate all messages
        for desc, array in objects:
            print(desc.shape, array.dtype)

    raw = f.read_message(0)          # raw bytes for forwarding/caching

The first access triggers a streaming scan that records message offsets. After that, every read is an O(1) seek.

Streaming encoder

For building a message one object at a time in memory:

enc = tensogram.StreamingEncoder({}, hash="xxh3")
for desc, data in objects:
    enc.write_object(desc, data)
msg = enc.finish()  # returns complete message as bytes

For pre-encoded payloads, use enc.write_object_pre_encoded(desc, raw_bytes).

`finish()` vs `finish_backfilled()`

finish() produces a streaming-mode message: both the preamble and postamble carry total_length = 0, signalling “unknown length at write time”. Forward-only readers handle this transparently by walking the message’s END_MAGIC trailer, but backward readers cannot O(1)-jump from the postamble to the message start without the mirrored length.

finish_backfilled() seeks back into the in-memory cursor and patches both length slots with the real message length before returning. The produced bytes satisfy the backward-locatability invariant from wire-format §7 — any reader can read the last 16 bytes of the postamble, take the mirrored total_length, and jump straight to the message start. Required for fixtures or workloads that exercise the bidirectional remote walker (see remote-access.md → Bidirectional Scan).

enc = tensogram.StreamingEncoder({"base": [{"name": "obs"}]})
enc.write_object(desc, payload)
msg = enc.finish_backfilled()  # mirrored total_length in both slots

Async API

AsyncTensogramFile provides the same operations as TensogramFile but as asyncio coroutines. A single handle supports truly concurrent operations with no per-handle mutex; internal caches are thread-safe.

Opening and decoding

import asyncio
import tensogram

async def main():
    f = await tensogram.AsyncTensogramFile.open("forecast.tgm")

    meta, objects = await f.decode_message(0)
    result = await f.file_decode_object(0, 0)
    print(result["data"].shape)

asyncio.run(main())

For remote files with credentials:

    f = await tensogram.AsyncTensogramFile.open_remote(
        "s3://bucket/data.tgm", {"region": "eu-west-1"}
    )

Concurrent decoding with `asyncio.gather`

Multiple decode calls run concurrently on a single handle:

    results = await asyncio.gather(
        f.file_decode_object(0, 0),
        f.file_decode_object(1, 0),
        f.file_decode_object(2, 0),
    )

Batch decoding from many messages at once

When you need the same data from many messages, for example reading how a value at one grid point changes over 300 time steps, individual requests are slow because each one is a separate HTTP round-trip.

file_decode_range_batch collects the requested element ranges across messages and fetches the underlying data in a batched HTTP call. file_decode_object_batch does the same for full frames:

    indices = list(range(300))
    row, col, grid = 100, 200, 528
    offset = row * grid + col

    values = await f.file_decode_range_batch(indices, 0, [(offset, 1)], join=True)

    frames = await f.file_decode_object_batch(indices, 0)

For even more speed, split the work into chunks and run them concurrently:

    chunks = [indices[i::16] for i in range(16)]
    batch_results = await asyncio.gather(
        *[f.file_decode_range_batch(chunk, 0, [(offset, 1)], join=True)
          for chunk in chunks]
    )

The sync TensogramFile also has file_decode_range_batch and file_decode_object_batch with the same signatures. Both batch methods require a remote backend; calling them on a local file raises OSError.

Layout prefetching

Before running many concurrent decodes on a remote file, prefetch the internal layout metadata to avoid repeated discovery requests:

    count = await f.message_count()
    await f.prefetch_layouts(list(range(count)))

Context manager and iteration

    async with await tensogram.AsyncTensogramFile.open("data.tgm") as f:
        await f.message_count()   # required before async for or len(f)
        async for meta, objects in f:
            print(objects[0][1].shape)

Async iteration works on remote files (sync iteration does not). await f.message_count() must be called once before using async for or len(f), to discover the message count without blocking the event loop.

Other methods

    count = await f.message_count()
    raw = await f.read_message(0)
    all_raw = await f.messages()
    print(f.is_remote(), f.source())

Note: len(f) requires a prior await f.message_count() call. Without it, len(f) raises RuntimeError.

When to use async vs sync

Scenario	Recommendation
Script, CLI, or notebook	`TensogramFile` (sync)
Inside an asyncio event loop	`AsyncTensogramFile`
xarray or zarr	Sync (those frameworks are synchronous)
Many concurrent remote reads	`asyncio.gather` on one `AsyncTensogramFile`
Same data from many messages	`file_decode_range_batch` or `file_decode_object_batch`

Validation

Two functions check whether messages and files are well-formed without consuming the data. See also the CLI reference.

report = tensogram.validate(msg)
file_report = tensogram.validate_file("data.tgm")

Levels

Level	Checks	`hash_verified`
`"quick"`	Structure only: magic bytes, frame layout, lengths	always `False`
`"default"`	+ metadata (CBOR) + integrity (hash verification, decompression)	`True` only if hash succeeds and no errors
`"checksum"`	Hash verification only, structural warnings suppressed	`True` only if hash succeeds and no errors
`"full"`	+ fidelity (full decode, decoded-size check, NaN/Inf scan)	`True` only if hash succeeds and no errors

# Full validation with canonical CBOR key-order checking
report = tensogram.validate(msg, level="full", check_canonical=True)

Return values

validate() returns:

{
    "issues": [
        {
            "code": "hash_mismatch",   # stable snake_case string
            "level": "integrity",      # which validation level found it
            "severity": "error",       # "error" or "warning"
            "description": "...",      # human-readable message
            "object_index": 0,         # optional — which object
            "byte_offset": 1234,       # optional — position in buffer
        }
    ],
    "object_count": 1,
    "hash_verified": False,
}

validate_file() returns file-level issues plus per-message reports:

{
    "file_issues": [
        {"byte_offset": 100, "length": 19, "description": "trailing bytes after last message"}
    ],
    "messages": [
        {"issues": [], "object_count": 1, "hash_verified": True}
    ],
}

Interpreting results

report = tensogram.validate(msg)
if not report["issues"]:
    print(f"OK — {report['object_count']} objects, hash verified")
else:
    for issue in report["issues"]:
        print(f"[{issue['severity']}] {issue['code']}: {issue['description']}")

GRIB / NetCDF conversion

Three PyO3-bound helpers wrap tensogram-grib and tensogram-netcdf. They are always callable — when the Python wheel was built without the corresponding Cargo feature, each raises RuntimeError with a pointer to rebuild instructions.

You can probe availability at runtime:

import tensogram

if tensogram.__has_grib__:
    msgs = tensogram.convert_grib("forecast.grib2")

if tensogram.__has_netcdf__:
    msgs = tensogram.convert_netcdf("data.nc")

`convert_grib(path, **options) -> list[bytes]`

Convert a GRIB file (as many messages as it contains) to Tensogram wire format. Returns one bytes per output Tensogram message — join or write sequentially to produce a .tgm file.

msgs = tensogram.convert_grib(
    "forecast.grib2",
    grouping="merge_all",      # "merge_all" | "one_to_one"
    preserve_all_keys=False,   # lift every ecCodes namespace into base[i]["grib"]
    encoding="simple_packing", # "none" | "simple_packing"
    bits=16,                   # None -> defaults to 16; ignored for encoding="none"
    filter="none",             # "none" | "shuffle"
    compression="szip",        # "none" | "zstd" | "lz4" | "blosc2" | "szip"
    compression_level=None,    # applies to zstd / blosc2 (None = codec default)
    threads=0,                 # 0 = sequential; honours TENSOGRAM_THREADS env var
    hash="xxh3",               # "xxh3" | None
    # NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
    allow_nan=False,           # False (default) rejects any NaN input
    allow_inf=False,           # False (default) rejects any ±Inf input
)
with open("forecast.tgm", "wb") as fh:
    for msg in msgs:
        fh.write(msg)

Pipeline defaults and edge cases:

bits=None with encoding="simple_packing" defaults to 16 bits.
bits outside 1..=64 silently falls back to encoding="none" and emits a warning to stderr. Validate your inputs before calling if fail-fast is important.
Unknown compression / encoding names raise ValueError with the list of valid choices in the message.
Unknown grouping / split_by / hash values raise ValueError.
Missing input paths raise FileNotFoundError.
Building the wheel without the grib / netcdf feature causes the corresponding function to raise RuntimeError at call time with rebuild instructions.

Requires libeccodes at the OS level and the wheel built with --features grib (maturin develop --features grib). Official PyPI wheels do not currently include the grib feature — see Jupyter Notebook Walk-through.

`convert_grib_buffer(buf, **options) -> list[bytes]`

In-memory variant of convert_grib. Accepts any Python bytes-like object (bytes, bytearray, memoryview, numpy.uint8[:]). Useful when the GRIB bytes come from a byte-range HTTP fetch, a cache, or any other in-memory source — no filesystem staging needed.

import requests

# Byte-range download of a single GRIB message from data.ecmwf.int.
resp = requests.get(
    "https://data.ecmwf.int/forecasts/.../...grib2",
    headers={"Range": "bytes=74573515-75234113"},
)
msgs = tensogram.convert_grib_buffer(
    resp.content,
    encoding="simple_packing",
    bits=16,
    compression="szip",
    # See [NaN / Inf Handling](nan-inf-handling.md) for the
    # `allow_nan` / `allow_inf` opt-in if your data contains
    # non-finite values.
)

convert_grib and convert_grib_buffer produce bit-identical decoded payloads for the same input. The encoded bytes may differ — each call stamps a fresh timestamp and UUID into _reserved_.

`convert_netcdf(path, **options) -> list[bytes]`

Convert a NetCDF-3 or NetCDF-4 file to Tensogram. Packed variables (scale_factor / add_offset) are automatically unpacked to float64.

msgs = tensogram.convert_netcdf(
    "data.nc",
    split_by="file",           # "file" | "variable" | "record"
    cf=False,                  # lift 16 CF attributes into base[i]["cf"]
    encoding="none",
    bits=None,
    filter="none",
    compression="zstd",
    compression_level=3,
    threads=0,
    hash="xxh3",
    # NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
    allow_nan=False,           # False (default) rejects any NaN input
    allow_inf=False,           # False (default) rejects any ±Inf input
)

Note on NaN and --encoding simple_packing. Since 0.17 the importer hard-fails on NaN or Inf in a variable targeted for simple_packing (previous behaviour: stderr warning + fallback to encoding="none"). If your NetCDF has _FillValue / missing_value fields unpacked to NaN, either stick with the default encoding="none" or pre-process the values. See the NetCDF Importer error-handling reference for the full contract.

Requires libnetcdf + libhdf5 at the OS level and the wheel built with --features netcdf.

Error Handling

Exception	When
`FileNotFoundError`	`convert_grib(path)` / `convert_netcdf(path)` called with a non-existent path (subclass of `OSError`).
`OSError`	Other file I/O failures (permission denied, disk error, etc.).
`ValueError`	Invalid parameters; unknown dtype; NaN in simple packing; unknown validation level; invalid `grouping` / `split_by` / `hash`; unknown codec / bit width in the conversion pipeline; empty/non-GRIB input buffer; `split_by="record"` on a NetCDF without an unlimited dimension.
`RuntimeError`	Hash mismatch during `decode(..., verify_hash=True)`; calling `convert_grib` / `convert_grib_buffer` / `convert_netcdf` on a wheel built without the feature; internal ecCodes / libnetcdf C-library failures that cannot be classified as caller-input errors.
`KeyError`	Missing metadata key via `meta["key"]`.

Supported dtypes

Category	Types
Floating point	`float16`, `bfloat16`, `float32`, `float64`
Complex	`complex64`, `complex128`
Signed integer	`int8`, `int16`, `int32`, `int64`
Unsigned integer	`uint8`, `uint16`, `uint32`, `uint64`
Special	`bitmask`

bfloat16 is returned as ml_dtypes.bfloat16 when ml_dtypes is installed; otherwise it falls back to np.uint16.

See Data Types for byte widths and wire-format details.

Examples

See examples/python/ for complete working examples:

Example	Topic
`01_encode_decode.py`	Basic round-trip
`02_mars_metadata.py`	Per-object metadata (ECMWF MARS vocabulary example)
`02b_generic_metadata.py`	Per-object metadata using a generic application namespace
`03_simple_packing.py`	Simple-packing encoding
`04_multi_object.py`	Multi-object messages, selective decode
`05_file_api.py`	Multi-message `.tgm` files
`06_hash_and_errors.py`	Hash verification and error handling
`07_iterators.py`	File iteration, indexing, slicing
`08_xarray_integration.py`	Opening `.tgm` as xarray Datasets
`08_zarr_backend.py`	Reading/writing through Zarr v3
`09_dask_distributed.py`	Dask distributed computing over 4-D tensors
`09_streaming_consumer.py`	Streaming consumer pattern
`11_encode_pre_encoded.py`	Pre-encoded data API
`12_convert_netcdf.py`	NetCDF → Tensogram import via the Python API
`13_validate.py`	Message and file validation
`15_async_operations.py`	Async open, decode, and `asyncio.gather`
`17_convert_grib.py`	GRIB → Tensogram import (file + in-memory buffer)

For narrative walk-throughs with plots and explanations, see also examples/jupyter/*.ipynb — five journey notebooks covering quickstart/MARS, encoding pipeline fidelity, GRIB conversion, NetCDF conversion with xarray, and validation with multi-threaded encoding.

C API

Tensogram exposes a flat C ABI through the tensogram-ffi crate. The generated header is tensogram.h; all public functions are prefixed tgm_, most public types follow the tgm_*_t pattern (the option structs TgmEncodeMaskOptions and TgmDecodeMaskOptions are PascalCase exceptions inherited from the underlying Rust types), and all error codes are members of the tgm_error enum.

The C++ wrapper at cpp/include/tensogram.hpp is built directly on top of this C API; see C++ API for the higher-level RAII-and-exceptions front-end. This page covers the C side: how to get the library, how to link it, and what to be aware of regarding versioning.

Three install paths

Path	When to use	Toolchain required
Prebuilt binary tarball	Quick local install, no Rust toolchain	none (just `tar`, `pkg-config`, `cc`)
`cargo cinstall`	Custom prefix, or platform/arch we do not pre-build	Rust stable
Build from source	Hacking on the FFI itself	Rust stable + clone of the repo

Prebuilt binary tarball (recommended for users)

Each tagged release publishes two tarballs at https://github.com/ecmwf/tensogram/releases:

tensogram-ffi-<VERSION>-linux-x86_64.tar.gz
tensogram-ffi-<VERSION>-macos-aarch64.tar.gz

The Linux tarball is built on AlmaLinux 8 (glibc 2.28; ABI-compatible with the manylinux_2_28 wheel platform tag, works on RHEL 8 / Debian 11 / Ubuntu 20.04 and newer). The macOS tarball is built on Apple Silicon. For other platforms (linux-aarch64, macos-x86_64, etc.) build from source with cargo cinstall below.

Each tarball is rooted for /usr/local (the bundled tensogram.pc hard-codes prefix=/usr/local), and is packed with uid=0 / gid=0 so extraction under sudo produces root-owned files. The default install is:

VERSION=<release-version>          # e.g. 0.20.0
PLATFORM=linux-x86_64              # or macos-aarch64
ASSET=tensogram-ffi-${VERSION}-${PLATFORM}.tar.gz

curl -LO "https://github.com/ecmwf/tensogram/releases/download/${VERSION}/${ASSET}"
sudo tar --no-same-owner -C /usr/local -xzf "${ASSET}"
sudo ldconfig                      # Linux: refresh dynamic linker cache
pkg-config --modversion tensogram  # → ${VERSION}

After extraction the layout under /usr/local is:

/usr/local/
├── lib/
│   ├── libtensogram.so.0.20.0          (Linux: real shared library)
│   ├── libtensogram.so.0.20            → libtensogram.so.0.20.0
│   ├── libtensogram.so                 → libtensogram.so.0.20.0
│   ├── libtensogram.0.20.0.dylib       (macOS: real dylib)
│   ├── libtensogram.0.20.dylib         → libtensogram.0.20.0.dylib
│   ├── libtensogram.dylib              → libtensogram.0.20.0.dylib
│   ├── libtensogram.a
│   └── pkgconfig/tensogram.pc
└── include/
    └── tensogram/
        └── tensogram.h

Plus LICENSE, README.md, and INSTALL.md at the install root.

For a non-default prefix, use cargo cinstall (next section) — the .pc file embeds the prefix at build time, so simply extracting the tarball under a different directory will leave pkg-config returning broken paths.

`cargo cinstall` (recommended for custom prefixes)

cargo-c is a cargo subcommand that builds and installs the C-callable artefacts in one step. Unlike plain cargo build, it produces a properly versioned shared library with SONAME symlinks, a pkg-config .pc file, and an installed header at the right path.

# One-time: install cargo-c (uses your default stable toolchain).
cargo install cargo-c

# Build + install Tensogram FFI under any prefix you control.
# --libdir=lib pins the layout (otherwise Debian/Ubuntu multiarch
# would pick lib/<triplet>, breaking PKG_CONFIG_PATH guesses below).
cargo cinstall --release -p tensogram-ffi \
    --prefix="$HOME/.local" --libdir=lib

# Verify pkg-config can see it.
export PKG_CONFIG_PATH="$HOME/.local/lib/pkgconfig:$PKG_CONFIG_PATH"
pkg-config --modversion tensogram

The cargo-c metadata is committed in rust/tensogram-ffi/Cargo.toml, so no extra configuration is needed.

Build from source

Plain cargo build continues to work for in-tree development: it produces target/release/libtensogram_ffi.{a,so,dylib} (note the _ffi suffix). The C++ wrapper’s CMake integration uses this path.

cargo build --release -p tensogram-ffi
# Outputs:
#   target/release/libtensogram_ffi.a
#   target/release/libtensogram_ffi.so   (or .dylib on macOS)
#   rust/tensogram-ffi/tensogram.h       (regenerated by build.rs cbindgen)

There is no SONAME, no pkg-config file, and no header at a system include path — those are cargo-c’s job. cargo build is the contributor flow; cargo cinstall is the user flow.

Linking against the installed library

pkg-config (recommended)

cc $(pkg-config --cflags tensogram) my_program.c \
   $(pkg-config --libs tensogram) \
   -o my_program

--cflags returns -I<includedir> so your code uses #include <tensogram/tensogram.h>. --libs returns -L<libdir> -ltensogram.

CMake

cmake_minimum_required(VERSION 3.16)
project(my_project C)

find_package(PkgConfig REQUIRED)
pkg_check_modules(TENSOGRAM REQUIRED IMPORTED_TARGET tensogram)

add_executable(my_program my_program.c)
target_link_libraries(my_program PRIVATE PkgConfig::TENSOGRAM)

Manual flags (when pkg-config is not available)

# Linux
cc -I/usr/local/include my_program.c \
   -L/usr/local/lib -ltensogram \
   -ldl -lpthread -lm \
   -o my_program

# macOS
cc -I/usr/local/include my_program.c \
   -L/usr/local/lib -ltensogram \
   -framework CoreFoundation -framework Security -framework SystemConfiguration \
   -lc++ -lm \
   -o my_program

The Libs.private field of tensogram.pc lists the platform-specific support libraries; pkg-config picks them up automatically when linking the static archive (pkg-config --static --libs tensogram).

Quick start

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <tensogram/tensogram.h>

int main(void) {
    const char *meta_json =
        "{\"descriptors\":[{"
        "\"type\":\"ntensor\",\"ndim\":1,\"shape\":[4],\"strides\":[4],"
        "\"dtype\":\"float32\",\"byte_order\":\"little\","
        "\"encoding\":\"none\",\"filter\":\"none\",\"compression\":\"none\""
        "}]}";

    float in[4] = { 1.0f, 2.0f, 3.0f, 4.0f };
    const uint8_t *ptrs[1] = { (const uint8_t*)in };
    const size_t lens[1] = { sizeof(in) };

    tgm_bytes_t enc = {0};
    if (tgm_encode(meta_json, ptrs, lens, 1, "xxh3", 0, &enc) != TGM_ERROR_OK) {
        fprintf(stderr, "encode failed: %s\n", tgm_last_error());
        return 1;
    }

    tgm_message_t *msg = NULL;
    if (tgm_decode(enc.data, enc.len,
                   /*native_byte_order=*/1, /*threads=*/0,
                   /*verify_hash=*/0, &msg) != TGM_ERROR_OK) {
        fprintf(stderr, "decode failed: %s\n", tgm_last_error());
        tgm_bytes_free(enc);
        return 1;
    }

    size_t out_len = 0;
    const uint8_t *out_data = tgm_object_data(msg, 0, &out_len);
    printf("decoded %zu bytes, equal=%s\n", out_len,
           (out_len == sizeof(in) && memcmp(out_data, in, out_len) == 0) ? "yes" : "no");

    tgm_message_free(msg);
    tgm_bytes_free(enc);
    return 0;
}

Memory ownership

Handles returned by tgm_* constructors (e.g. tgm_decode returns tgm_message_t **) are owned by the caller. Free them with the matching tgm_*_free function.
Pointers returned by accessor functions (e.g. tgm_object_data, tgm_object_shape) are borrowed from the parent handle and are valid only until the parent is freed.
tgm_bytes_t returned by encode functions must be freed with tgm_bytes_free.
tgm_last_error() returns a thread-local pointer to the most recent error message. The string is owned by the FFI layer; do not free it. Treat it as valid only until the next tgm_* call on the same thread.

Error handling

Every fallible tgm_* function returns a tgm_error enum value. TGM_ERROR_OK (= 0) is success; any other value indicates failure and populates the thread-local error string accessible via tgm_last_error().

tgm_error rc = tgm_encode(...);
if (rc != TGM_ERROR_OK) {
    const char *msg = tgm_last_error();
    fprintf(stderr, "encode failed [code=%d]: %s\n",
            (int)rc, msg ? msg : "(no detail)");
    return 1;
}

The error enum is non-exhaustive; future Tensogram releases may add new variants without bumping the major version. Always treat any non-TGM_ERROR_OK value as failure and rely on tgm_last_error() for human-readable detail.

Versioning policy

Tensogram is currently labelled “Emerging” software (pre-1.0). The ABI policy reflects that:

The C library SONAME is MAJOR.MINOR (e.g. libtensogram.so.0.20). Every minor release bumps the SONAME, so consumers must rebuild. Patch releases (0.20.x → 0.20.y) keep the SONAME stable.
The C ABI may change between minor releases without an explicit deprecation cycle.
The wire format version is independent of the SONAME and lives in the TGM_WIRE_VERSION constant exposed by tensogram.h. See plans/WIRE_FORMAT.md for the wire-level versioning policy.
When the project crosses 1.0, the SONAME suffix policy will be reviewed; the cargo-c version_suffix_components = 2 lock in Cargo.toml is the explicit mechanism for that decision.

If you depend on Tensogram from C, expect to rebuild on every minor release for now. The Rust API has the same caveat.

C++ API

Tensogram provides a header-only C++17 wrapper at cpp/include/tensogram.hpp. It delegates all work to the C FFI and adds RAII handle management, typed exceptions, and idiomatic C++ patterns.

The C ABI underneath this wrapper is documented in C API. The build flow on this page is the in-tree wrapper build: the bundled CMake reads the cbindgen-generated header from rust/tensogram-ffi/ and links against cargo build’s libtensogram_ffi.{a,so}. The C API page also covers the distribution paths (pre-built tarballs, cargo cinstall) used when shipping the C library to consumers; the SONAME / versioning policy described there applies to those distributed binaries. Building the C++ wrapper against a cargo-c-installed libtensogram (rather than the in-tree static library) is not currently wired into cpp/CMakeLists.txt.

Requirements

C++17 compiler (GCC 7+, Clang 5+, MSVC 19.14+)
Rust static library built via cargo build --release
CMake 3.16+ (recommended)

Build

cargo build --release
cmake -S cpp -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Quick Start

#include <tensogram.hpp>

// Encode
std::string meta_json = R"({"descriptors": [...]})";
std::vector<float> data(100 * 200, 0.0f);
auto encoded = tensogram::encode(
    meta_json,
    {{reinterpret_cast<const uint8_t*>(data.data()), data.size() * sizeof(float)}});

// Decode
auto msg = tensogram::decode(encoded.data(), encoded.size());
auto obj = msg.object(0);
const float* values = obj.data_as<float>();

RAII Classes

Class	Wraps	Cleanup
`message`	`tgm_message_t`	`tgm_message_free`
`metadata`	`tgm_metadata_t`	`tgm_metadata_free`
`file`	`tgm_file_t`	`tgm_file_close`
`buffer_iterator`	`tgm_buffer_iter_t`	`tgm_buffer_iter_free`
`file_iterator`	`tgm_file_iter_t`	`tgm_file_iter_free`
`object_iterator`	`tgm_object_iter_t`	`tgm_object_iter_free`
`streaming_encoder`	`tgm_streaming_encoder_t`	`tgm_streaming_encoder_free`

All classes are move-only (copy deleted). Handles are released automatically when the object goes out of scope.

Error Handling

C error codes are mapped to a typed exception hierarchy:

try {
    auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
    // Invalid message framing
} catch (const tensogram::hash_mismatch_error& e) {
    // Payload integrity check failed
} catch (const tensogram::error& e) {
    // Any Tensogram error (base class)
    std::cerr << e.what() << " (code=" << e.code() << ")\n";
}

Validation

Two free functions validate messages and files, returning JSON strings:

// Validate a single message buffer (default level)
auto report = tensogram::validate(buf, len);

// Full validation with canonical CBOR check
auto full_report = tensogram::validate(buf, len, "full", /*check_canonical=*/true);

// Validate a .tgm file
auto file_report = tensogram::validate_file("data.tgm");
auto file_full   = tensogram::validate_file("data.tgm", "full");

Validation levels: "quick", "default", "checksum", "full".

The returned JSON contains issues, object_count, and hash_verified for single messages, or file_issues and messages for files. Parse with your preferred JSON library.

An invalid level string or a missing file throws tensogram::invalid_arg_error or tensogram::io_error respectively. Validation issues (corrupted data, hash mismatches) are reported in the JSON — they do not throw.

Iterators

See Iterators for buffer, file, and object iterator usage.

Examples

See examples/cpp/ for complete working examples covering encode/decode, metadata, file API, simple packing, and iterators.

C++ Async API

Tensogram ships an asynchronous C++ surface designed for the HPC producer/consumer scenario where independent jobs on the same cluster pipe data through a .tgm artefact.

The async layer is header-only and opt-in via three frontends that all sit on the same callback-based FFI core:

Frontend	C++ standard	Header	Style
Callback	C++17 (default)	`tensogram/async/callback.hpp`	`std::function` completion handlers
`std::future`	C++17 (opt-in)	`tensogram/async/std_future.hpp`	`std::future<T>`
Coroutines	C++20 (opt-in)	`tensogram/async/coro.hpp`	`task<T>` + `co_await`

For the architecture, see the Asynchronous frontends section of plans/ARCHITECTURE.md.

Build setup

The async surface is enabled by default in CMake. To opt out:

cmake -S cpp -B build -DTENSOGRAM_ASYNC=OFF

When TENSOGRAM_ASYNC=ON (the default), the cargo build line gains --features=async, the C++ wrapper target gets a TENSOGRAM_ASYNC=1 compile definition, and the test suite picks up the async test files.

The C++20 coroutine frontend is built as a separate test executable. You can disable it with -DTENSOGRAM_ASYNC_CORO_TESTS=OFF if your compiler doesn’t support C++20 coroutines.

Architecture

┌────────────────────────────────────────────────────────────────────┐
│ User code                                                           │
└────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────────┐
│ C++ frontend (header-only):                                         │
│   tensogram/async/callback.hpp     std::function callbacks          │
│   tensogram/async/coro.hpp         task<T> coroutines (C++20)       │
│   tensogram/async/std_future.hpp   std::future<T>                   │
└────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────────┐
│ tensogram-ffi async core (cdylib + staticlib):                      │
│   tgm_async_task_t   — opaque task handle                          │
│   tgm_cancellation_token_t — cancellation                           │
│   tgm_async_file_t   — Arc-shared file handle                      │
│   tgm_async_streaming_encoder_t — producer                         │
└────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────────┐
│ Rust core: SHARED_RUNTIME (tokio multi-thread, contained)           │
└────────────────────────────────────────────────────────────────────┘

The runtime is fully contained — no tokio types appear in the public C/C++ API. Default workers = min(num_cpus, 8); configurable via tgm_runtime_configure(...) once at process start.

Callback frontend (`callback.hpp`)

Always available wherever TENSOGRAM_ASYNC=ON. No extra C++ standard required.

#include <tensogram.hpp>
#include <tensogram/async/callback.hpp>

namespace tac = tensogram::async_callback;

tac::async_file::open("data.tgm", [](tac::result<tac::async_file> r) {
    if (!r.ok()) {
        std::cerr << "open failed: " << r.message() << "\n";
        return;
    }
    auto file = r.take();
    file.message_count([file_ref = std::move(file)](tac::result<std::size_t> r) mutable {
        if (r.ok()) {
            std::cout << "messages: " << r.value() << "\n";
        }
    });
});

Callback contract

The callback runs on the FFI dispatcher pool — a small set of non-tokio worker threads owned by tensogram-ffi. This insulates the tokio runtime: even if a user callback blocks or runs slowly, tokio’s worker threads stay free to drive other in-flight async I/O.

The callback must:

Complete quickly (< 100 µs of CPU time). The dispatcher pool is small (default min(num_cpus, 4)); long callbacks queue up and starve other completions.
Not throw — panic = "abort" is set on the Rust side and the trampolines that wrap user callbacks are noexcept, so an exception escaping a callback will terminate the process.
Signal a condvar / coroutine handle / promise and return. For heavy work, hand off to your own thread pool.

The pool size is overridable at startup via tac::runtime_configure(workers, dispatcher_workers, ...). The configuration must be set before any other async call.

`std::future` frontend (`std_future.hpp`)

C++17 opt-in. Each method returns std::future<T> whose .get() blocks until the operation completes. Failures throw the typed tensogram::error hierarchy through .get().

#include <tensogram/async/std_future.hpp>

namespace tsf = tensogram::stdfuture;

auto file = tsf::async_file::open("data.tgm").get();
auto count = file.message_count().get();
auto msg = file.decode_message(0).get();

Composition is intentionally weak — no .then, no when_all. Users wanting pipelines should reach for coro.hpp. Users on -fno-exceptions builds should use callback.hpp (the future surface needs exceptions to surface failures through .get()).

Coroutine frontend (`coro.hpp`)

C++20 opt-in. Two types are exported:

task<T> — proper coroutine return type. Users write task<int> my_func() { co_return 42; } and chain via co_await my_func().
awaiter<T> — what async I/O methods return. Itself awaitable; suspends until the underlying FFI task resolves.

#include <tensogram/async/coro.hpp>

namespace tco = tensogram::coro;

tco::task<std::size_t> walk(const std::string& path) {
    auto file = co_await tco::async_file::open(path);
    std::size_t total = 0;
    auto count = co_await file.message_count();
    for (std::size_t i = 0; i < count; ++i) {
        auto msg = co_await file.decode_message(i);
        total += msg.num_objects();
    }
    co_return total;
}

int main() {
    auto n = tco::block_on(walk("data.tgm"));
    std::cout << "total objects: " << n << "\n";
}

tco::block_on runs the task synchronously on the calling thread; useful at top-level entry points like main().

tco::async_for_each(file, fn) is a convenience that walks every message in file and applies fn to each tensogram::message.

Producer side: streaming async encoder

All three frontends expose async_streaming_encoder for the producer scenario. Local file backend; preamble + header metadata are written asynchronously when create() resolves.

// Coroutine frontend example
tco::task<void> emit_steps(forecast_engine& fc) {
    auto enc = co_await tco::async_streaming_encoder::create(
        "/run/forecast.tgm",
        R"({"base": []})");

    while (auto step = fc.next_step()) {
        co_await enc.write_object(step.descriptor_json,
                                   step.bytes.data(), step.bytes.size());
    }

    co_await enc.finish(/*backfill=*/true);
}

The producer task and the underlying encoder are protected by a tokio::sync::Mutex so concurrent FFI calls against the same handle serialise correctly.

Cancellation and timeouts

Every async-launching call accepts an optional cancellation_token* and a std::chrono::milliseconds timeout.

tac::cancellation_token tok;
file.decode_message(42, [](tac::result<tensogram::message> r) {
    if (r.code() == TGM_ERROR_CANCELLED) {
        std::cerr << "cancelled\n";
    }
}, &tok, std::chrono::milliseconds{5000});

// Cancel from any thread:
tok.cancel();

Timeout 0 means “no timeout”. Internally implemented via tokio::time::timeout and tokio_util::sync::CancellationToken.

Thread safety

async_file — internally backed by Arc<TensogramFile>. Multiple threads may issue concurrent reads against the same handle.
async_streaming_encoder — single-handle, internally serialised via tokio::sync::Mutex. Concurrent writes against the same encoder serialise; this matches the inherently sequential nature of streaming writes.
cancellation_token — safe to share across threads; refcounted internally.

Runtime configuration

Call once before any other async API:

tac::runtime_configure(/*workers=*/16,
                        /*dispatcher_workers=*/0,  // default
                        /*multipart_part_size_bytes=*/0);  // default 8 MiB

Subsequent calls throw invalid_arg_error because the runtime is built lazily on first use and cannot be reconfigured after that.

runtime_shutdown_blocking(timeout_ms) shuts the shared async runtime down gracefully. It blocks for up to timeout_ms while in-flight tasks drain, then returns the number of tasks that had not finished when the timeout elapsed (0 on a clean drain) so a caller can log or abort:

using namespace std::chrono_literals;
std::uint64_t unfinished = tac::runtime_shutdown_blocking(5000ms);  // 5 s
if (unfinished != 0) {
    // some work did not drain in time — decide how to react
}

The runtime is single-shot: once shut down it is never rebuilt. Every subsequent async call fails fast with io_error (the C ABI returns TGM_ERROR_IO), and a second runtime_shutdown_blocking is an idempotent no-op returning 0. Calling it before any async work has run is also a clean no-op (nothing was ever spawned). Do not call it from inside an async completion callback running on the runtime’s own worker threads — tokio forbids dropping a runtime from within itself; call it from your application’s teardown thread instead.

Remote reads (`open_remote`)

All three frontends can open a .tgm over an object store or a file:// URL on the read path:

// callback frontend
tac::async_file::open_remote(
    "s3://bucket/forecast.tgm",
    /*storage_options=*/{{"aws_region", "eu-west-1"}},
    /*bidirectional=*/true,
    [](tac::result<tac::async_file> r) { /* ... */ });

// std::future frontend
auto file = tsf::async_file::open_remote("gs://bucket/f.tgm", {}, true).get();

// coroutine frontend
auto file = co_await tco::async_file::open_remote("az://c/f.tgm", {}, true);

storage_options are object-store key→value pairs (credentials, region, endpoint); pass an empty list to use ambient configuration. bidirectional selects the pipelined two-ended remote scan.

Supported schemes: s3://, gs://, az://, https://, and file://. Requires the FFI built with --features=async-remote (cmake -S cpp -B build -DTENSOGRAM_ASYNC_REMOTE=ON). The tgm_async_file_open_remote symbol is always linkable; in a build without the feature it resolves with TGM_ERROR_REMOTE and a diagnostic naming the missing feature, so callers never hit an undefined symbol.

See examples/cpp/19_async_decode_remote.cpp and the producer/consumer guide.

What’s not in scope (v1)

Object-store backends for the streaming encoder (S3, GCS, Azure). Local file only (tracked in plans/TODO.md).
External tokio runtime interop (users supplying their own runtime).
Boost.Asio or Folly frontends — explicitly removed from the plan.
MSVC / Windows. Linux + macOS only.
Per-file runtime isolation.

Cross-language parity

The async surface produces wire-format-identical bytes to the sync StreamingEncoder for the same logical sequence of writes. Every file written via the async path can be read by Rust, Python, or the sync C++ wrapper without modification.

Producer / consumer integration test

The integration test cpp/tests/test_async_producer_consumer.cpp exercises the canonical HPC pattern: a producer writes 8 forecast steps as separate frames in one streaming message; a consumer reads them all back and verifies the data. Mirror your own producer/consumer pair on this shape and you’ll have a working setup.

C++ Async Streaming (Producer / Consumer)

This guide is a practical recipe for the HPC scenario that drives the C++ async surface: two independent jobs on the same cluster pipe data through a .tgm artefact. A producer (a simulation or inference job) streams forecast steps out as they are computed; a consumer (post-processing or visualisation) reads each message as soon as it is available. Neither job should stall waiting on the other.

For the full async API reference — all three frontends, the callback contract, runtime configuration, cancellation, and thread-safety — see C++ Async API. This page focuses on the end-to-end producer/consumer pattern.

The runnable companions to this guide are:

examples/cpp/20_async_producer.cpp
examples/cpp/21_async_consumer.cpp
examples/cpp/19_async_decode_remote.cpp (consumer reading over an object store / file://)

Build setup

The async surface is header-only and on by default in CMake:

cmake -S cpp -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

The examples below use the C++20 coroutine frontend (tensogram/async/coro.hpp). The callback (callback.hpp) and std::future (std_future.hpp) frontends offer the same operations on C++17 — pick whichever matches your code base. To read over an object store or a file:// URL, configure with -DTENSOGRAM_ASYNC_REMOTE=ON (this builds the FFI with --features=async-remote).

Producer: streaming a message out object-by-object

The producer creates an async_streaming_encoder, writes each data object as it is produced, and calls finish() once the message is complete. Nothing buffers the whole message — each write_object hands its bytes to the async writer and returns.

#include <tensogram/async/coro.hpp>
namespace tco = tensogram::coro;

tco::task<void> emit_forecast(forecast_engine& fc) {
    auto enc = co_await tco::async_streaming_encoder::create(
        "/scratch/run/forecast.tgm",
        R"({"base": []})");          // global metadata (CBOR-encoded)

    while (auto step = fc.next_step()) {
        co_await enc.write_object(step.descriptor_json,
                                  step.bytes.data(), step.bytes.size());
    }

    // backfill=true rewrites the preamble/postamble lengths so the
    // finished file is fully random-access (see WIRE_FORMAT.md §7).
    co_await enc.finish(/*backfill=*/true);
}

Notes and edge cases:

One encoder, serialised writes. A single async_streaming_encoder is internally guarded by a tokio::sync::Mutex; concurrent write_object calls against the same handle serialise. This matches the inherently sequential nature of a streaming write — don’t try to parallelise writes to one encoder.
Pre-encoded payloads. If your bytes are already encoded (compressed/packed), use write_pre_encoded(...) to skip the encoding pipeline. Use write_preceder(...) to emit a metadata preceder frame ahead of an object so a streaming consumer can decode it before the payload arrives.
Backend. In v1 the streaming encoder writes to a local file only. The shared filesystem (Lustre, GPFS, WekaFS, BeeGFS) is the supported producer sink. Object-store write is out of scope; object-store reads are supported on the consumer side (below).
Crash/cancellation. If the producer is cancelled or killed mid-stream the partial .tgm is left on disk as-is (no truncate or delete). Operational systems should treat an unfinished file as untrusted until the producer signals completion — see Coordination.

Consumer: reading each message as it lands

The consumer opens the file and walks its messages. async_for_each decodes each message in turn and hands it to your callback; it is the coroutine mirror of the Python iterator pattern.

tco::task<std::size_t> consume(const std::string& path) {
    auto file = co_await tco::async_file::open(path);
    std::size_t seen = 0;
    co_await tco::async_for_each(file, [&](tensogram::message msg) {
        process(msg);              // your work
        ++seen;
    });
    co_return seen;
}

If you need random access instead of a full walk, use message_count() + decode_message(i) directly. An async_file is backed by an Arc<TensogramFile>, so it is safe to issue concurrent reads against the same handle from multiple threads (see the Thread safety section of the reference).

Coordination on a shared filesystem

Two patterns, depending on how tightly coupled the jobs are:

Commit-then-read (simplest, recommended). The producer writes the whole message and calls finish(); the consumer waits for the file to be committed (e.g. an atomic rename into a watched directory, or a sentinel marker) and then opens it. This avoids any torn-read window and is how most operational pipelines hand off.
Progressive streaming. The consumer reads a growing byte stream and decodes messages as they complete. The synchronous examples/cpp/09_streaming_consumer.cpp shows the scan()-the-rolling-buffer technique; the producer’s metadata preceder frames (write_preceder) let the consumer decode each object’s metadata before its payload arrives. Use this only when the latency saving justifies the added coordination.

Edge case — torn reads. A consumer that opens a .tgm the producer has not yet finish()-ed may see a truncated file. Decode calls surface a framing error rather than returning garbage; wait for the committed file (pattern 1) or use the progressive reader with a retry-on-partial loop (pattern 2).

Consumer reading over an object store

A consumer running on a different node can read the producer’s artefact over an object store or a shared-filesystem file:// URL via open_remote:

std::vector<std::pair<std::string, std::string>> opts = {
    {"aws_region", "eu-west-1"},
};
auto file = co_await tco::async_file::open_remote(
    "s3://forecasts/run-42/forecast.tgm", opts, /*bidirectional=*/true);
auto count = co_await file.message_count();

bidirectional=true selects the pipelined two-ended remote scan, which halves the per-message round-trip cost on high-latency links.
Pass storage credentials/config through storage_options, or rely on ambient configuration (environment, instance role). For a file:// URL no options are needed.
This requires the FFI built with --features=async-remote (-DTENSOGRAM_ASYNC_REMOTE=ON). Without it, open_remote resolves with TGM_ERROR_REMOTE and a diagnostic naming the missing feature.

Wire-format parity

The async streaming encoder produces byte-identical output to the synchronous StreamingEncoder for the same logical sequence of writes. Every file written via the async producer can be read by Rust, Python, the WASM/TypeScript decoder, or the synchronous C++ wrapper without modification.

Fortran API

Tensogram ships a shallow Fortran 2008 binding (fortran/) over the C ABI. It is a thin iso_c_binding layer that links the same libtensogram the C and C++ wrappers use — no new C or Rust code — following the eccodes_f90 model ECMWF Fortran codes already expect. The library compiles clean under -std=f2008 -Wall -Wextra -Werror; see Fortran standard for the one gfortran caveat.

Status. Emerging. The synchronous surface is complete: generic encode/decode, the multi-message file API, application metadata, the encoding pipeline, and the streaming encoder. Only the async surface is deferred.

The memory-order contract (read this first)

A Fortran array is column-major; Tensogram (like NumPy / C) is row-major. The binding bridges this by writing a Fortran a(ni, nj) with the on-wire descriptor shape and strides reversed to [nj, ni] / [ni, 1]. The consequences are deliberate and important:

Fortran → Tensogram → Fortran round-trips are bit-identical. What you encode as a(ni, nj) decodes back as (ni, nj).
A NumPy / C / xarray / Zarr reader sees the transpose — shape (nj, ni), with arr[j-1, i-1] == a(i, j).

If you exchange .tgm data between Fortran and other languages, account for that transpose (or transpose on one side). This is the single most common source of cross-language confusion.

Install and link

The binding links the system libtensogram discovered through pkg-config (tensogram.pc), shipped by cargo cinstall -p tensogram-ffi and by the release tarballs. CMake is the build system of record; an fpm manifest is also provided.

CMake (recommended)

cmake -S fortran -B build/fortran
cmake --build build/fortran
ctest --test-dir build/fortran --output-on-failure

fortran/CMakeLists.txt discovers the FFI with pkg_check_modules(... IMPORTED_TARGET tensogram) and exposes a tensogram_f static library plus the tensogram Fortran module. For a non-default FFI prefix, point pkg-config at it first:

export PKG_CONFIG_PATH="$HOME/.local/lib/pkgconfig:$PKG_CONFIG_PATH"

In your own project:

find_package(PkgConfig REQUIRED)
pkg_check_modules(TENSOGRAM REQUIRED IMPORTED_TARGET tensogram)

# The binding is one preprocessed module source (tensogram.F90) plus the
# per-rank include templates beside it (src/tgm_*.inc). The simplest
# integration is to compile it into your own target and link the C ABI
# (the .F90 is preprocessed automatically; the .inc files resolve from src/):
add_executable(my_app /path/to/tensogram/fortran/src/tensogram.F90 my_app.f90)
target_link_libraries(my_app PRIVATE PkgConfig::TENSOGRAM)

# Alternatively, if you have the repository checked out, build the binding's
# own library target and link that (it propagates the .mod path and the C ABI):
#   add_subdirectory(/path/to/tensogram/fortran tensogram_fortran)
#   target_link_libraries(my_app PRIVATE tensogram_f)

fpm

fpm has no native pkg-config integration, so inject the flags:

cd fortran
fpm build \
    --flag      "$(pkg-config --cflags tensogram)" \
    --link-flag "$(pkg-config --libs   tensogram)"

Plain gfortran

gfortran -std=f2018 fortran/src/tensogram.F90 my_app.f90 \
    $(pkg-config --cflags --libs tensogram) -o my_app

The .F90 (capital F) is preprocessed automatically; keep the src/tgm_*.inc templates beside it.

Fortran standard

The binding library is Fortran 2008 and builds clean under -std=f2008 -Wall -Wextra -Werror (the rank-generic surface uses explicit per-rank specifics, not the F2018 assumed-rank + SELECT RANK). The examples above use -std=f2018 for one reason: the handle types pair a final procedure with default initialization (the RAII / null-safety design), and gfortran emits an f08/0011 advisory wherever such a type is used as a local variable under -std=f2008 — which includes your own application code. That construct is valid Fortran 2008; the advisory is a gfortran diagnostic that -Werror would promote to an error. So if you compile your application with -std=f2008, either drop -Werror, or use -std=f2018 / -std=gnu. (The library itself has no such locals and stays warning-clean, which is what the CI fortran-f2008-check gate verifies.)

Quick start

program demo
   use, intrinsic :: iso_c_binding, only : c_float, c_int8_t, c_int
   use tensogram
   implicit none

   real(c_float)                  :: field(100, 200)
   real(c_float),    allocatable  :: out(:,:)
   integer(c_int8_t), allocatable :: wire(:)
   type(tensogram_buffer)  :: buf
   type(tensogram_message) :: msg
   integer(c_int) :: err

   call random_number(field)

   ! Encode -> Rust-owned bytes (lossless: encoding/compression "none").
   ! `tensogram_encode` is generic over dtype and rank.
   call tensogram_encode(field, buf, err)
   call tensogram_check(err, 'encode')

   ! Copy the wire bytes out, then decode them back.
   call buf%as_array(wire)
   call tensogram_decode(wire, msg, err)
   call tensogram_check(err, 'decode')

   ! Extract object 1 -> a Fortran array shaped like the input.
   call tensogram_to_array(msg, 1, out, err)
   call tensogram_check(err, 'decode object')

   print *, 'bit-identical: ', .not. any(transfer(out, [0]) /= transfer(field, [0]))
end program demo

See examples/fortran/encode_decode.f90 for the fully annotated version.

Ownership and non-copyable handles

tensogram_buffer (encoded bytes) and tensogram_message (a decoded handle) own resources allocated by Rust. They are released automatically by a final procedure at scope exit, or eagerly via call buf%free() / call msg%free() (idempotent).

These handle types are non-copyable. Fortran cannot delete the intrinsic assignment, so a b = a would alias the underlying handle and free it twice. To make that mistake impossible to do silently, the types define an assignment(=) that aborts with error stop:

type(tensogram_message) :: a, b
b = a        ! ERROR STOP: tensogram_message is non-copyable ...

Pass handles by reference (the default), and let factory procedures return them through intent(out) arguments (as tensogram_decode does).

Error handling

Every fallible call returns a tgm_error code via an intent(out) argument; TGM_ERROR_OK (= 0) is success. The named constants (TGM_ERROR_FRAMING, TGM_ERROR_OBJECT, …) mirror the C tgm_error enum.

integer(c_int) :: err
call tensogram_decode(wire, msg, err)
if (err /= TGM_ERROR_OK) then
   write(*,*) 'decode failed: ', tensogram_last_error()
   ! ... handle / return ...
end if

tensogram_last_error() returns the thread-local message from the most recent failing call.
tensogram_strerror(err) returns a static description of a code.
tensogram_check(err, context) is a convenience guard: it is a no-op on success and prints context + the detail then error stops on failure. Use it in examples and scripts; in library code prefer inspecting err yourself.

The codes the binding surfaces (see Error Handling for the full taxonomy):

Code	Raised by, for example
`TGM_ERROR_FRAMING`	decoding a buffer that is not a valid message
`TGM_ERROR_ENCODING`	encoding non-finite values (NaN / ±Inf are rejected by default)
`TGM_ERROR_OBJECT`	`tensogram_to_array` with a wrong dtype, wrong rank, or out-of-range object index
`TGM_ERROR_IO`	opening / reading a file that does not exist or cannot be read
`TGM_ERROR_HASH_MISMATCH` / `TGM_ERROR_MISSING_HASH`	`verify_hash = .true.` and the recorded digest disagrees / is absent

The binding never aborts on these (only the deliberate non-copyable handle guard and tensogram_check use error stop). The metadata getters do not raise: a missing key returns the supplied default (or an empty string).

Decoding options

tensogram_decode accepts two optional logicals:

verify_hash (default .false.) — when .true., verifies the per-object xxh3 digest. Off by default to match the core library (most transports already provide integrity); enable it for end-to-end checks.
native_byte_order (default .true.) — convert decoded bytes to the host byte order. Leave it true unless you want raw wire-order bytes.

What is available now

Procedure	Purpose
`tensogram_encode(a, buf, err [, metadata_json, hash, encoding, filter, compression])`	Encode an array into a one-object message — generic over dtype and rank
`tensogram_decode(wire, msg, err [, verify_hash, native_byte_order])`	Decode wire bytes into a message handle
`tensogram_num_objects(msg)`	Number of decoded objects
`tensogram_object_ndim(msg, iobj)`	Rank of object `iobj` (1-based)
`tensogram_object_shape(msg, iobj)`	Extents in Fortran (column-major) order
`tensogram_object_dtype(msg, iobj)`	dtype string (e.g. `"float32"`)
`tensogram_to_array(msg, iobj, out, err)`	Copy an object into a Fortran array — generic over dtype and rank of `out`
`tensogram_file_open/create(path, file, err)`	Open / create a `.tgm` file
`tensogram_file_message_count(file, n, err)`	Number of messages in the file
`tensogram_file_append(file, a, err [, metadata_json, hash, encoding, filter, compression])`	Encode `a` and append it as a message — generic over dtype and rank
`tensogram_file_decode_message(file, index, msg, err [, verify_hash, native_byte_order])`	Decode message `index` (1-based)
`tensogram_file_read_message(file, index, buf, err)`	Read raw message bytes at `index`
`tensogram_meta` + `%add_string` / `%add_int` / `%add_real` / `%base_json`	Build per-object application metadata (JSON-escaped)
`tensogram_message_metadata(msg, meta, err)`	Extract a metadata handle from a decoded message
`tensogram_metadata_get_string(meta, key)`	Look up a string by dot-notation key (empty if absent)
`tensogram_metadata_get_int(meta, key, default)` / `_get_float(meta, key, default)`	Look up an int / float by key (`default` returned if absent)
`tensogram_streaming_encoder_create(path, enc, err [, metadata_json, hash])`	Open a streaming encoder (writes a message progressively to a file)
`tensogram_streaming_encoder_write(enc, a, err [, encoding, filter, compression])`	Append one object to the stream — generic over dtype and rank
`tensogram_streaming_encoder_count(enc)` / `_finish(enc, err)` / `enc%free()`	Objects written so far / finalise (footer + close) / release
`buf%as_array(out)` / `buf%size()` / `buf%free()`	Buffer access and release
`file%close()`	Close a file handle (also automatic at scope exit)
`tensogram_check` / `tensogram_last_error` / `tensogram_strerror`	Error helpers

tensogram_encode, tensogram_to_array, tensogram_file_append, and tensogram_streaming_encoder_write are generic interfaces over dtype (real32, real64, int32, int64) and rank (0–7). The dtype is resolved from the array’s type/kind, the rank from the array itself (ranks above 7 are unsupported and fail to compile). A dtype mismatch on decode returns TGM_ERROR_OBJECT. (int8/int16/complex/float16 are follow-ups — Fortran has no native unsigned or half/complex-as-pair mapping.)

Working with files

tensogram_file is an RAII handle for a multi-message .tgm file (the common “append forecast steps” pattern). Like the buffer/message handles it is non-copyable and closes automatically at scope exit (or eagerly via call f%close()). Message indices are 1-based.

type(tensogram_file)    :: f
type(tensogram_message) :: msg
real(c_float)              :: field(ni, nj)
real(c_float), allocatable :: out(:,:)
integer(c_int) :: err, n, step

call tensogram_file_create('forecast.tgm', f, err)
do step = 1, nsteps
   ! ... fill field ...
   call tensogram_file_append(f, field, err)      ! one message per step
end do
call f%close()

call tensogram_file_open('forecast.tgm', f, err)
call tensogram_file_message_count(f, n, err)       ! n == nsteps
do step = 1, n
   call tensogram_file_decode_message(f, step, msg, err)   ! random access
   call tensogram_to_array(msg, 1, out, err)
end do
call f%close()

See examples/fortran/file_api.f90.

Application metadata and compression

The encoding pipeline is configurable per call: pass encoding, filter, and/or compression (default "none"). Use a lossless codec ("zstd", "lz4", "szip", "blosc2") to keep round-trips bit-exact.

Per-object application metadata (names, units, namespaced vocabularies) is built with the zero-dependency tensogram_meta builder, which JSON-escapes keys and values for you, and read back with the dot-notation getters (which search the base[i] entries, then _extra_).

type(tensogram_meta)     :: m
type(tensogram_buffer)   :: buf
type(tensogram_message)  :: msg
type(tensogram_metadata) :: meta
integer(c_int) :: err

call m%add_string('name', 'temperature')
call m%add_string('units', 'K')
call m%add_int('level', 850_c_int64_t)

call tensogram_encode(field, buf, err, &
                      metadata_json = m%base_json(), compression = 'zstd')

! ... decode into msg ...
call tensogram_message_metadata(msg, meta, err)
print *, tensogram_metadata_get_string(meta, 'name')          ! temperature
print *, tensogram_metadata_get_int(meta, 'level', -1_c_int64_t)  ! 850

The metadata_json argument of tensogram_encode / tensogram_file_append accepts either the builder’s fragment (m%base_json()) or a complete JSON object such as '{"base":[{...}]}' (the same shape the streaming encoder’s metadata_json takes) — both produce the same message.

The library remains vocabulary-agnostic: tensogram_meta just builds the base JSON; the meaning of the keys is the application’s concern.

Streaming encoder

For producers that emit objects incrementally, tensogram_streaming_encoder writes a single multi-object message to a file progressively — the preamble and header metadata on create, one data-object frame per _write, and the footer + postamble on _finish — without buffering the whole message. The handle is non-copyable; enc%free() (and the finalizer) only release the handle, so call _finish for a valid file before freeing.

type(tensogram_streaming_encoder) :: enc
integer(c_int) :: err, obj

call tensogram_streaming_encoder_create('out.tgm', enc, err)
do obj = 1, nobj
   ! ... fill field ...
   call tensogram_streaming_encoder_write(enc, field, err, compression='zstd')
end do
call tensogram_streaming_encoder_finish(enc, err)   ! footer + close
call enc%free()

The result is one message with nobj objects; reopen it with tensogram_file_open + tensogram_file_decode_message and pull each object out with tensogram_to_array. See examples/fortran/streaming.f90.

Planned next: the async surface (deferred).

Edge cases

The binding fails gracefully rather than crashing:

Zero-size tensors (e.g. real :: a(0)) encode and round-trip to a zero-size array.
Out-of-range object indices (< 1 or > num_objects) return TGM_ERROR_OBJECT from tensogram_to_array; the raw accessors (tensogram_object_dtype / _ndim / _shape) return an empty / zero result for an out-of-range index rather than aborting.
Empty or truncated wire buffers decode to a non-OK error with tensogram_last_error() populated, never a crash.
Zero-object messages (e.g. a streamed message finished with no writes) are valid and decode to num_objects == 0.
Unicode (raw UTF-8) and long strings in metadata survive the builder / JSON escaper / decode / getter round-trip byte-for-byte.

Build and test

make fortran-build      # CMake configure + build
make fortran-test       # ctest

The test suite covers bit-identical round-trips across dtypes and ranks, the file API, application metadata, lossless-compression round-trips, the streaming encoder, the error path, and the non-copyable guard (a negative test). Cross-language parity tests assert the column-major contract in both directions against a C/C++ reader/writer, and both directions against Python / NumPy (when a Python with the tensogram package is present).

TypeScript API

Tensogram ships @ecmwf.int/tensogram, a TypeScript package that wraps the WebAssembly build with typed, idiomatic helpers. Use it in any modern browser or Node ≥ 20.

The package exposes typed encode / decode / scan, dtype-aware payload views, metadata helpers, progressive streaming decode, the TensogramFile file / URL helper, the validate wrapper, encodePreEncoded, and first-class float16 / bfloat16 / complex* view classes.

Installation

The package is not yet published to npm. Build it locally:

# First, build the WebAssembly blob from the Rust source
cd typescript
npm install
npm run build:wasm   # runs wasm-pack build -t web -d typescript/wasm
npm run build        # runs wasm-pack + tsc

Or use the top-level Makefile:

make ts-build        # build WASM + tsc
make ts-test         # vitest
make ts-typecheck    # strict tsc --noEmit on src + tests

Quick start

import {
  init, encode, decode,
  type DataObjectDescriptor,
  type GlobalMetadata,
} from '@ecmwf.int/tensogram';

// One-time WASM initialisation (idempotent)
await init();

// ── Encode ────────────────────────────────────────────────────────────
const temps = new Float32Array(100 * 200);
for (let i = 0; i < temps.length; i++) temps[i] = 273.15 + i / 100;

const meta: GlobalMetadata = { version: 3 };
const descriptor: DataObjectDescriptor = {
  type: 'ntensor',
  ndim: 2,
  shape: [100, 200],
  strides: [200, 1],
  dtype: 'float32',
  byte_order: 'little',
  encoding: 'none',
  filter: 'none',
  compression: 'none',
};

const msg: Uint8Array = encode(meta, [{ descriptor, data: temps }]);

// ── Decode ────────────────────────────────────────────────────────────
const { metadata, objects } = decode(msg);
const arr = objects[0].data();  // Float32Array (inferred from dtype)
console.log(arr.length);        // 20000

API surface

`init(opts?)`

Loads and instantiates the WASM blob. Must be awaited before any other function is called. Safe to call multiple times — subsequent calls reuse the same instance.

await init();                                              // defaults
await init({ wasmInput: new URL('...', import.meta.url) });  // custom location

`encode(metadata, objects, opts?)`

Parameter	Type	Description
`metadata`	`GlobalMetadata`	Free-form metadata; only `base`, `_reserved_`, `_extra_` are library-interpreted. An empty `{}` is valid. The wire-format version lives in the preamble — see `WIRE_VERSION`.
`objects`	`Array<{ descriptor, data }>`	Each `data` is a `TypedArray` or `Uint8Array`
`opts.hash`	`'xxh3' \| false`	Hash algorithm. Default `'xxh3'`. Pass `false` to disable.

Returns: Uint8Array containing the complete wire-format message.

`decode(buf, opts?)`

Parameter	Type	Description
`buf`	`Uint8Array`	Raw message bytes
`opts.verifyHash`	`boolean`	Default `false`. If `true`, throws `HashMismatchError` on corruption.

Returns: { metadata: GlobalMetadata, objects: DecodedObject[], close() }.

`decodeMetadata(buf)`

Returns only the metadata; does not touch any payload bytes.

`decodeObject(buf, index, opts?)`

O(1) seek to object index, decoding only that object.

`scan(buf)`

Returns Array<{ offset: number; length: number }> for each Tensogram message found in a (potentially multi-message) buffer. Garbage between messages is silently skipped.

`DecodedObject` / `DecodedFrame`

interface DecodedObject {
  readonly descriptor: DataObjectDescriptor;
  /** Copy into the JS heap.  Safe across WASM memory growth. */
  data(): TypedArray;
  /** Zero-copy view.  Invalidated if WASM memory grows. */
  dataView(): TypedArray;
  readonly byteLength: number;
}

interface DecodedFrame extends /* structurally */ DecodedObject {
  /** The matching `base[i]` entry from the containing message. */
  readonly baseEntry: BaseEntry | null;
  close(): void;
}

The returned array type is picked from descriptor.dtype:

`dtype`	Returned TypedArray
`float32`	`Float32Array`
`float64`	`Float64Array`
`int8`	`Int8Array`
`int16`	`Int16Array`
`int32`	`Int32Array`
`int64`	`BigInt64Array`
`uint8`	`Uint8Array`
`uint16`	`Uint16Array`
`uint32`	`Uint32Array`
`uint64`	`BigUint64Array`
`float16` / `bfloat16`	`Uint16Array` (no native half-precision in JS)
`complex64`	`Float32Array` (interleaved real, imag)
`complex128`	`Float64Array` (interleaved real, imag)
`bitmask`	`Uint8Array` (packed bits)

`getMetaKey(meta, path)`

Dot-path lookup matching the Rust / Python / CLI first-match-across-base semantics: searches base[0], base[1], …, skipping the _reserved_ key in each, then falls back to _extra_.

getMetaKey(meta, 'mars.param')      // 'base[0].mars.param' first match
getMetaKey(meta, '_extra_.source')  // explicit _extra_ prefix

Returns undefined if the key is missing (never throws).

`computeCommon(meta)`

Mirror of tensogram::compute_common. Returns a Record<string, CborValue> of keys that are present with identical values in every entry of meta.base. Useful for display and merge operations.

Error classes

All errors thrown from this package are instances of the abstract TensogramError class. Eight concrete subclasses match the Rust TensogramError variants plus the TS-layer InvalidArgumentError and StreamingLimitError:

import {
  TensogramError,
  FramingError,
  MetadataError,
  EncodingError,
  CompressionError,
  ObjectError,
  IoError,
  RemoteError,
  HashMismatchError,
  InvalidArgumentError,
  StreamingLimitError,
} from '@ecmwf.int/tensogram';

try {
  decode(corruptBuffer);
} catch (err) {
  if (err instanceof FramingError) {
    console.error('bad wire format:', err.message);
  } else if (err instanceof HashMismatchError) {
    console.error('integrity failure:', err.expected, err.actual);
  } else {
    throw err;
  }
}

Memory model

Safe-copy by default. object.data() / frame.data() always allocate a new TypedArray on the JS heap. It remains valid even after the underlying DecodedMessage / DecodedFrame is freed or WASM memory grows.
Zero-copy opt-in. object.dataView() / frame.dataView() return a view directly into WASM linear memory. It is invalidated the next time any WASM call grows linear memory — which can happen on the next encode() / decode(). Read the view immediately or copy it.
Explicit cleanup. DecodedMessage, DecodedFrame, and TensogramFile all expose .close() to release WASM-side memory. A FinalizationRegistry also calls .free() on the underlying WASM handle when the wrapper is garbage-collected, but explicit .close() is strongly recommended for deterministic cleanup.

Streaming decode

Use decodeStream(readable, opts?) to progressively decode a ReadableStream<Uint8Array>. Works against any stream source — fetch().body, a Node Readable.toWeb(), a Blob.stream(), or a hand-rolled ReadableStream.

import { decodeStream } from '@ecmwf.int/tensogram';

const res = await fetch('/data.tgm');
for await (const frame of decodeStream(res.body!)) {
  render(frame.descriptor.shape, frame.data());
  frame.close();
}

Options:

Option	Type	Description
`signal`	`AbortSignal`	Cancels the iteration. The underlying reader is cancelled and the decoder is freed cleanly.
`maxBufferBytes`	`number`	Max size of the internal staging buffer. Default: 256 MiB. Exceeding this throws `StreamingLimitError`.
`onError`	`(err: StreamDecodeError) => void`	Called whenever a corrupt message is skipped. The iterator does not throw on skips — it keeps going.

Key behaviours:

Chunk-boundary tolerant. A message can be split across any number of chunks. The decoder accumulates until a complete message is seen, then emits every object as a separate frame.
Corruption resilient. A single bad message is skipped; the iterator keeps going with subsequent messages. Pass onError to observe the skips.
Early break is safe. Breaking out of the for await loop runs the generator’s finally block, which releases the stream reader and frees the decoder.
AbortSignal cancels cleanly. Firing the signal cancels the underlying reader; the generator throws whatever error the signal carries.

File API

TensogramFile gives you random-access reads over a .tgm file, whether it lives on the local file system, behind an HTTPS URL, or already in memory.

import { TensogramFile } from '@ecmwf.int/tensogram';

// Node: from the local file system
const file = await TensogramFile.open('/data/input.tgm');

// Browser or Node: over HTTPS
const file = await TensogramFile.fromUrl('https://example.com/input.tgm');

// Any runtime: from pre-loaded bytes
const file = TensogramFile.fromBytes(uint8ArrayFromSomewhere);

All three factories produce an identical object:

interface TensogramFile extends AsyncIterable<DecodedMessage> {
  readonly messageCount: number;
  readonly byteLength: number;
  readonly source: 'local' | 'remote' | 'buffer';

  // Whole-message access (unchanged since Scope B).
  message(index: number, opts?: DecodeOptions): Promise<DecodedMessage>;
  messageMetadata(index: number): Promise<GlobalMetadata>;
  rawMessage(index: number): Promise<Uint8Array>;

  // Layout-aware per-object access (new — cheap over HTTP Range).
  messageDescriptors(index: number): Promise<{
    metadata: GlobalMetadata;
    descriptors: DataObjectDescriptor[];
  }>;
  messageObject(msgIndex: number, objectIndex: number,
    opts?: DecodeOptions): Promise<DecodedMessage>;
  messageObjectRange(msgIndex: number, objectIndex: number,
    ranges: readonly RangePair[],
    opts?: DecodeRangeOptions): Promise<DecodeRangeResult>;

  // Bounded-concurrency fan-out.
  messageObjectBatch(msgIndices: readonly number[], objectIndex: number,
    opts?: DecodeOptions & { concurrency?: number }): Promise<DecodedMessage[]>;
  messageObjectRangeBatch(msgIndices: readonly number[], objectIndex: number,
    ranges: readonly RangePair[],
    opts?: DecodeRangeOptions & { concurrency?: number }): Promise<DecodeRangeResult[]>;
  prefetchLayouts(msgIndices: readonly number[],
    opts?: { concurrency?: number }): Promise<void>;

  [Symbol.asyncIterator](): AsyncIterator<DecodedMessage>;
  close(): void;
}

Numeric limit. All TensogramFile file positions are JavaScript number values, capped at Number.MAX_SAFE_INTEGER (2⁵³ − 1, ≈ 9 PB). Files larger than that must be processed via the Rust or Python bindings; the TS wrapper throws InvalidArgumentError when a WASM-returned u64 exceeds the safe range.

Usage:

const file = await TensogramFile.open('/data/input.tgm');
try {
  console.log(`${file.messageCount} messages, ${file.byteLength} bytes`);

  // Random access
  const first = await file.message(0);
  console.log(first.objects[0].descriptor.shape);
  first.close();

  // Async iteration
  for await (const msg of file) {
    // ...
    msg.close();
  }
} finally {
  file.close();
}

`TensogramFile.open(path, opts?)` (Node only)

Loads the file via node:fs/promises. The node:fs/promises import is dynamic so browser bundlers can tree-shake this code path.

Option	Type	Description
`signal`	`AbortSignal`	Cancels the initial read.

`TensogramFile.fromUrl(url, opts?)` (any fetch-capable runtime)

Downloads the file over HTTPS using the ambient globalThis.fetch.

Option	Type	Description
`fetch`	`typeof fetch`	Override the fetch implementation (useful for tests and for browsers with a polyfill, and for AWS-authenticated S3 via `createAwsSigV4Fetch`).
`headers`	`HeadersInit`	Extra request headers (auth, etc.).
`signal`	`AbortSignal`	Cancels the download.
`concurrency`	`number`	Per-host concurrency cap for fan-out operations (`messageObjectBatch`, `prefetchLayouts`, descriptor prefix fetches). Defaults to `6`, matching typical browser per-host connection limits.
`bidirectional`	`boolean`	Bidirectional remote-scan walker on open. See Bidirectional scan below. Default `true`.
`debug`	`boolean`	Emit `console.debug` events on every walker state transition. Default `false`.

`TensogramFile.fromBytes(bytes)`

Wraps an already-loaded Uint8Array. The buffer is defensively copied, so later mutation of the caller’s buffer is invisible to the TensogramFile.

Range-based lazy access

TensogramFile.fromUrl automatically probes the server for HTTP Range support. When the HEAD response advertises Accept-Ranges: bytes and a finite Content-Length, the file switches to a lazy backend:

The initial open issues a small HEAD + one 24-byte Range read per message preamble to build the boundary index. No payload data is downloaded.
rawMessage(i) / message(i) fetch the full message via a Range: bytes=offset-(offset+length-1) GET and cache it in a 32-entry LRU.
messageMetadata(i) fetches at most a 256 KB header chunk (or 256 KB footer suffix for footer-indexed messages) and caches the decoded GlobalMetadata in a per-message MessageLayout entry. Subsequent metadata reads are free.
messageDescriptors(i) uses the cached index frame and the descriptor-prefix optimisation (header + footer + CBOR region for large frames, full frame for small ones) so a 10-object message with 100 MB frames fetches only a few KB per descriptor.
messageObject(i, j) and messageObjectRange(i, j, ranges) each issue exactly one Range GET for the target object’s frame bytes.
messageObjectBatch, messageObjectRangeBatch, and prefetchLayouts fan out with bounded concurrency (default 6, configurable via FromUrlOptions.concurrency or per-call opts.concurrency).

When the server omits Accept-Ranges, returns non-200 on HEAD, or the file uses streaming-mode messages (total_length=0 — the writer did not know the final length up front), the open falls back to a single eager GET. Behaviour is indistinguishable to callers except in memory use and timing.

Browser callers using fromUrl directly need CORS to expose the Accept-Ranges, Content-Range, and Content-Length headers.

Bidirectional scan

fromUrl runs a pipelined bidirectional walker by default. Each iteration of lazyScanMessages fetches the next forward preamble, the next backward postamble, AND the previous iteration’s candidate-preamble validation in one parallel Promise.allSettled round. That overlap collapses the per-round critical path from 2 RTTs to 1 RTT, so on real-network workloads the open-time scan is roughly half the wall-clock of forward-only.

import { init, TensogramFile } from '@ecmwf.int/tensogram';

await init();

// Default: bidirectional pipelined walker.
const file = await TensogramFile.fromUrl('https://example.com/data.tgm');
try {
  // ... use file ...
} finally {
  file.close();
}

// Force forward-only.
const fwdOnly = await TensogramFile.fromUrl('https://example.com/data.tgm', {
  bidirectional: false,
});
try {
  // ... use fwdOnly ...
} finally {
  fwdOnly.close();
}

Forward-only and bidirectional opens produce identical messageLayouts on well-formed files; the parity harness asserts this on every fixture. Set bidirectional: false when an adversarial server might serve disagreeing forward and backward reads, or when you specifically want serial Range fetches.

concurrency: 1 is illegal alongside the bidirectional default because the paired round needs two parallel fetches to be useful. The fromUrl promise rejects with InvalidArgumentError before any HTTP probe in that case. To use concurrency: 1, pair it with bidirectional: false explicitly.

Wire-format compatibility: existing .tgm files gain the speedup automatically — the walker is reader-side only, with no migration, no re-encoding, and no wire-format bump.

debug: true emits console.debug events on every state transition — tensogram:scan:mode, tensogram:scan:fallback, tensogram:scan:fwd-terminated, tensogram:scan:gap-closed, tensogram:scan:hop, tensogram:scan:footer-eager — same vocabulary as the Rust tracing events at target = "tensogram::remote_scan".

When the bidirectional walker discovers a footer-indexed message via its postamble, the dispatcher folds an eager footer-region fetch into the same paired round as the candidate-preamble validation. The parsed footer’s metadata and index frames land in the cached layout inline, so a subsequent messageMetadata(idx) / messageDescriptors(idx) short-circuits without issuing a separate footer-region GET.

The fetch is best-effort: if the footer Range request fails or the chunk fails to parse, the layout still commits via the validated preamble alone, and the lazy populate path picks up footer discovery on first metadata access. The footer fetch fires only after the just-validated preamble’s flags carry both FOOTER_METADATA and FOOTER_INDEX, so header-indexed messages (whose footer region typically holds only a hash frame) skip the fetch entirely instead of paying a discarded round trip.

Behaviour is symmetric across the Rust sync / async dispatchers and the TypeScript walker — the same wire-format outcome enum (BackwardOutcome.NeedPreambleValidation) carries the postamble’s first_footer_offset so both sides decide identically.

AWS-signed (S3-compatible) access

For authenticated reads against S3 or any S3-compatible HTTPS endpoint, wrap fetch with createAwsSigV4Fetch and pass it through FromUrlOptions.fetch:

import { createAwsSigV4Fetch, TensogramFile } from '@ecmwf.int/tensogram';

const signedFetch = createAwsSigV4Fetch({
  accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
  region: 'eu-west-1',
  // Optional: `sessionToken` for STS credentials; `service` for non-S3
  // S3-compatible endpoints (default 's3').
});

const file = await TensogramFile.fromUrl(
  'https://my-bucket.s3.eu-west-1.amazonaws.com/data.tgm',
  { fetch: signedFetch },
);

The pure signer signAwsV4Request is also exported for callers that want to manage the request lifecycle themselves. Both are covered by byte-for-byte AWS sig-v4-test-suite parity tests, including query-string canonicalisation, header value trim, session-token handling, and pre-encoded path round-tripping.

Azure Blob and Google Cloud Storage are not yet supported by a bundled helper — generate a presigned URL in your control plane and pass it as a plain HTTPS URL instead.

Append (Node local file system)

TensogramFile#append(meta, objects, opts?) encodes the new message in-memory, appends it to the on-disk file, refreshes the position index, and makes the new message reachable via message(i) on the same handle. Only supported when the file was opened via TensogramFile.open(path) — fromBytes- and fromUrl-backed files throw InvalidArgumentError, matching the contract in the other language bindings.

const file = await TensogramFile.open('/data/forecast.tgm');
try {
  await file.append({ version: 3 }, [{ descriptor, data }]);
  console.log(`now has ${file.messageCount} messages`);
} finally {
  file.close();
}

Scope-C API additions

Scope C brought the TypeScript wrapper to full API parity with Rust / Python / FFI / C++. The surface additions are:

Function / class	What it does
`decodeRange(buf, objIndex, ranges, opts?)`	Partial sub-tensor decode. `ranges` is an array of `[offset, count]` pairs in element units; each returned `parts[i]` is a dtype-typed view. Option `join: true` concatenates every range into a single view.
`computeHash(bytes, algo?)`	Standalone `xxh3` hash — matches the digest stamped by `encode()` on the same bytes.
`simplePackingComputeParams(values, bits, decScale?)`	GRIB-style simple-packing parameter computation. Return shape uses snake-case keys so the result spreads directly into a descriptor.
`validate(buf, opts?)`	Report-only validation (never throws on bad input). Modes: `quick`, `default`, `checksum`, `full`.
`validateBuffer(buf, opts?)`	Multi-message buffer: reports file-level gaps / trailing garbage plus per-message reports.
`validateFile(path, opts?)`	Node-only helper: reads the file via `node:fs/promises` then delegates to `validateBuffer`.
`encodePreEncoded(meta, objects, opts?)`	Wrap already-encoded bytes verbatim into a wire-format message. The library still validates descriptor structure and stamps a fresh hash.
`StreamingEncoder`	Frame-at-a-time construction. Two modes: buffered (default, `finish()` returns the complete `Uint8Array`) or streaming via `opts.onBytes` callback (bytes flow through the callback as they’re produced; `finish()` returns an empty `Uint8Array`).
`TensogramFile#append`	Append a new message to a file opened via `TensogramFile.open(path)`. Node-only.

Streaming `StreamingEncoder` (no full-message buffering)

For browser uploads, WebSocket pushes, or any sink that needs bytes as soon as they are produced, pass an onBytes callback to the StreamingEncoder constructor:

const enc = new StreamingEncoder({ version: 3 }, {
  onBytes: (chunk) => uploadSocket.send(chunk),   // e.g. WebSocket.send
});
enc.writeObject(descriptor, new Float32Array([1, 2, 3]));
enc.finish();    // flushes footer; returns empty Uint8Array in streaming mode
enc.close();

Semantics:

The callback is invoked during construction (preamble + header metadata frame), during each writeObject / writeObjectPreEncoded (one data-object frame’s bytes, potentially across multiple invocations), and during finish() (footer frames + postamble).
Concatenating every chunk the callback sees (in order) yields a message byte-for-byte identical to what buffered mode would return. Tested via round-trip with decode().
The callback must be synchronous — Promise return values are silently discarded because the Rust/WASM writer contract is synchronous. Buffer internally first if you need async work.
Each chunk is JS-owned and fresh per invocation. Copy (new Uint8Array(chunk) or chunk.slice()) if you need to keep it past the next writeObject — the underlying ArrayBuffer is invalidated when WASM memory grows.
If the callback throws, the exception surfaces as an IoError on the next writeObject / finish. The encoder state is undefined after an error — call close() and start over.
enc.streaming (getter) reports whether an onBytes sink was supplied — useful for code that needs to branch on mode.

Parity note: the Rust core StreamingEncoder<W: Write> has always supported arbitrary sinks; the WASM/TS surface now exposes this capability to JS code. Python / FFI / C++ bindings remain buffered-only; extending them would follow the same JsCallbackWriter pattern with a language-specific sink abstraction.

First-class half-precision and complex dtypes

Scope C also upgraded the dtype dispatch in {@link typedArrayFor}. obj.data() now returns a first-class view for dtypes JS does not have a native TypedArray for:

Dtype	`data()` return type
`float16`	`Float16Array` (native when available) or `Float16Polyfill` (TC39-accurate)
`bfloat16`	`Bfloat16Array` — 1-8-7 layout, truncating-with-round-to-nearest-even narrow
`complex64` / `complex128`	`ComplexArray` — `.real(i)`, `.imag(i)`, `.get(i) → {re, im}`, iteration

All three classes expose .bits / .data for zero-copy access to the underlying raw storage if you need it.

const m = decode(buf);
const f16 = m.objects[0].data();           // Float16Array or polyfill
const asFloat32 = f16.toFloat32Array();    // widened copy
const bits = f16.bits;                      // raw binary16

const cplx = m.objects[1].data() as ComplexArray;
for (let i = 0; i < cplx.length; i++) {
  console.log(cplx.real(i), cplx.imag(i));
}

The polyfill is used automatically when the host runtime does not ship globalThis.Float16Array. hasNativeFloat16Array() and getFloat16ArrayCtor() expose the detection machinery for callers that want direct control.

Breaking change from Scope B: Before Scope C, obj.data() on float16 / bfloat16 returned a raw Uint16Array of bits, and complex dtypes returned an interleaved Float32Array / Float64Array. Consumers that relied on that shape can reach the same bytes via .bits (for f16/bf16) or .data (for complex).

The low-level bit-conversion helpers (halfBitsToFloat, floatToHalfBits, bfloat16BitsToFloat, floatToBfloat16Bits) and the isComplexDtype type-guard are internal and are not re-exported from @ecmwf.int/tensogram. Callers that need bit-level manipulation should grab the raw storage from a view’s .bits / .data accessor and do the conversion themselves, or import directly from @ecmwf.int/tensogram/float16, …/bfloat16, …/complex with the understanding that these module paths are not part of the stable API.

Examples

See examples/typescript/ in the repository for runnable scripts:

01_encode_decode.ts — basic round-trip
02_mars_metadata.ts — per-object metadata using the MARS vocabulary
02b_generic_metadata.ts — per-object metadata using a generic application namespace
03_multi_object.ts — multiple dtypes in one message
04_decode_range.ts — partial sub-tensor decode
05_streaming_fetch.ts — progressive decode over a ReadableStream
06_file_api.ts — TensogramFile over Node fs, fetch, and in-memory bytes
07_hash_and_errors.ts — hash verification and typed errors
08_validate.ts — validate(buf) + validateFile(path)
11_encode_pre_encoded.ts — wrap already-encoded bytes
12_streaming_encoder.ts — frame-at-a-time encoder with preceders
13_range_access.ts — lazy TensogramFile.fromUrl over HTTP Range
14_streaming_callback.ts — StreamingEncoder with onBytes callback sink

Run them with:

cd examples/typescript
npm install
npx tsx 01_encode_decode.ts     # or any other file

Cross-language parity

This TypeScript package decodes the same golden .tgm files used by the Rust, Python, and C++ test suites. The committed files at rust/tensogram/tests/golden/*.tgm are decoded by each language’s test runner; any drift in wire-format semantics fails all four suites.

Specifically, typescript/tests/golden.test.ts decodes:

simple_f32.tgm — single-object Float32 round-trip
multi_object.tgm — mixed-dtype message (f32 / i64 / u8)
mars_metadata.tgm — MARS keys under base[0].mars
multi_message.tgm — two concatenated messages (via scan())
hash_xxh3.tgm — verifyHash success + tamper detection

typescript/tests/property.test.ts and the Scope-C dtype suites add fast-check property tests pinning:

mapTensogramError never throws for any finite-string input and always returns a TensogramError subclass;
encode → decode is bit-exact for random Float32 shapes across random application metadata;
decode on random byte input either succeeds with a structurally valid message or throws a typed TensogramError — never panics;
float32 → float16 → float32 round-trip stays within half-precision ulp for any random value in a reasonable magnitude band;
float32 → bfloat16 → float32 round-trip stays within bfloat16 ulp;
complex64 encode → decode preserves real(i) / imag(i) byte-for-byte across random shapes and values.

The CI typescript job rebuilds and runs every TS test on every PR.

Tensoscope

Tensoscope is an interactive web viewer for .tgm files. It runs entirely in the browser — no server-side component — by decoding data via the @ecmwf.int/tensogram WebAssembly package.

Quick start

Build the WASM package first, then start the dev server:

cd typescript && make ts-build
cd tensoscope && npm install && npm run dev

Open http://localhost:5173 in your browser, then drag-and-drop a .tgm file onto the page or paste a URL into the file open dialog.

Loading a file

Two modes are supported:

Local file — drag the .tgm file onto the drop zone, or click Open file.
Remote URL — paste an HTTP/HTTPS URL. The file is fetched in full before scanning. (HTTP Range support for lazy loading is planned.)

Once loaded, Tensoscope scans all messages and builds a field index without decoding any payloads.

Field browser

The left sidebar lists every decodable field in the file. Each entry shows:

Variable name (resolved from mars.param, name, or param metadata keys)
Shape and dtype

Click a field to decode it and render it on the map.

Map view

Fields with two spatial dimensions (latitude × longitude) are rendered as a coloured overlay on an interactive map. Regridding from the unstructured source grid onto the display pixel grid runs in a web worker so the UI stays responsive while large arrays are processed.

Projections

Switch between flat (Mercator, powered by MapLibre GL JS) and globe (3D sphere, powered by CesiumJS with OpenStreetMap base tiles) using the projection picker in the bottom-left of the map. Camera position is preserved when switching between the two renderers.

Render modes

A Heatmap / Contours toggle in the top-left of the map switches between two rendering styles:

Heatmap — smooth continuous gradient from the active colour scale. Pixel colours are interpolated linearly across the data range.
Contours — filled colour bands (like matplotlib.contourf). The data range is divided into N discrete bands where N is the number of colour steps in the active palette (default 10 for continuous palettes; stop count for custom palettes). Each band is rendered with a single solid colour.

Point inspection

Click anywhere on the map to inspect the value at the nearest grid point.

A crosshair marker appears at the click position, and a floating popup opens showing:

The snapped grid-point coordinates (lat/lon).
The parameter name and level label (e.g. T 500 hPa for a pressure-level field).
Single value — when only one time step is loaded, the raw value is displayed in large text with its unit.
Time series — when animation frames exist, an SVG line chart shows the value at the selected point across all steps. Below the chart:
- For 8 steps or fewer, each step value is listed in a compact grid.
- For more than 8 steps, summary statistics (min, max, mean, std) are shown instead.

Unit conversion follows the same colour-scale unit toggle: changing the unit in the popup updates both the popup values and the colour bar simultaneously.

To close the popup, click the ✕ button or click anywhere outside it.

The feature works in both flat (MapLibre) and globe (Cesium) projection modes.

Colour scale

The colour bar at the bottom of the map shows the current field range. Use the colour scale controls to:

Change the colour map (perceptually uniform maps from d3-scale-chromatic)
Lock or reset the min/max range

Animation

For files with a time or step dimension, the step slider appears below the map. Use play/pause to animate through steps at a fixed frame rate.

Docker deployment

cd tensoscope
make build          # build the container image
make run            # serve at http://localhost:8000
BASE_PATH=/scope make run   # serve under a subpath

The image uses nginx and accepts a BASE_PATH environment variable for subpath deployments behind a reverse proxy.

Known limitations

Only lat/lon grids are currently regridded; polar stereographic and other projections are not yet handled.
3D fields (pressure levels) cannot yet be sliced via the level selector (the UI component exists but is not yet wired up).
HTTP Range-based lazy loading is not yet implemented; the full file is fetched before any field can be displayed.

xarray Integration

The tensogram-xarray package provides a read-only xarray backend engine for .tgm files. Once installed, you can open tensogram data with:

import xarray as xr
ds = xr.open_dataset("data.tgm", engine="tensogram")

This chapter explains the conversion philosophy, the mapping rules, and walks through progressively complex examples so you know exactly what to expect – and what to provide – when loading tensogram data into xarray.

Philosophy: Why Mapping is Needed

Tensogram and xarray have fundamentally different data models:

Concept	Tensogram	xarray
Dimensions	Unnamed, positional (`shape = [512, 512]`)	Named (`"x"`, `"y"`, `"latitude"`, `"time"`)
Coordinates	Not built-in; application metadata	Arrays of values labelling each dimension
Variables	Data objects, indexed by position	Named DataArrays inside a Dataset
Attributes	CBOR maps at message and per-object level	Key-value dicts on Dataset and DataArray

Tensogram is vocabulary-agnostic by design. The library never interprets metadata keys – it does not know what "mars.param", "bids.subject", or "product.name" means. xarray, on the other hand, requires named dimensions and coordinate arrays to enable its powerful label-based indexing and alignment.

The tensogram-xarray backend bridges this gap. It applies a set of rules to translate tensogram structure into xarray structure, and lets you override those rules when the defaults are not enough.

flowchart LR
    A["Tensogram Message"] --> B["tensogram-xarray"]
    B --> C["xr.Dataset"]
    D["User Mapping<br/>(optional)"] -.-> B
    E["Coordinate<br/>Auto-Detection"] -.-> B

The Mapping Pipeline

When you call xr.open_dataset("file.tgm", engine="tensogram"):

Read metadata – only the CBOR metadata is parsed (no payload decode).
Detect coordinates – data objects whose name or param matches a known coordinate name (latitude, longitude, time, …) become coordinate arrays.
Name dimensions – if you provided dim_names, those are used. Otherwise, axes matching a detected coordinate use that coordinate’s name; remaining axes become dim_0, dim_1, …
Name variables – if you provided variable_key, the value at that metadata path becomes the variable name. Otherwise object_0, object_1, …
Wrap data lazily – each tensor is backed by a BackendArray that decodes on demand. No payload bytes are read until you access .values.

Example 1: Simplest Case – Single Object, No Metadata

Creating the file:

import numpy as np
import tensogram

data = np.arange(60, dtype=np.float32).reshape(6, 10)
meta = {}
desc = {"type": "ntensor", "shape": [6, 10], "dtype": "float32",
        "byte_order": "little", "encoding": "none",
        "filter": "none", "compression": "none"}

with tensogram.TensogramFile.create("simple.tgm") as f:
    f.append(meta, [(desc, data)])

Opening in xarray:

>>> import xarray as xr
>>> ds = xr.open_dataset("simple.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions:   (dim_0: 6, dim_1: 10)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    object_0  (dim_0, dim_1) float32 ...
Attributes:
    tensogram_version:  3

The data object became a variable named object_0. Dimensions are auto-generated as dim_0, dim_1. No coordinates – tensogram has no information to generate them.

Adding dimension names:

>>> ds = xr.open_dataset("simple.tgm", engine="tensogram",
...                      dim_names=["latitude", "longitude"])
>>> ds["object_0"].dims
('latitude', 'longitude')

Example 2: Single Object with Coordinate Objects

When coordinate arrays are stored as separate data objects in the same message, the backend auto-detects them by name.

Creating the file:

lat = np.linspace(-90, 90, 5, dtype=np.float64)
lon = np.linspace(0, 360, 8, endpoint=False, dtype=np.float64)
temp = np.random.default_rng(42).random((5, 8)).astype(np.float32)

meta = {"base": [
    {"name": "latitude"},
    {"name": "longitude"},
    {"name": "temperature"},
]}

with tensogram.TensogramFile.create("with_coords.tgm") as f:
    f.append(meta, [
        ({"type": "ntensor", "shape": [5], "dtype": "float64", ...}, lat),
        ({"type": "ntensor", "shape": [8], "dtype": "float64", ...}, lon),
        ({"type": "ntensor", "shape": [5, 8], "dtype": "float32", ...}, temp),
    ])

Opening in xarray:

>>> ds = xr.open_dataset("with_coords.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions:      (latitude: 5, longitude: 8)
Coordinates:
  * latitude     (latitude) float64 -90.0 -45.0 0.0 45.0 90.0
  * longitude    (longitude) float64 0.0 45.0 90.0 135.0 180.0 225.0 270.0 315.0
Data variables:
    temperature  (latitude, longitude) float32 ...
Attributes:
    tensogram_version:  3

How it works:

Objects with name: "latitude" and name: "longitude" match known coordinate names (case-insensitive).
They become coordinate arrays on the Dataset.
The temperature object’s shape (5, 8) matches the sizes of latitude (5) and longitude (8), so its dimensions are automatically resolved to ("latitude", "longitude").

Known Coordinate Names

The following names are recognized (case-insensitive):

Name	Canonical dimension
`lat`, `latitude`	`latitude`
`lon`, `longitude`	`longitude`
`x`	`x`
`y`	`y`
`time`	`time`
`level`	`level`
`pressure`	`pressure`
`height`	`height`
`depth`	`depth`
`frequency`	`frequency`
`step`	`step`

If no matching coordinate objects are found and no dim_names are provided, dimensions fall back to dim_0, dim_1, … per variable. When two variables of different shapes would claim the same generic name, the colliding axes are renamed to obj_{i}_dim_{axis} so the Dataset opens cleanly even for mixed-rank messages without coord hints.

Per-object `dim_names` hint

Producers that know their axes’ semantic meaning can embed an axis-ordered list into each base[i] entry under the dim_names key:

meta = {
        "base": [
        {"name": "reflectance",       "dim_names": ["time", "y", "x"]},
        {"name": "count_flash_all",   "dim_names": ["time"]},
        {"name": "ny",                "dim_names": ["y"]},
        {"name": "nx",                "dim_names": ["x"]},
    ],
}

The reader validates each list strictly — exactly as many non-empty, distinct strings as the object has axes. Malformed hints are silently ignored so files remain openable; the validator emits a DEBUG log line explaining why the hint was rejected.

When two objects assign the same name to equally sized axes, those axes share the resulting xarray dimension. When the same name is used at different sizes across objects, a warning is logged and the conflicting axes fall back to obj_{i}_dim_{axis}.

Resolution priority

Dimension names are resolved per data variable using this chain (highest priority first):

dim_names=[...] kwarg passed to xr.open_dataset.
Coord size-match — an existing coordinate variable whose size equals the axis size.
Per-object base[i]["dim_names"] — validated axis-ordered list.
Message-level _extra_["dim_names"] — list (axis-ordered) or {size: name} dict.
Generic dim_{axis} fallback — eligible for per-object disambiguation on collision.

The same chain applies in the merge path (open_datasets and xr.open_dataset(..., merge_objects=True)) to keep behaviour consistent regardless of entry point.

Example 3: Multi-Object with `variable_key`

When a message contains multiple data objects, each with per-object metadata identifying the parameter, you can use variable_key to name the variables.

Creating the file:

t2m = np.ones((3, 4), dtype=np.float32) * 273.15
u10 = np.ones((3, 4), dtype=np.float32) * 5.0

meta = {     "base": [
        {"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "2t", "levtype": "sfc"}},
        {"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "10u", "levtype": "sfc"}},
    ],
}

with tensogram.TensogramFile.create("mars.tgm") as f:
    f.append(meta, [
        ({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, t2m),
        ({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, u10),
    ])

Without variable_key:

>>> ds = xr.open_dataset("mars.tgm", engine="tensogram")
>>> list(ds.data_vars)
['object_0', 'object_1']

With variable_key:

>>> ds = xr.open_dataset("mars.tgm", engine="tensogram",
...                      variable_key="mars.param")
>>> list(ds.data_vars)
['2t', '10u']
>>> ds.attrs
{'tensogram_version': 3}

The variable_key supports dotted paths: "mars.param" navigates into the nested mars dict within each object’s metadata.

Where to put `name` / `param` / `units`

Application metadata belongs in meta["base"][i], not in the descriptor dict. The descriptor is only for tensor shape/dtype and encoding parameters (reference_value, bits_per_value, …). Unknown keys in a descriptor dict are still preserved for wire compatibility (they land in DataObjectDescriptor.params), and the xarray backend’s per-object merge picks them up as a last-resort fallback — so naming-chain keys (name, mars.param, param, mars.shortName, shortName) and the dim_names hint still reach the consumer from either source. Other keys (units, long_name, description, …) survive on the descriptor but only reach xarray-visible attributes through the same fallback, which is why tensogram.encode(), encode_pre_encoded(), TensogramFile.append(), and StreamingEncoder.write_object*() emit a UserWarning pointing at the canonical location. See issue #67.

Example 4: Multi-Message File with Auto-Merge

When a .tgm file contains many messages (one object each) that differ only in metadata, open_datasets() can stack them along outer dimensions.

Creating the file:

import tensogram_xarray

rng = np.random.default_rng(99)
with tensogram.TensogramFile.create("multi.tgm") as f:
    for param in ["2t", "10u"]:
        for date in ["20260401", "20260402"]:
            data = rng.random((3, 4), dtype=np.float32).astype(np.float32)
            meta = {                     "base": [{"mars": {"param": param, "date": date}}]}
            desc = {"type": "ntensor", "shape": [3, 4], "dtype": "float32",
                    "byte_order": "little", "encoding": "none",
                    "filter": "none", "compression": "none"}
            f.append(meta, [(desc, data)])

Opening with open_datasets():

>>> datasets = tensogram_xarray.open_datasets(
...     "multi.tgm", variable_key="mars.param"
... )
>>> len(datasets)
1
>>> ds = datasets[0]
>>> list(ds.data_vars)
['2t', '10u']

What happened:

The scanner read metadata from all 4 messages (no payload decode).
Objects were grouped by structure: all have shape (3, 4) and float32.
variable_key="mars.param" split by parameter: 2t (2 objects) and 10u (2 objects).
Within each sub-group, mars.date varies across ["20260401", "20260402"], so it became an outer dimension.
Each variable has shape (2, 3, 4) with a mars.date coordinate.

Example 5: Heterogeneous File – Auto-Split

When a file contains objects of different shapes or dtypes, they cannot be merged into a single Dataset. open_datasets() automatically splits them into compatible groups.

Creating the file:

with tensogram.TensogramFile.create("hetero.tgm") as f:
    # Message 0: 2D float32 temperature field
    f.append({"base": [{"name": "temp"}]},
             [({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
               np.ones((3, 4), dtype=np.float32))])

    # Message 1: 2D float32 wind field (same shape -- compatible)
    f.append({"base": [{"name": "wind"}]},
             [({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
               np.ones((3, 4), dtype=np.float32) * 2)])

    # Message 2: 1D int32 counts (different shape AND dtype -- incompatible)
    f.append({"base": [{"name": "counts"}]},
             [({"type": "ntensor", "shape": [5], "dtype": "int32", ...},
               np.array([1, 2, 3, 4, 5], dtype=np.int32))])

Opening:

>>> datasets = tensogram_xarray.open_datasets("hetero.tgm")
>>> len(datasets)
2
>>> datasets[0]  # The (3, 4) float32 group
<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 4)
Data variables:
    temp     (dim_0, dim_1) float32 ...
    wind     (dim_0, dim_1) float32 ...

>>> datasets[1]  # The (5,) int32 group
<xarray.Dataset>
Dimensions:  (dim_0: 5)
Data variables:
    counts   (dim_0) int32 ...

Objects that share (shape, dtype) are grouped together. Incompatible objects go to separate Datasets.

Example 6: Providing Full User Mapping

For complete control, pass all mapping parameters:

ds = xr.open_dataset(
    "forecast.tgm",
    engine="tensogram",
    dim_names=["latitude", "longitude"],
    variable_key="mars.param",
    message_index=0,         # which message in a multi-message file
    verify_hash=True,        # verify xxh3 integrity on decode
)

Parameter	Type	Effect
`dim_names`	`list[str]`	Names for the innermost tensor axes (positional)
`variable_key`	`str`	Dotted path in per-object metadata for variable naming
`message_index`	`int`	Which message to open (default `0`)
`merge_objects`	`bool`	If `True`, calls `open_datasets()` and returns first result
`verify_hash`	`bool`	Verify xxh3 hashes during decode
`drop_variables`	`list[str]`	Variables to exclude from the Dataset
`range_threshold`	`float`	Fraction of total elements below which partial reads are used (default `0.5`)

For multi-message files, use tensogram_xarray.open_datasets() directly:

import tensogram_xarray

datasets = tensogram_xarray.open_datasets(
    "forecast.tgm",
    dim_names=["latitude", "longitude"],
    variable_key="mars.param",
    verify_hash=True,
)

Example 7: Lazy Loading with Dask

Data is always loaded lazily. Opening a file only reads metadata – the tensor payloads are decoded on first access. This enables working with larger-than-memory files via dask.

# Open with dask chunking
ds = xr.open_dataset("large.tgm", engine="tensogram", chunks={})
print(ds["object_0"])
# <xarray.DataArray 'object_0' (dim_0: 10000, dim_1: 10000)>
# dask.array<...>

# Compute a mean without loading the full array
mean = ds["object_0"].mean().compute()

See also: Dask Integration for a complete walkthrough with distributed computation, performance tuning, and a runnable 4-D tensor example.

When Partial Reads Are Used

The backend inspects each data object’s encoding pipeline to determine whether partial reads via decode_range(join=False) are available:

Compression	Filter	Partial Read?	Mechanism
`none`	`none`	Yes	Direct byte offset
`szip`	`none`	Yes	RSI block offset seeking
`blosc2`	`none`	Yes	Independent chunk decompression
`zfp` (fixed_rate)	`none`	Yes	Fixed-size blocks, computable offsets
`zfp` (fixed_precision)	`none`	No	Variable-size blocks
`zfp` (fixed_accuracy)	`none`	No	Variable-size blocks
`zstd`	`none`	No	Stream compressor
`lz4`	`none`	No	Stream compressor
`sz3`	`none`	No	Stream compressor
Any	`shuffle`	No	Byte rearrangement breaks contiguous ranges

When partial reads are available, slicing a lazy array decodes only the requested region:

ds = xr.open_dataset("szip_data.tgm", engine="tensogram")
# Only the bytes for rows 100-110 are decompressed:
subset = ds["object_0"][100:110, :].values

When partial reads are not available (stream compressors or shuffle filter), the full object is decoded and then sliced in memory. This is transparent to the user – the API is identical.

N-Dimensional Slice Mapping

When you slice a lazy xarray variable backed by tensogram, the backend must convert an N-dimensional slice into flat element ranges that decode_range() understands. Here is how the decomposition works:

Find the split point – scan the slice dimensions from innermost to outermost and find the first (innermost) dimension whose slice does not cover the full axis. All dimensions inner to this point are contiguous in memory and form a single block per outer-index combination.
Compute the contiguous block size – multiply the lengths of all dimensions inner to (and including) the split dimension’s slice width. This gives the number of elements in each flat range.
Generate one range per outer-index combination – iterate over the Cartesian product of sliced indices in all dimensions outer to the split point. Each combination produces one (offset, count) pair.
Merge adjacent ranges – if two consecutive ranges abut in the flat layout (i.e. offset_i + count_i == offset_{i+1}), they are merged into a single wider range to reduce I/O calls.

Concrete example: an array of shape (100, 200) sliced as [10:20, 50:100]:

The innermost dimension (axis 1) has slice 50:100 (width 50), which does not cover the full axis (200), so it is the split point.
Contiguous block size = 50 elements (just the inner slice width).
Outer indices: axis 0 slice 10:20 gives indices [10, 11, ..., 19] – 10 combinations.
This produces 10 flat ranges, each of 50 elements: (10*200+50, 50), (11*200+50, 50), …, (19*200+50, 50).
None are adjacent (gap of 150 elements between each), so no merging occurs.

If the slice were [10:20, :] instead (full inner axis), the split point moves to axis 0 and the 10 individual ranges of 200 elements each are adjacent in memory – they merge into a single range (10*200, 10*200).

flowchart TD
    A["N-D slice<br/>arr[10:20, 50:100]"] --> B["Find split point<br/>axis 1 (not full)"]
    B --> C["Block size = 50"]
    C --> D["Outer indices:<br/>axis 0 → [10..19]"]
    D --> E["10 ranges of 50 elements"]
    E --> F["Merge adjacent?<br/>No — gap of 150"]
    F --> G["decode_range()<br/>10 × (offset, 50)"]

Range Threshold Heuristic

Even when partial reads are technically available, reading many small ranges can be slower than decoding the entire array – especially for compressed data where decompression has fixed overhead per block.

The backend uses a ratio-based heuristic controlled by the range_threshold parameter (default 0.5):

Rule: partial reads are used only when the total number of requested elements is less than range_threshold × total_elements.

With the default of 0.5, if you request more than 50% of the array, the backend falls back to a full decode and slices in memory. Lower values make the backend more aggressive about using partial reads; higher values make it prefer full decodes.

# More aggressive partial reads (use when each range is cheap, e.g. uncompressed)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.3)

# Almost always full decode (use when decode overhead is very low)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.9)

Installation

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install tensogram-xarray

This pulls in tensogram and xarray as dependencies. The xarray backend is registered automatically via entry points – no extra configuration needed.

>>> import xarray as xr
>>> "tensogram" in xr.backends.list_engines()
True

For dask support:

source .venv/bin/activate   # if not already in the virtualenv
uv pip install "tensogram-xarray[dask]"

Error Handling

The backend reports errors with enough context for diagnosis. Common error scenarios and their messages:

Scenario	Error type	Message includes
File not found	`OSError`	File path (from OS)
Negative `message_index`	`ValueError`	`"message_index must be >= 0, got -1"`
`message_index` out of range	`ValueError`	Index and file message count
`dim_names` length mismatch	`ValueError`	Actual vs expected count
Unsupported dtype	`TypeError`	`"unsupported tensogram dtype 'foo'"`
`decode_range` failure	Falls back to `decode_object`	Warning logged at DEBUG level with file, message, object, and cause
Incomplete hypercube in merge	`ValueError`	Which coordinate combination is missing
Silent data loss in merge	`WARNING` log	Variable name and count of dropped objects
Hash verification failure	`ValueError`	Object index and expected/actual hash
Conflicting coordinate objects	`ValueError`	Dimension name and mismatch details

Hash Verification and Partial Reads

When verify_hash=True is passed, xxh3 hash verification is performed on full object reads (decode_object) only. Partial reads via decode_range() intentionally skip hash verification because:

Partial reads decode only a subset of the payload, so the full-object hash cannot be validated.
The purpose of partial reads is to minimise I/O; verifying the hash would require reading the entire payload, defeating the optimisation.

This means that for lazily-loaded arrays, hash verification happens when a slice triggers a full-object decode (i.e. when the requested fraction exceeds range_threshold), but not when partial decode_range() is used.

Logging

The backend uses Python’s standard logging module. To see partial-read fallback diagnostics:

import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)

To see merge data-loss warnings (enabled by default at WARNING level):

import logging
logging.basicConfig(level=logging.WARNING)

Dask Integration

Tensogram supports Dask natively through its xarray backend. When you open a .tgm file with chunks={}, xarray wraps every tensor variable in a dask.array.Array. No data is read from disk until you call .compute() or .values.

import xarray as xr
ds = xr.open_dataset("forecast.tgm", engine="tensogram", chunks={})
# ds["temperature"].data is now a dask.array -- zero I/O so far
mean = ds["temperature"].mean().compute()  # data decoded here

This chapter explains how the integration works, walks through a complete example with distributed computation, and covers the performance knobs you can tune.

How It Works

The tensogram xarray backend implements BackendArray, xarray’s lazy-loading protocol. When dask requests a chunk, the backend:

Opens the .tgm file and reads the raw message bytes.
For small slices on compressors that support random access (none, szip, blosc2, zfp fixed-rate): maps the N-D slice to flat byte ranges and decodes only those ranges via decode_range().
For large slices or stream compressors: falls back to full decode_object() and slices in memory.

The BackendArray stores only the file path (no open handles), making it pickle-safe for dask multiprocessing and distributed execution.

flowchart LR
    A["xr.open_dataset<br/>chunks={}"] --> B["BackendArray<br/>(lazy, pickle-safe)"]
    B --> C["dask.array.Array"]
    C -->|".compute()"| D["decode_range()<br/>or decode_object()"]
    D --> E["numpy.ndarray"]

Chunking Strategies

`chunks` value	Behaviour
`{}`	Automatic: one chunk per tensor object (most common)
`{"latitude": 100}`	Split along latitude every 100 elements
`{"latitude": 100, "longitude": 200}`	Split along both axes

For tensogram files, chunks={} is usually the right choice because each data object is already a self-contained tensor. Finer chunking adds overhead from repeated file opens.

Complete Example: Distributed Statistics over 4-D Tensors

This walkthrough corresponds to examples/python/09_dask_distributed.py. It creates 4 .tgm files representing a 4-D temperature field (time x level x latitude x longitude), then computes statistics entirely through dask’s lazy execution.

Step 1: Create the Data Files

Each file contains 10 data objects (one per pressure level) plus latitude and longitude coordinate arrays:

import numpy as np
import tensogram

def _desc(shape, dtype="float32", **extra):
    return {
        "type": "ntensor", "shape": list(shape), "dtype": dtype,
        "byte_order": "little", "encoding": "none",
        "filter": "none", "compression": "none", **extra,
    }

LEVEL_VALUES = [1000, 925, 850, 700, 500, 400, 300, 200, 100, 50]
NLAT, NLON = 36, 72

with tensogram.TensogramFile.create("temperature_20260401.tgm") as f:
    lat = np.linspace(-87.5, 87.5, NLAT, dtype=np.float64)
    lon = np.linspace(0, 355, NLON, dtype=np.float64)

    objects = [
        (_desc([NLAT], dtype="float64", name="latitude"), lat),
        (_desc([NLON], dtype="float64", name="longitude"), lon),
    ]
    for level_hpa in LEVEL_VALUES:
        field = np.random.default_rng(42).random((NLAT, NLON)).astype(np.float32)
        desc = _desc([NLAT, NLON], name=f"temperature_{level_hpa}hPa")
        objects.append((desc, field))

    f.append({}, objects)

Step 2: Open with Dask Lazy Loading

The critical parameters are engine="tensogram" and chunks={}:

import xarray as xr
import tensogram_xarray  # registers the engine

ds = xr.open_dataset(
    "temperature_20260401.tgm",
    engine="tensogram",
    variable_key="name",  # name variables from descriptor "name" field
    chunks={},             # enable dask lazy loading
)

At this point:

No tensor data has been decoded. Only CBOR metadata was read.
Each variable is a dask.array.Array:

>>> type(ds["temperature_1000hPa"].data)
<class 'dask.array.core.Array'>

>>> ds["temperature_1000hPa"].shape
(36, 72)

>>> ds["temperature_1000hPa"].chunks
((36,), (72,))

Step 3: Build a 4-D Tensor from Multiple Files

Stack variables across levels within each file, then stack files across time:

import dask
import dask.array as da

# Open all 4 files
paths = ["temperature_20260401.tgm", "temperature_20260402.tgm",
         "temperature_20260403.tgm", "temperature_20260404.tgm"]

datasets = [
    xr.open_dataset(p, engine="tensogram", variable_key="name", chunks={})
    for p in paths
]

# Stack levels within each file, then stack across time
# Build in LEVEL_VALUES order (not alphabetical) so axis matches labels
temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]

all_timesteps = []
for ds in datasets:
    level_arrays = [ds[v].data for v in temp_vars]
    all_timesteps.append(da.stack(level_arrays, axis=0))

full_4d = da.stack(all_timesteps, axis=0)
# Shape: (4, 10, 36, 72) -- (time, level, lat, lon)
# Still lazy -- zero I/O

Step 4: Compute Statistics with Dask

Schedule multiple computations, then execute them in a single dask.compute() call:

# Schedule (lazy -- no computation yet)
global_mean = full_4d.mean()
global_std  = full_4d.std()
global_min  = full_4d.min()
global_max  = full_4d.max()

# Execute all at once (data decoded from .tgm files here)
mean_val, std_val, min_val, max_val = dask.compute(
    global_mean, global_std, global_min, global_max
)

print(f"Mean: {mean_val:.2f} K")
print(f"Std:  {std_val:.2f} K")
print(f"Min:  {min_val:.2f} K")
print(f"Max:  {max_val:.2f} K")

Step 5: Selective Lazy Loading

Only the data you touch is decoded. Slicing the 4-D array triggers decoding of just the relevant chunks:

# Single point: backend uses decode_range() for the tiny slice
# (1 element out of 2592 = 0.04%, well below the 50% threshold)
point = full_4d[0, 0, 18, 0].compute()

# One pressure level across all times: touches 4 backing arrays
level_400 = full_4d[:, 5, :, :].mean().compute()

# Equatorial band: partial range decode for the selected rows
equatorial = full_4d[0, 0, 9:27, :].mean().compute()

Performance Tuning

The `range_threshold` Parameter

When dask requests a slice, the backend decides between partial decode (decode_range()) and full decode (decode_object()) based on the fraction of requested elements:

Rule: partial reads are used when requested_elements / total_elements <= range_threshold

`range_threshold`	Behaviour
`0.3`	Aggressive partial reads (good for uncompressed data)
`0.5` (default)	Balanced: partial below 50%, full above
`0.9`	Almost always full decode (good for fast decompressors)

# More aggressive partial reads
ds = xr.open_dataset("file.tgm", engine="tensogram",
                     chunks={}, range_threshold=0.3)

# Almost always full decode
ds = xr.open_dataset("file.tgm", engine="tensogram",
                     chunks={}, range_threshold=0.9)

Which Compressors Support Partial Reads?

Compression	Partial Read?	Notes
`none`	Yes	Direct byte offset
`szip`	Yes	RSI block seeking
`blosc2`	Yes	Independent chunk decompression
`zfp` (fixed_rate)	Yes	Fixed-size blocks
`zfp` (other modes)	No	Variable-size blocks
`zstd`	No	Stream compressor
`lz4`	No	Stream compressor
`sz3`	No	Stream compressor

The shuffle filter also disables partial reads (byte rearrangement breaks contiguous ranges). The fallback is always transparent: the full object is decoded and sliced in memory.

Dask Scheduler Choice

Tensogram’s backend is thread-safe (uses a threading.Lock per array). All three dask schedulers work:

# Synchronous (debugging)
dask.config.set(scheduler="synchronous")

# Threaded (default, good for I/O-bound work)
dask.config.set(scheduler="threads")

# Multiprocessing (BackendArray is pickle-safe)
dask.config.set(scheduler="processes")

For large-scale work, dask.distributed also works because the BackendArray stores only the file path (no unpicklable state).

Thread Safety

The TensogramBackendArray uses a per-array threading.Lock to serialise file I/O. This means:

Multiple dask tasks can read different variables concurrently.
Reads to the same variable are serialised (no concurrent file opens for the same array).
The lock is excluded from pickle state and recreated on deserialise.

Installation

For dask support, install the optional dependency:

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install "tensogram-xarray[dask]"

This pulls in dask[array] alongside tensogram and xarray.

Debugging

Enable debug logging to see when partial reads are used vs full decodes:

import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)

You will see messages like:

DEBUG:tensogram_xarray.array:decode_range failed for forecast.tgm msg=0 obj=2,
    falling back to full decode: RangeNotSupported

This is expected for stream compressors and is not an error.

Error Handling

When Errors Are Raised

When	What	Error type
`open_dataset()`	File not found	`OSError` with file path
`open_dataset()`	`message_index` negative	`ValueError` with index
`open_dataset()`	`message_index` out of range	`ValueError` with index and count
`open_dataset()`	`dim_names` length mismatch	`ValueError` with actual vs expected
`open_dataset()`	Unsupported dtype	`TypeError` with dtype name
`.compute()`	Decode failure	`ValueError` or `RuntimeError` from tensogram
`.compute()`	Hash mismatch (with `verify_hash=True`)	`ValueError` with object index
`.compute()`	File moved/deleted after open	`OSError` from OS

Key design point: errors in metadata (file not found, bad index, wrong dim_names) surface immediately at open_dataset() time. Errors in data decoding surface at .compute() time because payloads are lazy-loaded.

Partial Read Fallback

When decode_range() fails (e.g. unsupported compressor for partial reads), the backend catches the error and falls back to full decode_object():

except (ValueError, RuntimeError, OSError) as exc:
    logger.debug("decode_range failed ... falling back to full decode: %s", exc)

This fallback is transparent — the user gets correct data regardless. Enable DEBUG logging to see when fallbacks occur.

Dask Worker Errors

File paths are automatically resolved to absolute paths when the dataset is opened. This prevents “file not found” errors when dask sends work to processes with a different working directory.

If a dask worker encounters a decode error, it propagates through dask’s error handling. The traceback will show the tensogram error with file path, message index, and object index for diagnosis.

Edge Cases

Ambiguous Dimension Matching

When coordinate arrays have the same size (e.g. both latitude and longitude have 360 elements), the backend cannot distinguish them by shape alone. The first match gets the coordinate name; the second falls back to a generic dim_N.

Workaround: pass explicit dim_names to disambiguate:

ds = xr.open_dataset("file.tgm", engine="tensogram",
                     dim_names=["latitude", "longitude"], chunks={})

Stacking Files with Different Variables

When stacking multiple .tgm files into a single dask array, verify that every dataset contains the expected variables before stacking:

temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]
for i, ds in enumerate(datasets):
    missing = [v for v in temp_vars if v not in ds.data_vars]
    if missing:
        raise KeyError(f"Dataset {i} missing: {missing}")

Otherwise da.stack() will fail with a confusing KeyError from a deep dask callback.

Zero-Object Messages

A .tgm file containing only metadata frames (no data objects) returns an empty xr.Dataset with no variables. This is valid and does not raise an error.

Scalar (0-D) Tensors

Data objects with shape=() (zero dimensions) are supported. They become scalar xr.Variable objects in the dataset.

Hash Verification with Partial Reads

When verify_hash=True is set, hash verification only runs on full object reads (via decode_object()). Partial reads via decode_range() skip verification because only a subset of the payload is decoded. This means:

Large slices (above range_threshold) trigger full decode with hash verification.
Small slices use decode_range() without hash verification.

This is by design. If you need guaranteed hash verification on every access, set range_threshold=0.0 to force full decodes.

Zarr v3 Backend

The tensogram-zarr package implements a Zarr v3 Store backed by .tgm files. This lets you read and write Tensogram data through the standard Zarr Python API.

Installation

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install tensogram-zarr

Requires zarr >= 3.0, tensogram, and numpy.

Reading a .tgm file through Zarr

import zarr
from tensogram_zarr import TensogramStore

# Open existing .tgm file as a read-only Zarr store
store = TensogramStore.open_tgm("data.tgm")
root = zarr.open_group(store=store, mode="r")

# Browse available arrays
for name, arr in root.members():
    print(f"{name}: shape={arr.shape}, dtype={arr.dtype}")

# Read an array (decoded eagerly at store open, served from memory)
temperature = root["2t"][:]
print(temperature.shape, temperature.mean())

# Access group-level metadata (from GlobalMetadata _extra_)
# The example below shows a MARS namespace; the attributes dict reflects
# whatever namespaces the producer put in the message's GlobalMetadata.
print(root.attrs["mars"])  # {'class': 'od', 'type': 'fc', ...}

How the mapping works

Each .tgm message maps to a Zarr group:

zarr.json                     # root group ← GlobalMetadata
temperature/zarr.json         # array metadata ← DataObjectDescriptor
temperature/c/0/0             # chunk data ← decoded object payload
pressure/zarr.json            # another array
pressure/c/0/0                # its chunk data

graph LR
    TGM[".tgm file"] --> GM["GlobalMetadata"]
    TGM --> OBJ1["Object 0: temperature"]
    TGM --> OBJ2["Object 1: pressure"]
    
    GM --> GZJ["zarr.json (group)"]
    OBJ1 --> AZJ1["temperature/zarr.json"]
    OBJ1 --> CHK1["temperature/c/0/0"]
    OBJ2 --> AZJ2["pressure/zarr.json"]
    OBJ2 --> CHK2["pressure/c/0/0"]

Key design decisions:

Each TGM data object becomes one Zarr array with a single chunk (chunk shape = array shape)
Variable names are resolved from metadata via a default lookup path (name, mars.param, param, mars.shortName, shortName), or a custom dot-path you supply
TGM encoding metadata is preserved in Zarr array attributes under _tensogram_* keys
Duplicate variable names get a numeric suffix (field, field_1)

Variable naming

By default, the store tries these metadata paths to name arrays:

name
mars.param
param
mars.shortName
shortName
Falls back to object_<index>

The lookup runs against a shallow root-key merge of meta.base[i] and desc.params (matching the xarray backend’s long-standing behaviour): for each top-level key, meta.base[i] wins if present, otherwise desc.params fills in. Nested dicts are not deep-merged, so a base[i] entry with {"mars": {"levtype": "sfc"}} entirely shadows a desc.params["mars"] with {"param": "2t"}; no mars.param from the descriptor survives that case. After merging, the priority chain above is applied to the combined dict, so a higher-priority key from either source beats a lower-priority one.

You can override with any dot-path, including non-MARS vocabularies:

# Weather pipeline using MARS
store = TensogramStore.open_tgm("weather.tgm", variable_key="mars.param")

# Neuroimaging pipeline using BIDS
store = TensogramStore.open_tgm("scans.tgm", variable_key="bids.task")

# Custom vocabulary
store = TensogramStore.open_tgm("data.tgm", variable_key="product.name")

Common pitfall: `name` in the descriptor dict

Application metadata belongs in meta["base"][i], not in the descriptor:

# ✗ Avoid — triggers a UserWarning; works via fallback but is not canonical
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32",
        "name": "temperature"}  # ← goes into desc.params, flagged
tensogram.encode({}, [(desc, data)])

# ✓ Canonical — the simplest form
meta = { "base": [{"name": "temperature"}]}
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32"}
tensogram.encode(meta, [(desc, data)])

The descriptor fallback exists so files produced by the ✗ form still surface the correct names in zarr (and xarray); the write-side warning exists so the mistake is visible at the time it’s made. See issue #67.

Multi-message files

By default the store reads message 0. Select a different message with message_index:

store = TensogramStore.open_tgm("multi.tgm", message_index=2)

Writing a .tgm file through Zarr

import numpy as np
import zarr
from tensogram_zarr import TensogramStore

store = TensogramStore("output.tgm", mode="w")
root = zarr.open_group(store=store, mode="w")

# Create arrays — data is buffered in memory
root.create_array("temperature", data=np.random.rand(100, 200).astype(np.float32))
root.create_array("pressure", data=np.array([1000, 925, 850, 700], dtype=np.float64))

# Close flushes to .tgm
store.close()

The write path assembles all arrays into a single TGM message when the store is closed.

Context manager

with TensogramStore("data.tgm", mode="r") as store:
    root = zarr.open_group(store=store, mode="r")
    data = root["temperature"][:]
# Store automatically closed

Supported data types

Tensogram dtype	Zarr data_type	NumPy dtype
`float16`	`float16`	`float16`
`float32`	`float32`	`float32`
`float64`	`float64`	`float64`
`int8`	`int8`	`int8`
`int16`	`int16`	`int16`
`int32`	`int32`	`int32`
`int64`	`int64`	`int64`
`uint8`	`uint8`	`uint8`
`uint16`	`uint16`	`uint16`
`uint32`	`uint32`	`uint32`
`uint64`	`uint64`	`uint64`
`complex64`	`complex64`	`complex64`
`complex128`	`complex128`	`complex128`
`bitmask`	`uint8`	`uint8`

Byte range support

The store supports Zarr’s ByteRequest types for efficient partial reads:

RangeByteRequest(start, end) — read a byte range
OffsetByteRequest(offset) — read from offset to end
SuffixByteRequest(suffix) — read last N bytes

Comparison with tensogram-xarray

Feature	`tensogram-zarr`	`tensogram-xarray`
API level	Low-level (Zarr Store)	High-level (xarray engine)
Dimensions	Generic (dim_0, dim_1)	Named (lat, lon, time)
Coordinates	Not interpreted	Auto-detected from metadata
Multi-message	One message per store	Auto-merge into hypercubes
Write support	Yes	No
Data loading	Eager (all at open)	Lazy (on-demand `decode_range`)

Use tensogram-zarr when you need direct Zarr API access or write support. Use tensogram-xarray when you want automatic coordinate detection and multi-message merging.

Edge cases and limitations

Variable name sanitization

If a metadata value used as a variable name contains / or \, those characters are replaced with _ to prevent spurious directory nesting in the virtual key space. Empty names become _.

mars.param = "temperature/surface"  →  variable name "temperature_surface"

Duplicate variable names

When multiple objects resolve to the same name, suffixes are appended: field, field_1, field_2, etc.

Zero-object messages

A message with no data objects is valid (metadata-only). The store produces a root group with attributes but no arrays.

Single chunk per array

Each TGM data object maps to a Zarr array with chunk_shape == array_shape (one chunk). There is no sub-chunking; partial reads within the array are handled by Zarr’s byte-range support against the single chunk. If a Zarr writer attempts to store multiple chunks for the same variable, a ValueError is raised — TensogramStore does not silently drop extra chunks.

Out-of-range message index

If message_index exceeds the number of messages in the file, an IndexError is raised. Negative indices are rejected with ValueError.

bfloat16 dtype

bfloat16 maps to Zarr data type "bfloat16" but is stored as raw 2-byte values (<V2 numpy dtype) since numpy has no native bfloat16 type. Use ml_dtypes.bfloat16 for interpretation.

Byte order handling

The read path normalises all chunk data to little-endian (matching the Zarr bytes codec default). The write path respects byte_order from the Zarr codecs metadata — if a big-endian bytes codec is specified, the data is byte-swapped before encoding to TGM.

JSON serialization (RFC 8259)

serialize_zarr_json() converts non-finite float values to their Zarr v3 string sentinels ("NaN", "Infinity", "-Infinity") so the output is valid RFC 8259 JSON.

Write path byte-count validation

When flushing to .tgm, the store validates that chunk byte count matches product(shape) * dtype_size. A mismatch raises ValueError with the expected and actual counts.

close() exception safety

If _flush_to_tgm() fails during close(), the store is still marked as closed (_is_open = False). The exception propagates normally — partial writes do not corrupt the file since TGM messages are written atomically.

When used as a context manager and an exception is already in flight, flush errors are logged at WARNING level instead of replacing the original exception.

Error handling

All errors surface with enough context for debugging:

Scenario	Exception	Message includes
File not found / unreadable	`OSError`	File path
Invalid TGM message	`ValueError`	File path + message index
Object decode failure	`ValueError`	File path + message index + object index + variable name
Out-of-range message index	`IndexError`	Requested index + available count
Negative message index	`ValueError`	The invalid index value
Invalid mode	`ValueError`	The invalid mode string
Empty path	`ValueError`	The value passed
Chunk byte-count mismatch	`ValueError`	Variable name + expected vs actual byte count
Unsupported dtype on write	`ValueError`	Variable name + dtype
Invalid JSON in zarr.json	`ValueError`	Byte count + hex preview
Unknown ByteRequest type	`TypeError`	The type name
Array without chunk data	`WARNING` log	Variable name (array skipped)
No arrays to flush	`WARNING` log	File path

Errors from the underlying Rust tensogram library are wrapped with Python-level context so users see which file, message, and variable caused the problem.

anemoi-inference Integration

The tensogram-anemoi package provides a plug-and-play output for anemoi-inference, the ECMWF framework for running AI-based weather forecast models. Once installed, anemoi-inference automatically discovers the plugin via Python entry points — no code changes to anemoi-inference are required.

Installation

pip install tensogram-anemoi

Or from source:

pip install -e python/tensogram-anemoi/

Usage

In an anemoi-inference run config, specify tensogram as the output:

output:
  tensogram:
    path: forecast.tgm

All forecast steps are written to a single .tgm file as they are produced. Remote destinations (S3, GCS, Azure, …) are supported via fsspec:

output:
  tensogram:
    path: s3://my-bucket/forecast.tgm
    storage_options:
      key: ...
      secret: ...

Configuration options

All options after path must be supplied as keyword arguments.

Option	Type	Default	Description
`path`	`str`	—	Destination file path or remote URL
`encoding`	`str`	`"none"`	`"none"` or `"simple_packing"`
`bits`	`int`	`None`	Bits per value (required when `encoding="simple_packing"`)
`compression`	`str`	`"zstd"`	`"none"`, `"zstd"`, `"lz4"`, `"szip"`, `"blosc2"`
`dtype`	`str`	`"float32"`	Field array dtype: `"float32"` or `"float64"`
`storage_options`	`dict`	`{}`	Forwarded to fsspec for remote paths
`stack_pressure_levels`	`bool`	`False`	Stack pressure-level fields into 2-D objects
`variables`	`list[str]`	`None`	Restrict output to a subset of variables
`output_frequency`	`int`	`None`	Write every N steps
`write_initial_state`	`bool`	`None`	Whether to write step 0

Pressure-level stacking

When stack_pressure_levels=True, all fields sharing the same GRIB param are merged into a single 2-D object of shape (n_grid, n_levels), sorted by level ascending. The "mars" namespace carries "levelist": [500, 850, ...] instead of a scalar "level" (following standard MARS convention). Non-pressure-level fields are always written as individual 1-D objects.

output:
  tensogram:
    path: forecast.tgm
    stack_pressure_levels: true

Simple packing

For compact storage, use simple_packing with a bits value:

output:
  tensogram:
    path: forecast.tgm
    encoding: simple_packing
    bits: 16
    compression: zstd

Coordinate arrays (lat/lon) are never lossy-encoded; only field arrays are packed.

Metadata reference

Each .tgm file produced by tensogram-anemoi contains one message per forecast step. This section documents exactly what is stored in each message and how to read it with the raw tensogram Python API.

Opening a file

import tensogram

tgm = tensogram.TensogramFile.open("forecast.tgm")
print(len(tgm), "steps")

meta, objects = tgm[0]   # first step

meta is the decoded message metadata. objects is a list of (descriptor, array) pairs, one entry per object in the message.

Object layout

Every message has the following fixed layout:

Index	`base[i]["name"]`	Content
0	`"grid_latitude"`	Latitude coordinates, float64, shape `(n_grid,)`
1	`"grid_longitude"`	Longitude coordinates, float64, shape `(n_grid,)`
2 … N	variable name or param name	Field data

meta, objects = tgm[0]

lat_desc, lat_arr = objects[0]   # latitudes
lon_desc, lon_arr = objects[1]   # longitudes
fld_desc, fld_arr = objects[2]   # first field

The coordinate names "grid_latitude" and "grid_longitude" are intentionally distinct from the standard "latitude" / "longitude" names so that all objects in a message share a single flat grid dimension rather than each coordinate spawning its own dimension.

`base[i]` — per-object metadata

Each object has a corresponding entry in meta.base:

for i, entry in enumerate(meta.base):
    print(i, entry)

Every entry contains:

Key	Type	Present on	Description
`"name"`	`str`	all objects	Variable or coordinate name
`"anemoi"`	`dict`	all objects	anemoi-specific metadata (see below)
`"mars"`	`dict`	field objects only	MARS metadata (see below)

`"anemoi"` namespace

Key	Type	Present on	Description
`"variable"`	`str`	all objects	Internal anemoi-inference variable name

For coordinates, "variable" is "latitude" or "longitude" (the canonical name, not the "grid_*" name stored in "name"):

assert meta.base[0]["name"] == "grid_latitude"
assert meta.base[0]["anemoi"]["variable"] == "latitude"

assert meta.base[1]["name"] == "grid_longitude"
assert meta.base[1]["anemoi"]["variable"] == "longitude"

For fields, "variable" is the internal anemoi-inference name (e.g. "t500" for 500 hPa temperature, "2t" for 2 m temperature):

assert meta.base[2]["anemoi"]["variable"] == "2t"

`"mars"` namespace

Coordinate objects carry no "mars" key. Every field object carries a "mars" dict combining keys from the anemoi-inference checkpoint with the temporal keys derived from the forecast state:

Temporal keys (present on every field object):

Key	Type	Description	Example
`"date"`	`str`	Analysis/base date (`YYYYMMDD`)	`"20240101"`
`"time"`	`str`	Analysis/base time (`HHMM`)	`"0000"`
`"step"`	`int` or `float`	Forecast lead time in hours	`6`, `1.5`

Checkpoint keys (present when available in the model checkpoint):

Key	Type	Description	Example
`"param"`	`str`	GRIB parameter short name	`"2t"`, `"t"`, `"u"`
`"levtype"`	`str`	Level type	`"sfc"`, `"pl"`, `"ml"`
`"level"`	`int`	Pressure level (unstacked fields only)	`500`
`"levelist"`	`list[int]`	Pressure levels (stacked fields only)	`[500, 850, 1000]`

Reading field metadata:

meta, objects = tgm[0]

# Surface field (e.g. 2 m temperature)
entry = meta.base[2]
print(entry["name"])                    # "2t"
print(entry["anemoi"]["variable"])      # "2t"
print(entry["mars"]["param"])           # "2t"
print(entry["mars"]["date"])            # "20240101"
print(entry["mars"]["time"])            # "0000"
print(entry["mars"]["step"])            # 6

# Pressure-level field (unstacked)
entry = meta.base[3]
print(entry["mars"]["param"])           # "t"
print(entry["mars"]["levtype"])         # "pl"
print(entry["mars"]["level"])           # 500

With stack_pressure_levels=True, the pressure-level group has "levelist" instead of "level", and the array is 2-D:

entry = meta.base[2]                    # stacked t group
print(entry["mars"]["levelist"])        # [500, 850, 1000]
print(entry["mars"]["param"])           # "t"

desc, arr = objects[2]
print(arr.shape)                        # (n_grid, 3)  — columns sorted by level

`meta.extra` — message-level metadata

meta.extra carries metadata that applies to the whole message rather than individual objects.

`"dim_names"` — axis-size hints

dim_names = meta.extra["dim_names"]
# e.g. {"21600": "values"}
# or   {"21600": "values", "3": "level"}  (with stack_pressure_levels=True)

dim_names maps the string representation of an axis length to a semantic name. It exists to allow downstream tools to assign meaningful axis names without requiring any anemoi-specific knowledge. The grid axis is always labelled "values"; when pressure-level stacking is enabled, each unique level-axis size is labelled "level".

Object descriptors

Each (descriptor, array) pair returned by objects[i] gives low-level encoding detail:

desc, arr = objects[2]

print(desc.dtype)        # "float32" or "float64"
print(desc.shape)        # [n_grid] for flat, [n_grid, n_levels] for stacked
print(desc.encoding)     # "none" or "simple_packing"
print(desc.compression)  # "zstd", "lz4", etc.

Coordinate arrays are always float64 regardless of the dtype setting. Field arrays use the configured dtype ("float32" by default), promoted to float64 automatically when encoding="simple_packing".

Full inspection example

import tensogram

tgm = tensogram.TensogramFile.open("forecast.tgm")

for step_idx, (meta, objects) in enumerate(tgm):
    print(f"\n--- step {step_idx} ---")

    # Dimension hints
    print("dim_names:", meta.extra.get("dim_names", {}))

    for i, entry in enumerate(meta.base):
        desc, arr = objects[i]
        anemoi = entry.get("anemoi", {})
        mars = entry.get("mars", {})

        print(
            f"  [{i}] name={entry['name']!r:20s}"
            f"  variable={anemoi.get('variable')!r:10s}"
            f"  shape={arr.shape}"
            f"  dtype={desc.dtype}"
            + (f"  step={mars.get('step')}" if mars else "")
        )

Example output for a single step with surface fields and stacked pressure levels:

--- step 0 ---
dim_names: {'21600': 'values', '3': 'level'}
  [0] name='grid_latitude'    variable='latitude'   shape=(21600,)  dtype=float64
  [1] name='grid_longitude'   variable='longitude'  shape=(21600,)  dtype=float64
  [2] name='2t'               variable='2t'         shape=(21600,)  dtype=float32  step=6
  [3] name='t'                variable='t'          shape=(21600, 3)  dtype=float32  step=6
  [4] name='u'                variable='u'          shape=(21600, 3)  dtype=float32  step=6

earthkit-data Integration

The tensogram-earthkit package registers Tensogram as a first-class source and encoder with ecmwf/earthkit-data. It lets users load and write .tgm content through the same API they already use for GRIB, NetCDF, BUFR, and other scientific formats.

import earthkit.data as ekd

data = ekd.from_source("tensogram", "file.tgm")
ds = data.to_xarray()                                 # xarray Dataset
fl = data.to_fieldlist()                              # FieldList (MARS content only)

# Round-trip back out:
fl.to_target("file", "out.tgm", encoder="tensogram")

Install

pip install tensogram-earthkit

The package depends on tensogram, tensogram-xarray, and earthkit-data; they are pulled in automatically.

Two decoding paths

Tensogram messages come in two flavours from the earthkit perspective:

flowchart LR
    A[ekd.from_source<br/>tensogram, path]:::u --> B{any base[i]<br/>has mars?}
    B -- yes --> C[FieldList of MARS fields]:::m
    B -- no --> D[xarray Dataset]:::x
    C -- to_xarray --> D
    classDef u fill:#eef
    classDef m fill:#efe
    classDef x fill:#fee

MARS path — every object whose base[i] carries a non-empty mars sub-map becomes one earthkit Field with the MARS keys surfaced through the standard metadata(), sel(), order_by() API. Coordinate-only objects (no MARS) are skipped.
xarray path — non-MARS tensogram messages (or non-MARS objects inside a MARS message) are handed straight to the tensogram-xarray backend for coordinate auto-detection, dim-name resolution, and lazy loading.

Both paths live on the same TensogramData wrapper returned by from_source. to_xarray() always works; to_fieldlist() raises NotImplementedError with a clear message for non-MARS content.

Input shapes

Input	Example	Behaviour
Local path	`ekd.from_source("tensogram", "f.tgm")`	Direct file read
Remote URL	`ekd.from_source("tensogram", "https://.../f.tgm")`	Remote range reads
Bytes	`ekd.from_source("tensogram", open("f.tgm", "rb").read())`	Materialised to temp file
Byte stream	Use `ekd.from_source("tensogram", stream.read())`	Drain + temp file

Remote URLs accept the same storage_options= dict as the tensogram-xarray backend for credentials and endpoint overrides:

data = ekd.from_source(
    "tensogram",
    "s3://bucket/file.tgm",
    storage_options={"region": "eu-west-1"},
)

MARS FieldList usage

fl = ekd.from_source("tensogram", "mars.tgm").to_fieldlist()

# Standard earthkit FieldList API
print(len(fl))
print(fl[0].metadata("param"))

# Selection
subset = fl.sel(param="2t")

# Array-namespace interop via earthkit-utils
values_np = fl[0].to_array(array_namespace="numpy")
values_torch = fl[0].to_array(array_namespace="torch", device="cpu")

The FieldList’s to_xarray() delegates to the tensogram-xarray backend, so coordinate detection, dim-name resolution, and lazy loading behave identically whether you start from .to_xarray() or .to_fieldlist().to_xarray().

Writing tensogram from earthkit data

Any earthkit FieldList — including ones loaded from GRIB, NetCDF, or another tensogram file — can be written out as a .tgm file:

fl = ekd.from_source("file", "/data/op.grib")
fl.to_target("file", "out.tgm", encoder="tensogram")

MARS keys survive the round-trip:

grib_fl = ekd.from_source("file", "op.grib")
grib_fl.to_target("file", "out.tgm", encoder="tensogram")

ek_fl = ekd.from_source("tensogram", "out.tgm").to_fieldlist()
assert {f.metadata("param") for f in ek_fl} == {f.metadata("param") for f in grib_fl}

xarray Datasets can also be encoded — every data variable becomes one tensogram data object:

from earthkit.data.encoders import create_encoder

enc = create_encoder("tensogram")
encoded = enc.encode(my_dataset)
encoded.to_file("out.tgm")

Array namespace

earthkit-utils provides the array_namespace abstraction transparently. Because TensogramField is an ArrayField under the hood, you get numpy / torch / cupy / jax interop for free:

field = fl[0]
field.to_array(array_namespace="numpy")   # numpy.ndarray
field.to_array(array_namespace="torch")   # torch.Tensor
field.to_array(dtype="float64")           # cast on conversion
field.to_array(flatten=True)              # 1-D view

Relationship with `tensogram-xarray`

tensogram-earthkit depends on tensogram-xarray and delegates every xarray Dataset it produces to tensogram_xarray.TensogramBackendEntrypoint.open_dataset. There is exactly one place in the stack where coordinate detection, dim-name resolution, and lazy backing arrays are implemented.

Edge cases

Non-MARS message + .to_fieldlist() — raises NotImplementedError with the message "… has no MARS metadata; use to_xarray() for non-MARS tensograms".
Coordinate objects in a MARS message — silently skipped by the FieldList builder. They remain accessible via .to_xarray() as coordinate variables.
Empty FieldList encode — to_target raises ValueError (the underlying tensogram.encode requires at least one object).
Round-trip byte-equality — encoded bytes include a per-encode timestamp in _reserved_, so two encodes of the same data are not byte-identical but are semantically identical. Tests assert on decoded values, not raw bytes.
Bytes inputs — materialised to a temp file that is unlinked when the source is garbage-collected (via weakref.finalize). Calling TensogramSource.close() early unlinks immediately.

Architecture

flowchart TB
    A[earthkit.data]
    B[tensogram-earthkit]
    C[tensogram]
    D[tensogram-xarray]

    A -- entry point<br/>sources.tensogram --> B
    A -- entry point<br/>encoders.tensogram --> B
    B -- decode /<br/>encode --> C
    B -- open_dataset --> D
    D -- TensogramFile --> C

Internally the package mirrors earthkit-data’s own readers layout: readers/file.py, readers/memory.py, readers/stream.py each expose the reader / memory_reader / stream_reader callables earthkit’s registry would expect, so a future upstream PR that moves the code into earthkit-data/readers/tensogram/ is a verbatim copy.

Free-Threaded Python

Tensogram supports free-threaded Python (CPython 3.13t / 3.14t), which removes the Global Interpreter Lock (GIL) and allows true multi-threaded parallelism from Python.

What This Means

On standard CPython, the GIL serializes access to the interpreter — only one thread runs Python code at a time. Tensogram already releases the GIL during Rust computation (py.detach()), which helps, but the GIL is still re-acquired for numpy array construction and Python object creation.

On free-threaded CPython (3.13t / 3.14t), there is no GIL at all. Multiple threads can call tensogram.encode() and tensogram.decode() in true parallel. Use the included benchmark (rust/benchmarks/python/bench_threading.py) to measure scaling on your hardware.

Building for Free-Threaded Python

Install a free-threaded Python build:

# uv (recommended)
uv python install cpython-3.14+freethreaded

# Or via pyenv
pyenv install 3.14t

Build tensogram:

uv venv .venv --python python3.14t
source .venv/bin/activate
uv pip install maturin "numpy>=2.1"
cd python/bindings && maturin develop --release

Verify the GIL is disabled:

import sys
print(sys._is_gil_enabled())  # False

Thread-Safe API

All tensogram read operations are safe to call from multiple threads simultaneously:

import threading
import numpy as np
import tensogram

data = np.random.randn(100_000).astype(np.float32)
meta = {"base": [{}]}
desc = {"type": "ntensor", "shape": [100_000], "dtype": "float32"}
msg = tensogram.encode(meta, [(desc, data)])

def decode_worker():
    for _ in range(10):
        result = tensogram.decode(msg)

threads = [threading.Thread(target=decode_worker) for _ in range(8)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Each thread can independently:

Encode and decode messages
Scan buffers
Validate messages and files
Read from TensogramFile instances (same handle or separate handles)
Use StreamingEncoder (separate instances per thread)

`TensogramFile` Thread Safety

All read methods on TensogramFile (decode_message, read_message, decode_metadata, decode_descriptors, decode_object, decode_range, __getitem__, __len__, __iter__) use &self and support concurrent access from multiple threads on the same handle:

f = tensogram.TensogramFile.open("data.tgm")

def worker(thread_id):
    # Multiple threads can read from the same handle concurrently
    msg = f.decode_message(thread_id % len(f))

threads = [threading.Thread(target=worker, args=(i,)) for i in range(8)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Only append() requires exclusive access — calling it while other threads are reading will raise RuntimeError (PyO3 runtime borrow check).

Benchmark Results

Measured on Linux x86_64 (20 cores), NumPy 2.4.4, release build. Same-version paired comparisons to isolate the GIL effect.

All scaling below comes from Python-level threading (threading.Thread). Each call into Rust is single-threaded — there is no rayon or internal parallelism within a single encode/decode. The speedups reflect multiple Python threads entering Rust concurrently via py.detach(). A future Rust-level parallel pipeline would multiply on top of these numbers.

Headline: Decode Throughput (1M float32, no codec)

Threads	3.13 (GIL)	3.13t (free)	3.14 (GIL)	3.14t (free)
1	416 op/s	391 op/s	408 op/s	396 op/s
2	432 (1.04x)	775 (1.98x)	432 (1.06x)	776 (1.96x)
4	427 (1.03x)	1,356 (3.47x)	425 (1.04x)	1,352 (3.41x)
8	309 (0.74x)	1,507 (3.85x)	293 (0.72x)	1,841 (4.65x)

Headline: Encode Throughput (1M float32, no codec)

Threads	3.13 (GIL)	3.13t (free)	3.14 (GIL)	3.14t (free)
1	608 op/s	572 op/s	504 op/s	595 op/s
2	761 (1.25x)	709 (1.24x)	664 (1.32x)	702 (1.18x)
4	659 (1.08x)	726 (1.27x)	468 (0.93x)	725 (1.22x)
8	520 (0.86x)	706 (1.23x)	351 (0.70x)	717 (1.20x)

Small Messages (16K float32, no codec)

Threads	3.13 (GIL)	3.13t (free)	3.14 (GIL)	3.14t (free)
1	20,765 op/s	17,085 op/s	20,174 op/s	12,951 op/s
2	23,689 (1.14x)	35,642 (2.09x)	23,093 (1.14x)	35,176 (2.72x)
4	22,629 (1.09x)	36,483 (2.14x)	22,839 (1.13x)	61,583 (4.75x)
8	23,664 (1.14x)	79,539 (4.66x)	22,487 (1.11x)	73,549 (5.68x)
16	23,418 (1.13x)	93,627 (5.48x)	23,369 (1.16x)	168,786 (13.03x)

Other Operations (1M float32)

Scan (message boundary detection — ~0.2µs/call, GIL overhead dominates):

Threads	3.14 (GIL)	3.14t (free)
1	312,930 op/s	79,431 op/s
2	421,701 (1.35x)	266,103 (3.35x)
4	629,505 (2.01x)	811,096 (10.21x)
8	522,940 (1.67x)	389,106 (4.90x)
16	516,342 (1.65x)	1,231,777 (15.51x)

Validate (full message validation — CPU-bound, scales well on both):

Threads	3.14 (GIL)	3.14t (free)
1	5,457 op/s	4,347 op/s
2	10,860 (1.99x)	9,440 (2.17x)
4	20,249 (3.71x)	18,752 (4.31x)
8	39,766 (7.29x)	23,048 (5.30x)
16	48,560 (8.90x)	45,455 (10.46x)

Decode-range (sub-array extraction, 2x1K slices from 1M):

Threads	3.14 (GIL)	3.14t (free)
1	66,488 op/s	40,265 op/s
2	111,544 (1.68x)	98,319 (2.44x)
4	103,191 (1.55x)	167,786 (4.17x)
8	104,752 (1.58x)	325,101 (8.07x)
16	103,236 (1.55x)	475,755 (11.82x)

Iter-messages (3 messages, 100K f32 each):

Threads	3.14 (GIL)	3.14t (free)
1	1,214 op/s	1,195 op/s
2	1,291 (1.06x)	2,327 (1.95x)
4	1,211 (1.00x)	4,548 (3.81x)
8	1,194 (0.98x)	5,589 (4.68x)
16	1,106 (0.91x)	4,432 (3.71x)

Key Takeaways

Methodology: 5 runs per configuration, median reported. 200–500 warmup iterations for fast operations.

Validate scales near-linearly on both GIL and free-threaded — 8.9x (GIL) and 10.5x (free-threaded) at 16 threads. This is the most CPU-bound operation and benefits fully from py.detach() regardless of GIL.
Free-threaded decode scales to 4.7x at 8 threads for the headline workload (1M f32, no codec). GIL-enabled stays near 1.0x because numpy array construction dominates and serializes under the GIL.
GIL-enabled decode-range plateaus at ~1.7x — py.detach() allows 2 threads of overlap but the lightweight result construction can’t overlap further. Free-threaded reaches 11.8x at 16 threads.
Scan shows dramatic free-threaded scaling — free-threaded reaches 15.5x at 16 threads. GIL-enabled scales to 2.0x at 4 threads but drops back at higher thread counts due to contention.
Small messages (16K) reach 13.0x at 16 threads on free-threaded (3.14t) vs 1.2x on GIL-enabled.
iter_messages scales to 4.7x at 8 threads on free-threaded, then drops due to contention. GIL-enabled stays flat (~1.0x).
Single-thread trade-off — free-threaded single-thread performance varies by workload: decode is within ~5% of GIL-enabled (396 vs 408 op/s on 3.14), encode varies by version (3.14t is 18% faster than 3.14, while 3.13t is 6% slower than 3.13). Validate is ~20% slower (4,347 vs 5,457 op/s) and scan ~4x slower due to reference counting overhead on returned Python objects — both recover by 2 threads.

These numbers are machine-specific. Run the benchmark on your hardware:

python rust/benchmarks/python/bench_threading.py              # full suite
python rust/benchmarks/python/bench_threading.py --headline   # quick comparison
python rust/benchmarks/python/bench_threading.py --quick      # CI smoke test

Reference Comparison: Tensogram (Python) vs ecCodes (C)

This section measures Tensogram’s Python throughput against ecCodes’ native C performance on the same pipeline — 10 million float64 values (80 MiB), 24-bit simple packing + szip compression — as a concrete reference point. The pipeline is common in operational weather forecasting and is representative of scientific-quantisation workloads more broadly.

What we measured

Both sides are measured end-to-end: from a float64 array to serialized compressed bytes (encode), and back to a float64 array (decode). Both include metadata serialization, framing, and integrity overhead — not just the raw packing step.

ecCodes (C, single-threaded): The Rust benchmark (rust/benchmarks/src/bin/grib_comparison.rs) calls ecCodes’ C library directly via FFI. Encode: allocate a GRIB handle, configure the grid (10M regular lat/lon), set packing type to CCSDS at 24 bits, write the values array, serialize to GRIB bytes. Decode: load the GRIB message from bytes, extract the values array. No Python involved. Median of 10 iterations, 3 warmup.

Tensogram (Python, multi-threaded): The same 10M float64 values, same 24-bit quantization, same szip compression. Encode: pass a numpy array + CBOR metadata dict to tensogram.encode(), which crosses the PyO3 boundary, quantizes, compresses, frames, computes the integrity hash, and returns Python bytes. Decode: pass bytes to tensogram.decode(), which deframes, decompresses, dequantizes, and returns a numpy array. Each Python thread makes independent encode/decode calls. The GIL is released during the Rust computation.

Why scaling depends on the codec

Threading helps most when the Rust computation (compression, quantization) is the dominant cost. With simple packing + szip, each encode/decode spends ~170 ms in Rust and ~20 ms in Python/numpy — so ~89% of the time runs with the GIL released and threads scale well. Without compression, the Rust work is trivial (~1 ms) and the Python overhead limits parallelism.

The tables above measure uncompressed data to isolate the threading mechanism. The results below use the production pipeline (24-bit packing + szip) and show what real workloads achieve.

Results

ecCodes CCSDS (Rust FFI, single-threaded): 870 MB/s encode, 531 MB/s decode.

Tensogram from Python (free-threaded 3.14t, 5-run median, 10M float64 24-bit packing+szip):

Decode:

Threads	Throughput	vs ecCodes C
1	446 MB/s	0.84x
2	858 MB/s	1.62x
4	1,596 MB/s	3.01x
8	2,602 MB/s	4.90x

Encode:

Threads	Throughput	vs ecCodes C
1	435 MB/s	0.50x
2	833 MB/s	0.96x
4	1,516 MB/s	1.74x
8	2,353 MB/s	2.71x

Single-threaded Tensogram from Python is slower than ecCodes from C (the PyO3 boundary costs ~10-15% on decode, ~50% on encode due to numpy data extraction for 80 MiB). But at 2 threads, decode already surpasses ecCodes. At 4 threads, both encode and decode exceed ecCodes. At 8 threads, decode reaches 4.9x ecCodes throughput — from Python.

Requirements

Python >= 3.13t for free-threaded mode (3.12/3.13 GIL-enabled also works)
NumPy >= 2.1 (free-threaded support)
maturin >= 1.8 (free-threaded wheel building)

Known Limitations

Inherent:

Shared mutable numpy arrays across threads can cause data races (same as any Python threading)
xarray and zarr backends have their own threading models (dask, zarr locking)

By design:

TensogramFile read methods (decode_message, read_message, __getitem__, etc.) support concurrent access from multiple threads on the same handle. Only append() requires exclusive access.
bytes inputs to decode/scan/validate are zero-copy across the GIL release. bytearray inputs are copied once internally by PyO3.
iter_messages / PyBufferIter own a full buffer copy (the buffer must outlive iteration).

Multi-Threaded Coding Pipeline

Since v0.13.0 Tensogram exposes a caller-controlled thread budget that spreads encoding and decoding work across a scoped pool of workers. The feature is off by default — existing code paths produce byte-identical output to previous releases until the caller opts in.

This page covers:

The threads option
Cross-language parity
Axis-A vs axis-B dispatch
Determinism contract
Environment variable override
Interaction with free-threaded Python
Benchmarks and tuning

The `threads` option

All four bindings expose a threads: u32 option on encode and decode entry points:

#![allow(unused)]
fn main() {
use tensogram::{encode, decode, EncodeOptions, DecodeOptions};

// Encode with a 4-thread pool:
let msg = encode(&meta, &descriptors, &EncodeOptions {
    threads: 4,
    ..Default::default()
})?;

// Decode with an 8-thread pool:
let (meta, objs) = decode(&msg, &DecodeOptions {
    threads: 8,
    ..Default::default()
})?;
}

import tensogram

msg = tensogram.encode(meta, descriptors, threads=4)
decoded = tensogram.decode(msg, threads=8)

tensogram::encode_options enc{};
enc.threads = 4;
auto bytes = tensogram::encode(meta_json, objects, enc);

tensogram::decode_options dec{};
dec.threads = 8;
auto msg = tensogram::decode(buf, len, dec);

tgm_encode(meta_json, data_ptrs, data_lens, num_objects,
           "xxh3", /* threads= */ 4, &out);
tgm_decode(buf, len, /* native_byte_order= */ 1, /* threads= */ 8,
           /* verify_hash= */ 0, &msg);

tensogram --threads 8 merge -o merged.tgm a.tgm b.tgm
TENSOGRAM_THREADS=4 tensogram split -o 'part_[index].tgm' input.tgm

Value semantics

`threads`	Behaviour
`0` (default)	Sequential, single-threaded. Falls back to the `TENSOGRAM_THREADS` env var if set and non-zero.
`1`	Build a scoped 1-worker rayon pool. Useful for testing — everything flows through the parallel code paths but runs deterministically.
`N ≥ 2`	Build a scoped `N`-worker rayon pool for the duration of the call. Pool is dropped when the call returns.

Cross-language parity

Every language binding exposes the same threads option on every encode/decode entry point that does CPU work. Metadata-only commands (scan, describe, list) never accept it because they never decode payloads.

Entry point	Rust	Python	C FFI	C++ wrapper	CLI
`encode` / `encode_pre_encoded`	✅	✅	✅	✅	— (via subcommand)
`decode` / `decode_object` / `decode_range`	✅	✅	✅	✅	— (via subcommand)
`TensogramFile::append`	✅	✅	✅	✅	—
`TensogramFile::decode_message`	✅	✅	✅	✅	—
`TensogramFile::decode_range`	✅	✅	✅	✅	—
Batch decode (object/range)	✅	✅	— (not exposed in FFI)	—	—
`AsyncTensogramFile::*`	— (async feature, trait)	✅	—	—	—
`StreamingEncoder::new`	✅	✅	✅	✅	—
`tensogram merge`	—	—	—	—	✅ (`--threads`)
`tensogram split`	—	—	—	—	✅
`tensogram reshuffle`	—	—	—	—	✅
`tensogram convert-grib` / `convert-netcdf`	—	—	—	—	✅
`tensogram validate`	—	—	—	—	⚠ (flag accepted but not plumbed — IDEAS)
`tensogram copy` / `merge`	—	—	—	—	✅
`TENSOGRAM_THREADS` env var fallback	✅	✅	✅	✅	✅

Legend: ✅ = full support, ⚠ = flag accepted but currently a no-op (tracked in IDEAS), — = not applicable at this layer.

Threshold behaviour

For very small payloads the pool-build cost (~10–100 µs) outweighs any parallelism gain. The library transparently skips the pool when the total payload bytes are below a threshold (default 64 KiB). The threshold is tunable:

#![allow(unused)]
fn main() {
EncodeOptions {
    threads: 8,
    parallel_threshold_bytes: Some(0),       // always parallel
    // parallel_threshold_bytes: Some(usize::MAX), // never parallel
    ..Default::default()
}
}

Axis-A vs axis-B dispatch

The threads budget is spent along one of two axes:

Axis A — across objects. When a message carries multiple data objects and none of them uses an axis-B-friendly codec, rayon par_iter() runs the encode/decode pipeline for each object on a worker in parallel. Output order is preserved exactly.

Axis B — inside one codec. When any stage is axis-B-friendly (simple_packing encoding, shuffle filter, blosc2 or zstd compression), the budget flows into the codec’s internal parallelism:

Stage	How it uses the budget
`simple_packing` encode/decode	Chunked `par_iter` with byte-aligned chunk sizes — output bytes remain identical.
`shuffle` / `unshuffle`	Parallelise the outer `byte_idx` loop (shuffle) or output-chunk scatter (unshuffle).
`blosc2`	`CParams::nthreads` / `DParams::nthreads` — decompress path stays single-threaded in v0.13.0.
`zstd` FFI	`NbWorkers` libzstd parameter on compress; decompress is inherently sequential.

Policy

Tensogram messages tend to carry a small number of very large objects, so the library prefers axis B when any codec can use it:

Object count	Any object axis-B friendly?	Behaviour
1	—	Axis B (codec gets the full budget).
N ≥ 2	yes	Axis B on each object sequentially. Avoids `N × N` thread over-subscription.
N ≥ 2	no	Axis A (`par_iter` across objects), each codec single-threaded.

This decision happens once per encode/decode call based on the descriptors. Nothing is configurable beyond threads and parallel_threshold_bytes — the policy is deterministic.

Determinism contract

v0.13.0 makes two different promises depending on which codecs you use.

Transparent codecs — byte-identical across thread counts

These stages produce the same encoded bytes regardless of threads:

encoding = "none"
encoding = "simple_packing" (at any bits-per-value)
filter = "none"
filter = "shuffle"
compression ∈ {none, lz4, szip, zfp, sz3}

Encoded payload bytes are bit-exact identical for threads ∈ {0, 1, 2, 4, 8, 16, ...}. This is exercised by the rust/tensogram/tests/threads_determinism.rs integration suite.

Opaque codecs — lossless round-trip, may differ

compression ∈ {blosc2, zstd} hand off work to third-party C libraries. When their internal thread pool is asked to run in parallel, blocks land in the output frame in worker completion order. The compressed bytes may therefore differ from the sequential path — but every variant round-trips losslessly:

Encode with threads=8, decode with threads=0 → same decoded values as a pure sequential round-trip.
Golden files (produced with threads=0) are still byte-for-byte stable across releases because the default path is unchanged.

Why this matters

Determinism across thread counts is the core property that lets Tensogram users turn threads on in production without worrying about cache keys, deduplication hashes, or reproducible builds breaking. The invariant is tested at every layer — Rust, Python, C FFI, C++ wrapper — with a sweep over {0, 1, 2, 4, 8}.

Interaction with integrity hashing

The xxh3-64 integrity hash attached to every data object (EncodeOptions.hash_algorithm = Some(Xxh3), on by default) is a pure function of the final encoded bytes. Hashing runs in the calling thread after any intra-codec parallelism has joined; each object owns its own Xxh3Default hasher on the stack and the hasher is never shared across threads.

As a consequence the hash follows the same contract as the encoded bytes:

Codec class	Encoded bytes across thread counts	Hash across thread counts
Transparent	Byte-identical	Byte-identical
Opaque	May reorder compressed blocks	May differ per-run

For opaque codecs the hash is still internally consistent — descriptor.hash == xxh3_64(encoded_payload) always holds for the bytes that were actually written — it just may not match a hash computed at a different thread count. verify_hash on decode always succeeds regardless of the threads value used at encode time.

Since the hash is folded into the codec output in lockstep (see the Integrity Hashing and Multi-Threaded Pipeline sections of plans/DESIGN.md), turning on threads has no additional hash-computation cost beyond what threading already does to the encoded bytes themselves.

Environment variable override

TENSOGRAM_THREADS is consulted only when the caller-provided threads is 0. This matches the existing TENSOGRAM_COMPRESSION_BACKEND pattern:

# One-shot invocation — every library call inherits the budget.
TENSOGRAM_THREADS=4 python my_pipeline.py

# Explicit option still wins.
tensogram.encode(meta, descs, threads=0)   # sequential (env honoured)
tensogram.encode(meta, descs, threads=1)   # single-threaded (env ignored)
tensogram.encode(meta, descs, threads=16)  # 16 workers (env ignored)

The env var is parsed once per process (OnceLock), so changing it mid-run has no effect.

Interaction with free-threaded Python

threads is orthogonal to Python threading. For CPython 3.13+ built with --disable-gil, you can combine:

Python threads — run multiple Tensogram calls concurrently.
Tensogram threads — each call uses rayon internally.

The PyO3 bindings always release the GIL around encode/decode, so the two dimensions compose cleanly. Be careful about total thread count: N Python threads × M Tensogram threads creates N×M workers. The safest starting point is one dimension at a time.

Benchmarks and tuning

The threads-scaling benchmark measures encode/decode throughput for 7 representative codec combinations across a sweep of thread counts:

cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
    --num-points 16000000 \
    --iterations 5 \
    --warmup 2 \
    --threads 0,1,2,4,8,16

Output columns (per case × thread count):

enc (ms), dec (ms) — median wall time over iterations.
enc MB/s, dec MB/s — throughput based on the original byte size.
ratio — compressed size as a percentage of original.
size (MiB) — compressed size.
enc x, dec x — speedup relative to the threads=0 baseline.

See the Benchmark Results page for numbers on a reference machine.

Tuning recommendations

Start with threads=0. The default is deterministic, well tested, and fast for small-to-medium payloads.
Turn it on globally via env. TENSOGRAM_THREADS=$(nproc) is a reasonable starting point for CPU-bound data-movement pipelines. Leave the in-process tensogram calls as threads=0 unless you need finer control per call.
Measure before tuning. On small payloads the threshold keeps you safe, but the sweet spot for large tensors varies by codec. For simple_packing + szip, 2–4 threads already reaches diminishing returns; for blosc2 it can scale further.
Do not stack Python threads × Tensogram threads unless you know the total fits your CPU budget. Over-subscription destroys throughput.

Benchmarks

Tensogram ships with a benchmark suite that measures all encoding and compression combinations on synthetic data. It produces tabular comparisons of speed, compressed size, and decode fidelity. The benchmarks can be re-run at any time to measure the effect of changes.

Codec Matrix Benchmark

Tests all valid encoder × compressor × bit-width combinations on 16 million synthetic float64 values.

Quick start

cargo run --release -p tensogram-benchmarks --bin codec-matrix

Override parameters with CLI flags:

cargo run --release -p tensogram-benchmarks --bin codec-matrix -- \
    --num-points 16000000 \
    --iterations 10 \
    --warmup 3 \
    --seed 42

Flag	Default	Description
`--num-points`	16 000 000	Number of float64 values to encode
`--iterations`	10	Timed iterations per combination (median reported)
`--warmup`	3	Warm-up iterations (discarded)
`--seed`	42	PRNG seed for deterministic data generation

Combinations measured

Group	Description	Count
Baseline	No encoding, no compression	1
Lossless compressors	Raw floats compressed with zstd, LZ4, Blosc2, or szip	4
SimplePacking + lossless	Quantized to 16, 24, or 32 bits, then compressed with each of the above (or no compressor)	15
Lossy codecs	ZFP (fixed rate 16/24/32) and SZ3 (absolute error 0.01)	4
Total		24

For actual results, see Benchmark Results.

How to read the results

The results page splits each benchmark into a performance table (timing, throughput, compressed size) and a fidelity table (error norms for lossy codecs).

Column	Meaning	Better is
Method	Encoder + compressor. E.g. “24-bit + szip” means values are quantized to 24 bits then compressed with szip. [REF] marks the baseline.	—
Enc / Dec (ms)	Median encode / decode time.	Lower
Enc / Dec MB/s	Throughput: uncompressed size ÷ median time.	Higher
Ratio	Compressed size as percentage of original. 25% = compressed to ¼. Above 100% means the codec expanded the data.	Lower
Size (MiB)	Compressed output size.	—
Linf	Max absolute error (worst single value).	Smaller
L1	Mean absolute error (average drift).	Smaller
L2	Root mean square error (penalizes outliers).	Smaller

For lossless codecs all three error norms are zero. Errors are absolute, in the same units as the input data.

Quick rules of thumb:

If you need exact data back, use one of the lossless codecs.
If you can tolerate some loss, compare Ratio vs error norms for your use case.
Throughput (MB/s) is the most useful speed metric — it accounts for data size and lets you compare across different payload sizes.

Reference Comparison: ecCodes GRIB Encoding

Scientific codecs are easiest to understand alongside an established reference. ecCodes is a widely-deployed GRIB encoder used throughout operational weather forecasting. This benchmark compares Tensogram’s 24-bit SimplePacking + szip pipeline against ecCodes’ built-in packing methods on 10 million float64 values. Both sides are timed symmetrically: encoding measures the full path from a float64 array to compressed bytes, and decoding measures the reverse.

Requirements

ecCodes C library installed (brew install eccodes on macOS, apt install libeccodes-dev on Debian/Ubuntu)
Build with --features eccodes

Quick start

cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes

cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes -- \
    --num-points 10000000 \
    --iterations 10 \
    --warmup 3 \
    --seed 42

Methods compared

Method	Description
ecCodes CCSDS (reference)	CCSDS packing via ecCodes — a widely-deployed operational reference
ecCodes simple packing	Basic fixed-bit-width packing without entropy coding
Tensogram 24-bit + szip	Tensogram’s SimplePacking at 24 bits followed by szip entropy coding

For actual results, see Benchmark Results.

Benchmark pipeline flow

flowchart TD
    G[Generate synthetic field] --> W[Warm-up iterations]
    W --> T[Timed iterations]
    T --> E[Encode]
    E --> D[Decode]
    D --> T
    T --> F[Fidelity check]
    F --> R[Print report]

    style G fill:#388e3c,stroke:#2e7d32,color:#fff
    style T fill:#1565c0,stroke:#0d47a1,color:#fff
    style F fill:#c62828,stroke:#b71c1c,color:#fff

Each timed iteration runs a full encode → decode cycle. After all iterations complete, the last decoded output is compared against the original to produce the fidelity metrics.

Things to know

Compression expansion

Some compressors (especially LZ4 on raw 64-bit floats) may produce output larger than the input (Ratio > 100%). This is normal — high-entropy data can’t always be compressed. The baseline row is a raw copy and always shows 100%.

Szip alignment

The codec matrix may round num_points up by 1–3 values for szip block alignment. This only matters for very small inputs.

Small data sizes

With --num-points 1, timing is dominated by per-call overhead rather than compression throughput. Use ≥ 10 000 points for meaningful comparisons.

GRIB grid shape

For prime num_points, the GRIB benchmark creates a 1 × N grid (not a realistic near-square grid). Use composite sizes for representative results (e.g. --num-points 10000000).

Reproducibility

The data generator is deterministic for a given --seed, so repeated runs on the same machine produce comparable timing. Compression ratios, sizes, and fidelity are reproducible across machines. Timing and throughput are not.

Error handling

If a single codec fails, the benchmark logs the error and continues with the remaining combinations. The summary line reports how many succeeded and failed. The CLI exits with code 1 if any combination failed.

Running in CI

For fast CI validation, pass --num-points 10000 --iterations 1 --warmup 1:

cargo run -p tensogram-benchmarks --bin codec-matrix -- \
    --num-points 10000 --iterations 1 --warmup 1

The smoke test suite (cargo test -p tensogram-benchmarks) uses 500–1000 points and completes in under 5 seconds.

Benchmark Results

This page is a snapshot of benchmark results recorded on a specific machine. For methodology, flags, and how to re-run, see Benchmarks.

Note: Timing and throughput are machine-specific. Compression ratios, sizes, and fidelity metrics are determined by the codec and are reproducible.

Run metadata

Field	Value
Date	2026-04-16
Tensogram version	0.13.0
CPU	Apple M4, 10 cores / 10 threads
OS	macOS 26.3 (Darwin 25.3.0)
Rust	rustc 1.94.1
ecCodes	2.46.0
Methodology	10 timed iterations, 3 warmup, median reported

Codec Matrix

16 million float64 values (122 MiB). The test data is a synthetic smooth scientific-like field with values in the range 250–310 (a profile that also matches real temperature grids and other bounded-range physical measurements).

How fidelity is measured

After each encode→decode round-trip, the decoded values are compared to the original. Three error norms are reported, all absolute in the same units as the input:

Linf — the largest error for any single value. Answers: “what is the worst case?”
L1 — the average error across all values. Answers: “how far off are values on average?”
L2 (RMSE) — root mean square error. Like L1 but penalizes large outliers more heavily. Answers: “how large are the typical errors, weighted toward the worst ones?”

For lossless codecs all three are zero.

Lossless compressors on raw floats

No encoding step — raw 64-bit floats compressed directly. Decoded values are bit-identical to the original.

Method	Enc (ms)	Dec (ms)	Enc MB/s	Dec MB/s	Ratio	Size (MiB)
no compression [REF]	3.7	3.7	32818	33226	100.0%	122.1
zstd level 3	128.5	114.5	950	1066	90.3%	110.2
LZ4	8.5	7.4	14328	16535	100.4%	122.6
Blosc2	51.9	26.6	2350	4584	75.2%	91.8
szip	69.7	206.8	1753	590	100.9%	123.2

Raw 64-bit floats have high entropy, so most lossless compressors cannot reduce their size. LZ4 and szip slightly expand the data. Blosc2 is the exception — its byte-shuffle step exposes compressible patterns (75%).

SimplePacking (quantization) + lossless compressors

Values are quantized to N bits, then compressed. Fidelity depends only on the bit width, not on the compressor — see the fidelity table below.

Method	Enc (ms)	Dec (ms)	Enc MB/s	Dec MB/s	Ratio	Size (MiB)
16-bit only	17.3	15.1	7039	8078	25.0%	30.5
16-bit + zstd	54.2	36.2	2254	3375	24.4%	29.7
16-bit + LZ4	19.7	22.2	6204	5493	25.1%	30.6
16-bit + Blosc2	115.2	31.5	1060	3873	20.3%	24.8
16-bit + szip	53.9	99.3	2263	1229	14.6%	17.8
24-bit only	19.2	17.1	6347	7135	37.5%	45.8
24-bit + zstd	67.3	41.1	1813	2969	37.2%	45.4
24-bit + LZ4	31.5	23.5	3871	5188	37.6%	46.0
24-bit + Blosc2	124.9	40.0	978	3052	32.8%	40.0
24-bit + szip	63.3	133.5	1928	914	27.2%	33.2
32-bit only	21.2	25.3	5771	4825	50.0%	61.0
32-bit + zstd	97.8	37.0	1248	3299	49.8%	60.8
32-bit + LZ4	37.1	45.1	3287	2706	50.2%	61.3
32-bit + Blosc2	141.0	38.3	866	3183	45.3%	55.3
32-bit + szip	69.8	157.4	1748	775	39.7%	48.4

Fidelity by bit width

Bit width	Linf (max abs)	L1 (mean abs)	L2 (RMSE)
16 bits	4.9 × 10⁻⁴	2.4 × 10⁻⁴	2.8 × 10⁻⁴
24 bits	1.9 × 10⁻⁶	9.5 × 10⁻⁷	1.1 × 10⁻⁶
32 bits	7.5 × 10⁻⁹	3.7 × 10⁻⁹	4.3 × 10⁻⁹

For context: with input values around 280, a Linf of 1.9 × 10⁻⁶ means the worst-case relative error at 24 bits is roughly 7 parts per billion.

Lossy floating-point compressors

These operate directly on raw f64 bytes without quantization.

Method	Enc (ms)	Dec (ms)	Enc MB/s	Dec MB/s	Ratio	Size (MiB)
ZFP rate 16	220.1	304.2	555	401	25.0%	30.5
ZFP rate 24	248.0	468.5	492	261	37.5%	45.8
ZFP rate 32	288.0	581.0	424	210	50.0%	61.0
SZ3 abs 0.01	131.4	141.0	929	865	6.5%	7.9

Fidelity by lossy codec

Method	Linf (max abs)	L1 (mean abs)	L2 (RMSE)
ZFP rate 16	1.3 × 10⁻²	1.6 × 10⁻³	2.0 × 10⁻³
ZFP rate 24	5.6 × 10⁻⁵	6.1 × 10⁻⁶	7.9 × 10⁻⁶
ZFP rate 32	1.9 × 10⁻⁷	2.4 × 10⁻⁸	3.1 × 10⁻⁸
SZ3 abs 0.01	1.0 × 10⁻²	5.0 × 10⁻³	5.8 × 10⁻³

Notable observations

16-bit + szip achieves the best compression ratio (14.6%) among the SimplePacking combinations.
SZ3 achieves the smallest output overall (6.5%) with a max error of 0.01. If your application tolerates that error bound, this gives the best compression in this benchmark.
In this benchmark, higher ZFP rates gave proportionally smaller errors. ZFP fixed-rate modes always hit their target ratio exactly (25% / 37.5% / 50%).

Reference Comparison: ecCodes GRIB Encoding

GRIB is a binary format widely used in operational weather forecasting, and ecCodes (from ECMWF) is a common implementation. Comparing against it gives a concrete, reproducible reference point for Tensogram’s quantisation + entropy-coding pipeline.

This benchmark runs Tensogram’s 24-bit SimplePacking + szip and ecCodes’ built-in packing methods on the same input. Both sides are timed end-to-end: from a float64 array to serialised compressed bytes (encode), and back (decode).

10 million float64 values (76 MiB), 24-bit packing. Different dataset size from the codec matrix above.

Method	Enc (ms)	Dec (ms)	Enc MB/s	Dec MB/s	Ratio	Size (MiB)
ecCodes CCSDS [REF]	47.9	84.8	1594	900	27.2%	20.8
ecCodes simple packing	32.6	7.9	2339	9660	37.5%	28.6
Tensogram 24-bit + szip	43.7	80.4	1745	950	27.4%	20.9

All three methods produce identical fidelity: Linf = 1.9 × 10⁻⁶, L1 = 9.5 × 10⁻⁷, L2 = 1.1 × 10⁻⁶.

Notable observations

Tensogram and ecCodes CCSDS achieve nearly identical compression (27.4% vs 27.2%) and identical fidelity at 24 bits.
Tensogram encode is now slightly faster than ecCodes CCSDS (43.7 vs 47.9 ms) on this machine; decode is comparable (80.4 vs 84.8 ms).
ecCodes simple packing decodes fastest (7.9 ms) but produces a larger file (37.5% vs 27%).

Threading Scaling

The v0.13.0 multi-threaded coding pipeline lets callers spend a threads budget on encode/decode work. Results here show the effect of sweeping threads ∈ {0, 1, 2, 4, 8} on 16M f64 values (122 MiB) for seven representative codec combinations. threads=0 is the sequential baseline; speedups are measured against it.

Reminder: Transparent codecs (no codec, simple_packing, szip, lz4, zfp, sz3, shuffle) produce byte-identical encoded payloads across thread counts. Opaque codecs (blosc2, zstd with nb_workers > 0) may produce different compressed bytes while always round-tripping losslessly.

Lossless (no encoding)

Method	Metric	threads=0	threads=1	threads=2	threads=4	threads=8
none+none	enc MB/s	32818	35929	36801	35173	35520
none+none	speedup	1.00x	1.09x	1.12x	1.07x	1.08x
none+lz4	enc MB/s	7733	3619	3559	2029	2513
none+lz4	speedup	1.00x	0.47x	0.46x	0.26x	0.32x
none+zstd(3)	enc MB/s	942	1163	2075	2259	1839
none+zstd(3)	speedup	1.00x	1.23x	2.20x	2.40x	1.95x
none+blosc2(lz4)	enc MB/s	3150	3140	5030	7458	8906
none+blosc2(lz4)	speedup	1.00x	1.00x	1.60x	2.37x	2.83x

SimplePacking + compression

Method	Metric	threads=0	threads=1	threads=2	threads=4	threads=8
sp(16)+none	enc MB/s	12964	13268	15584	15643	14612
sp(16)+none	enc speedup	1.00x	1.02x	1.20x	1.21x	1.13x
sp(16)+none	dec speedup	1.00x	1.14x	2.37x	2.34x	2.18x
sp(24)+szip	enc MB/s	2273	2263	2351	2389	2427
sp(24)+szip	speedup	1.00x	1.00x	1.03x	1.05x	1.07x
sp(24)+blosc2(lz4)	enc MB/s	2371	2350	3965	5554	6388
sp(24)+blosc2(lz4)	enc speedup	1.00x	0.99x	1.67x	2.34x	2.69x

Notable observations

Memory-bound baselines (none+none, none+lz4) do not scale. The parallel dispatch overhead outweighs any gain when the work per task is already at memory bandwidth. none+lz4 actually regresses — leave threads=0 for lz4-only workloads.
blosc2 scales best. Encoding with blosc2+lz4 reaches 2.8× on 8 threads; the sp(24)+blosc2 combination reaches 2.7× on encode and 1.3× on decode.
zstd scales ~2.4× on encode at 4 threads via libzstd’s NbWorkers. Beyond 4 threads the benefit plateaus on this CPU.
simple_packing decode is 2.3× faster at 2+ threads — the internal chunk-parallel scatter saturates memory bandwidth quickly.
szip is single-threaded. The marginal gains shown for sp(24)+szip come from parallelising the simple_packing stage only; szip itself runs sequentially in v0.13.0.

The raw numbers above were produced by the threads-scaling binary in rust/benchmarks. Re-run locally with:

cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
    --num-points 16000000 \
    --iterations 5 \
    --warmup 2 \
    --threads 0,1,2,4,8

Simple Packing

Simple packing is a lossy quantisation technique derived from GRIB’s simple-packing method. It quantises a range of floating-point values into N-bit integers, dramatically reducing payload size at the cost of precision.

A 16-bit simple_packing payload is 8× smaller than the equivalent float64 and 4× smaller than float32, with precision loss typically below instrument noise for most bounded-range scientific measurements (temperatures, voltages, pressures, intensity counts).

Quick start — auto-compute

The encoder computes sp_reference_value and sp_binary_scale_factor for you — you only need to choose sp_bits_per_value (the knob), and optionally sp_decimal_scale_factor (defaults to 0).

import tensogram
import numpy as np

temps = np.random.randn(200, 200, 50, 10).astype(np.float64) + 273.15
desc = {
    "type": "ntensor",
    "shape": [200, 200, 50, 10],
    "dtype": "float64",
    "encoding": "simple_packing",
    "sp_bits_per_value": 16,
    "compression": "zstd",
}
buf = tensogram.encode({"parameter": "temperature", "unit": "K"},
                       [(desc, temps)])

That’s it — no explicit compute_packing_params call, no four-key spread into the descriptor. The four canonical keys are stamped into the wire format automatically so the file remains fully self-describing.

The same ergonomic form works from every language binding:

#![allow(unused)]
fn main() {
// Rust — only the auto-compute knob in `params`; the encoder fills in
// the rest at encode time.
use std::collections::BTreeMap;
use ciborium::Value;

let mut params = BTreeMap::new();
params.insert("sp_bits_per_value".into(), Value::Integer(16i64.into()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: shape.len() as u64,
    shape: shape.clone(),
    strides: c_strides(&shape),
    dtype: Dtype::Float64,
    byte_order: ByteOrder::native(),
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "none".into(),
    params,
    masks: None,
};
}

// TypeScript
const desc = {
    type: "ntensor", shape: [N], dtype: "float64",
    encoding: "simple_packing", sp_bits_per_value: 16,
};
await encode(meta, [{ descriptor: desc, data }]);

When to use the explicit form

The explicit 4-key form (with sp_reference_value and sp_binary_scale_factor supplied by the caller) remains supported and wins over auto-compute on a per-object basis. Use it when you need:

the same sp_reference_value pinned across a time-series so downstream delta-encoding works on the packed integers;
control over sp_binary_scale_factor when the data distribution is known a priori (avoids a min/max scan over huge buffers);
repeatable encodes against a shared set of params computed once via compute_packing_params.

When both explicit and auto-compute keys are present on the descriptor, the explicit values win — the encoder does not recompute them.

How It Works

Given a set of float64 values V[i]:

Find the minimum value R (the reference value).
Scale all values relative to R: Y[i] = (V[i] - R) × 10^D × 2^-E
Round Y[i] to the nearest integer and pack it into B bits (MSB first).

The parameters D (decimal scale factor), E (binary scale factor), and B (bits per value) are chosen automatically by compute_params().

flowchart TD
    A["Input: V = [250.0, 251.3, 252.7]"]
    B["Find reference value
    R = min(V) = 250.0"]
    C["Scale relative to R
    [0, 1.3, 2.7] × 10^D × 2^−E"]
    D["Round to integers
    [0, 17369, 36044]"]
    E["Pack as 16-bit MSB
    00 00 43 99 8C 8C"]

    A --> B --> C --> D --> E

    style A fill:#388e3c,stroke:#2e7d32,color:#fff
    style E fill:#1565c0,stroke:#0d47a1,color:#fff

Limitations and Edge Cases

NaN and ±Infinity are Rejected

compute_params() and encode() return an error if the data contains any NaN or ±Infinity values. Simple packing has no representation for non-finite numbers (unlike IEEE 754 floats), and feeding Inf through the range / scale-factor derivation would produce an i32::MAX-saturated sp_binary_scale_factor that silently decodes to NaN everywhere. Both are errors at the codec entry:

NaN → PackingError::NanValue(index)
+Inf / -Inf → PackingError::InfiniteValue(index)

Remove or replace non-finite values before encoding. If you want to preserve them, switch to encoding="none" and opt in to the NaN / Inf bitmask companion via allow_nan=true / allow_inf=true — see NaN / Inf Handling for the full semantics. Simple packing cannot represent non-finite values at all, so the mask companion is only available on the pass-through encoding path.

#![allow(unused)]
fn main() {
// Both rejected:
let with_nan = vec![1.0_f64, 2.0, f64::NAN, 4.0];
let with_inf = vec![1.0_f64, 2.0, f64::INFINITY, 4.0];
assert!(compute_params(&with_nan, 16, 0).is_err());
assert!(compute_params(&with_inf, 16, 0).is_err());
}

Params Safety Net

Beyond input-value validation, encode() also checks the SimplePackingParams it receives:

sp_reference_value must be finite (NaN / ±Inf → error).
|binary_scale_factor| ≤ 256. The threshold catches the i32::MAX-saturation fingerprint from feeding Inf through compute_params indirectly; real-world data (|bsf| ≤ 60) fits comfortably. The constant MAX_REASONABLE_BINARY_SCALE = 256 is exported from tensogram_encodings::simple_packing.

This closes the standalone-API footgun where a caller constructs or mutates SimplePackingParams directly rather than deriving them from compute_params. Both failures surface as PackingError::InvalidParams { field, reason } with a clear message naming the offending field.

Constant Fields

If all values are identical (range = 0), compute_params() succeeds and stores everything in the reference value. All packed integers are 0. Decoding reconstructs the constant correctly.

bits_per_value Range

Valid range: 0 to 64. More than 64 bits is rejected. Zero bits is accepted — compute_params stores the first value as the reference value (not the minimum) and encode produces an empty byte buffer. Decode reconstructs the reference value for every element, so this is only lossless for constant fields. Typical range for scientific floating-point data is 8–24 bits.

bits_per_value	Packed values	Precision vs float64
8	256 levels	Coarse (rough categories)
16	65,536 levels	Good for temperature, wind
24	16,777,216 levels	Near-float32 precision
32	~4 billion levels	Near-float64 for most ranges

API

compute_params

#![allow(unused)]
fn main() {
pub fn compute_params(
    values: &[f64],
    bits_per_value: u32,
    decimal_scale_factor: i32,
) -> Result<SimplePackingParams, PackingError>
}

Computes the optimal packing parameters for the given data. Call this once before encoding.

#![allow(unused)]
fn main() {
let values: Vec<f64> = (0..1000).map(|i| 250.0 + i as f64 * 0.01).collect();
let params = compute_params(&values, 16, 0)?;

println!("reference_value: {}", params.reference_value);
println!("binary_scale_factor: {}", params.binary_scale_factor);
println!("bits_per_value: {}", params.bits_per_value);
}

encode

#![allow(unused)]
fn main() {
pub fn encode(
    values: &[f64],
    params: &SimplePackingParams,
) -> Result<Vec<u8>, PackingError>
}

Encodes f64 values to a packed byte buffer using the given parameters.

decode

#![allow(unused)]
fn main() {
pub fn decode(
    packed: &[u8],
    num_values: usize,
    params: &SimplePackingParams,
) -> Result<Vec<f64>, PackingError>
}

Decodes a packed buffer back to f64 values. The num_values parameter is required because the byte length alone is not enough to determine the element count (bits per value may not divide evenly into bytes).

Precision Example

Consider a bounded-range scalar field spanning 90 units (e.g. a temperature field 220–310 K, a pressure field 950–1040 hPa, or any analogous bounded scientific quantity):

bits_per_value	Step size	Max error
8	0.353 units	±0.18 units
12	0.022 units	±0.011 units
16	0.00137 units	±0.00069 units

At 16 bits, the error is smaller than most practical sensor precisions. The same analysis applies to any physical quantity with a bounded dynamic range.

Full Integration Example

#![allow(unused)]
fn main() {
use tensogram::{encode, decode, GlobalMetadata, DataObjectDescriptor,
                     ByteOrder, Dtype, EncodeOptions, DecodeOptions};
use tensogram_encodings::simple_packing;
use ciborium::Value;
use std::collections::BTreeMap;

// Source data: 1000 temperature values
let values: Vec<f64> = (0..1000).map(|i| 273.0 + i as f64 * 0.05).collect();
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();

// Compute packing parameters
let params = simple_packing::compute_params(&values, 16, 0).unwrap();

// Build descriptor with packing params
let mut p = BTreeMap::new();
p.insert("sp_reference_value".into(), Value::Float(params.reference_value));
p.insert("sp_binary_scale_factor".into(),
    Value::Integer((params.binary_scale_factor as i64).into()));
p.insert("sp_decimal_scale_factor".into(),
    Value::Integer((params.decimal_scale_factor as i64).into()));
p.insert("sp_bits_per_value".into(),
    Value::Integer((params.bits_per_value as i64).into()));

let desc = DataObjectDescriptor {
    obj_type: "ntensor".into(),
    ndim: 1,
    shape: vec![1000],
    strides: vec![1],
    dtype: Dtype::Float64,
    byte_order: ByteOrder::Big,
    encoding: "simple_packing".into(),
    filter: "none".into(),
    compression: "none".into(),
    masks: None,
    params: p,
};

let global = GlobalMetadata::default();

let msg = encode(&global, &[(&desc, &raw)], &EncodeOptions::default()).unwrap();
println!("Packed size: {} bytes (was {} bytes)", msg.len(), raw.len());

let (_, objects) = decode(&msg, &DecodeOptions::default()).unwrap();
let decoded: Vec<f64> = objects[0].1.chunks_exact(8)
    .map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
    .collect();

// Check precision
for (orig, dec) in values.iter().zip(decoded.iter()) {
    assert!((orig - dec).abs() < 0.001);
}
}

Byte Shuffle Filter

The shuffle filter rearranges the bytes of a multi-byte array to improve compression. It is the same algorithm used by HDF5 and NetCDF4.

Why Shuffle Helps

For float32 data, each value occupies 4 bytes. The bytes within a float are not independent — nearby values tend to share their most-significant bytes (exponent + high mantissa) while the least-significant bytes are more random.

Without shuffle, the bytes are interleaved:

[B0 B1 B2 B3][B0 B1 B2 B3][B0 B1 B2 B3]...

A compressor sees B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 ... — not very compressible because the predictable (B0, B1) bytes are mixed with the random (B3) bytes.

After shuffle, all byte-0s come first, then all byte-1s, etc.:

[B0 B0 B0 ...][B1 B1 B1 ...][B2 B2 B2 ...][B3 B3 B3 ...]

Now the B0 run and B1 run are highly compressible (long runs of similar values). The B3 run is still noisy, but it’s isolated. Overall compression improves significantly.

API

shuffle

#![allow(unused)]
fn main() {
pub fn shuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}

Rearranges bytes. element_size is the byte width of each element (e.g. 4 for float32, 8 for float64).

#![allow(unused)]
fn main() {
let floats: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
let raw: Vec<u8> = floats.iter().flat_map(|f| f.to_ne_bytes()).collect();
let shuffled = shuffle(&raw, 4)?;
// shuffled is ready for compression
}

unshuffle

#![allow(unused)]
fn main() {
pub fn unshuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}

Reverses the shuffle. Applied automatically by the decode pipeline.

Using Shuffle in a Message

Set filter: "shuffle" in the DataObjectDescriptor and provide shuffle_element_size:

#![allow(unused)]
fn main() {
use ciborium::Value;

let mut params = BTreeMap::new();
params.insert(
    "shuffle_element_size".to_string(),
    Value::Integer(4.into()), // 4 bytes per float32
);

let desc = DataObjectDescriptor {
    obj_type: "ntensor".to_string(),
    ndim: 1,
    shape: vec![100],
    strides: vec![1],
    dtype: Dtype::Float32,
    byte_order: ByteOrder::Big,
    encoding: "none".to_string(),
    filter: "shuffle".to_string(),
    compression: "none".to_string(),
    masks: None,
    params,
};
}

Edge Cases

Element Size Must Divide the Buffer

The shuffle operation requires data.len() % element_size == 0. If this is not true, the function returns Err(ShuffleError::Misaligned). Ensure your data buffer is a whole number of elements.

Shuffle Alone Does Not Compress

Shuffle rearranges bytes but does not reduce the total byte count. It only helps when followed by a compression stage (e.g. szip, zstd, lz4, blosc2). Set compression in the descriptor to apply compression after the shuffle step.

Combining with simple_packing

When using both encoding: "simple_packing" and filter: "shuffle", the pipeline applies them in order: encode first, then shuffle. The simple_packing output is 1-byte-per-packed-chunk (MSB-first bits), so shuffle_element_size should be 1 in this case (no benefit from shuffling already-packed data). In practice, the combination is unusual — either use simple_packing alone (when quantising float values) or shuffle alone (before a lossless compressor).

Compression

Compression is the third stage of the encoding pipeline. It reduces the total byte count of the already-encoded and filtered payload.

Supported Compressors

Compressor	Type	Random Access	Notes
`none`	Pass-through	Yes (trivial)	No compression
`szip`	Lossless	Yes (RSI blocks)	CCSDS 121.0-B-3 via libaec. Best for integer/packed data
`zstd`	Lossless	No	Zstandard. Excellent ratio/speed tradeoff
`lz4`	Lossless	No	Fastest decompression. Good for real-time pipelines
`blosc2`	Lossless	Yes (chunks)	Multi-codec meta-compressor with chunk-level access
`zfp`	Lossy	Yes (fixed-rate)	Purpose-built for floating-point arrays
`sz3`	Lossy	No	Error-bounded lossy compression for scientific data

The Compressor Trait

All compressors implement a common interface with three operations:

#![allow(unused)]
fn main() {
pub trait Compressor {
    fn compress(&self, data: &[u8]) -> Result<CompressResult, CompressionError>;
    fn decompress(&self, data: &[u8], expected_size: usize) -> Result<Vec<u8>, CompressionError>;
    fn decompress_range(
        &self,
        data: &[u8],
        block_offsets: &[u64],
        byte_pos: usize,
        byte_size: usize,
    ) -> Result<Vec<u8>, CompressionError>;
}
}

decompress_range enables partial decode without decompressing the entire payload. Compressors that don’t support it return CompressionError::RangeNotSupported.

Lossless Compressors

Szip (libaec)

Szip implements CCSDS 121.0-B-3, a lossless compressor designed for scientific data. It works on integer data and exploits the block structure of packed values.

Random access: Szip records RSI (Reference Sample Interval) block boundaries during encoding. These offsets are stored in metadata as szip_block_offsets, enabling seek-to-block partial decode via decompress_range. When using encode_pre_encoded, the caller must provide these bit-precise block offsets themselves to enable random access (see Pre-encoded Payloads).

Parameter	Type	Description
`szip_rsi`	uint	Reference sample interval (samples per RSI block)
`szip_block_size`	uint	Block size (typically 8 or 16)
`szip_flags`	uint	AEC encoding flags (e.g., `AEC_DATA_PREPROCESS`)
`szip_block_offsets`	array of uint	Bit offsets of RSI block boundaries (computed during encoding)

Important: libaec encodes integers only. For floating-point data, use either:

simple_packing → szip (lossy quantization to integers, then compress)

shuffle → szip (byte rearrangement, then compress as uint8)

Zstd (Zstandard)

General-purpose lossless compression with excellent ratio/speed tradeoff. Widely used and well-optimized.

Parameter	Type	Default	Description
`zstd_level`	int	3	Compression level (1-22). Higher = better ratio, slower

No random access — decode_range is not supported with zstd.

LZ4

Fastest decompression of any compressor in the library. Slightly lower compression ratio than Zstd, but 3-5x faster to decompress.

No configurable parameters. No random access.

Blosc2

A meta-compressor that splits data into independently-compressed chunks, then stores them in a frame. Supports multiple internal codecs.

Random access: Because each chunk is independent, Blosc2 can decompress only the chunks covering the requested byte range. decompress_range works by mapping byte offsets to chunk indices.

Parameter	Type	Default	Description
`blosc2_codec`	string	`"lz4"`	Internal codec: `blosclz`, `lz4`, `lz4hc`, `zlib`, `zstd`
`blosc2_clevel`	int	5	Compression level (0-9)
`blosc2_typesize`	uint	(auto)	Element byte width for shuffle optimization

blosc2_typesize is automatically computed from the preceding pipeline stage: dtype byte width for unencoded data, 1 for shuffled bytes, or packed byte width for simple_packing output.

Lossy Compressors

ZFP

Purpose-built compression for floating-point arrays. ZFP compresses data in blocks of 4 elements (1D) and supports three modes:

Mode	Parameter	Description
`fixed_rate`	`zfp_rate` (float)	Fixed bits per value. Enables O(1) random access
`fixed_precision`	`zfp_precision` (uint)	Fixed number of uncompressed bit planes
`fixed_accuracy`	`zfp_tolerance` (float)	Maximum absolute error bound

Random access: In fixed-rate mode, every block compresses to exactly the same number of bits. This means the byte offset of any block is computable from its index, enabling decompress_range without stored block offsets.

Parameter	Type	Description
`zfp_mode`	string	One of `"fixed_rate"`, `"fixed_precision"`, `"fixed_accuracy"`
`zfp_rate`	float	Bits per value (only for `fixed_rate`)
`zfp_precision`	uint	Bit planes to keep (only for `fixed_precision`)
`zfp_tolerance`	float	Max absolute error (only for `fixed_accuracy`)

Important: ZFP operates directly on floating-point data. Use encoding: "none" and filter: "none" — ZFP replaces both encoding and compression.

SZ3

Error-bounded lossy compression for scientific data. SZ3 uses prediction-based methods (interpolation, Lorenzo, regression) to achieve high compression ratios within strict error bounds.

Parameter	Type	Description
`sz3_error_bound_mode`	string	One of `"abs"`, `"rel"`, `"psnr"`
`sz3_error_bound`	float	Error bound value (meaning depends on mode)

Error bound modes:

abs — Absolute error: |original - decompressed| <= bound for every element
rel — Relative error: |original - decompressed| / value_range <= bound
psnr — Peak signal-to-noise ratio lower bound

No random access — decode_range is not supported with SZ3.

Important: Like ZFP, SZ3 operates on floating-point data. Use encoding: "none" and filter: "none".

Choosing a Compressor

flowchart TD
    A{"Data type?"}
    A -->|"Integer / packed"| B{"Need random access?"}
    A -->|"Float, lossy OK"| C{"Need random access?"}
    A -->|"Float, lossless"| D{"Speed priority?"}

    B -->|Yes| E["szip"]
    B -->|No| F{"Speed or ratio?"}
    F -->|Speed| G["lz4"]
    F -->|Ratio| H["zstd"]

    C -->|Yes| I["zfp (fixed_rate)"]
    C -->|No| J{"Error bound type?"}
    J -->|"Bits/precision"| K["zfp"]
    J -->|"Absolute/relative"| L["sz3"]

    D -->|"Fastest decompress"| M["lz4"]
    D -->|"Best ratio"| N["blosc2 or zstd"]
    D -->|"Need random access"| O["blosc2"]

    style E fill:#388e3c,stroke:#2e7d32,color:#fff
    style I fill:#388e3c,stroke:#2e7d32,color:#fff
    style O fill:#388e3c,stroke:#2e7d32,color:#fff

Use case	Recommended	Why
Quantised floats with partial-access support	`simple_packing` + `szip`	RSI-block random access; interoperable with GRIB 2 CCSDS packing
Real-time streaming	`lz4`	Fastest decompression, low latency
Archival storage	`zstd` (level 9-15)	Best lossless ratio
ML model weights	`blosc2`	Chunk random access, good for large tensors
Float fields, lossy OK	`zfp` (fixed_rate)	Best lossy ratio with random access
Error-bounded science	`sz3` (abs)	Guaranteed error bounds per element
Exact integers	`none` or `lz4`	No information loss

Invalid Combinations

Some pipeline combinations are rejected at configuration time:

Combination	Rejected?	Reason
`zfp` + `shuffle`	Yes	ZFP operates on typed floats; shuffle rearranges bytes
`zfp` + `simple_packing`	Yes	ZFP IS the encoding for floats
`sz3` + `shuffle`	Yes	SZ3 operates on typed data
`sz3` + `simple_packing`	Yes	SZ3 IS lossy encoding for floats
`shuffle` + `decode_range`	Yes	Byte rearrangement breaks contiguous sample ranges
`zstd`/`lz4`/`sz3` + `decode_range`	Yes	Stream compressors don’t support partial decode

tensogram info

Displays a summary of a Tensogram file: number of messages, total file size, and format version.

Usage

tensogram info [FILES]...

Options

Option	Description
`-h, --help`	Print help

Example

$ tensogram info forecast.tgm
Messages : 48
File size: 1.2 GB
Version  : 3

What it Shows

Field	Description
Messages	Total number of valid messages found by scanning the file
File size	Raw byte count of the file on disk
Version	Format version from the first message’s metadata

Notes

The scan counts only valid messages (those with a matching TENSOGRM header and 39277777 terminator). Corrupted regions are skipped.
If the file is empty, Messages: 0 is shown.
Version is read from the first message. If messages have different versions, only the first is shown.

tensogram ls

Lists messages in a Tensogram file, showing metadata in tabular or JSON format.

Usage

tensogram ls [OPTIONS] [FILES]...

Options

Option	Description
`-w <WHERE_CLAUSE>`	Where-clause filter (e.g., mars.param=2t/10u)
`-p <KEYS>`	Comma-separated keys to display
`-j`	JSON output
`-h, --help`	Print help

Examples

# List all messages with default columns
tensogram ls forecast.tgm

# Only temperature fields
tensogram ls forecast.tgm -w "mars.param=2t"

# Temperature or wind
tensogram ls forecast.tgm -w "mars.param=2t/10u/10v"

# Exclude ensemble members
tensogram ls forecast.tgm -w "mars.type!=em"

# Show only date and step columns
tensogram ls forecast.tgm -p "mars.date,mars.step"

# JSON output (one object per line, good for jq)
tensogram ls forecast.tgm -j | jq '.["mars.param"]'

Where Clause Syntax

The -w flag accepts a single expression:

key=value           # exact match
key=v1/v2/v3        # OR — matches any of v1, v2, v3
key!=value          # not equal
key!=v1/v2          # not any of v1, v2

Key format: namespace.field for namespaced keys (e.g. mars.param) or just field for top-level keys (e.g. version).

Missing key: For key=value, a missing key is treated as non-matching. For key!=value, a missing key passes the filter.

Only one -w expression can be specified per command. To apply multiple filters, pipe commands:

tensogram ls forecast.tgm -w "mars.type=fc" | grep "2t"

Pick Keys

The -p flag selects which metadata columns to display. Keys use the same dot-notation as -w:

tensogram ls forecast.tgm -p "mars.date,mars.step,mars.param"

Without -p, all available metadata keys are shown.

Default Table Output

mars.date   mars.step  mars.param  mars.type  shape
20260401    0          2t          fc         [721, 1440]
20260401    0          10u         fc         [721, 1440]
20260401    0          10v         fc         [721, 1440]
20260401    6          2t          fc         [721, 1440]
...

JSON Output

With -j, each matching message is printed as a JSON object on its own line:

{"mars.date": "20260401", "mars.step": "0", "mars.param": "2t", "shape": "[721, 1440]"}
{"mars.date": "20260401", "mars.step": "0", "mars.param": "10u", "shape": "[721, 1440]"}

This is compatible with jq, grep, and any tool that processes newline-delimited JSON.

tensogram dump

Prints the full contents of every message in a Tensogram file — metadata keys and optionally the raw data values.

Usage

tensogram dump [OPTIONS] [FILES]...

Options

Option	Description
`-w <WHERE_CLAUSE>`	Filter messages (e.g. `mars.param=2t`, same syntax as `ls`)
`-p <KEYS>`	Comma-separated keys to display
`-j`	JSON output
`-h, --help`	Print help

Example

$ tensogram dump forecast.tgm
─── Message 0 ───
version    : 3
mars.class : od
mars.type  : fc
mars.date  : 20260401
mars.step  : 0

  Object 0
  type     : ntensor
  ndim     : 2
  shape    : [721, 1440]
  strides  : [1440, 1]
  dtype    : float32
  mars.param: 2t
  encoding : none
  filter   : none
  compression: none

─── Message 1 ───
...

Filtering

Use -w to limit the dump to specific messages:

# Dump only wave spectra
tensogram dump forecast.tgm -w "mars.param=wave_spectra"

JSON Output

With -j, each message is a JSON object:

{
  "message": 0,
  "metadata": {
         "base": [
      {
        "mars": {"class": "od", "type": "fc", "date": "20260401", "step": 0, "param": "2t"},
        "_reserved_": {"tensor": {"ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32"}}
      }
    ]
  },
  "objects": [
    {"type": "ntensor", "ndim": 2, "shape": [721, 1440], "dtype": "float32",
     "encoding": "none"}
  ]
}

Note. Per-object integrity hashes live in the frame’s inline hash slot (see wire-format.md §Common Frame Footer) and are not surfaced as a hash field on the decoded descriptor.

When to Use dump vs ls

Use ls for a quick overview of many messages (one line per message)
Use dump when you need to see all keys for a specific message, or check encoding parameters

tensogram get

Extracts a single metadata value from messages in a file. Returns an error if the key is missing.

Usage

tensogram get [OPTIONS] -p <KEYS> [FILES]...

Options

Option	Description
`-w <WHERE_CLAUSE>`	Filter messages (e.g. `mars.param=2t`, same syntax as `ls`)
`-p <KEYS>`	Comma-separated keys to extract (required)
`-h, --help`	Print help

Examples

# Get the mars.param value from all messages
tensogram get -p mars.param forecast.tgm

# Get the date from messages where param is 2t
tensogram get -p mars.date -w "mars.param=2t" forecast.tgm

# Get the shape of object 0
tensogram get -p shape forecast.tgm

Strict Key Lookup

Unlike ls which shows a blank for missing keys, get exits with a non-zero status if any matching message does not have the requested key:

$ tensogram get -p mars.nonexistent forecast.tgm
Error: key not found: mars.nonexistent

This makes get safe to use in shell scripts where missing data should fail fast.

Multi-Object Messages

For messages with multiple objects, get returns the first matching value found. Lookup checks top-level metadata first and then scans objects in order until it finds a match.

tensogram set

Modifies metadata keys in messages and writes the result to a new file. Matching messages are decoded, their metadata is updated, and they are re-encoded with the original payload bytes and pipeline settings.

Usage

tensogram set [OPTIONS] -s <SET_VALUES> <INPUT> <OUTPUT>

Options

Option	Description
`-s <SET_VALUES>`	Key=value pairs to set (comma-separated)
`-w <WHERE_CLAUSE>`	Only modify messages matching this filter (e.g. `mars.param=2t`)
`-h, --help`	Print help

Examples

# Change mars.date to 20260402 in all messages
tensogram set input.tgm output.tgm mars.date=20260402

# Set multiple keys at once
tensogram set input.tgm output.tgm mars.date=20260402,mars.step=12

# Only modify temperature fields
tensogram set input.tgm output.tgm mars.class=rd -w "mars.param=2t"

Key=Value Syntax

Multiple mutations can be specified as a comma-separated list:

tensogram set in.tgm out.tgm key1=val1,key2=val2,key3=val3

Keys use dot-notation: mars.param sets the param field inside the mars namespace. A top-level key like experiment sets a top-level metadata field.

Object-level metadata can be updated with objects.<index>.<path>:

# Add object-specific metadata to the first object
tensogram set input.tgm output.tgm objects.0.processing.version=2

Structural/Integrity Keys

The following keys cannot be modified because they describe the physical structure of the payload. Changing them would make the metadata inconsistent with the actual bytes on disk:

Key	Reason
`shape`	Tensor dimensions
`strides`	Memory layout
`dtype`	Element type
`ndim`	Number of dimensions
`type`	Object type
`encoding`	Encoding algorithm
`filter`	Filter algorithm
`compression`	Compression algorithm
`hash`	Payload integrity hash
`szip_rsi`	Szip compression block parameter
`szip_block_size`	Szip compression block parameter
`szip_flags`	Szip compression flags
`szip_block_offsets`	Szip block seek table
`sp_reference_value`	Simple packing quantization parameter
`sp_binary_scale_factor`	Simple packing quantization parameter
`sp_decimal_scale_factor`	Simple packing quantization parameter
`sp_bits_per_value`	Simple packing quantization parameter
`shuffle_element_size`	Shuffle filter parameter

Attempting to modify any of these returns an error before any output is written.

Pass-Through for Non-Matching Messages

Messages that do not match the -w filter are copied verbatim to the output file. Their bytes are not re-encoded or re-hashed.

Note: Messages that are modified are re-encoded after the metadata mutation. Because the decoded payload bytes are unchanged, set preserves the original payload hash instead of recomputing it.

Workflow

flowchart TD
    A[Read message] --> B{Matches -w?}
    B -- No --> C[Write raw bytes to output]
    B -- Yes --> D[Decode metadata]
    D --> E[Apply mutations]
    E --> F[Re-encode message\npreserve payload hash]
    F --> G[Write to output]
    C --> H[Next message]
    G --> H

tensogram copy

Copies messages from one file to one or more output files. The output filename can include placeholders that expand to metadata values, allowing a single file to be split by parameter, date, step, or any other key.

Usage

tensogram copy [OPTIONS] <INPUT> <OUTPUT>

Options

Option	Description
`-w <WHERE_CLAUSE>`	Only copy messages that match this filter
`-h, --help`	Print help

Basic Copy

# Copy all messages from one file to another
tensogram copy input.tgm output.tgm

Filename Placeholders

Wrap any metadata key in square brackets to expand it in the output filename:

# One file per parameter
tensogram copy forecast.tgm "by_param/[mars.param].tgm"
# Produces: by_param/2t.tgm, by_param/10u.tgm, by_param/msl.tgm, ...

# One file per date+step combination
tensogram copy forecast.tgm "archive/[mars.date]_[mars.step].tgm"
# Produces: archive/20260401_0.tgm, archive/20260401_6.tgm, ...

# Split by type and param
tensogram copy forecast.tgm "split/[mars.type]/[mars.param].tgm"
# Produces: split/fc/2t.tgm, split/an/2t.tgm, etc.

Multiple messages with the same expanded filename are appended to the same output file. This is how you split-then-concatenate: a 1000-message file with 4 unique mars.param values produces 4 output files with ~250 messages each.

Filtering During Copy

Combine -w with placeholders for targeted extraction:

# Copy only forecasts, split by step
tensogram copy forecast.tgm "steps/[mars.step].tgm" -w "mars.type=fc"

Edge Cases

Missing Placeholder Key

If a message does not have the key referenced by a placeholder, that placeholder expands to unknown:

# If mars.param is missing, the message is written to by_param/unknown.tgm
tensogram copy forecast.tgm "by_param/[mars.param].tgm"

Output Directory

The output directory must exist before running copy. The command does not create directories. Use mkdir -p beforehand:

mkdir -p by_param
tensogram copy forecast.tgm "by_param/[mars.param].tgm"

Overwriting

If the expanded output filename already exists before the copy starts, it is truncated once and matching messages are then appended in order. This means running copy twice will duplicate messages. To avoid this, delete or rename existing outputs first.

Placeholder Syntax Conflicts

If a metadata value contains /, \, or other characters that are invalid in filenames on your OS, the resulting filename will be invalid. Choose placeholder keys whose values are filesystem-safe (e.g. dates, step numbers, short codes).

tensogram merge

Merge messages from one or more files into a single message.

Usage

tensogram merge [OPTIONS] --output <OUTPUT> [INPUTS]...

Options

Option	Description
`-o, --output <OUTPUT>`	Output file
`-s, --strategy <STRATEGY>`	Merge strategy for conflicting metadata keys: first (default) — first value wins, last — last value wins, error — fail on conflict [default: first]
`-h, --help`	Print help

Description

All data objects from all input messages are collected into a single Tensogram message. Global metadata is merged according to --strategy: first (default) keeps the first value, last keeps the last, and error fails on conflict.

Examples

# Merge two files into one
tensogram merge file1.tgm file2.tgm -o merged.tgm

# Merge all messages in a single multi-message file
tensogram merge multi.tgm -o single.tgm

tensogram split

Split multi-object messages into separate single-object files.

Usage

tensogram split --output <OUTPUT> <INPUT>

Options

Option	Description
`-o, --output <OUTPUT>`	Output template (use `[index]` for numbering)
`-h, --help`	Print help

Description

Each data object from each message in the input file becomes its own Tensogram message, inheriting the global metadata.

Output files are named using the template:

Use [index] for zero-padded numbering: split_[index].tgm → split_0000.tgm, split_0001.tgm, …
Without [index]: the index is appended before the extension: out.tgm → out_0000.tgm, out_0001.tgm, …

Examples

# Split with index template
tensogram split multi_object.tgm -o 'field_[index].tgm'

# Split with auto-numbered names
tensogram split multi_object.tgm -o output.tgm

tensogram reshuffle

Reshuffle frames: move footer frames to header position.

Usage

tensogram reshuffle --output <OUTPUT> <INPUT>

Options

Option	Description
`-o, --output <OUTPUT>`	Output file
`-h, --help`	Print help

Description

Converts streaming-mode messages (footer-based index and hash frames) into random-access-mode messages (header-based index and hash frames).

This is a decode → re-encode operation. The data is not modified; only the frame layout changes so that index and hash information appears before the data objects, enabling efficient random access.

Examples

tensogram reshuffle streamed.tgm -o random_access.tgm

tensogram doctor

Report environment diagnostics: compiled-in features, backend library versions, and a self-test of the encode/decode pipeline plus the GRIB and NetCDF converters.

Usage

tensogram doctor [OPTIONS]

All flags

Flag	Description
`--json`	Machine-parseable JSON output
`-h, --help`	Print help

Output sections

Build

Compile-time metadata: crate version, wire-format version, target triple, and build profile (debug or release).

Compiled-in features

One row per known feature. Features that were not compiled in show off. Features that are on show the backend library name, linkage model (FFI or pure-Rust), and version string.

Feature	Backend	Notes
`szip`	libaec (FFI)	GRIB szip compression
`szip-pure`	tensogram-szip (pure-Rust)	Pure-Rust szip fallback
`zstd`	libzstd (FFI)	Zstandard compression
`zstd-pure`	ruzstd (pure-Rust)	Pure-Rust zstd decompressor
`lz4`	lz4_flex (pure-Rust)	LZ4 compression
`blosc2`	libblosc2 (FFI)	Blosc2 compression
`zfp`	libzfp (FFI)	ZFP lossy compression
`sz3`	SZ3 (FFI)	SZ3 lossy compression
`threads`	rayon (pure-Rust)	Multi-threaded pipeline
`remote`	object_store (pure-Rust)	Remote object store I/O
`mmap`	memmap2 (pure-Rust)	Memory-mapped file I/O
`async`	tokio (pure-Rust)	Async I/O runtime
`grib`	libeccodes (FFI)	GRIB converter
`netcdf`	libnetcdf (FFI)	NetCDF converter

Self-test

Runs a suite of encode/decode round-trips and converter smoke tests. Each row shows a label and one of:

ok — test passed
FAILED (reason) — test failed; the reason is shown inline
skipped (reason) — feature not compiled in

Human-readable output (default)

tensogram doctor
================

Build
  tensogram             0.19.0
  wire-format           3
  target                aarch64-apple-darwin
  profile               release

Compiled-in features
  szip                  on  (libaec FFI 1.1.4)
  szip-pure             off
  zstd                  on  (libzstd FFI 1.5.5)
  zstd-pure             off
  lz4                   on  (lz4_flex pure-Rust 0.11.3)
  blosc2                on  (libblosc2 FFI 2.15.0)
  zfp                   on  (libzfp FFI 1.0.1)
  sz3                   off
  threads               on  (rayon pure-Rust 1.10.0)
  remote                on  (object_store pure-Rust 0.13.0)
  mmap                  on  (memmap2 pure-Rust 0.9.5)
  async                 on  (tokio pure-Rust 1.44.2)
  grib                  off
  netcdf                off

Self-test
  encode/decode  none/none/none            ok
  decode_metadata round-trip               ok
  scan multi-message buffer                ok
  hash xxh3 verify                         ok
  pipeline       simple_packing+szip       ok
  pipeline       shuffle+zstd              ok
  pipeline       lz4                       ok
  pipeline       blosc2                    ok
  pipeline       zfp (fixed-rate)          ok
  pipeline       sz3 (absolute error)      skipped (feature 'sz3' not built in)
  convert        grib    (sanity.grib2)    skipped (feature 'grib' not built in)
  convert        netcdf3 (sanity-classic.nc) skipped (feature 'netcdf' not built in)
  convert        netcdf4 (sanity-hdf5.nc)  skipped (feature 'netcdf' not built in)

Status: HEALTHY

JSON output (`--json`)

{
  "build": {
    "version": "0.19.0",
    "wire_version": 3,
    "target": "aarch64-apple-darwin",
    "profile": "release"
  },
  "features": [
    {
      "name": "szip",
      "kind": "compression",
      "state": "on",
      "backend": "libaec",
      "linkage": "ffi",
      "version": "1.1.4"
    },
    {
      "name": "zstd-pure",
      "kind": "compression",
      "state": "off"
    }
  ],
  "self_test": [
    {
      "label": "encode/decode  none/none/none",
      "outcome": "ok"
    },
    {
      "label": "pipeline       shuffle+zstd",
      "outcome": "ok"
    },
    {
      "label": "convert        grib    (sanity.grib2)",
      "outcome": "skipped",
      "reason": "feature 'grib' not built in"
    }
  ]
}

Exit codes

Code	Meaning
`0`	All self-tests passed (or were skipped)
`1`	One or more self-tests failed

A Skipped row does not cause exit code 1.

Library API

The same diagnostics are available programmatically from every Tensogram language binding. All of them produce the JSON shape documented above — the only differences are the native return type and the entry-point name.

Rust

#![allow(unused)]
fn main() {
use tensogram::doctor::{run_diagnostics, DoctorReport};

let report: DoctorReport = run_diagnostics();
println!("version: {}", report.build.version);
for feat in &report.features {
    println!("{}: {:?}", feat.name, feat.state);
}
for row in &report.self_test {
    println!("{}: {:?}", row.label, row.outcome);
}
}

Python

import tensogram

report = tensogram.doctor()  # returns a dict
print(report["build"]["version"])
for feat in report["features"]:
    print(feat["name"], feat["state"])

TypeScript / WebAssembly

import init, { doctor } from "@ecmwf.int/tensogram";
await init();
const report = doctor();  // returns DoctorReport
console.log(report.build.version);
for (const feat of report.features) {
    console.log(feat.name, feat.state);
}

C / C++

#include "tensogram.h"
#include <stdio.h>

tgm_bytes_t report = {0};
if (tgm_doctor_to_json(&report) == TGM_ERROR_OK) {
    fwrite(report.data, 1, report.len, stdout);  // UTF-8 JSON
    tgm_bytes_free(report);
} else {
    fprintf(stderr, "doctor failed: %s\n", tgm_last_error());
}

The C FFI returns the report as a (non-null-terminated) UTF-8 JSON byte buffer so callers can parse it with their own JSON library (json_loads, nlohmann::json::parse, cJSON_Parse, etc.).

Examples

# Default human-readable output
tensogram doctor

# Machine-parseable JSON (pipe to python for pretty-printing)
tensogram doctor --json | python3 -m json.tool

# With GRIB and NetCDF converters compiled in
cargo run -p tensogram-cli --features grib,netcdf -- doctor

tensogram validate

Check whether .tgm files are well-formed and intact. Analogous to grib_check or h5check.

Usage

tensogram validate [OPTIONS] <FILES>...

Validation Levels

The command runs up to four validation levels:

Level	Name	What it checks
1	Structure	Magic bytes, frame headers, ENDF markers, total_length, postamble, frame ordering, preceder legality, preamble flags vs observed frames
2	Metadata	CBOR parses correctly, required keys present (`_reserved_.tensor`, dtype, shape, strides), encoding/filter/compression types recognized, object count consistency, shape/strides/ndim consistency
3	Integrity	xxh3 hash in descriptor/hash-frame matches recomputed hash, compressed payloads decompress without error
4	Fidelity	Full decode succeeds, decoded size matches shape/dtype, NaN/Inf in float arrays are errors

Modes

Mode	Levels	Description
default	1–3	Structure + metadata + integrity
quick	1	Structure only, no payloads
checksum	3	Hash verification only (structural errors still reported, no decompression)
full	1–4	All levels including fidelity (NaN/Inf check)

Level selectors (--quick, --checksum, --full) are mutually exclusive. --canonical is independent and can be combined with any level selector.

All flags

Flag	Description
`--quick`	Quick mode: structure only (level 1)
`--checksum`	Checksum only: hash verification (structural errors still reported, but metadata/decompression/fidelity checks skipped)
`--full`	Full mode: all levels including fidelity (levels 1-4)
`--canonical`	Check RFC 8949 canonical CBOR key ordering (combinable with any level)
`--json`	Machine-parseable JSON output
`-h, --help`	Print help

Output

Human-readable (default)

file.tgm: OK (3 messages, 47 objects, hash verified)

On failure:

bad.tgm: FAILED — message 2, object 5: hash mismatch (expected a3f7..., got 91c2...)
bad.tgm: FAILED (1 error, 1 message, 3 objects)

JSON (`--json`)

[
  {
    "file": "file.tgm",
    "status": "ok",
    "messages": 1,
    "objects": 3,
    "hash_verified": true,
    "file_issues": [],
    "message_reports": [
      {
        "issues": [],
        "object_count": 3,
        "hash_verified": true
      }
    ]
  }
]

On failure, issues within message_reports[i].issues contain (note: object_index is 0-based in JSON; absent fields are omitted, not null):

{
  "code": "hash_mismatch",
  "level": "integrity",
  "severity": "error",
  "object_index": 4,
  "description": "hash mismatch (expected a3f7..., got 91c2...)"
}

Issue codes are stable snake_case strings (e.g. hash_mismatch, invalid_magic, buffer_too_short) suitable for machine parsing.

Exit Code

0 — all files pass validation
1 — one or more files have errors or file-level issues

Batch Mode

tensogram validate data/*.tgm

Validates all files. Reports per-file. Exits 1 if any file fails.

File-level Checks

When validating a file with multiple messages, the command also detects:

Unrecognized bytes between messages (garbage or padding)
Truncated messages at end of file
Trailing bytes after the last valid message

These are reported as file-level issues and cause validation to fail (exit code 1).

Library API

The same validation is available programmatically:

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram::{validate_message, validate_file, ValidateOptions};

// Validate a single message buffer
let report = validate_message(&bytes, &ValidateOptions::default());
assert!(report.is_ok());

// Validate a file
let file_report = validate_file(Path::new("data.tgm"), &ValidateOptions::default())?;
println!("{} messages, {} objects", file_report.messages.len(), file_report.total_objects());
}

Examples

# Default validation (levels 1-3)
tensogram validate measurements.tgm

# Quick structural check
tensogram validate --quick *.tgm

# Verify checksums only
tensogram validate --checksum archive/*.tgm

# Full validation including NaN/Inf detection (levels 1-4)
tensogram validate --full output.tgm

# Full validation with canonical CBOR check
tensogram validate --full --canonical output.tgm

# Check canonical CBOR encoding
tensogram validate --canonical output.tgm

# JSON output for CI pipelines
tensogram validate --json data/*.tgm

GRIB Import

Tensogram provides tensogram-grib, a dedicated crate for importing GRIB (GRIdded Binary) messages into Tensogram format. GRIB is widely used in operational weather forecasting; this importer lets you bring existing GRIB data into Tensogram pipelines while preserving the full MARS namespace metadata. Conversion is one-way: GRIB → Tensogram.

System Requirement

The ecCodes C library must be installed:

brew install eccodes       # macOS
apt install libeccodes-dev # Debian/Ubuntu

Building

The tensogram-grib crate is excluded from the default workspace build to avoid requiring ecCodes on machines that do not need GRIB import.

# Build the library
cd rust/tensogram-grib && cargo build

# Build CLI with GRIB support
cargo build -p tensogram-cli --features grib

Conversion Modes

Merge All (default)

All GRIB messages are combined into a single Tensogram message with N data objects. ALL MARS keys for each GRIB message are placed into the corresponding base[i] entry independently — there is no common/varying partitioning in the output.

tensogram convert-grib forecast.grib -o forecast.tgm

One-to-One (split)

Each GRIB message becomes a separate Tensogram message with one data object. All MARS keys go into base[0].

tensogram convert-grib forecast.grib -o forecast.tgm --split

Rust API

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_grib::{convert_grib_file, ConvertOptions, Grouping};

let options = ConvertOptions {
    grouping: Grouping::MergeAll,
    ..Default::default()
};

let messages = convert_grib_file(Path::new("forecast.grib"), &options)?;
// messages is Vec<Vec<u8>> — each element is a complete Tensogram wire-format message
}

Data Mapping

Source (GRIB)	Target (Tensogram)
Grid values (`values` key)	Data object payload (float64, little-endian)
Grid dimensions (Ni, Nj)	`DataObjectDescriptor.shape` as `[Nj, Ni]`
Reduced Gaussian grids (Ni=0)	Shape `[numberOfPoints]` (1D)
MARS keys (all, per message)	`GlobalMetadata.base[i]["mars"]` (each entry independent)

Scope

Only GRIB → Tensogram import is supported. Tensogram → GRIB is out of scope because Tensogram’s N-tensor data model is a superset of GRIB’s 2-D-field model; a faithful down-conversion is often impossible.

MARS Key Mapping

The importer reads the following MARS namespace keys from each GRIB message using ecCodes’ read_key_dynamic API.

Keys Extracted

Identification

GRIB Key	Description	Example
`class`	MARS class	`"od"` (operational)
`type`	Data type	`"an"` (analysis), `"fc"` (forecast)
`stream`	Data stream	`"oper"`, `"enfo"`
`expver`	Experiment version	`"0001"`

Parameter

GRIB Key	Description	Example
`param`	Parameter ID	`"2t"` (2m temperature)
`shortName`	Short name	`"2t"`
`name`	Full name	`"2 metre temperature"`
`paramId`	Numeric ID	`167`
`discipline`	WMO discipline	`0`
`parameterCategory`	WMO category	`0`
`parameterNumber`	WMO number	`0`

Vertical

GRIB Key	Description	Example
`level`	Level value	`500`
`typeOfLevel`	Level type	`"isobaricInhPa"`
`levtype`	MARS level type	`"pl"` (pressure level)

Temporal

GRIB Key	Description	Example
`date` / `dataDate`	Reference date	`20260404`
`time` / `dataTime`	Reference time	`1200`
`stepRange` / `step`	Forecast step	`"0"`, `"6"`, `"0-6"`
`stepUnits`	Step units	`1` (hours)

Spatial

The grid type is read from outside the MARS namespace and stored as mars.grid as a convenience (not a standard MARS key):

GRIB Key	Stored as	Description	Example
`gridType`	`mars.grid`	Grid type	`"regular_ll"`

For regular_ll grids with standard scan mode (iScansNegatively = 0, jScansPositively = 0), the four corner-point keys from the ecCodes geography namespace are also lifted into a canonical mars.area = [N, W, S, E]:

ecCodes key	Role	Example (0.25° global, dateline-first)
`latitudeOfFirstGridPointInDegrees`	→ `N`	`90.0`
`longitudeOfFirstGridPointInDegrees`	→ `W` (see normalisation)	raw `180.0` → `-180.0`
`latitudeOfLastGridPointInDegrees`	→ `S`	`-90.0`
`longitudeOfLastGridPointInDegrees`	→ `E`	`179.75`

Full-global dateline-first normalisation. ECMWF open-data GRIB encodes full-global grids as longitudeOfFirstGridPointInDegrees = 180 (first point at the dateline, scanning eastwards through Greenwich). The MARS-area convention wants a monotone W ≤ E range, so the importer subtracts 360 from the raw W when the grid provably spans a full circle (Ni × iDirectionIncrementInDegrees ≈ 360), yielding [-180, 179.75] instead of the raw [180, 179.75]. This is the only normalisation applied; everything else passes through unchanged. See rust/tensogram-grib/src/area.rs for the pure helper and its test matrix.

When mars.area is NOT emitted. The converter refuses (and omits mars.area) when any of the following holds — non-standard scan (iScansNegatively != 0 or jScansPositively != 0, or either key missing from the GRIB), missing or NaN corner keys, Ni < 2, i_direction_increment <= 0, degenerate W == E or N == S, inverted latitudes (lat_first < lat_last), or a dateline-crossing regional subdomain that cannot be expressed as a monotone [W, E]. Non-regular_ll grids (reduced_gg, octahedral O*, Gaussian N*) never get a mars.area: their geography cannot be captured as four corners.

The full raw geography namespace (including Ni, Nj, numberOfPoints, iDirectionIncrementInDegrees, etc.) is lifted into base[i]["grib"]["geography"] only when the converter is called with preserve_all_keys=true.

Other

GRIB Key	Description	Example
`bitsPerValue`	Packing precision	`16`
`packingType`	GRIB packing	`"grid_simple"`
`centre`	Originating centre	`"ecmf"`
`subCentre`	Sub-centre	`0`
`generatingProcessIdentifier`	Process ID	`148`

Storage in Tensogram

Given N GRIB messages in merge-all mode:

Extract all MARS keys from each message using read_key_dynamic
Store ALL keys for each GRIB message in the corresponding base[i]["mars"] entry independently
There is no common/varying partitioning in the output — each base[i] entry is self-contained

graph TD
    A[N GRIB messages] --> B[Extract MARS keys from each]
    B --> C["Store in base[i] independently"]
    C --> D["base[0]: all keys from GRIB msg 0"]
    C --> E["base[1]: all keys from GRIB msg 1"]
    C --> F["base[N-1]: all keys from GRIB msg N-1"]

If you need to extract commonalities after decoding (e.g. for display), use the compute_common() utility in software.

Sentinel Handling

ecCodes uses sentinel values for missing keys:

String: "MISSING" or "not_found" → skipped
Integer: 2147483647 or -2147483647 → skipped
Float: NaN or Inf → skipped

NetCDF Import

Tensogram ships tensogram-netcdf, a dedicated crate for importing NetCDF (both Classic and NetCDF-4) files into Tensogram messages. NetCDF is widely used in climate, ocean, atmospheric, and Earth-observation science, but the importer treats any NetCDF file the same way — the mapping is structural, not domain-specific.

The crate is exposed through the CLI as tensogram convert-netcdf and through a thin Rust library API. Conversion is one-way: NetCDF → Tensogram. There is no Tensogram → NetCDF writer.

System requirement

The NetCDF C library must be installed on your system:

brew install netcdf            # macOS
apt install libnetcdf-dev      # Debian/Ubuntu

The crate transitively pulls in HDF5 (used internally by NetCDF-4 files), so on Debian-family distros you also want libhdf5-dev.

Building

The tensogram-netcdf crate is excluded from the default workspace build to avoid forcing libnetcdf on every contributor. Build it explicitly:

# Library
cargo build --manifest-path rust/tensogram-netcdf/Cargo.toml

# CLI with NetCDF support
cargo build -p tensogram-cli --features netcdf

The binary then exposes the new subcommand:

tensogram convert-netcdf --help

Quick example

# Convert one file
tensogram convert-netcdf input.nc -o output.tgm

# Convert multiple files into a single output
tensogram convert-netcdf jan.nc feb.nc mar.nc -o q1.tgm

# Stream to stdout (useful for piping)
tensogram convert-netcdf input.nc | tensogram info /dev/stdin

Command-line options

Flag	Default	Description
`-o`, `--output PATH`	stdout	Where to write the Tensogram file.
`--split-by MODE`	`file`	Grouping mode: `file`, `variable`, or `record`. See Splitting modes.
`--cf`	off	Extract the CF attribute allow-list into `base[i]["cf"]`. See CF metadata mapping.
`--encoding ENC`	`none`	`none` or `simple_packing`.
`--bits N`	auto (16)	Bits per value for `simple_packing` (1–64).
`--filter FILTER`	`none`	`none` or `shuffle`.
`--compression CODEC`	`none`	`none`, `zstd`, `lz4`, `blosc2`, or `szip`.
`--compression-level N`	codec default	Level for `zstd` (1–22) and `blosc2` (0–9).

The --encoding/--bits/--filter/--compression/--compression-level flags are the same set used by tensogram convert-grib. Both importers share a PipelineArgs struct so the two commands stay symmetric.

How variables become objects

Each numeric NetCDF variable in the root group is mapped 1:1 to a Tensogram data object. The variable’s name is stored under base[i]["name"], the dtype and shape come from the NetCDF type and dimension list, and the raw bytes become the object payload (always little-endian).

Dtype matrix

NetCDF type	Tensogram `Dtype`
`byte`	`Int8`
`ubyte`	`Uint8`
`short`	`Int16`
`ushort`	`Uint16`
`int`	`Int32`
`uint`	`Uint32`
`int64`	`Int64`
`uint64`	`Uint64`
`float`	`Float32`
`double`	`Float64`

char and string variables, as well as the NetCDF-4 enhanced types (compound, vlen, enum, opaque), are skipped with a warning. They have no clean tensor representation.

Scalar variables

A NetCDF scalar (zero dimensions) becomes an object with ndim = 0, shape = [], and a single value in the payload.

Packed data

Variables with scale_factor or add_offset attributes are unpacked during conversion: the raw integer values are read, multiplied by the scale, offset applied, and the result stored as Float64 regardless of the on-disk dtype. This matches the convention used by xarray and most netCDF tooling.

The fill value (_FillValue or missing_value) is replaced with NaN in the unpacked output. The original sentinel is preserved under base[i]["netcdf"]["_FillValue"] so consumers can recover it.

Time coordinates

Time coordinate variables are stored as numeric values (typically Float64) exactly as they appear in the file — Tensogram does not convert them to calendar dates. The CF units string ("days since 1970-01-01") and calendar ("gregorian", "noleap", etc.) are preserved under base[i]["netcdf"] so a consumer can decode them on demand.

NetCDF-4 groups

Tensogram extracts only the root group of a NetCDF-4 file. If sub-groups are detected the importer prints a warning to stderr and continues with the root variables. Sub-group support is intentionally out of scope for this release — most operational datasets keep their data variables at the root anyway.

Splitting modes

The --split-by flag controls how variables are grouped into Tensogram messages.

`--split-by=file` (default)

All variables from one input file are bundled into a single Tensogram message containing N data objects. This is the most compact representation and is the right choice when you want to keep a NetCDF file as a single logical unit.

tensogram convert-netcdf forecast.nc -o forecast.tgm
# 1 message with N objects

`--split-by=variable`

Each variable becomes its own one-object Tensogram message. Useful when downstream consumers want to fetch individual variables without decoding the whole file.

tensogram convert-netcdf forecast.nc -o forecast.tgm --split-by variable
# N messages with 1 object each

`--split-by=record`

Splits along the unlimited (record) dimension. Each step along the unlimited dimension produces a separate message. The unlimited dimension is detected automatically; passing this mode against a file without one is a hard error (NoUnlimitedDimension).

Variables that don’t depend on the unlimited dimension (e.g. a static mask variable) are still included in every output message — that way each record is fully self-describing.

tensogram convert-netcdf timeseries.nc -o timeseries.tgm --split-by record
# 1 message per record

Encoding pipeline flags

The pipeline flags are applied per data object before encoding into the wire format. They use the same names and semantics as convert-grib:

Stage	Flag	Notes
Encoding	`--encoding simple_packing --bits N`	Lossy quantization. Float64 only — non-`f64` variables in the same file are skipped (with a warning) and pass through unencoded so mixed files convert cleanly.
Filter	`--filter shuffle`	Byte-shuffle filter, sets `shuffle_element_size` to the post-encoding byte width.
Compression	`--compression zstd --compression-level 3`	`zstd_level` defaults to 3.
Compression	`--compression lz4`	No params.
Compression	`--compression blosc2 --compression-level 9`	Uses `blosc2_codec=lz4` by default.
Compression	`--compression szip`	Sets `szip_rsi=128`, `szip_block_size=16`, `szip_flags=8`. Requires preceding `simple_packing` or `shuffle` because libaec szip caps at 32 bits per sample (raw `f64` is 64 bits).

Variables that contain NaN or ±Inf (typically from unpacked _FillValue / missing_value substitution or degenerate arithmetic upstream) cannot be represented by simple_packing — the algorithm’s range / scale-factor derivation has no slot for non-finite values.

The importer hard-fails when --encoding simple_packing is requested on data containing NaN or Inf. The error names the offending variable and suggests recovery options:

error: simple_packing failed for forecast_temperature: NaN value
encountered at index 42. The variable contains NaN or Inf which
cannot be represented by simple_packing. Pre-process the data or
choose a different encoding (e.g. encoding="none").

Recovery options, in order of effort:

Drop the --encoding simple_packing flag AND pass --allow-nan. The default pipeline (encoding="none") combined with the NaN bitmask companion frame preserves NaN positions; decode restores a canonical quiet-NaN at each position (specific NaN payloads are not preserved — see NaN / Inf Handling).
Substitute non-finite values with an in-band sentinel before conversion if you need simple_packing throughout.
Split the conversion with --split-by variable and re-run per-variable, using --encoding simple_packing only for the variables you know are NaN-free.

Prior behaviour (pre-0.17). The importer used to soft-downgrade NaN-bearing variables to encoding="none" with a stderr warning. That silently hid data-quality problems from automated pipelines; 0.17 surfaces them as hard errors and pairs the fix with the --allow-nan bitmask opt-in (preferred over pre-processing). The non-f64-payload branch (a structural mismatch rather than a data-quality problem) keeps its stderr-warning + fallback behaviour unchanged.

# Pack temperature to 24-bit + zstd
tensogram convert-netcdf --encoding simple_packing --bits 24 \
  --compression zstd --compression-level 3 \
  era5_t2m.nc -o era5_t2m.tgm

# Shuffle + szip on a multi-variable file
tensogram convert-netcdf --filter shuffle --compression szip \
  forecast.nc -o forecast.tgm

CF metadata mapping

NetCDF attributes are always extracted into a netcdf sub-map under each base entry:

base[0]:
  name: "temperature"
  netcdf:
    units: "K"
    long_name: "Air Temperature"
    standard_name: "air_temperature"
    _FillValue: -32768
    add_offset: 273.15
    scale_factor: 0.01
    _global:
      Conventions: "CF-1.10"
      title: "..."
      institution: "..."

When --cf is set, an additional cf sub-map is added containing only the 16 CF allow-list attributes. This duplicate copy makes CF-aware tooling cheaper because it can ignore the verbose netcdf map and rely on a stable, standardised key set.

Limitations

No NetCDF writer. Conversion is one-way only.
No string or char variables. They are skipped with a warning.
No NetCDF-4 enhanced types (compound, vlen, enum, opaque).
Root group only. Sub-groups are skipped with a warning.
simple_packing is f64-only. Mixed-dtype files convert cleanly but only f64 variables get packed.

The importer is also available from Python via tensogram.convert_netcdf() when the wheel is built with the netcdf feature.

Library API

If you’d rather call the importer directly from Rust:

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_netcdf::{convert_netcdf_file, ConvertOptions, DataPipeline, SplitBy};

let options = ConvertOptions {
    split_by: SplitBy::Variable,
    cf: true,
    pipeline: DataPipeline {
        encoding: "simple_packing".to_string(),
        bits: Some(24),
        compression: "zstd".to_string(),
        compression_level: Some(3),
        ..Default::default()
    },
    ..Default::default()
};

let messages = convert_netcdf_file(Path::new("forecast.nc"), &options)?;
// messages: Vec<Vec<u8>> — each element is a complete wire-format message
}

Note: DataPipeline is defined in tensogram::pipeline and re-exported from both tensogram_netcdf and tensogram_grib. The underlying apply_pipeline helper is the same for both importers, guaranteeing that convert-grib and convert-netcdf produce byte-identical descriptor fields for equivalent flag combinations.

NetCDF CF Metadata Mapping

When tensogram convert-netcdf --cf is set, the importer walks each NetCDF variable and lifts a fixed set of 16 CF Conventions v1.10 attributes into a cf sub-map under the corresponding base[i] entry. The attributes are also still present in the verbose netcdf map alongside every other variable attribute — the cf map is a curated, schema-stable view that CF-aware tooling can rely on.

The allow-list lives in rust/tensogram-netcdf/src/metadata.rs as the constant CF_ATTRIBUTES. If you change the list, update this page to match.

Attributes lifted by `--cf`

CF Attribute	Tensogram Key	Notes
`standard_name`	`base[i]["cf"]["standard_name"]`	CF standard name from the CF Standard Name Table, e.g. `"air_temperature"`, `"eastward_wind"`.
`long_name`	`base[i]["cf"]["long_name"]`	Free-form descriptive label, e.g. `"2 metre temperature"`.
`units`	`base[i]["cf"]["units"]`	UDUNITS-compliant string, e.g. `"K"`, `"m s-1"`, `"days since 1970-01-01"`.
`calendar`	`base[i]["cf"]["calendar"]`	Calendar for time coordinate variables, e.g. `"gregorian"`, `"noleap"`, `"360_day"`.
`cell_methods`	`base[i]["cf"]["cell_methods"]`	Aggregation description, e.g. `"time: mean"`, `"area: sum"`.
`coordinates`	`base[i]["cf"]["coordinates"]`	Space-separated list of auxiliary coordinate variable names, e.g. `"lon lat"`.
`axis`	`base[i]["cf"]["axis"]`	Dimension role flag: `"X"`, `"Y"`, `"Z"`, or `"T"`.
`positive`	`base[i]["cf"]["positive"]`	Direction of vertical coordinate: `"up"` (altitude) or `"down"` (depth/pressure).
`valid_min`	`base[i]["cf"]["valid_min"]`	Minimum valid value for QA/range checks.
`valid_max`	`base[i]["cf"]["valid_max"]`	Maximum valid value for QA/range checks.
`valid_range`	`base[i]["cf"]["valid_range"]`	Two-element array `[min, max]` — alternative to `valid_min`/`valid_max`.
`bounds`	`base[i]["cf"]["bounds"]`	Name of an associated cell-bounds variable (irregular grids).
`grid_mapping`	`base[i]["cf"]["grid_mapping"]`	Name of an associated coordinate reference system variable.
`ancillary_variables`	`base[i]["cf"]["ancillary_variables"]`	Space-separated list of related ancillary variable names (uncertainty, QA flags, etc.).
`flag_values`	`base[i]["cf"]["flag_values"]`	Array of integer flag values for categorical variables.
`flag_meanings`	`base[i]["cf"]["flag_meanings"]`	Space-separated list of meanings, paired with `flag_values`.

That’s 16 attributes — the full CF allow-list as of v0.7.0.

Storage layout

For a CF-compliant temperature variable, the --cf flag produces:

base[0]:
  name: "temperature"
  netcdf:
    units: "K"
    long_name: "2 metre temperature"
    standard_name: "air_temperature"
    _FillValue: -32768
    add_offset: 273.15
    scale_factor: 0.01
    cell_methods: "time: mean"
    _global:
      Conventions: "CF-1.10"
      title: "ERA5 reanalysis"
  cf:
    units: "K"
    long_name: "2 metre temperature"
    standard_name: "air_temperature"
    cell_methods: "time: mean"

The netcdf map is a verbatim dump of every variable attribute (the _global sub-map carries the file-level attributes). The cf map is a filtered slice containing only the allow-listed keys, in the order they appear on the variable.

What is not extracted

The allow-list is intentionally narrow. The following CF concepts are out of scope for v0.7.0 — they are accessible via the verbose netcdf map but not surfaced under cf:

Grid mapping variable contents — only the grid_mapping reference is lifted, not the projection parameters of the referenced variable.
Coordinate variable contents — coordinate variables are converted to their own data objects, not inlined into other variables’ metadata.
Bounds variable contents — only the bounds reference is lifted.
Cell measures — cell_measures is not in the allow-list.
Climatology bounds — climatology is not lifted.
Geometry containers — CF 1.8+ geometries are out of scope.
Labels and string-valued auxiliary coordinates — not in the allow-list.
Compound coordinates / compress — ragged-array support is out of scope.

If you need these, read the raw NetCDF metadata from base[i]["netcdf"] instead — every original attribute is preserved there, byte-for-byte.

Why a curated allow-list?

Two reasons:

Schema stability. Downstream tooling (xarray engines, dashboards, indexers) wants to rely on a small, fixed key set without having to inspect every NetCDF file’s variable-attribute zoo. The cf map gives them that contract.
Interop friendliness. The 16 allow-listed attributes are the ones that show up in essentially every CF-compliant climate or weather dataset. They are the lingua franca that makes CF data interoperable.

If you have a strong case for adding an attribute, file an issue on the GitHub project and we’ll evaluate it.

CF Conventions §3 — variable attributes.
CF Conventions §8 — packed data, scale_factor / add_offset.
CF Standard Name Table — the controlled vocabulary referenced by standard_name.
NetCDF Import — main user guide for tensogram convert-netcdf.

Error Handling

Tensogram uses typed errors across all language bindings. Every fallible operation returns a Result (Rust), raises an exception (Python / C++ / TypeScript), or returns an error code (C). No library code panics.

Error Categories

Category	Trigger	Rust	Python	C++	TypeScript	C Code
Framing	Invalid magic bytes, truncated message, bad terminator	`TensogramError::Framing`	`ValueError`	`framing_error`	`FramingError`	`TGM_ERROR_FRAMING (1)`
Metadata	CBOR parse failure, missing required field, schema violation	`TensogramError::Metadata`	`ValueError`	`metadata_error`	`MetadataError`	`TGM_ERROR_METADATA (2)`
Encoding	Encoding pipeline failure (e.g. NaN in simple_packing)	`TensogramError::Encoding`	`ValueError`	`encoding_error`	`EncodingError`	`TGM_ERROR_ENCODING (3)`
Compression	Decompression failure, unknown codec	`TensogramError::Compression`	`ValueError`	`compression_error`	`CompressionError`	`TGM_ERROR_COMPRESSION (4)`
Object	Invalid descriptor, object index out of range	`TensogramError::Object`	`ValueError`	`object_error`	`ObjectError`	`TGM_ERROR_OBJECT (5)`
I/O	File not found, permission denied, disk full	`TensogramError::Io`	`OSError`	`io_error`	`IoError`	`TGM_ERROR_IO (6)`
Hash Mismatch	Payload integrity check fails on `verify_hash=True`	`TensogramError::HashMismatch`	`RuntimeError`	`hash_mismatch_error`	`HashMismatchError`	`TGM_ERROR_HASH_MISMATCH (7)`
Invalid Arg	NULL pointer or invalid argument at the API boundary	—	`ValueError`	`invalid_arg_error`	`InvalidArgumentError`	`TGM_ERROR_INVALID_ARG (8)`
Remote	S3 / GCS / Azure / HTTP(S) object-store failure	`TensogramError::Remote`	`OSError`	`remote_error`	`RemoteError`	`TGM_ERROR_REMOTE (10)`
Streaming Limit	`decodeStream` internal buffer exceeded the configured maximum	—	—	—	`StreamingLimitError`	—

Notes on the TypeScript column:

All TypeScript errors extend the abstract TensogramError base class, so a single catch (err) { if (err instanceof TensogramError) … } handles every library-raised error.
HashMismatchError in TypeScript additionally carries parsed expected and actual hex digests when the underlying Rust message is in the canonical "hash mismatch: expected X, got Y" form.
StreamingLimitError is TS-specific and is raised only from decodeStream when the internal buffer would grow past maxBufferBytes (default 256 MiB).

Error Paths by Operation

Encoding

Input data + metadata dict
  │
  ├─ Missing 'version' ──────────► Metadata error
  ├─ Missing 'type'/'shape'/'dtype' ► Metadata error
  ├─ Unknown dtype string ────────► Metadata error
  ├─ Unknown byte_order ──────────► Metadata error
  ├─ Data size ≠ shape × dtype ───► Metadata error
  ├─ Shape product overflow ──────► Metadata error
  ├─ NaN in simple_packing ───────► Encoding error
  ├─ Inf reference_value ─────────► Metadata error
  ├─ Client wrote _reserved_ ─────► Metadata error (message or base[i])
  ├─ base.len() > descriptors ────► Metadata error (extra entries would be lost)
  ├─ emit_preceders in buffered ──► Encoding error (use StreamingEncoder)
  ├─ Param out of range (i32/u32) ► Metadata error (zstd_level, szip_rsi, etc.)
  ├─ Unknown compression codec ───► Encoding error
  ├─ Compression codec failure ───► Compression error
  └─ File I/O failure ────────────► I/O error

Decoding

Raw bytes
  │
  ├─ No magic bytes / truncated ──► Framing error
  ├─ Bad frame type codes ────────► Framing error
  ├─ Frame total_length overflow ─► Framing error
  ├─ Frame ordering violation ────► Framing error (header→data→footer)
  ├─ cbor_offset out of range ────► Framing error
  ├─ CBOR parse failure ──────────► Metadata error
  ├─ Preceder base ≠ 1 entry ─────► Metadata error
  ├─ Dangling preceder (no obj) ──► Framing error
  ├─ Consecutive preceders ────────► Framing error
  ├─ base.len() > object count ───► Metadata error
  ├─ Object index out of range ───► Object error
  ├─ Shape product overflow ──────► Metadata error
  ├─ Decompression failure ───────► Compression error
  ├─ Decoding pipeline failure ───► Encoding error
  └─ Hash verification mismatch ──► HashMismatch error

File Operations

TensogramFile.open(path)
  │
  ├─ File not found ──────────────► I/O error
  ├─ Permission denied ───────────► I/O error
  └─ Invalid file content ────────► Framing error

TensogramFile.decode_message(index)
  │
  ├─ Index out of range ──────────► Object error / IndexError
  └─ Corrupt message at offset ───► Framing error

Streaming Encoder

StreamingEncoder
  │
  ├─ write_preceder(_reserved_) ──► Metadata error
  ├─ write_preceder twice ─────────► Framing error (no intervening write_object)
  ├─ finish() with pending prec ──► Framing error (dangling preceder)
  ├─ write_object invalid shape ──► Metadata error
  ├─ Encoding pipeline failure ───► Encoding error
  ├─ Variable-length hash algo ───► Framing error (see below)
  └─ I/O write failure ───────────► I/O error

The streaming path writes the frame header before the payload has been hashed, so it needs to know the final CBOR descriptor length up front. This works only when the configured HashAlgorithm produces a digest whose hex representation has a fixed length — currently only Xxh3 (always 16 hex chars). If a future hash algorithm with variable-length output is used, StreamingEncoder::write_object returns TensogramError::Framing before writing any bytes, so the caller’s sink is never corrupted. Use the buffered encode() API for such algorithms.

CLI Operations

set command
  │
  ├─ Immutable key (shape, dtype) ► Error (cannot modify structural key)
  ├─ _reserved_ namespace ────────► Error (library-managed)
  └─ Invalid object index ────────► Error (out of range)

merge command
  │
  ├─ No input files ──────────────► Error
  ├─ Invalid strategy name ───────► Error
  └─ Conflicting keys (error mode) ► Error (use first/last to resolve)

split command
  │
  └─ Single-object: pass through; multi-object: split per-object base metadata

Importer Operations (convert-grib / convert-netcdf)

Both importer crates (tensogram-grib, tensogram-netcdf) use typed error enums and never panic on invalid or exotic input. Anything the importer can’t represent cleanly is either surfaced as a typed error or skipped with a warning: … line on stderr so the operator can see what was dropped.

tensogram-netcdf errors (rust/tensogram-netcdf/src/error.rs)
  │
  ├─ NetcdfError::Netcdf(netcdf::Error)
  │     Low-level failure from libnetcdf — file missing, permission
  │     denied, format error, truncated file, HDF5 error.
  │
  ├─ NetcdfError::NoVariables
  │     Input file has zero supported numeric variables after skipping
  │     char/string/compound/vlen. Empty files also hit this.
  │
  ├─ NetcdfError::NoUnlimitedDimension { file }
  │     --split-by=record requested but the file has no unlimited
  │     dimension. Contains the file path for diagnostics.
  │
  ├─ NetcdfError::UnsupportedType { name, reason }
  │     Variable has a type we can't represent (e.g. compound,
  │     enum, opaque, vlen). Currently only the char / string
  │     variants hit this path — the other complex types are
  │     downgraded to a stderr warning and skipped because they
  │     frequently coexist with valid numeric variables.
  │
  ├─ NetcdfError::InvalidData(String)
  │     Catch-all for:
  │       - low-level read errors on a specific variable
  │       - unknown --encoding / --filter / --compression names
  │       - simple_packing compute_params failures on edge-case data
  │       - extract_variable_record invariant violations (should be
  │         unreachable; if it fires the importer is buggy)
  │
  ├─ NetcdfError::Encode(String)
  │     tensogram rejected the pipeline. Common cause:
  │     szip on raw f64 (bits_per_sample=64 exceeds libaec's
  │     32-bit cap). Fix: add --filter shuffle or --encoding
  │     simple_packing first.
  │
  └─ NetcdfError::Io(std::io::Error)
        Reserved for future use — the current importer reads
        through libnetcdf and writes through the CLI wrapper, so
        stdlib I/O errors don't currently reach this variant.

Soft warnings (stderr, exit 0):

warning: {file}: sub-groups found; only root-group variables are converted
warning: skipping variable '{name}': Char variables are not supported
warning: skipping variable '{name}': complex type Compound(_) is not supported
warning: skipping simple_packing for variable '{name}' (not a float64 payload)
warning: variable '{name}': failed to read attribute '{attr}': {cause}
warning: failed to read global attribute '{name}': {cause}

Note: NaN/Inf in a variable that targets simple_packing now hard-fails the conversion (see NetCDF Importer — simple_packing on Mixed-dtype Files below). The previous “warning: skipping simple_packing … NaN value encountered” line no longer fires; that case is an error rather than a warning.

The last two lines above are rare — they only fire on corrupt attribute values or unsupported upstream AttributeValue variants — but they surface instead of dropping data silently so operators can trace unexpected missing metadata.

tensogram-grib errors (rust/tensogram-grib/src/error.rs)
  │
  ├─ GribError::Eccodes(String) — ecCodes C library error
  ├─ GribError::NoMessages — empty GRIB file
  ├─ GribError::MissingKey — required ecCodes/MARS namespace key absent
  ├─ GribError::InvalidShape — grid dimension mismatch
  └─ GribError::Encode — tensogram encode failure

Language-Specific Patterns

Rust

#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions, TensogramError};

match decode(&buffer, &DecodeOptions::default()) {
    Ok((meta, objects)) => { /* use data */ }
    Err(TensogramError::Framing(msg)) => eprintln!("bad format: {msg}"),
    Err(TensogramError::HashMismatch { expected, actual }) =>
        eprintln!("integrity: {expected} ≠ {actual}"),
    Err(e) => eprintln!("error: {e}"),
}
}

Python

import tensogram

# Decode errors
try:
    msg = tensogram.decode(buf, verify_hash=True)
except ValueError as e:
    # Framing, Metadata, Encoding, Compression, Object errors
    print(f"decode failed: {e}")
except RuntimeError as e:
    # Hash verification mismatch
    print(f"integrity error: {e}")
except OSError as e:
    # File I/O and Remote (S3/GCS/Azure/HTTP) errors
    print(f"I/O error: {e}")

# File errors
try:
    f = tensogram.TensogramFile.open("missing.tgm")
except OSError:
    print("file not found")

# Index errors
with tensogram.TensogramFile.open("data.tgm") as f:
    try:
        msg = f[999]
    except IndexError:
        print("message index out of range")

# Packing errors
try:
    tensogram.compute_packing_params(nan_array, 16, 0)
except ValueError as e:
    print(f"NaN rejected: {e}")

C++

#include <tensogram.hpp>

try {
    auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
    // Invalid message structure
    std::cerr << "framing: " << e.what() << " (code " << e.code() << ")\n";
} catch (const tensogram::hash_mismatch_error& e) {
    // Payload integrity failure
    std::cerr << "hash: " << e.what() << "\n";
} catch (const tensogram::error& e) {
    // Any Tensogram error (base class)
    std::cerr << "error: " << e.what() << "\n";
}

C

#include "tensogram.h"

tgm_message_t *msg = NULL;
tgm_error rc = tgm_decode(buf, len,
                          /*native_byte_order=*/1, /*threads=*/0,
                          /*verify_hash=*/0, &msg);
if (rc != TGM_ERROR_OK) {
    tgm_error code = tgm_last_error_code();
    const char* message = tgm_last_error();
    fprintf(stderr, "%s (%d): %s\n",
            tgm_error_string(code), code, message);
}

Note: tgm_last_error() returns a thread-local string valid until the next FFI call on the same thread. Copy it if you need to keep it.

TypeScript

Every error thrown by @ecmwf.int/tensogram is an instance of the abstract TensogramError base class. The concrete subclasses match the Rust variants one-to-one, plus a TS-specific InvalidArgumentError and StreamingLimitError.

import {
  decode,
  TensogramError,
  FramingError,
  HashMismatchError,
  ObjectError,
  StreamingLimitError,
} from '@ecmwf.int/tensogram';

try {
  const { metadata, objects } = decode(buf, { verifyHash: true });
  // ...
} catch (err) {
  if (err instanceof HashMismatchError) {
    // Structured fields are parsed from the Rust-side message.
    console.error('integrity failure:', err.expected, err.actual);
  } else if (err instanceof FramingError) {
    console.error('bad wire format:', err.message);
  } else if (err instanceof ObjectError) {
    console.error('object index error:', err.message);
  } else if (err instanceof TensogramError) {
    console.error('tensogram error:', err.name, err.message);
  } else {
    throw err;
  }
}

All concrete classes expose:

err.rawMessage — the untruncated string from the WASM / Rust side, including any error-variant prefix ("framing error: ...").
err.message — the human-readable message with the prefix stripped.
err.name — stable string name ("FramingError", etc.).

HashMismatchError additionally exposes parsed expected and actual hex digests when the underlying message follows the canonical "hash mismatch: expected X, got Y" form.

Streaming decode does not throw on a single corrupt message — the iterator skips and continues. Register an onError callback to observe the skips:

import { decodeStream, StreamingLimitError } from '@ecmwf.int/tensogram';

try {
  for await (const frame of decodeStream(res.body!, {
    maxBufferBytes: 64 * 1024 * 1024,
    onError: ({ message, skippedCount }) => {
      console.warn(`skipped corrupt message (#${skippedCount}): ${message}`);
    },
  })) {
    render(frame.descriptor.shape, frame.data());
    frame.close();
  }
} catch (err) {
  if (err instanceof StreamingLimitError) {
    // Stream exceeded maxBufferBytes; configure a larger limit or split.
  } else {
    throw err;
  }
}

Note: decodeStream does throw for infrastructure-level failures (buffer limit exceeded, AbortSignal fired, non-ReadableStream input). Only per-message corruption is routed through onError.

Common Error Scenarios

Garbage or Truncated Input

Any non-Tensogram bytes passed to decode() produce a Framing error. The decoder looks for the 8-byte magic TENSOGRM and a matching terminator.

Hash Mismatch After Corruption

v3 note. Frame-level integrity moved from the decoder to the validator. verify_hash=True (Python DecodeOptions) or TGM_DECODE_VERIFY_HASH (C) is retained for source compatibility but is a no-op on the decode path in v3.

To detect corruption in a v3 message, run the message through tensogram validate --checksum (CLI), validate_message (Rust), tgm_validate (C), or the equivalent Python / TypeScript helpers. The validator:

Walks every frame and recomputes the xxh3-64 of its body (payload + masks + CBOR; cbor_offset, the hash slot, and ENDF are excluded — see plans/WIRE_FORMAT.md §2.4).
Compares the recomputed digest to the inline hash slot at frame_end − 12. A mismatch emits a HashMismatch validation issue carrying the expected and actual hex values plus the frame offset.
When both a HeaderHash and a FooterHash aggregate frame are present, cross-checks them against each other and against the inline slots. Disagreement also surfaces as a HashMismatch.
An UnknownHashAlgorithm warning fires when the aggregate HashFrame.algorithm is not "xxh3" — the inline slots are still verified (they’re authoritative); only the aggregate’s algorithm identifier is advisory.

Messages encoded with hash_algorithm=None clear the HASHES_PRESENT preamble flag and leave every inline slot at 0x00…00. On such messages, validate --checksum emits NoHashAvailable at warning level and cannot detect corruption beyond structural errors — re-encode with hash_algorithm = Some(Xxh3) to enable integrity checking.

Object Index Out of Range

Accessing decode_object(buf, index=N) where N ≥ number of objects produces an Object error (Rust/C/C++) or ValueError (Python). File indexing file[N] raises IndexError for out-of-range N.

NaN / Inf in Simple Packing

compute_packing_params() rejects both NaN and ±Inf values with a ValueError that includes the index of the first offending sample. simple_packing’s scale-factor derivation has no meaningful value for non-finite input — rejecting them up front prevents the silent corruption path where an i32::MAX-saturated binary_scale_factor decodes to NaN everywhere.

0.17+ extends this contract to every pipeline: encoding="none" (and every compressor) rejects NaN / ±Inf input by default. The NaN / Inf Handling guide covers the allow_nan / allow_inf opt-in that substitutes non-finite values with 0.0 and records their positions in a bitmask companion section.

Malformed Descriptor — Pathological Tensor Size

A corrupted or hostile .tgm file whose tensor descriptor declares an unrealistic element count (for example, shape: [2^40]) drives num_values × dtype_byte_width to a multi-terabyte value on decode.

Every descriptor-derived allocation on the decode path is fallible:

szip (both C FFI and pure-Rust backends): Vec::try_reserve_exact on the output buffer, the per-RSI scratch buffer and the sample serialisation buffer; checked_mul on the sample-count → byte-count conversion; checked_sub on the FFI bytes-written arithmetic that drives set_len. The FFI backend additionally runs the same AecParams validation as the pure-Rust backend (rejects bits_per_sample ∉ [1..=32], block_size == 0, invalid block sizes without AEC_NOT_ENFORCE, rsi ∉ [1..=4096], and AEC_RESTRICTED with bits_per_sample > 4).
simple_packing decode: try_reserve_exact on the output Vec<f64> in both aligned and generic paths (sequential and rayon-parallel); checked_mul on num_values × bits_per_value (promoted to u128) and on the range-decode bit_offset + num_values × bpv arithmetic.
Bitmask decoders (packing::unpack, rle::decode, roaring::decode): shared try_reserve_mask helper reserves the output Vec<bool> fallibly. The RLE path’s run-length overflow check was also reworked so run + out.len() > n_elements cannot itself overflow on hostile input.
zfp: zfp_ffi::zfp_decompress_f64 reserves the output buffer via try_reserve_exact; serialisation back to bytes uses checked_mul on values.len() × 8 and fallible reservation for the byte buffer.
Shuffle filter: unshuffle reserves its output buffer fallibly so a large decompressed stage cannot abort the shuffle stage even when the host is close to OOM.
blosc2: ignores the caller-supplied hint entirely and reserves fallibly per chunk against its own frame-trailer metadata.

A hostile size therefore surfaces through the normal error channel as a typed error. The specific variants and their message prefixes:

Codec / stage	Error variant (as seen from `TensogramError`)	Prefix
szip (both backends, via FFI wrapper)	`CompressionError::Szip`	`"failed to reserve"`
simple_packing	`PackingError::AllocationFailed` / `BitCountOverflow` / `OutputSizeOverflow`	`"failed to reserve"` / `"bit-count overflow"` / `"output size overflow"`
bitmask decoders	`MaskError::AllocationFailed`	`"failed to reserve"`
zfp	`CompressionError::Zfp`	`"failed to reserve"`
sz3	`CompressionError::Sz3`	`"failed to reserve"`
blosc2	`CompressionError::Blosc2`	`"failed to reserve"`
shuffle	`ShuffleError::AllocationFailed`	`"failed to reserve"`
pipeline f64↔bytes + range math	`PipelineError::Range`	`"failed to reserve"` / `"overflow"` / `"exceeds usize"`

Note: the szip (pure) backend surfaces AecError::Data internally, but the wrapper in tensogram-encodings/src/compression/szip_pure.rs converts it to CompressionError::Szip at the trait boundary, so callers always see the FFI form.

This means a caller processing untrusted .tgm input (remote reads, user uploads, multi-tenant services) can rely on the normal Result<_, TensogramError> / exception channel to signal the problem. No defensive size-cap is needed at the call site.

File Not Found / Permission Denied

TensogramFile.open() raises OSError (Python), io_error (C++), or returns TGM_ERROR_IO (C) for any file system failure.

NetCDF Importer — `--split-by=record` on Files Without Unlimited Dim

tensogram convert-netcdf --split-by record foo.nc where foo.nc has no unlimited dimension hard-errors with NetcdfError::NoUnlimitedDimension { file } (exit code 1). The error message includes the path so the caller can identify which file in a multi-input batch triggered it.

NetCDF Importer — `simple_packing` on Mixed-dtype Files

--encoding simple_packing is f64-only by design. Mixed files (a typical CF temperature file has f32 lat/lon coordinates alongside f64 data) are handled gracefully: non-f64 variables emit a stderr warning and pass through with encoding="none", and the conversion overall succeeds.

NaN or Inf in a targeted f64 variable is now a hard error (0.17+). The importer fails with NetcdfError::InvalidData("simple_packing failed for {var}: ...") and a recovery hint, rather than silently downgrading the variable to encoding="none". Pre-0.17 soft-downgrade hid data-quality problems; the new behaviour surfaces them at conversion time. Callers relying on the old fallback should either pick a non-simple_packing encoding up front, opt into the NaN / Inf bitmask companion via --allow-nan / --allow-inf (see NaN / Inf Handling), pre-process NaN / Inf out of the data, or use --split-by variable and choose per-variable encodings.

NetCDF Importer — Unknown Codec Name

--encoding foo, --filter bar, --compression baz all hard-error with NetcdfError::InvalidData listing the expected values. The pre-validation fires inside apply_pipeline so the error surfaces immediately, before any data is read from disk.

NetCDF Importer — szip on Raw f64

libaec szip caps at 32 bits per sample, but raw f64 gives bits_per_sample = 64, so --compression szip on unencoded f64 produces a low-level aec_encode_init failed error from tensogram wrapped in NetcdfError::Encode. Fix:

Combine with --encoding simple_packing --bits N (N ≤ 32), or
Combine with --filter shuffle (which makes the element size 8 bits).

Unknown Hash Algorithm (Forward Compatibility)

When the decoder encounters a hash algorithm string it doesn’t recognize (e.g. a future "sha256" hash), it logs a warning via tracing::warn! and skips verification rather than failing. This ensures forward compatibility: older decoders can still read messages produced by newer encoders that use new hash algorithms.

No-Panic Guarantee

All Rust library code in tensogram, tensogram-encodings, and tensogram-ffi is free from panic!(), unwrap(), expect(), todo!(), and unimplemented!() in non-test code paths. The library guarantees:

All fallible operations return Result<T, TensogramError>.
Integer arithmetic uses checked operations (checked_mul, try_from) to prevent overflow and truncation.
u64 → usize conversions use usize::try_from() to prevent truncation on 32-bit platforms.
Array indexing is guarded by prior bounds checks.
Untrusted sizes derived from the wire format (the descriptor’s num_values × dtype_byte_width, range-decode lengths, per-codec internal size hints, bitmask element counts) are reserved via Vec::try_reserve_exact on every decode path, so allocation failure surfaces as a typed error rather than aborting the process. Size arithmetic that could overflow usize (num_values × bits_per_value, samples × byte_width) is guarded by checked_mul / u128 promotion. See Malformed Descriptor — Pathological Tensor Size for the per-codec error variants.
FFI boundary code returns error codes instead of panicking, and uses unwrap_or_default() only for CString::new() (interior null fallback).
The scan functions (scan, scan_file) tolerate truncation of total_length as usize because the subsequent bounds check catches it.
The hash-while-encoding pipeline (PipelineConfig.compute_hash = true plus the streaming encoder’s inline-hash path) verifies its CBOR-length invariant before writing any bytes and surfaces a TensogramError::Framing if a variable-length hash algorithm is ever configured — the caller’s sink is never left in a partial-write state on that specific failure mode. Internal debug assertions guard against non-deterministic CBOR serialisation during development.

Edge Cases

A collection of non-obvious situations and how the library handles them.

Corrupted Messages

What happens: The scanner (scan()) searches for TENSOGRM magic bytes and validates the postamble (last 8 bytes should be 39277777). If total_length is set, the scanner checks for the end magic at the expected position.

Recovery: If a message fails validation, the scanner skips one byte and resumes searching. A single corrupted message in a multi-message file does not prevent reading the others.

#![allow(unused)]
fn main() {
let offsets = scan(&file_bytes);
// offsets only contains valid (start, length) pairs
// Corrupted regions are silently skipped
}

Edge case within edge case: If a random byte sequence inside a valid payload happens to match TENSOGRM, the scanner might try to parse a “message” starting mid-payload. The postamble cross-check catches this: the false start’s postamble won’t contain the expected 39277777 end magic.

Descriptor Size Inconsistent with Payload

A data object’s CBOR descriptor declares a tensor shape and dtype, from which the decoder can compute the exact decoded byte count. For pipelines with no compression and no filter, that count must equal the frame’s actual payload length.

What happens: Before any decode work, the decoder checks the descriptor-implied size against the payload length for the two uncompressed pipelines where the size is known exactly:

encoding=none: num_values × dtype_width (or ceil(num_values / 8) for the bitmask dtype).
encoding=simple_packing: ceil(num_values × bits_per_value / 8).

A mismatch in either direction — payload too short (truncation) or too long (trailing junk / wrong descriptor) — is categorically malformed input and is rejected with a clear error naming the claimed and actual byte counts. This also means a simple_packing payload with extra trailing bytes is now rejected rather than silently tolerated.

Hostile descriptors: A tiny payload paired with a descriptor claiming a terabyte-scale tensor is caught here structurally — no allocation is attempted, so there is no risk of an out-of-memory abort.

Compressed objects: Codecs such as zstd, szip, blosc2, zfp, and sz3 are not subject to this exact check — their on-disk length is the compressed length, which the descriptor does not determine. No fixed compression-ratio ceiling is imposed (it would risk false-rejecting legitimately highly-compressible scientific data). A malformed compressed descriptor instead fails gracefully at the fallible decompression allocation, surfacing a structured error rather than aborting the process.

NaN in Simple Packing

Simple packing cannot represent NaN. The quantization formula maps the range [min, max] onto integers, and NaN has no defined place in this range.

What happens: compute_params() returns PackingError::NanValue(index) if any value is NaN. The encode() function also rejects NaN inputs before packing.

Solution: Replace NaN values with a sentinel (e.g. the minimum representable value, or a separate bitmask object) before encoding.

Inf in Simple Packing — Silent Corruption

Subtle gotcha — simple_packing’s compute_params scans for NaN but not for Inf. Passing [1.0, +Inf, 3.0]:

range = max - min = +Inf, which produces binary_scale_factor = i32::MAX (saturating cast from Inf as i32).
Encoding yields all-zero packed integers.
Decoding reconstructs NaN at every position (because Inf × 0 = NaN in IEEE 754).

Net effect: every decoded value silently becomes NaN.

Mitigation: turn on strict-finite encoding (see docs). It catches Inf upstream of the simple_packing encoder and fails with a clean EncodingError before the corruption path runs.

Also: extract_simple_packing_params catches a non-finite sp_reference_value in the descriptor, so callers going through the high-level encode() API are protected when the computed reference happens to be ±Inf (e.g. data like [1.0, -Inf]). But for data like [1.0, +Inf, 3.0] the reference is 1.0 (finite) and only sp_binary_scale_factor overflows — that’s not caught without the strict flag.

Decode Range on Compressed Data

decode_range() supports partial range decode for compressors that have random access capability: szip (via RSI block offsets), blosc2 (via chunk-based access), and zfp fixed-rate mode. Stream compressors (zstd, lz4, sz3) return CompressionError::RangeNotSupported.

Workaround for stream compressors: Decode the full object with decode_object() and slice the result in memory.

Bitmask Byte Width

Dtype::Bitmask returns 0 from byte_width(). This is a sentinel, not a real byte width.

Why: A bitmask of N elements occupies ceil(N / 8) bytes. The library cannot infer N from the byte width alone, so the “element size” concept doesn’t apply. Callers that need the payload size must compute it from the element count.

#![allow(unused)]
fn main() {
let num_elements: u64 = descriptor.shape.iter().product();
let payload_bytes = if descriptor.dtype == Dtype::Bitmask {
    let n = usize::try_from(num_elements)?;
    (n + 7) / 8
} else {
    let n = usize::try_from(num_elements)?;
    n * descriptor.dtype.byte_width()
};
}

verify_hash on Messages Without Hashes

If a message was encoded with hash_algorithm: None (no hash), and you decode it with verify_hash: true, the decoder silently skips hash verification for that object. No error is returned.

Rationale: The absence of a hash is not an error. The decoder cannot verify what was never stored. If you need to enforce that all messages have hashes, check descriptor.hash.is_some() after decoding.

Constant-Value Fields with simple_packing

If all values in a field are identical (range = 0), compute_params() sets sp_binary_scale_factor such that all packed integers are 0, and the full value is recovered from sp_reference_value alone. This is correct and handled without special cases.

Very Short Buffers

Passing a buffer shorter than the preamble size (24 bytes) to any decode function returns TensogramError::Framing("buffer too short ..."). No panic.

Object Index Out of Range

decode_object(&message, 99, &options) when the message has fewer than 100 objects returns TensogramError::Object("object index N out of range").

Empty Files

TensogramFile::message_count() returns 0. read_message(0) returns an error.

CBOR Key Ordering

The library uses canonical CBOR key ordering (RFC 8949 §4.2). If you construct a GlobalMetadata struct with keys in one order and then check the CBOR bytes, the bytes may not match your insertion order. This is intentional and correct — it ensures deterministic output.

If you need to compare metadata across languages or implementations, always compare the decoded values, not the raw CBOR bytes from different encoders.

You can verify that any CBOR output is canonical using the verify_canonical_cbor() utility:

#![allow(unused)]
fn main() {
use tensogram::verify_canonical_cbor;

let cbor_bytes = /* ... */;
verify_canonical_cbor(&cbor_bytes)?; // Returns Ok(()) if canonical, Err if not
}

Frame Ordering Violations

The decoder validates that frames appear in the expected order: header frames first, then data object frames, then footer frames. A message with frames out of order (e.g. a header metadata frame appearing after a data object frame) is rejected with TensogramError::Framing.

This catches malformed or tampered messages. Valid messages produced by the encoder always have correct ordering.

Streaming Mode (total_length = 0)

When encoding for a non-seekable output (e.g. TCP socket), the preamble’s total_length is set to 0. In this mode:

Header index and header hash frames are omitted (the encoder doesn’t know the data object count or offsets upfront).
The footer must contain at least the metadata frame.
The first_footer_offset in the postamble points to the first footer frame.

Decoders that encounter total_length = 0 should read from the postamble backward to find the footer frames, then use the footer index (if present) for random access to data objects.

first_footer_offset is Never Zero

The postamble’s first_footer_offset field always points to a valid position:

If footer frames exist: it points to the start of the first footer frame.
If no footer frames exist: it points to the start of the postamble itself.

This invariant means decoders can always seek to first_footer_offset and determine whether they’ve landed on a footer frame or the postamble.

Inter-Frame Padding

The encoder may insert padding bytes between frames for memory alignment (e.g. 64-bit alignment). Padding appears between the ENDF marker of one frame and the FR marker of the next. Decoders should scan for the FR marker rather than assuming frames are contiguous.

Zero-Element Tensors

Shapes containing zero dimensions are valid: shape: [0], shape: [3, 0, 5]. This matches numpy and PyTorch semantics where zero-element tensors are legitimate objects (e.g. an empty batch). The encoded payload for a zero-element tensor is zero bytes.

Scalar Tensors

shape: [] (empty shape, ndim: 0) represents a scalar tensor containing exactly one element. The payload size equals dtype.byte_width() bytes.

Metadata-Only Messages

A message with zero data objects is valid. This can be used to transmit metadata without any tensor data (e.g. coordination signals, timestamps, provenance records). Both encode() with an empty descriptors slice and StreamingEncoder with no write_object() calls produce valid messages.

Mixed Dtypes in One Message

Multiple data objects in the same message may have different dtypes. For example, a Float32 tensor paired with a Bitmask object used as a missing-data mask. Each object’s pipeline (encoding, filter, compression) is configured independently.

Bitmask with Encoding/Compression

Bitmask data is internally packed into uint8 bytes. Any encoding or compression pipeline that supports uint8 should work with bitmask data. The total bit count must be stored separately (in the shape) since the byte count ceil(N / 8) may not equal N exactly.

Strides Validation

Strides are validated for length: strides.len() must match shape.len(). Non-contiguous strides (e.g. shape: [4, 4], strides: [8, 1]) are accepted — they indicate a view into a larger array and are semantically valid.

Version Constraints

version: 0, version: 1, and version: 2 are no longer supported — decoders hard-fail on any preamble whose version is not 3.
version: 3 is the current wire format version.
Higher versions (4+) are reserved for future use and will be valid once defined.

NaN/Infinity in Simple Packing Parameters

If sp_reference_value is NaN or Infinity, encoding fails immediately with a clear error. This value is used in the quantization formula and would produce corrupt output. (sp_binary_scale_factor and sp_decimal_scale_factor are integers and cannot be NaN/Infinity.)

Duplicate CBOR Keys

Duplicate keys at the same level in a CBOR map are never accepted. The library uses canonical CBOR (RFC 8949 §4.2) which inherently rejects duplicate keys. Same-name keys at different nesting levels are acceptable: base[0]["foo"] and _extra_["foo"] are distinct keys.

Unknown Hash Algorithm on Decode

If a message contains a hash with an algorithm the decoder doesn’t recognize (e.g. "sha256" when only xxh3 is implemented), verify_hash: true issues a warning and skips verification rather than returning an error. This ensures forward compatibility when new hash algorithms are added.

decode_range with Empty Ranges

Calling decode_range() with an empty ranges slice (&[]) returns (descriptor, vec![]) — the parts vector is empty. This is not an error.

Preceder Metadata Error Paths

The decoder validates PrecederMetadata frames strictly:

Condition	Error type	Message
Consecutive preceders without DataObject	`Framing`	“PrecederMetadata must be followed by a DataObject frame, got {type}”
Dangling preceder (no DataObject follows)	`Framing`	“dangling PrecederMetadata: no DataObject frame followed”
Base has 0 or 2+ entries	`Metadata`	“PrecederMetadata base must have exactly 1 entry, got {n}”
Metadata base entries > data objects	`Metadata`	“metadata base has {n} entries but message contains {m} objects”

On the encoder side:

StreamingEncoder::write_preceder() errors if called twice without an intervening write_object().
StreamingEncoder::finish() errors if a preceder was written without a following write_object().
encode() (buffered mode) errors if emit_preceders: true — use StreamingEncoder::write_preceder() instead.

File Concatenation

Tensogram is a message format, not a file format. Multiple .tgm files can be concatenated:

cat 1.tgm 2.tgm > all.tgm

The resulting file is valid. scan() and TensogramFile will find all messages from both source files.

xarray Layer Edge Cases

meta.base Out-of-Range

If a message has more data objects than meta.base entries (e.g. 3 objects but base has only 1 entry), the xarray layer logs a warning and treats the missing base entries as empty dicts. The objects are still decoded — they just have no per-object metadata attributes.

This can happen when a message is encoded with an incomplete base array, or when objects are appended to a message without updating base. The warning helps diagnose silent metadata loss:

WARNING: meta.base has 1 entries but object index 2 requested;
         per-object metadata will be empty for this object

Empty or Missing base Attribute

A message with base: [] or no base key at all is valid. All objects get empty per-object metadata and are named object_0, object_1, etc. The _reserved_ key (auto-populated by the encoder in each base entry) is always filtered out — it never appears in user-facing variable attributes.

Variable Naming with Dot Paths

When variable_key="mars.param" is used, the resolve_variable_name() function traverses the nested dict path. If any segment is missing, the function falls back to the generic object_<index> name. The obj_index used is the object’s position in the message (not its position among data variables), so a file with objects 0 (coord), 1 (data), 2 (data) would produce names like "object_1" and "object_2" for the data variables.

Coordinate Name Case Insensitivity

Coordinate detection (detect_coords) is case-insensitive: "LATITUDE", "Lat", and "latitude" all match the known coordinate name "latitude". The canonical dimension name is always lowercase (e.g. "latitude", not "LATITUDE").

Ambiguous Dimension Size Matching

When two coordinate arrays have the same size (e.g. latitude with 5 points and depth with 5 points), the dimension resolution assigns the first matching coord to the first axis that matches the size, and the second to the next axis. If the data variable is 2D [5, 5], one axis gets "latitude" and the other gets "depth". When no coord has the matching size, the axis gets a generic "dim_N" name.

Multi-Message Merge with Different Keys

When open_datasets() merges multiple messages, objects whose base entries have different key sets are handled as follows:

Keys present in all objects with identical values become Dataset attributes (constant).
Keys present in all objects with varying values become outer dimensions (if they form a hypercube) or separate variables.
Keys present in some objects but not others are treated as varying with None for missing entries.

reserved Filtering Consistency

The _reserved_ key is filtered at every access point:

TensogramDataStore._get_per_object_meta() (store.py)
_base_entry_from_meta() (scanner.py)
_filter_reserved() (zarr store.py)

This ensures the encoder’s auto-populated tensor info (ndim, shape, strides, dtype) never leaks into user-facing metadata.

Zarr Layer Edge Cases

Group Attributes from meta.extra

Group-level attributes in the root zarr.json come from meta.extra (message-level annotations). If meta.extra is empty or absent, the group zarr.json only contains internal attributes (_tensogram_version, _tensogram_variables).

Per-Array Attributes from meta.base[i]

Per-array attributes come from meta.base[i] with the _reserved_ key filtered out. Descriptor encoding params are stored under _tensogram_params to avoid namespace collisions.

Variable Name Resolution — No Extra Fallback

Variable names are resolved exclusively from per_object_meta (from meta.base[i]). The common_meta (from meta.extra) is not searched for variable naming. This prevents all objects in a message from sharing the same name when a name key exists only at the message level.

This is consistent across both xarray and zarr layers.

Zarr Metadata Key Collision

If a base entry has keys like "zarr", "chunks", or "shape", they go into the Zarr array’s attributes dict — not the top-level metadata. There is no collision with Zarr’s own shape, chunk_grid, etc. fields.

Write Path: reserved Filtering

When writing through TensogramStore, user-set array attributes are written into base[i] entries. The _reserved_ key is explicitly filtered from these entries to prevent collision with the encoder’s auto-populated _reserved_.tensor info.

Write Path: Group Attributes

Group attributes set via Zarr become unknown top-level keys in GlobalMetadata, which the encoder preserves as _extra_. On re-read, they appear in meta.extra. Internal keys (starting with _tensogram_) and reserved structural keys (version, base, _extra_, _reserved_) are excluded.

Empty TGM File

A .tgm file with zero messages produces a root group zarr.json with no arrays. A message with zero data objects produces a root group with the message’s extra metadata but no arrays.

Variable Name Deduplication

When multiple objects resolve to the same name, suffixes _1, _2, etc. are appended. For example, three objects named "x" become "x", "x_1", "x_2".

Variable Name Sanitization

Slashes and backslashes in resolved variable names are replaced with underscores to prevent spurious directory nesting in the Zarr virtual key space. Empty names are replaced with "_".

GRIB Importer Edge Cases

This section covers behaviour specific to the tensogram-grib importer and the tensogram convert-grib CLI — these notes apply when you are bringing GRIB data into Tensogram, not to Tensogram itself.

Single GRIB to base[0] Has ALL MARS Keys

In OneToOne mode, each GRIB message becomes one Tensogram message. All MARS namespace keys (plus gridType as "grid") go into base[0]["mars"]. When --all-keys is enabled, non-MARS namespace keys (geography, time, vertical, parameter, statistics) go into base[0]["grib"].

MergeAll with N Fields

In MergeAll mode, N GRIB fields become one Tensogram message with N data objects. Each base[i] holds ALL metadata for that object independently — there is no common/varying partitioning at encode time. This means metadata keys are duplicated across base entries.

Performance note: With 1000 GRIB fields, this means 1000 copies of common keys (class, type, stream, expver, date, time, etc.). This is by design — the wire format prioritizes simplicity and independent object access over byte savings. Use tensogram::compute_common() at display/merge time to extract shared keys.

Different Grid Types in MergeAll

GRIB fields with different grid types (e.g. regular_ll and reduced_gg) can be merged into the same Tensogram message. Each base[i]["mars"]["grid"] independently records its grid type. Downstream consumers (xarray, zarr) must handle the structural differences (e.g. different shapes).

GRIB Shape from Ni/Nj

The shape is derived from ecCodes Ni and Nj keys (row-major: [Nj, Ni]). If either is zero or missing (e.g. reduced Gaussian grids), the shape falls back to [numberOfPoints] (1-D).

Empty params in DataObjectDescriptor

GRIB-converted data objects have empty desc.params — all metadata lives in base[i]["mars"] and base[i]["grib"], not in the per-object descriptor. This is by design: the descriptor carries only what’s needed to decode the payload (shape, dtype, encoding pipeline).

Metadata Model Edge Cases (base / reserved / extra)

The metadata model has three sections: base (per-object), _reserved_ (library internals), and _extra_ (client annotations). These create several non-obvious edge cases.

reserved is Protected

Client code must not set _reserved_ in any context:

Python: tensogram.encode({"_reserved_": {...}}) raises ValueError.
Python: encode({"base": [{"_reserved_": {...}}]}) raises ValueError.
FFI: JSON with "base": [{"_reserved_": {...}}] returns TgmError::Metadata.
CLI: set -s _reserved_.tensor.ndim=5 returns an error.

The encoder auto-populates _reserved_.tensor in each base entry (ndim, shape, strides, dtype) and _reserved_ at the message level (encoder, time, uuid).

Metadata Lookup Semantics (base first-match)

All lookup functions (__getitem__ in Python, tgm_metadata_get_string in FFI, lookup_key in CLI) use first-match semantics:

Search base[0], then base[1], …, skipping the _reserved_ key within each entry.
If not found in any base entry, search _extra_.
If not found → None (FFI/CLI) or KeyError (Python).

Implication: If base[0] has product.name="temperature" and base[1] has product.name="pressure", lookups return "temperature" (the first match). This is message-level lookup, not per-object. The same applies to any namespace (MARS, BIDS, DICOM, etc.).

reserved is Hidden from Dict Access

meta["_reserved_"] → KeyError (Python). The key is skipped during base entry iteration.
"_reserved_" in meta → False.
tgm_metadata_get_string(meta, "_reserved_.tensor") → NULL (FFI). The path is blocked.
To read _reserved_ data, use meta.reserved (Python) or read the base entry directly via meta.base[i]["_reserved_"].

Explicit extra / extra Prefix

The CLI and FFI support explicit _extra_.key or extra.key prefixes to target the _extra_ map directly, bypassing the base search:

# CLI: write to _extra_ map
tensogram set -s "extra.custom=value" input.tgm output.tgm
tensogram set -s "_extra_.custom=value" input.tgm output.tgm

# CLI: read from _extra_ map
tensogram get -p "_extra_.custom" input.tgm

Without the prefix, set writes to all base entries. With the prefix, it writes to _extra_ specifically.

Empty Key String

An empty key "" returns None (FFI/CLI) or raises KeyError (Python). This is not an error — it simply finds no match.

base vs Descriptor Count

The base array length should match the number of data objects. The encoder auto-extends base entries (adding _reserved_.tensor) for each object. If the user provides fewer base entries than objects, the encoder creates entries for the missing ones. If the user provides more base entries than objects, the encoder returns an error.

tgm_metadata_num_objects (FFI)

tgm_metadata_num_objects() returns base.len(), which is the number of per-object metadata entries. After encoding, this matches the actual data object count because the encoder populates one base entry per object.

set Command on Zero-Object Messages

The CLI set command redirects mutations to _extra_ when the message has zero data objects. This is because base entries must align 1:1 with descriptors, and a zero-object message has no descriptors.

Both extra and extra in Python Dict

When both "_extra_" and "extra" are present in a Python metadata dict, _extra_ takes precedence (it’s the wire-format name). The "extra" key is treated as a convenience alias and only used if "_extra_" is absent.

Filter Matching with Multi-Object Messages

CLI where-clause filters (-w mars.param=2t) match at the message level. If base[0] has mars.param=2t and base[1] has mars.param=msl, the filter matches "2t" (first base entry match). To filter by per-object values, split the message first.

Split Preserves Per-Object Metadata

When splitting a multi-object message, the CLI split command assigns each object its own base entry from the original message. The _reserved_ key is stripped from each entry (the encoder regenerates it). Extra metadata is copied to all split messages.

Merge Concatenates Base Arrays

When merging messages, the CLI merge command concatenates all base arrays. The merge strategy (first/last/error) only applies to _extra_ key conflicts. The _reserved_ section is cleared and regenerated by the encoder.

Deeply Nested Paths

Dot-notation paths support arbitrary nesting depth: grib.geography.Ni, a.b.c.d.e. The recursive resolver walks through CBOR Map values at each level. If a non-Map value is encountered before the path is fully resolved, the lookup returns None.

JSON Output Structure

CLI dump -j and ls -j output uses the wire-format structure:

{
     "base": [{"mars": {"param": "2t"}, "_reserved_": {"tensor": {"ndim": 1}}}],
  "extra": {"custom": "value"}
}

The _reserved_ keys within base entries are included in JSON output for transparency.

Metadata Refactor: Detailed Edge Cases

The following edge cases were identified during systematic review of the Rust core crate (tensogram) after the metadata refactor.

base Array Count Validation

Scenario	Behaviour
`base.len() < descriptors.len()`	Auto-extended with empty entries. `_reserved_.tensor` is inserted in each.
`base.len() == descriptors.len()`	Normal path. Pre-existing application keys preserved.
`base.len() > descriptors.len()`	Error: “metadata base has N entries but only M descriptors provided; extra base entries would be discarded”.

Rationale: Silently truncating excess base entries would lose user data. Auto-extending is safe because the library adds _reserved_.tensor to each new entry.

`_reserved_.tensor` After Encode

After encoding, each base[i]["_reserved_"]["tensor"] always contains exactly four keys:

Key	Value	Example
`ndim`	CBOR integer	`0` for scalar, `2` for matrix
`shape`	CBOR array of integers	`[]` for scalar, `[10, 20]` for matrix
`strides`	CBOR array of integers	`[]` for scalar, `[20, 1]` for matrix
`dtype`	CBOR text	`"float32"`, `"int64"`, etc.

For scalar tensors (ndim: 0), shape and strides are empty arrays [].

Preceder `_reserved_` Protection

Encoder side: StreamingEncoder::write_preceder() rejects any metadata map containing a _reserved_ key. Error: “client code must not write ‘reserved’ in preceder metadata”.

Decoder side: When the decoder encounters a _reserved_ key in a preceder’s base[0], it strips the key rather than rejecting the message. This is permissive — the data may come from a non-standard producer. The encoder-populated _reserved_.tensor from the footer metadata is preserved.

Merge order in finish(): Footer metadata is populated first (_reserved_.tensor), then preceder payloads are merged on top. Since the decoder strips _reserved_ from preceders, there is no risk of preceder _reserved_ clobbering the encoder’s _reserved_.tensor.

Backward Compatibility with Old CBOR Keys

Old key	Behaviour on decode
`"common"` (v2 pre-refactor)	Silently ignored (unknown CBOR key).
`"payload"` (v2 pre-refactor)	Silently ignored.
`"reserved"` (old name)	Silently ignored — only `"_reserved_"` is recognized.
Both `"reserved"` and `"_reserved_"`	Only `"_reserved_"` is captured; `"reserved"` is ignored.

GlobalMetadata does not use #[serde(deny_unknown_fields)], so serde drops unrecognized keys.

compute_common() Key Selection

compute_common() only examines keys from the first base entry as candidates for common keys. Keys present in later entries but absent from the first entry are never promoted to common.

Example: if entry 0 has keys {a, b} and entry 1 has {b, c}, only b is a candidate (and becomes common if values match). Key c appears only in entry 1’s remaining set.

compute_common() NaN Handling

CBOR Float(NaN) values with identical bit patterns are treated as equal by cbor_values_equal(), using f64::to_bits() comparison. This means NaN values are classified as common when all entries share the same NaN bit pattern. Standard CBOR equality (PartialEq) would fail because NaN != NaN.

compute_common() CBOR Map Ordering

cbor_values_equal() compares CBOR maps positionally (entry-by-entry). Two maps with the same keys and values in different order are NOT equal. This is correct because canonical CBOR encoding ensures all maps are always sorted — different-order maps can only arise from non-canonical input.

Shape Product Overflow

All shape-product computations use checked_mul to detect overflow. This applies to encode(), decode(), ObjectIter::next(), and decode_range(). If the product overflows u64, a TensogramError::Metadata("shape product overflow") is returned. No silent wraparound.

`_extra_` Scope Independence

_extra_ is message-level, while base[i] entries are per-object. Keys with the same name can exist in both:

#![allow(unused)]
fn main() {
meta.base[0].insert("mars".into(), ...);  // per-object
meta.extra.insert("mars".into(), ...);     // message-level
// Both preserved after encode/decode round-trip
}

Empty `_extra_` in CBOR

An empty _extra_ map is omitted from CBOR output via skip_serializing_if = "BTreeMap::is_empty". On decode, a missing _extra_ key is deserialized as an empty BTreeMap. Round-trips correctly.

Deeply Nested `_reserved_` in base Entries

Only the top-level _reserved_ key in base[i] is rejected by the encoder. Deeply nested _reserved_ keys (like {"foo": {"_reserved_": ...}}) are allowed and preserved. The encoder only checks entry.contains_key("_reserved_").

CLI `set` on Zero-Object Messages

When tensogram set modifies a zero-object message, keys that would normally go into base are redirected to _extra_ instead (since base entries must align 1:1 with data objects, and there are none).

Error Handling Reference

This section documents all error types, how they propagate across languages, and what messages users can expect.

TensogramError Variants (Rust)

The core library defines seven error variants in TensogramError:

Variant	When it occurs	Example message
`Framing(String)`	Invalid wire format — magic bytes, postamble, frame ordering	`"buffer too short (12 bytes, need >= 24)"`
`Metadata(String)`	Metadata validation failures — version, base count, CBOR parse	`"metadata base has 3 entries but only 2 descriptors provided"`
`Encoding(String)`	Encoding pipeline errors — simple_packing NaN, bit-width	`"NaN value at index 42"`
`Compression(String)`	Compression/decompression failures — codec errors, range access	`"RangeNotSupported: zstd does not support partial decode"`
`Object(String)`	Per-object errors — index out of range, shape overflow	`"object index 99 out of range (num_objects=2)"`
`Io(io::Error)`	File system errors — open, read, write, seek	`"data.tgm: No such file or directory"`
`HashMismatch { expected, actual }`	Integrity check failure	`"hash mismatch: expected=abc123, actual=def456"`

Python Exception Mapping

The Python bindings convert TensogramError to Python exceptions:

Rust variant	Python exception	Prefix in message
`Framing`	`ValueError`	`FramingError:`
`Metadata`	`ValueError`	`MetadataError:`
`Encoding`	`ValueError`	`EncodingError:`
`Compression`	`ValueError`	`CompressionError:`
`Object`	`ValueError`	`ObjectError:`
`Io`	`IOError`	(raw io message)
`HashMismatch`	`RuntimeError`	`HashMismatch:`

Additional Python-side exceptions:

Function	Exception	Condition
`encode()`	`ValueError`	Missing `version` key, `_reserved_` in dict, unknown dtype
`decode()`	`ValueError`	Corrupted buffer, invalid CBOR
`Metadata.__getitem__()`	`KeyError`	Key not found in base or extra
`Metadata.__getitem__("_reserved_")`	`KeyError`	`_reserved_` is always hidden from dict access
`TensogramFile.__getitem__()`	`IndexError`	Message index out of range
`TensogramFile.__getitem__()`	`TypeError`	Non-integer, non-slice index
`compute_packing_params()`	`ValueError`	NaN in input array
`encode(hash="sha256")`	`ValueError`	`"unknown hash: sha256"`

Example: handling errors in Python:

import tensogram

# File not found
try:
    with tensogram.TensogramFile.open("missing.tgm") as f:
        pass
except IOError as e:
    print(f"File error: {e}")
    # → "File error: file not found: missing.tgm"

# Corrupted buffer
try:
    tensogram.decode(b"garbage")
except ValueError as e:
    print(f"Decode error: {e}")
    # → "Decode error: FramingError: buffer too short ..."

# Hash verification failure
try:
    meta, objects = tensogram.decode(buf, verify_hash=True)
except RuntimeError as e:
    print(f"Integrity error: {e}")
    # → "Integrity error: HashMismatch: expected=..., actual=..."

# Missing metadata key
meta, objects = tensogram.decode(buf)
try:
    val = meta["nonexistent"]
except KeyError:
    print("Key not found")

# Index out of range
with tensogram.TensogramFile.open("data.tgm") as f:
    try:
        msg = f[999]
    except IndexError as e:
        print(f"Index error: {e}")
        # → "message index 999 out of range for file with 2 messages"

CLI Error Handling

All CLI commands:

Print errors to stderr with error: prefix
Show the full error chain (nested causes)
Exit with code 1 on any error
Exit with code 0 on success

Common CLI error scenarios:

# File not found
$ tensogram ls nonexistent.tgm
error: file not found: nonexistent.tgm

# Invalid where clause
$ tensogram ls -w "bad-clause" data.tgm
error: invalid where clause: invalid where-clause: bad-clause (expected key=value or key!=value)

# Missing key in strict get
$ tensogram get -p "nonexistent" data.tgm
error: key not found: nonexistent

# Protected namespace
$ tensogram set -s "_reserved_.tensor.ndim=5" input.tgm output.tgm
error: cannot modify '_reserved_' — this namespace is managed by the library

# Immutable descriptor key
$ tensogram set -s "shape=broken" input.tgm output.tgm
error: cannot modify immutable key: shape

# Merge conflict with error strategy
$ tensogram merge --strategy error a.tgm b.tgm -o merged.tgm
error: conflicting values for key 'param' (use --strategy first or last to resolve)

# Invalid merge strategy
$ tensogram merge --strategy unknown a.tgm b.tgm -o merged.tgm
error: unknown merge strategy 'unknown': expected first, last, or error

# Message index out of range (via file.read_message)
$ tensogram dump corrupt.tgm
error: framing error: buffer too short ...

xarray Backend Error Handling

Scenario	Behaviour
File not found	`IOError` from `tensogram.TensogramFile.open()`
Corrupt file	`ValueError` from `tensogram.decode_descriptors()`
`message_index` out of range	`ValueError` from `TensogramFile.read_message()`
`message_index < 0`	`ValueError("message_index must be >= 0, got -1")`
`meta.base` shorter than objects	Warning logged; missing entries treated as empty dicts
Unsupported dtype	`TypeError("unsupported tensogram dtype ...")`
`dim_names` count mismatch	`ValueError("dim_names has N entries but tensor has M dimensions")`
`decode_range` failure	Warning logged; falls back to full `decode_object()`
File with zero messages + `merge_objects=True`	Returns empty `xr.Dataset()`

Zarr Store Error Handling

Scenario	Behaviour
File not found	`OSError("failed to open TGM file ...")` wrapping the original error
Corrupt message	`ValueError("failed to decode message ...")` wrapping the original error
Failed object decode	`ValueError("failed to decode object N ...")` wrapping the original error
`message_index` out of range	`IndexError("message_index N out of range (file has M message(s))")`
`message_index < 0`	`ValueError("message_index must be >= 0, got -1")`
Invalid mode	`ValueError("invalid mode 'x'; expected 'r', 'w', or 'a'")`
Empty path	`ValueError("path must be a non-empty string, got ''")`
Store already open	`ValueError("store is already open")`
Write to read-only store	Raises from Zarr base class
Flush failure during exception	Warning logged; original exception preserved
Unsupported dtype on write	`ValueError("unsupported dtype for variable ...")`
Chunk size mismatch on write	`ValueError("chunk data for 'var': expected N bytes ... got M")`
Multiple chunks per variable	`ValueError("variable 'var' has N chunk keys; TensogramStore only supports single-chunk arrays")`
Unsupported `ByteRequest` type	`TypeError("unsupported ByteRequest type: ...")`
Zero messages in file	Root group zarr.json with empty attributes; no arrays

IO Error Path Context

All file I/O errors include the file path in the error message. This applies to:

TensogramFile::open() — "file not found: /path/to/file.tgm"
TensogramFile::create() — "cannot create /path/to/file.tgm: Permission denied"
Internal re-opens (scan, read, append) — "/path/to/file.tgm: No such file or directory"

This ensures that when errors propagate through multiple layers (e.g. Rust → Python → xarray), the original file path is always visible in the error message.

Mutation Testing

Mutation testing is a technique for measuring the depth of a test suite. The tool systematically corrupts the source code — flipping operators, replacing return values, deleting statements — and then re-runs the tests. If the tests still pass after a mutation, that surviving mutant marks untested behaviour: a place where the code could silently change without any test noticing. Survivors are the signal; killed mutants confirm the suite already covers that logic.

Why tensogram uses it

Tensogram’s critical path is a small set of Rust modules that handle wire-format layout, bit-flag dispatch, frame boundaries, and hash verification. These modules share three properties that make them ideal targets for mutation testing:

Bit-flag handling. Per-frame flags like HASH_PRESENT rely on bitwise operators (&, |, ^). Swapping one for another is the exact mutation class cargo-mutants generates, and a missed swap means silent data corruption.
Off-by-one frame-bound bugs. Boundary arithmetic in the decoder (< vs <=, +1 vs -1) is another mutation sweet spot. The Pass-7 frame-bound bug caught during PR #111 review is a textbook example.
Wire-format byte-equality. Tensogram guarantees that the same input produces bit-identical output across Rust, Python, C++, and TypeScript. Golden-file tests enforce this, but mutation testing reveals whether those golden files actually exercise every branch in the encoder and decoder.

This page is the reference for the process. Remaining rollout work (weekly-sweep triage, finishing critical-path coverage, and flipping the PR-time gate to required-for-merge) is tracked in plans/TODO.md under Code Quality.

The two regimes

Mutation testing runs in two complementary modes:

PR-time (`--in-diff`)

Every pull request runs cargo mutants --in-diff origin/main..HEAD as a non-blocking job inside .github/workflows/ci.yml. This restricts mutation to lines the PR actually touches, keeping wall-clock time reasonable (typically under five minutes). The job uploads mutants.out/ as a CI artifact so reviewers can inspect survivors without running locally. The plan is to flip this job to required-for-merge once Phase 1 stabilises.

Weekly full sweep (sharded)

A scheduled workflow in .github/workflows/mutants-weekly.yml runs the full mutation sweep across all configured modules, split into sixteen shards (--shard 1/16 through --shard 16/16) for parallelism.

Schedule: Saturday + Sunday at 09:00 UTC (weekend daytime, when the ECMWF builder fleet is least contended). Two runs per weekend catches commits made on Friday (Saturday’s run) and on Saturday (Sunday’s run).

Runner: ECMWF self-hosted infrastructure (platform-builder-docker-xl fleet) inside the eccr.ecmwf.int/tensogram/ci:1.3.0 container — the same runner the rest of the test matrix uses. Zero GitHub-Actions minutes are consumed by the sweep itself; only the small “open triage issue” job at the end runs on ubuntu-latest (one short API call per failed weekend).

Failure detection: surviving mutants surface through three independent channels:

GitHub issue auto-created with labels mutation-testing and triage, containing a status table, the full missed.txt dump, and a link to the run artifact. Repo watchers receive an email.
Workflow run summary ($GITHUB_STEP_SUMMARY) shows the caught/missed/timeout/unviable totals + the first 50 surviving mutants directly on the GitHub Actions run page — no artifact download needed for triage.
Red badge on the Actions tab + commit status.

The triage SLA is seven days from the auto-issue date.

Timeout safety: each shard has an 8-hour hard cap via GH-Actions timeout-minutes: 480, with a timeout 7h wrapper around cargo mutants so that if a shard hits the cap, the artifact-upload step still runs with whatever was tested before the kill. Typical shard wallclock is ~30 minutes, so the cap is belt-and-braces only.

Running mutation testing locally

Install the pinned version of cargo-mutants:

cargo install cargo-mutants --version 27.0.0 --locked

Concurrency — seven nested layers

cargo-mutants concurrency is not a single number. Seven independent layers each fan out by default to one thread per logical CPU; multiplied together they saturate a 12-core laptop even at CARGO_MUTANTS_JOBS=2. All seven must be clamped for a sustainable load profile:

#	Layer	Knob	Default	Recommended
1	Mutant workers	`CARGO_MUTANTS_JOBS` env	NCPUS	`2`
2	cargo + rustc invocation count	`--jobserver-tasks` flag	NCPUS	`2`
3	rustc codegen threads inside each rustc	`[profile].codegen-units`	16 in release	`1` (pinned in `[profile.release-mutants]`)
4	Test-binary threads	`cargo test -- --test-threads=N`	NCPUS	`2` (pinned in `.cargo/mutants.toml`)
5	cmake build.rs (libaec-sys, blosc2-sys, zfp-sys-cc, tensogram-sz3-sys)	`CMAKE_BUILD_PARALLEL_LEVEL`	NCPUS	`1`
6	nested make from cmake	`MAKEFLAGS=-j1`	NCPUS	`-j1`
7	tensogram-internal rayon	`TENSOGRAM_THREADS` env	NCPUS	`1`

Layer 3 is the most pernicious. Each rustc in release mode defaults to 16 codegen threads, and cargo’s jobserver only counts the rustc process as one slot. Without codegen-units = 1, even at --jobserver-tasks 2 peak parallelism is 32 codegen threads — drawing

120W on M-series silicon for 5–15-second bursts that the 1-minute load average smooths away but a laptop power supply does not. The [profile.release-mutants] profile (in workspace Cargo.toml) inherits release and pins codegen-units = 1; .cargo/mutants.toml selects this profile via --profile release-mutants so every cargo build during a sweep uses it automatically.

Canonical local invocation for a Phase-1 file sweep:

CARGO_MUTANTS_JOBS=2 \
TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 \
MAKEFLAGS=-j1 \
cargo mutants -p tensogram --file rust/tensogram/src/hash.rs --jobserver-tasks 2

The test-thread cap (--test-threads=2) and the --profile release-mutants selection are pinned in .cargo/mutants.toml so they don’t need to appear on every command line.

macOS escape hatch — `taskpolicy -b`

On Apple Silicon laptops, application-level clamps can still miss something — an undocumented threadpool in a future dependency, a code path that bypasses TENSOGRAM_THREADS, etc. For unattended runs where a power event is unacceptable, prefix the canonical invocation with taskpolicy -b:

taskpolicy -b -- env \
  CARGO_MUTANTS_JOBS=2 \
  TENSOGRAM_THREADS=1 \
  CMAKE_BUILD_PARALLEL_LEVEL=1 \
  MAKEFLAGS=-j1 \
  cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --jobserver-tasks 2

taskpolicy -b runs the entire process tree under background QoS: restricted to E-cores only (4 efficiency cores out of 12 on M-series), with total package power capped at ~30% of system maximum. Applies to all child processes — rustc, cc-rs, cmake, make, ld64, rayon, zstdmt workers, blosc2 workers — without their cooperation. Kernel-level enforcement, not application-level.

Cost: 2-3x wallclock (E-cores are slower than P-cores). Acceptable for unattended overnight runs; unsuitable for interactive iteration. Use the bare canonical invocation when you’re driving the sweep from a desk and watching it.

Per-context defaults

Use case	Jobs	Jobserver tasks	TENSOGRAM_THREADS	CMAKE / MAKEFLAGS	taskpolicy
Default (12-core laptop, attended)	`2`	`2`	`1`	`1` / `-j1`	optional
Unattended overnight on a laptop	`2`	`2`	`1`	`1` / `-j1`	required
Workstation (≥16 cores, good cooling)	`4`	`4`	`1`	`2` / `-j2`	not needed
CI runner	`2`	`2`	`1`	`1` / `-j1`	not applicable

Common invocations

Full sweep of a single file (useful when closing out a Phase-1 step):

CARGO_MUTANTS_JOBS=2 TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 MAKEFLAGS=-j1 \
  cargo mutants -p tensogram --file rust/tensogram/src/hash.rs --jobserver-tasks 2

Diff-only (the most common workflow for PR authors — mutates only lines you changed):

CARGO_MUTANTS_JOBS=2 TENSOGRAM_THREADS=1 \
CMAKE_BUILD_PARALLEL_LEVEL=1 MAKEFLAGS=-j1 \
  cargo mutants --in-diff origin/main..HEAD --jobserver-tasks 2

Both commands write results to mutants.out/ in the current directory.

Resuming an interrupted sweep

Long sweeps (notably the framing-module sweep at ~12 hours with the default concurrency) can be interrupted by power events, OOM kills, or laptop sleep. cargo-mutants does not have a native resume flag, but --shard N/K partitions the mutant list deterministically and lets you re-run a subset:

# Run quarter 1 of 4 — first ~138 mutants of framing.rs
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 1/4

# Then quarter 2, 3, 4 in subsequent sessions
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 2/4
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 3/4
cargo mutants -p tensogram --file rust/tensogram/src/framing.rs --shard 4/4

Shard assignment is stable across runs, so re-running a single shard re-tests the same mutants. mutants.out/ is overwritten per invocation; copy or rename between shard runs if you want to preserve per-shard results.

Reading `mutants.out/`

After a run completes, cargo-mutants writes four result files into the mutants.out/ directory:

caught.txt — Mutants that were killed by the test suite. This is the happy path: the tests detected the corruption and failed. A long caught.txt and an empty missed.txt is the goal.
missed.txt — Surviving mutants. Each entry describes a source location and the mutation that was applied. Survivors indicate either a genuine coverage gap (write a test) or an equivalent mutant (the mutation does not change observable behaviour — add an exemption). This is the file you triage.
timeout.txt — Mutants that caused the test suite to hang until the timeout expired. These are usually equivalent or dead-code mutations (e.g. removing a loop-break that the test harness never reaches). Worth a quick look, but rarely actionable.
unviable.txt — Mutants that did not compile. Expected for type-constrained mutations (e.g. replacing a u32 return with String). These are noise and can be ignored.

When to add a test vs. an exemption

When you encounter a surviving mutant, follow this decision tree:

Does the mutation change observable behaviour? Read the diff cargo-mutants prints. If flipping that operator or deleting that statement would produce incorrect output, a wrong error, or a panic in production, the answer is yes — write a test.
Is the mutant equivalent? Some mutations produce code that is functionally identical to the original (e.g. replacing x > 0 with x >= 1 when x is always a positive integer). If you can convince yourself (and a reviewer) that no input distinguishes the original from the mutant, add an exclude_re entry to .cargo/mutants.toml.
Is the code cosmetic or logging-only? Display implementations, fmt::Debug overrides, and log-line formatting are legitimate exemption targets. Add an exclude_re entry.

Two strict rules apply in all cases:

No mass suppression by category. Do not add broad patterns like "impl Display" or "fn fmt" that suppress entire classes of mutants across the codebase. Each exemption must be scoped to a specific function or pattern.
Every exclude_re entry needs a rationale. Add a one-line comment above the entry in .cargo/mutants.toml explaining why the mutant is equivalent or cosmetic.

Bumping cargo-mutants

The version is pinned at 27.0.0 across local installs, CI workflows, and this documentation. Version bumps land in dedicated PRs that:

Update the version in .cargo/mutants.toml comments, CI workflow files, and this page.
Re-run the full sweep on rust/tensogram/src/hash.rs as a smoke test to confirm the new version produces comparable results.
Document any changed mutant operators or output-format differences in the PR description.

References

plans/TODO.md — remaining mutation-testing rollout work (Code Quality section).
plans/TEST.md — test plan covering the full suite shape, including the mutation testing layer.
mutants.rs — upstream cargo-mutants documentation.

Releasing

This page is the canonical procedure for cutting a Tensogram release. It is written for both human maintainers and coding agents. The short version lives in AGENTS.md; this page adds the per-step detail, the rationale, and the prerequisites.

The gate model

A release passes through three gates, in order:

make all — the everyday build/test/lint gate. Compiles every language surface (Rust, Python, TypeScript, C++, WASM, cargo-c, Fortran) and runs the Rust/Python/TS test suites plus all lints.
make release-check — the release-only gates that make all does not run: version consistency, crate packaging, the C header-drift diff, Python wheel metadata, and the npm tarball contents. Run it after a green make all.
release-preflight.yml (GitHub Actions, workflow_dispatch) — the authoritative pre-tag gate. It runs the same checks on a clean checkout and additionally covers the grib/netcdf and macOS matrices and real dry-run publishes. Local make release-check is the fast, locally-runnable subset of this.

make release-check is not folded into make all because it is slower (it builds distributable wheels, dry-run-publishes crates, and builds the cargo-c C library) and only relevant when preparing a release.

Prerequisites

make release-check needs more than the core Rust toolchain:

Tool	Used by	Install
`cargo-c`	`cargo-c-header-check`, the cargo-c leg of `make all`	`cargo install cargo-c --locked` (CI pins `0.10.21 --features vendored-openssl`)
`uv`	Python build/test, wheel build, `twine`	see `astral.sh/uv`
Node ≥ 20	TypeScript build, `npm-pack-check`	nodejs.org
`wasm-pack`	WASM build	`cargo install wasm-pack`
`gfortran` + `pkg-config`	Fortran leg of `make all`	system package manager

maturin and patchelf are installed automatically into .venv by the Python targets (maturin[patchelf]).

Step-by-step

1. Confirm the work is landed and recorded

main must be green, and every user-facing change must already be in CHANGELOG.md under the [Unreleased] section. The changelog is the single backward-looking record — there is no separate status file.

2. Bump the version

make bump-version VERSION=X.Y.Z

The VERSION file at the repo root is the single source of truth. This command (wrapping scripts/bump_version.py) rewrites every manifest that carries a version string and then greps the tree for stragglers. Never hand-edit individual manifests.

SemVer rules:

MAJOR — never bump without the user explicitly asking.
MINOR — new features.
MICRO — bug fixes and documentation.

Then, by hand, move the [Unreleased] changelog entries under a new ## [X.Y.Z] - YYYY-MM-DD heading. The bump script deliberately does not touch CHANGELOG.md or the Python dependency constraint ranges (tensogram>=X.Y.Z,<X.Y+1) — the < ceiling is a compatibility policy that is surfaced for review rather than rewritten.

Verify everything is consistent at any time with:

make version-check

Why it matters: the provenance encoder reads the version via env!("CARGO_PKG_VERSION"), so a manifest out of sync with VERSION stamps the wrong provenance into encoded messages.

3. Run the full gate

make all

This must be green. It is now comprehensive across all languages, so it needs the full toolchain (see Prerequisites). It is also idempotent — safe to re-run.

4. Run the release-readiness gate

make release-check

This runs, in order:

Sub-target	Checks
`version-check`	Every manifest matches the `VERSION` file.
`feature-tests`	The optional-feature test surface (`remote`, `remote,async`) that `make all`’s default-feature run skips.
`crates-verify`	`cargo package --list` for every workspace crate + a real `cargo publish --dry-run` of the leaf crates (`tensogram-szip`, `tensogram-sz3-sys`), which compiles the packaged tarball and catches missing `include` files or bad metadata.
`cargo-c-header-check`	Diffs the in-tree `rust/tensogram-ffi/tensogram.h` against the header `cargo-c` generates — a drift guard for C/C++ consumers.
`python-release-check`	Builds the binding wheel for every discovered interpreter plus the pure-Python extra packages, then validates their metadata with `twine check`.
`npm-pack-check`	Verifies the published npm tarball would include `wasm/tensogram_wasm_bg.wasm` (wasm-pack writes a `.gitignore` that npm otherwise honours, silently dropping the wasm blob).

make release-check may be run on a working tree that still has the version-bump/changelog edits uncommitted — the packaging checks pass --allow-dirty for exactly this reason. The real publish always runs on the clean tagged commit.

5. Open a release PR and merge it

Releases land on main through a PR like every other change (see the “Review & merge” policy in AGENTS.md) — do not push the release commit directly. Branch as chore/release-X.Y.Z, commit, push, open the PR, get it green/reviewed, then merge and delete the branch. Releases default to a rebase merge (gh pr merge --rebase); /make-release takes an optional merge|squash|rebase argument to override. Sync main before tagging. If anything is uncommitted, STOP — the tag must point at a clean tree on main.

6. (Optional but recommended) Dispatch the CI preflight

Run the release-preflight workflow (workflow_dispatch, with the expected version as input). It re-runs the gates on a clean checkout and adds the grib/netcdf and macOS coverage plus real dry-run publishes — things the local gate cannot fully cover.

7. Tag and publish

git tag X.Y.Z          # NO leading 'v'
git push origin X.Y.Z

Then create the GitHub release. The tag push triggers the publish workflows:

publish-crates.yml — crates.io, in dependency order.
publish-pypi.yml — the binding wheel + pure-Python packages.
publish-npm.yml — @ecmwf.int/tensogram (with the wasm-blob guard).
publish-ffi.yml — pre-built C/C++ FFI tarballs attached to the release.

Troubleshooting

cargo package/publish aborts on “uncommitted changes” — expected on a dirty tree; make release-check already passes --allow-dirty. If you call the cargo commands directly, add --allow-dirty or commit first.
maturin failed … Object is too small — a stale/zeroed binding cdylib from a previous maturin develop. make python-build removes the artifact before building to force a clean relink; if you hit it outside make, delete python/bindings/target/release/libtensogram*.so and rebuild.
Fortran configure fails with a non-existent include path — a stale build/fortran CMake cache pinned an old prefix. make build wipes build/fortran first; otherwise rm -rf build/fortran and re-run.
cargo cinstall: no such subcommand — cargo-c is not installed (see Prerequisites).

Internals

This page explains implementation decisions that are not obvious from the public API. Useful if you’re contributing to the library or implementing a compatible reader in another language.

Deterministic CBOR Canonicalization

The library encodes all CBOR structures (global metadata, data object descriptors, index frames, hash frames) using a three-step process:

Serialize the struct to a ciborium::Value tree using serde.
Recursively sort all map keys by their CBOR byte encoding.
Write the sorted Value tree to bytes.

Standard serde serialization into ciborium does not guarantee key order (it depends on the HashMap/BTreeMap iteration order of the struct). Even though the library uses BTreeMap throughout (which gives alphabetical iteration order for string keys), relying on that would be fragile. The explicit canonicalization step ensures the output matches RFC 8949 §4.2 regardless of how the keys were stored.

GlobalMetadata / DataObjectDescriptor struct
    ↓ serde serialization
ciborium::Value::Map (arbitrary key order)
    ↓ canonicalize() — sort all maps recursively by CBOR-encoded key bytes
ciborium::Value::Map (canonical order)
    ↓ write to bytes
CBOR bytes (deterministic)

Note: canonicalize() returns Result<()> and propagates errors rather than panicking.

BTreeMap Throughout

The extra (serialized as _extra_), reserved (serialized as _reserved_), and base entry fields in GlobalMetadata, as well as the params field in DataObjectDescriptor, are BTreeMap<String, ciborium::Value>. This:

Gives alphabetical iteration order for string keys (which matches CBOR canonical order for short strings).
Avoids the non-determinism of HashMap.
Makes it easy to read and write keys without worrying about order.

Frame-Based Wire Format (v3)

Tensogram uses a frame-based wire format. For the normative byte-level specification, see plans/WIRE_FORMAT.md — this section summarises the structure so contributors working on the decoder can orient themselves.

Preamble (24 bytes)

MAGIC "TENSOGRM" (8) + version u16 (2) + flags u16 (2) + reserved u32 (4) + total_length u64 (8)

The preamble version field must be exactly 3. Preamble flags indicate which optional frames are present (header/footer metadata, index, hashes) plus the message-wide HASHES_PRESENT bit. total_length = 0 signals streaming mode.

Frame Header (16 bytes)

Every frame (metadata, index, hash, data object) starts with:

"FR" (2) + frame_type u16 (2) + version u16 (2) + flags u16 (2) + total_length u64 (8)

Every frame ends with a 12-byte tail:

hash u64 (8)  +  "ENDF" (4)

The hash slot is populated with the xxh3-64 digest of the frame body when HASHES_PRESENT is set; otherwise it is 0x0000000000000000. Frame versions are independent of message version.

Data Object Frame Layout (`NTensorFrame`, type 9)

Each data object is a self-contained frame with an extended 20-byte footer:

Frame header (16B)
  + payload bytes
  + [mask blobs: nan / inf+ / inf-]  (optional)
  + CBOR descriptor
  + cbor_offset u64 (8B)
  + hash u64 (8B)
  + "ENDF" (4B)

The cbor_offset is the byte offset from the frame start to the CBOR descriptor. A flag bit controls whether the CBOR descriptor appears before or after the payload (default: after). The inline hash slot at frame_end − 12 is the single source of truth for per-object integrity; the hash field was removed from the DataObjectDescriptor in v3.

Postamble (24 bytes)

first_footer_offset u64 (8) + total_length u64 (8) + END_MAGIC "39277777" (8)

first_footer_offset is never zero. It points to the first footer frame, or to the postamble itself when no footer frames are present. The mirrored total_length makes the postamble self-locating from any byte position inside a message, enabling backward and bidirectional scans.

Two-Pass Index Construction

When encoding a non-streaming message, the index frame contains byte offsets of each data object. But the index frame’s own size affects those offsets (circular dependency). The encoder solves this with a two-pass approach:

First pass: compute index CBOR with placeholder offsets to determine the index frame size.
Second pass: compute final offsets using the known index frame size, re-encode the index CBOR.

If the re-encoded CBOR changes size (edge case), the encoder returns an error rather than silently producing incorrect offsets.

Encoder Structure

The encode_message() function delegates to five focused helpers:

build_hash_frame_cbor() — collects hashes from objects and serializes the HashFrame
build_index_frame() — runs the two-pass index construction described above
compute_object_offsets() — calculates byte offsets with 8-byte alignment
compute_message_flags() — sets preamble flags from optional frame presence
assemble_message() — writes preamble, frames, and postamble into the final buffer

simple_packing Bit Layout

Values are packed MSB-first (most significant bit first), following the same bit layout as the GRIB 2 simple_packing specification so that quantised payloads are interoperable with existing GRIB tooling:

Element 0: bits [0 .. B-1]
Element 1: bits [B .. 2B-1]
Element 2: bits [2B .. 3B-1]
...

The last byte is zero-padded on the right if N × B is not a multiple of 8.

The decode formula is:

V[i] = R + (packed[i] × 2^E) / 10^D

Where:

R = reference_value (minimum of original data)
E = binary_scale_factor
D = decimal_scale_factor
packed[i] = the integer read from the packed bits

Lazy File Scanning

TensogramFile::open() does not read the file. The first call that needs the message list (e.g. message_count(), read_message()) triggers a streaming scan using scan_file(). The scanner reads only preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, the list of (offset, length) pairs is cached in memory for the lifetime of the TensogramFile object.

// No I/O here
let mut file = TensogramFile::open("huge.tgm")?;

// Streaming scan happens here (once) — reads preamble chunks, seeks forward
let count = file.message_count()?;

// O(1) seek + read
let msg = file.read_message(999)?;

Error Hierarchy

TensogramError
├── Framing     — invalid magic, truncated preamble, bad frame markers, missing postamble
├── Metadata    — CBOR serialization/deserialization failure
├── Encoding    — invalid encoding params, NaN in simple_packing
├── Compression — compressor error (szip, zstd, lz4, blosc2, zfp, sz3)
├── Object      — index out of range
├── Io          — filesystem errors (wraps std::io::Error)
└── HashMismatch { expected, actual } — payload integrity failure

All public functions return Result<T> where the error is TensogramError. The Io variant wraps std::io::Error via the From impl, so ? on any std::io::Result produces a TensogramError::Io automatically.

Memory-Mapped I/O (`mmap` feature)

The mmap feature gate enables memory-mapped file access via memmap2. When you open a file with TensogramFile::open_mmap(), the file is mapped into virtual memory and the existing scan() function runs directly on the mapped buffer. Subsequent read_message() calls return copies from the mapped region without additional seeks.

// Requires: cargo build --features mmap
let mut file = TensogramFile::open_mmap("huge.tgm")?;
let count = file.message_count()?; // already scanned during open_mmap
let msg = file.read_message(42)?;  // copies from mmap, no seek

The regular open() path still works without the feature and uses streaming seek-based scanning.

Async I/O (`async` feature)

The async feature gate adds tokio-based async variants: open_async(), read_message_async(), and decode_message_async(). All CPU-intensive work (scanning, decoding, FFI calls to libaec/zfp/blosc2) runs via spawn_blocking to avoid blocking the async runtime.

// Requires: cargo build --features async
let mut file = TensogramFile::open_async("forecast.tgm").await?;
let (meta, objects) = file.decode_message_async(0, &opts).await?;

Frame Ordering Validation

The decoder enforces that frames appear in the expected order within a message: header frames first, then data object frames, then footer frames. A DecodePhase state machine tracks the current phase and returns TensogramError::Framing if a frame type appears out of order.

This catches malformed messages where, for example, a header metadata frame appears after a data object frame.

Canonical CBOR Verification

The library provides verify_canonical_cbor() to check that a CBOR byte slice is in RFC 8949 §4.2.1 canonical form. This is used internally by tests to verify that all CBOR output (metadata, descriptors, index frames, hash frames) is deterministic. It can also be used by external tools that need to validate Tensogram CBOR output against the spec.

Keyboard shortcuts

Tensogram