Introduction
Tensogram is a binary message format for N-dimensional scientific tensors — the kind of data that appears in weather and climate forecasting, Earth observation, medical and microscopy imaging, genomics, particle physics, materials simulation, and machine-learning pipelines. It carries its own metadata, supports arbitrary tensor dimensions, and is fast to encode and decode.
What Tensogram gives you
- Self-describing messages. Every message carries the metadata needed to decode it — shape, dtype, encoding pipeline, application annotations — using CBOR. No external schema required.
- Any number of dimensions. A single message can carry multiple tensors, each with its own shape, dtype, and encoding. A 3-D spectrum, a 2-D field, and a 4-D ensemble tensor can coexist in one message.
- Vocabulary-agnostic. The library never interprets metadata keys. Application layers (MARS at ECMWF, CF in climate, BIDS in neuroimaging, your in-house taxonomy) own key names.
- Transport and file in one format. The same bytes that traverse a socket
can be appended to a
.tgmfile; both support O(1) random access to any object. - Interop with existing formats. Importers for GRIB and NetCDF let you bring existing data into Tensogram pipelines without a lossy re-modelling step.
- Partial range decode. Extract sub-tensor slices without decoding the whole object — useful for remote data at scale.
Tensogram is developed and maintained by ECMWF and is used in operational weather-forecasting workloads, but nothing in the format is weather-specific. The design targets the N-tensor-at-scale problem common to many scientific domains.
Crate Layout
The primary four Rust crates make up the default workspace build:
tensogram/
├── rust/
│ ├── tensogram ← encode, decode, framing, file API,
│ │ validation, remote object store
│ ├── tensogram-encodings ← simple_packing, shuffle, compression
│ ├── tensogram-cli ← `tensogram` command-line tool
│ └── tensogram-ffi ← C FFI layer for C/C++ callers
├── python/
│ └── bindings/ ← Python bindings (PyO3 / maturin)
├── cpp/
│ └── include/ ← C++ wrapper header + C header
On top of those, the repository ships several opt-in crates — the
tensogram-grib / tensogram-netcdf importers (exposed as the
convert-grib / convert-netcdf CLI subcommands), the tensogram-wasm
WebAssembly bindings, and the pure-Rust tensogram-szip /
tensogram-sz3 / tensogram-sz3-sys compression crates — together
with the separate Python packages tensogram-xarray (xarray backend)
and tensogram-zarr (Zarr v3 store backend), and a tensogram-benchmarks
crate. See plans/ARCHITECTURE.md
for the full crate list and build recipes.
Most users interact with tensogram and the CLI. The encodings
crate is used internally by the core but is also importable directly
if you need to call the encoding functions outside of a full message.
Installation
Rust:
cargo add tensogram
Python:
pip install tensogram # core
pip install tensogram[all] # with xarray + zarr backends
CLI:
cargo install tensogram-cli
See the Quick Start for feature flags, optional dependencies, and detailed setup.
Quick Example
#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
encode, decode, GlobalMetadata, DataObjectDescriptor,
ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};
// Describe what you're storing: a 100×200 grid of f32 values
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 2,
shape: vec![100, 200],
strides: vec![200, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "none".to_string(),
filter: "none".to_string(),
compression: "none".to_string(),
params: BTreeMap::new(),
hash: None,
};
let global_meta = GlobalMetadata {
version: 2,
..Default::default()
};
// Your raw bytes (100 × 200 × 4 bytes = 80,000 bytes)
let data = vec![0u8; 100 * 200 * 4];
// Encode into a self-contained message
let message = encode(&global_meta, &[(&desc, &data)], &EncodeOptions::default()).unwrap();
// Decode it back
let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();
assert_eq!(objects[0].0.shape, vec![100, 200]);
assert_eq!(objects[0].1, data);
}
The message bytes can be written to a file, sent over a socket, or stored in a database. The receiver does not need any external schema — everything is self-describing.
What is a Message?
A Tensogram message is a single, self-contained binary blob. It carries:
- A Preamble – fixed-size header with magic bytes, version, flags, and total length
- Optional header frames – metadata, index, and hash frames for fast random access
- One or more data object frames – each containing a CBOR descriptor and the actual tensor bytes
- Optional footer frames – metadata, index, and hash frames (used in streaming mode)
- A Postamble – footer offset and terminator magic
Every message begins with the ASCII string TENSOGRM and ends with 39277777. This makes it trivial to find message boundaries even in a file containing hundreds of concatenated messages.
Structure at a Glance
block-beta
columns 1
A["PREAMBLE (24 bytes)\nTENSOGRM · version · flags · total_length"]
B["Header Metadata Frame (optional)\nCBOR GlobalMetadata"]
C["Header Index Frame (optional)\nobject count + offsets"]
D["Header Hash Frame (optional)\nobject count + hash type + hashes"]
E["Data Object Frame 0\nCBOR descriptor + payload bytes"]
F["Data Object Frame 1 (if present)\nCBOR descriptor + payload bytes"]
G["... (more data object frames)"]
H["Footer Hash / Index / Metadata Frames (optional)"]
I["POSTAMBLE (16 bytes)\nfirst_footer_offset · 39277777"]
Frame-Based Design
The v2 wire format is entirely frame-based. Every piece of data between the Preamble and Postamble is wrapped in a frame. Each frame starts with a 4-byte marker (FR + a uint16 frame type), a version, flags, and a length field. This uniform structure means a decoder can skip any frame it does not understand by jumping over its declared length.
Frame types:
| Type ID | Name | Location |
|---|---|---|
| 1 | Header Metadata Frame | Header |
| 2 | Header Index Frame | Header |
| 3 | Header Hash Frame | Header |
| 4 | Data Object Frame | Body |
| 5 | Footer Hash Frame | Footer |
| 6 | Footer Index Frame | Footer |
| 7 | Footer Metadata Frame | Footer |
| 8 | Preceder Metadata Frame | Body (before a Data Object) |
Padding between frames is allowed (from ENDF to the next FR marker) for 64-bit memory alignment.
Why Header Frames?
When a message is encoded in a single buffer (the common case), the index and hash frames are placed in the header, right after the Preamble. A decoder reads the Preamble, then the metadata frame, then the index frame, and can immediately seek to any data object by offset. That is O(1) random access, which matters when a message carries many large tensors.
Streaming Support
When encoding in streaming mode, the producer may not know in advance how many data objects the message will contain. In this case:
total_lengthin the Preamble is set to 0 (unknown)- Index and hash frames are written in the footer instead of the header
- The Postamble’s
first_footer_offsetfield points back to where the footer frames begin
A decoder reading a streamed message seeks to the end, reads the Postamble, then jumps to the footer frames to find the index. Both paths (header index and footer index) give O(1) access to any object.
Data Object Frames
Each data object is self-contained in its own frame. The frame carries:
- A CBOR descriptor (
DataObjectDescriptor) describing the tensor shape, dtype, encoding pipeline, and optional hash - The binary payload (the actual encoded tensor bytes)
The CBOR descriptor can appear before or after the payload within the frame. By default it is placed after the payload, since some encoding parameters (like hash values) are only known after the payload has been written. A flag in the frame header indicates the position.
Messages vs Files
A .tgm file is just a sequence of messages written one after another:
[message 1][message 2][message 3]...
There is no file-level index or header. The TensogramFile API scans the file once (lazily, on first access) and builds an in-memory list of (offset, length) pairs for each message. After that, reading any message is a seek + read – no scan needed.
To find message boundaries in a file:
- Scan for
TENSOGRMmagic (8 bytes) - If
total_lengthis non-zero, use it to advance to the next message - Otherwise, walk frames using their length fields until the next magic or EOF
Self-Description
Every message carries all the information needed to decode it:
- The dtype of every object (float32, int16, etc.)
- The shape and strides (dimensions and memory layout)
- The full encoding pipeline applied to the payload (encoding, filter, compression)
- The byte order of each object’s data
- Any application-level metadata (MARS keys, units, timestamps, etc.)
This means a decoder never needs an external schema. You can receive a Tensogram message on a new machine, years after it was encoded, and decode it correctly.
Edge Case: Zero-Object Messages
A message with no data object frames is valid. It contains only the Preamble, a metadata frame, and the Postamble. This is useful for sending pure metadata (e.g. a control message or an acknowledgement with provenance information) without any tensor payload.
#![allow(unused)]
fn main() {
let metadata = GlobalMetadata {
version: 2,
..Default::default()
};
let msg = encode(&metadata, &[], &EncodeOptions::default()).unwrap();
}
Metadata
Metadata in Tensogram is stored as CBOR – Concise Binary Object Representation (RFC 8949). Think of it as a compact, binary version of JSON. It supports the same types (strings, integers, floats, booleans, arrays, maps), but is smaller and faster to parse.
Metadata Locations
In v2, metadata lives in two distinct places:
| Level | Where it lives | What it contains |
|---|---|---|
| Global | Header or footer metadata frame | GlobalMetadata: version + base (per-object metadata array) + _reserved_ (library internals) + _extra_ (client annotations) |
| Per-object | Each data object frame’s CBOR descriptor | DataObjectDescriptor: tensor shape, encoding pipeline, hash, plus params for encoding parameters |
Each data object carries its own descriptor inline within its frame.
GlobalMetadata
The global metadata frame contains a GlobalMetadata struct with three named sections:
#![allow(unused)]
fn main() {
GlobalMetadata {
version: 2,
base: Vec::new(), // one BTreeMap per data object (independent entries)
reserved: BTreeMap::new(), // library internals (_reserved_ in CBOR)
extra: BTreeMap::new(), // client-writable catch-all (_extra_ in CBOR)
}
}
In CBOR, this looks like (using ECMWF MARS keys as one concrete example vocabulary):
{
"version": 2,
"base": [
{
"mars": {
"class": "od", "type": "fc",
"date": "20260401", "time": "1200", "param": "2t"
}
}
],
"_extra_": {
"source": "ifs-cycle49r2"
}
}
The same mechanism works for any application vocabulary. A neuroimaging pipeline might use a BIDS namespace:
{
"version": 2,
"base": [{
"bids": { "subject": "sub-01", "session": "ses-01",
"task": "rest", "run": 1 }
}]
}
A materials-simulation pipeline might use a custom namespace:
{
"version": 2,
"base": [{
"material": { "composition": "Fe3O4", "lattice": "cubic", "T_K": 300.0 }
}]
}
The library does not know or care which vocabulary is used — it simply stores, serialises, and returns the keys you supply.
The version field is required (u16). The base array holds per-object metadata. _extra_ is a free-form catch-all – you can add any key using any CBOR value type. The library does not interpret or validate these keys. Your application layer assigns meaning.
Per-Object Metadata in base
The base section is a CBOR array of maps — one entry per data object. Each entry holds ALL structured metadata for that object independently. Entries are self-contained — there is no tracking of which keys are common across objects.
The encoder auto-populates _reserved_.tensor (with ndim, shape, strides, dtype) in each entry when you call encode() or StreamingEncoder::finish(). Application keys are preserved:
{
"base": [
{
"mars": { "class": "od", "type": "fc", "param": "2t", "levtype": "sfc" },
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
}
},
{
"mars": { "class": "od", "type": "fc", "param": "10u", "levtype": "sfc" },
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
}
}
]
}
This lets readers discover the shape, type, and per-object metadata of every object by reading only the global metadata frame — without opening each data object frame.
No common/varying split: Every
base[i]entry is self-contained. MARS keys shared across all objects (e.g.class,type) are simply repeated in each entry. If you need to extract commonalities (e.g. for display or merges), use thecompute_common()utility in software after decoding.
DataObjectDescriptor
The params field of each DataObjectDescriptor is a BTreeMap<String, ciborium::Value> for encoding parameters only (e.g. reference_value, bits_per_value). These are flattened into the CBOR descriptor alongside the fixed tensor fields.
For example, a data object’s CBOR descriptor might look like:
{
"type": "ntensor",
"ndim": 2,
"shape": [721, 1440],
"strides": [1440, 1],
"dtype": "float32",
"byte_order": "big",
"encoding": "simple_packing",
"filter": "none",
"compression": "szip",
"reference_value": 230.5,
"bits_per_value": 16,
"hash": { "type": "xxh3", "value": "a1b2c3d4e5f6..." }
}
Here, reference_value and bits_per_value live in the params map. Application metadata such as MARS keys belongs in base[i]["mars"] in the global metadata.
Namespaced Keys
Convention: application-layer keys are grouped under a namespace key, so
that multiple vocabularies can coexist in the same message. For example, ECMWF’s
MARS vocabulary lives under "mars":
{
"version": 2,
"base": [
{
"mars": {
"class": "od", "type": "fc",
"param": "2t", "date": "20260401", "step": 6
}
}
]
}
Other pipelines use other namespaces — "cf" for CF conventions, "bids" for
neuroimaging, "dicom" for medical imaging, or anything your application
defines. This convention applies at both levels — global metadata and
per-object params.
Filtering with the CLI
The -w flag on ls, dump, get, and copy uses dot-notation to filter
messages on any namespace. The examples below use the MARS vocabulary, but the
same syntax works with any application namespace (e.g. bids.subject,
dicom.Modality, product.name):
# Only messages where mars.param equals "2t" or "10u"
tensogram ls data.tgm -w "mars.param=2t/10u"
# Exclude messages where mars.class equals "od"
tensogram ls data.tgm -w "mars.class!=od"
The / character separates OR values. Key lookup searches base[i] entries first (skipping _reserved_, first match across entries), then _extra_ for backwards compatibility.
Preceder Metadata Frames
In streaming mode, per-object metadata is normally only available in the footer metadata frame (written after all objects). A Preceder Metadata Frame (frame type 8) allows producers to send per-object metadata before the data object, without waiting for the footer.
A preceder carries a GlobalMetadata CBOR with a single-entry base array for the next data object:
{
"version": 2,
"base": [{"product": {"name": "temperature"}, "units": "K"}]
}
Merge rule: On decode, preceder keys override footer base[i] keys on conflict. Structural keys auto-populated by the encoder (in _reserved_.tensor: ndim, shape, strides, dtype) are preserved from the footer when absent from the preceder. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.
Use StreamingEncoder::write_preceder() before write_object() to emit a preceder frame. Preceders are optional per-object: some objects may have them, others may not.
Value Type Rules
Keys must be text strings. Values must be JSON-compatible CBOR types: string, integer, float, boolean, null, array, or map. Byte strings, CBOR tags, undefined, and half-precision floats are not allowed. See Metadata Value Types for the full rules and rationale.
Deterministic Encoding
When Tensogram encodes metadata to CBOR, it sorts all map keys by their CBOR byte representation (RFC 8949 Section 4.2 canonical form). This guarantees that the same metadata always produces the same bytes, regardless of the order you inserted keys in your application code. This matters for hashing and reproducibility.
Edge case: Nested maps are also sorted recursively. Even metadata stored inside a CBOR map value (like the
"mars"namespace) gets canonical ordering.
Objects and Dtypes
An object is one N-dimensional tensor inside a message. A message can carry multiple objects. In v2, each object is fully described by a single struct:
- A
DataObjectDescriptorcarrying tensor metadata, encoding pipeline, and integrity hash – all in one place - The actual binary payload within the object’s frame
There is no separate “payload descriptor” array. The descriptor travels with the data inside the same frame.
DataObjectDescriptor
#![allow(unused)]
fn main() {
DataObjectDescriptor {
// ── Tensor metadata ──
obj_type: "ntensor", // always "ntensor" for now
ndim: 2, // number of dimensions
shape: vec![100, 200], // size of each dimension
strides: vec![200, 1], // elements to skip per dimension step
dtype: Dtype::Float32, // element type
// ── Encoding pipeline ──
byte_order: ByteOrder::Big, // big or little endian
encoding: "simple_packing", // or "none"
filter: "shuffle", // or "none"
compression: "szip", // or "none", "zstd", "lz4", etc.
// ── Flexible parameters (encoding only) ──
params: BTreeMap::from([ // BTreeMap<String, ciborium::Value>
("reference_value".into(), ciborium::Value::Float(230.5)),
("bits_per_value".into(), ciborium::Value::Integer(16.into())),
]),
// ── Integrity ──
hash: Some(HashDescriptor {
hash_type: "xxh3",
value: "a1b2c3d4e5f6...",
}),
}
}
The params map is flattened into the CBOR alongside the fixed fields, so the on-wire CBOR is a single flat map. This keeps things simple for decoders – no nested “encoding” or “tensor” sub-objects to navigate.
Each data object has its own descriptor, so different objects in the same message can use different encodings, byte orders, and hash algorithms.
Strides
Strides tell you how to navigate the memory layout. For a C-contiguous (row-major) array of shape [100, 200]:
- Advancing along axis 0 (rows) skips 200 elements
- Advancing along axis 1 (columns) skips 1 element
So strides = [200, 1]. For a Fortran-contiguous (column-major) array the strides would be reversed: [1, 100].
To compute C-contiguous strides from shape:
#![allow(unused)]
fn main() {
fn compute_strides(shape: &[u64]) -> Vec<u64> {
let mut strides = vec![1u64; shape.len()];
for i in (0..shape.len() - 1).rev() {
strides[i] = strides[i + 1] * shape[i + 1];
}
strides
}
// shape [100, 200] → strides [200, 1]
// shape [4, 5, 6] → strides [30, 6, 1]
}
Supported Data Types
| Name | Size | Description |
|---|---|---|
float16 | 2 bytes | IEEE 754 half-precision float |
bfloat16 | 2 bytes | Brain float (truncated float32) |
float32 | 4 bytes | IEEE 754 single-precision float |
float64 | 8 bytes | IEEE 754 double-precision float |
complex64 | 8 bytes | Two float32 (real + imag) |
complex128 | 16 bytes | Two float64 (real + imag) |
int8 | 1 byte | Signed integer |
int16 | 2 bytes | Signed integer |
int32 | 4 bytes | Signed integer |
int64 | 8 bytes | Signed integer |
uint8 | 1 byte | Unsigned integer |
uint16 | 2 bytes | Unsigned integer |
uint32 | 4 bytes | Unsigned integer |
uint64 | 8 bytes | Unsigned integer |
bitmask | < 1 byte | Packed bits (sub-byte; size depends on element count) |
Edge case:
bitmaskreturns0frombyte_width(). Callers that need the actual byte count must compute it from the element count:(num_elements + 7) / 8.
Multiple Objects in One Message
A message can carry several related tensors. Two concrete examples:
- A wave-spectrum message with the spectrum itself as a 3-tensor and a land/sea mask as a 2-tensor.
- A medical-imaging message with a 4-D time-series volume, a 3-D segmentation mask, and a 1-D array of acquisition timestamps.
block-beta
columns 3
A["Object 0\nSpectrum\nf32 · 721×1440×30\nencoding: simple_packing"]:2
B["Object 1\nLand mask\nuint8 · 721×1440\nencoding: none"]:1
All objects live in the same message. Each object has its own
DataObjectDescriptor embedded in its frame and its own entry in
GlobalMetadata.base holding per-object application metadata. Different
objects can use completely different encoding pipelines.
Edge case: The number of
DataObjectDescriptorentries and the data slices passed toencode()must be equal. The encoder returns an error if they do not match.
The Encoding Pipeline
Every object payload passes through a three-stage pipeline on the way in (encoding) and out (decoding). The stages always run in the same order:
flowchart TD
subgraph Encode["Encode Path"]
direction TB
A["Raw bytes"]
B["Stage 1 — Encoding
(lossy quantization)"]
C["Stage 2 — Filter
(byte shuffle)"]
D["Stage 3 — Compression
(szip / zstd / lz4 / blosc2 / zfp / sz3)"]
A --> B --> C --> D
end
S[("Stored bytes")]
subgraph Decode["Decode Path"]
direction TB
F["Stage 3 — Decompress"]
G["Stage 2 — Unshuffle"]
H["Stage 1 — Dequantize"]
I["Raw bytes"]
F --> G --> H --> I
end
D --> S --> F
style A fill:#e8f5e9,stroke:#388e3c
style S fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style I fill:#e8f5e9,stroke:#388e3c
style Encode fill:#e3f2fd,stroke:#1565c0,color:#1565c0
style Decode fill:#fce4ec,stroke:#c62828,color:#c62828
Each stage is independently configurable per object via fields in the DataObjectDescriptor. Set a stage to "none" to skip it. For callers with already-encoded payloads, a pipeline-bypass option exists via encode_pre_encoded (see Pre-encoded Payloads).
Stage 1: Encoding
Encoding transforms values to reduce the number of bits needed to represent them. The only supported encoding right now is simple_packing — a lossy quantisation that maps a bounded range of floating-point values onto N-bit integers. The bit layout matches GRIB 2 simple_packing so quantised payloads are interoperable with existing GRIB tooling.
| Value | Meaning |
|---|---|
"none" | Pass through unchanged |
"simple_packing" | Lossy quantization (see Simple Packing) |
Stage 2: Filter
Filters rearrange bytes to improve compression ratios. The shuffle filter reorders bytes by their significance level (all most-significant bytes first, then all second-most-significant bytes, etc.), which makes float data much more compressible because nearby values have similar high bytes.
| Value | Meaning |
|---|---|
"none" | Pass through unchanged |
"shuffle" | Byte-level shuffle (see Byte Shuffle Filter) |
Stage 3: Compression
Compression reduces the total byte count. Seven compressors are implemented:
| Value | Type | Random Access | Notes |
|---|---|---|---|
"none" | Pass-through | Yes | No compression |
"szip" | Lossless | Yes | CCSDS 121.0-B-3 via libaec |
"zstd" | Lossless | No | Excellent ratio/speed tradeoff |
"lz4" | Lossless | No | Fastest decompression |
"blosc2" | Lossless | Yes | Multi-codec, chunk-level access |
"zfp" | Lossy | Yes (fixed-rate) | Floating-point arrays |
"sz3" | Lossy | No | Error-bounded scientific data |
See Compression for full details on each compressor, including parameters and random access support.
Note: ZFP and SZ3 operate directly on typed floating-point data. Use them with
encoding: "none"andfilter: "none"– they replace both encoding and compression.
Typical Combinations
| Use case | encoding | filter | compression |
|---|---|---|---|
| Exact integers (e.g. a mask) | none | none | none |
| Lossy bounded-range floats | simple_packing | none | szip |
| Best lossless (floats) | none | shuffle | szip or blosc2 |
| GRIB 2 CCSDS-interoperable | simple_packing | none | szip |
| Real-time streaming | none | none | lz4 |
| Archival storage | none | shuffle | zstd |
| ML model weights | none | none | blosc2 |
| Lossy float w/ random access | none | none | zfp (fixed_rate) |
| Error-bounded science | none | none | sz3 |
How It Looks in Code
The entire pipeline is configured through the DataObjectDescriptor:
#![allow(unused)]
fn main() {
DataObjectDescriptor {
obj_type: "ntensor".into(),
ndim: 2,
shape: vec![721, 1440],
strides: vec![1440, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "simple_packing".into(),
filter: "none".into(),
compression: "szip".into(),
params: BTreeMap::from([
("reference_value".into(), Value::Float(230.5)),
("bits_per_value".into(), Value::Integer(16.into())),
]),
hash: None, // set automatically during encoding
}
}
All encoding parameters (reference_value, bits_per_value, szip_block_offsets, etc.) go into the params map. The encoder populates additional params during encoding (like block offsets for szip), and the decoder reads them back.
Integrity Hashing
After all three stages, the stored bytes can be hashed. The hash is stored in the DataObjectDescriptor’s hash field alongside the encoded bytes. On decode, if verify_hash: true is set, the hash is recomputed and compared.
| Algorithm | Hash length | Notes |
|---|---|---|
xxh3 | 16 hex chars (64-bit) | Default. Fast, non-cryptographic |
Edge case: The hash covers the stored bytes (after encoding + filter + compression), not the original raw bytes. This means a hash mismatch always indicates storage or transmission corruption, not a quantization difference from lossy encoding.
Wire Format (v3)
This page describes the exact byte layout of a Tensogram v3
message — the format shipped in 0.17.0. You need this if you are
implementing a reader in another language, debugging a corrupted
file, or just want to understand what is happening under the hood.
For the normative specification, see
plans/WIRE_FORMAT.md.
All integer fields are big-endian (network byte order).
Overview
A Tensogram message is built from three sections: a header (preamble + optional frames), one or more data object frames, and a footer (optional frames + postamble).
┌────────────────────────────────────────────────────────────────────┐
│ PREAMBLE magic, version, flags, length (24 B) │
├────────────────────────────────────────────────────────────────────┤
│ HEADER METADATA FRAME CBOR global metadata (optional) │
├────────────────────────────────────────────────────────────────────┤
│ HEADER INDEX FRAME CBOR object offsets (optional) │
├────────────────────────────────────────────────────────────────────┤
│ HEADER HASH FRAME CBOR object hashes (optional) │
├────────────────────────────────────────────────────────────────────┤
│ PRECEDER METADATA FRAME per-object metadata (optional) │
│ DATA OBJECT FRAME 0 header + payload + descriptor │
│ PRECEDER METADATA FRAME per-object metadata (optional) │
│ DATA OBJECT FRAME 1 ... │
│ DATA OBJECT FRAME 2 (no preceder) │
│ ... (any number of objects) │
├────────────────────────────────────────────────────────────────────┤
│ FOOTER HASH FRAME CBOR object hashes (optional) │
├────────────────────────────────────────────────────────────────────┤
│ FOOTER INDEX FRAME CBOR object offsets (optional) │
├────────────────────────────────────────────────────────────────────┤
│ FOOTER METADATA FRAME CBOR global metadata (optional) │
├────────────────────────────────────────────────────────────────────┤
│ POSTAMBLE first_footer_offset, total_length, end_magic (24 B) │
└────────────────────────────────────────────────────────────────────┘
At least one metadata frame (header or footer) must be present — messages cannot exist without metadata. Index and hash frames are optional but highly encouraged. By default, the encoder places them in the header when writing to a buffer, or in the footer when streaming.
Frame ordering: The decoder enforces that frames appear in order: header frames, then data object frames, then footer frames. A header frame appearing after a data object frame, or a data object frame appearing after a footer frame, is rejected as malformed.
Preamble (24 bytes)
The preamble is the fixed-size start of every message.
Offset Size Field
────── ────── ─────────────────────────────────
0 8 Magic: "TENSOGRM" (ASCII)
8 2 Version (uint16 BE) — must be 3 in v3
10 2 Flags (uint16 BE)
12 4 Reserved (uint32 BE) — set to zero
16 8 Total length (uint64 BE)
Total length is the byte count of the entire message from the first byte of the preamble to the last byte of the postamble. A value of zero means the encoder is in streaming mode — the total length was not known when the preamble was written.
Version compatibility. v3 decoders reject any preamble whose
version field is not exactly 3. Older v1/v2 messages must be
re-encoded.
Preamble flags
The flags field is a bitmask indicating which optional frames are present and, new in v3, whether inline per-frame hash slots are populated:
| Bit | Flag | Meaning |
|---|---|---|
| 0 | HEADER_METADATA | A HeaderMetadata frame is present. |
| 1 | FOOTER_METADATA | A FooterMetadata frame is present. |
| 2 | HEADER_INDEX | A HeaderIndex frame is present. |
| 3 | FOOTER_INDEX | A FooterIndex frame is present. |
| 4 | HEADER_HASHES | A HeaderHash aggregate frame is present. |
| 5 | FOOTER_HASHES | A FooterHash aggregate frame is present. |
| 6 | PRECEDER_METADATA | At least one PrecederMetadata frame is present. |
| 7 | HASHES_PRESENT | Every frame’s inline hash slot is populated with a non-zero xxh3-64 digest (new in v3). |
Unused flag bits must be set to zero.
Frames
Every frame (header, footer, and data object) shares a common
16-byte frame header and ends with a type-specific footer whose
last 12 bytes are always [hash u64][ENDF 4] (new in v3).
Frame header (16 bytes)
Offset Size Field
────── ────── ─────────────────────────────────
0 2 Start marker: "FR" (ASCII)
2 2 Frame type (uint16 BE)
4 2 Frame version (uint16 BE)
6 2 Reserved flags (uint16 BE)
8 8 Frame length — offset to end of frame (uint64 BE)
Frame versions are independent from the message version and from each other.
Frame common footer (12 bytes)
Every frame ends with this fixed-size tail:
Offset (from frame end) Size Field
─────────────────────── ────── ─────────────────────────────────
-12 8 hash (uint64 BE) — xxh3-64 digest of the frame body, or 0x0000000000000000 when HASHES_PRESENT = 0
-4 4 End marker: "ENDF" (ASCII)
Data-object frames (type 9) have a larger 20-byte footer that
adds an 8-byte cbor_offset field before the common tail.
Frame types
| Type | Name | Contents |
|---|---|---|
| 1 | Header Metadata | CBOR global metadata map |
| 2 | Header Index | CBOR index of data object offsets |
| 3 | Header Hash | CBOR aggregate of per-object hashes |
| 4 | (reserved) | Occupied by the obsolete v2 NTensorFrame; any v3 decoder errors on read |
| 5 | Footer Hash | CBOR aggregate of per-object hashes |
| 6 | Footer Index | CBOR index of data object offsets |
| 7 | Footer Metadata | CBOR global metadata map |
| 8 | Preceder Metadata | Per-object CBOR metadata (see below) |
| 9 | NTensorFrame | Descriptor + payload + optional NaN / Inf bitmask companion sections (see NaN / Inf Handling) |
The body phase of a v3 message carries one or more
data-object frames. In v3 only NTensorFrame (type 9) is
defined; future types can slot in at fresh unused numbers without
bumping the wire version.
Padding between frames
It is valid to have padding bytes between a frame’s ENDF marker
and the next frame’s FR marker. This allows encoders to align
frame starts to 8-byte (64-bit) boundaries for memory-mapped
access.
Data Object Frames
A data object frame wraps one tensor’s payload together with its
CBOR descriptor. v3 defines exactly one concrete data-object
type, NTensorFrame (type 9). The descriptor can go either
before or after the payload — flag bit 0 in the frame
header controls this. The default is after, because when
encoding the descriptor is sometimes only fully known once the
payload has been written (e.g. after computing a hash or
determining compressed size).
NTensorFrame (type 9) — v3 canonical layout
┌──────────────────────────────────────────────────────────────┐
│ FRAME HEADER "FR" + type(9) + ver + flags + len (16 B)│
├──────────────────────────────────────────────────────────────┤
│ DATA PAYLOAD raw or compressed bytes, NaN/Inf │
│ positions substituted with 0.0 │
├──────────────────────────────────────────────────────────────┤
│ mask_nan blob OPTIONAL — compressed NaN position mask │
├──────────────────────────────────────────────────────────────┤
│ mask_inf+ blob OPTIONAL — compressed +Inf position mask │
├──────────────────────────────────────────────────────────────┤
│ mask_inf- blob OPTIONAL — compressed -Inf position mask │
├──────────────────────────────────────────────────────────────┤
│ CBOR DESCRIPTOR carries a top-level "masks" sub-map │
│ when any mask is present (see below) │
├──────────────────────────────────────────────────────────────┤
│ cbor_offset (uint64 BE, 8 B) │
│ hash (uint64 BE, 8 B) xxh3-64 of body │
│ "ENDF" (4 B) │
└──────────────────────────────────────────────────────────────┘
The data-object footer is 20 bytes: [cbor_offset u64] [hash u64][ENDF 4]. The cbor_offset field points at the CBOR
descriptor’s start relative to the frame’s first byte. The inline
hash slot carries the xxh3-64 of the frame body (everything
between the 16-byte header and this 20-byte footer) when the
message’s HASHES_PRESENT preamble flag is set; otherwise it is
0x0000000000000000.
Hash scope includes payload + masks + CBOR. It does NOT include
the header, the cbor_offset field, the hash slot itself, or
ENDF.
The CBOR descriptor fully describes the data object: its type, shape, strides, data type, byte order, encoding pipeline, and optional per-object metadata. See the CBOR Metadata page for the schema.
See NaN / Inf Handling for the mask encode / decode semantics and the documented lossy-reconstruction caveat.
Preceder Metadata Frame
A Preceder Metadata Frame (type 8) optionally appears immediately before a Data Object Frame. It carries per-object metadata for the following data object, using the same GlobalMetadata CBOR format but with a single-entry base array.
Use case: Streaming producers that do not know ahead of time when the message will end can emit per-object metadata early via preceders, rather than waiting for the footer.
Ordering rules:
- Must appear in the data objects phase (after headers, before footers).
- Must be followed by exactly one Data Object Frame.
- Two consecutive preceders without an intervening DataObject are invalid.
- A dangling preceder (not followed by a DataObject) is invalid.
- Preceders are optional per-object.
CBOR structure:
{
"version": 2,
"base": [{"mars": {"param": "2t"}, "units": "K"}]
}
Merge on decode: Preceder keys override footer base[i] keys on conflict. Footer-only keys (e.g., auto-populated _reserved_.tensor with ndim, shape, strides, dtype) are preserved. The consumer sees a unified GlobalMetadata.base — the preceder/footer distinction is transparent.
Postamble (16 bytes)
The postamble sits at the very end of every message.
Offset Size Field
────── ────── ─────────────────────────────────
0 8 first_footer_offset (uint64 BE)
8 8 End magic: "39277777" (ASCII)
first_footer_offset is the byte offset (from the start of the message) to the first footer frame. This is never zero:
- If footer frames exist, it points to the start of the first one (e.g., the Footer Hash Frame).
- If no footer frames exist, it points to the postamble itself.
This guarantee means a reader can always distinguish “no footer frames” from “footer at offset 0” without ambiguity.
The end magic 39277777 was chosen because it is unlikely to appear naturally in floating-point or integer data, making it useful as a corruption boundary detector.
Random Access Patterns
With a header index (most common)
When a message was written in non-streaming mode, the index is in the header. This is the fastest path — no seeking to the end required.
1. Read preamble (24 B) → check flags
2. Read header metadata frame → global context
3. Read header index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode
With a footer index only (streaming mode)
When a message was written in streaming mode, the encoder did not know the object count or offsets up front. The index lives in the footer.
1. Seek to end − 24, read postamble → first_footer_offset
2. Seek to first_footer_offset, scan footer frames → find index
3. Read footer index frame → offsets[], lengths[]
4. Seek to offsets[N], read data object frame → decode
Both paths give O(1) access to any data object by index. The
object count is derived from offsets.len().
Scanning a Multi-Message File
Multiple messages can be concatenated into a single .tgm file. To find message boundaries:
- Scan forward for the
TENSOGRMmagic (8 bytes). - Read
total_lengthfrom the preamble.- If
total_lengthis non-zero, advance by that many bytes to reach the next message. - If
total_lengthis zero (streaming mode), use the header index frame length if present.
- If
- If neither total length nor header index is available, walk frame-by-frame — each frame header contains a length field — until the next
TENSOGRMmagic or EOF. - Verify the
39277777end magic at the expected position to confirm message integrity.
flowchart TD
A[Start of file] --> B{Find TENSOGRM?}
B -- No --> Z[End of scan]
B -- Yes --> C[Read total_length at +16]
C --> D{total_length > 0?}
D -- Yes --> E[Advance to offset + total_length]
D -- No --> F[Walk frame-by-frame to next magic]
E --> G[Verify 39277777 end magic]
F --> G
G -- Valid --> H[Record message]
H --> B
G -- Invalid --> I[Skip 1 byte, resume scan]
I --> B
If the end magic does not match, the message is likely corrupt. The scanner skips one byte and resumes searching — this is the corruption recovery path.
A Note on CBOR
Frames that contain CBOR data (metadata, index, hash) use length-prefixed CBOR encoding — there are no explicit start/end markers within the CBOR stream itself. The CBOR decoder reads the first byte to determine the data type and item count, then consumes exactly that many bytes. The frame boundaries (FR…ENDF) provide the outer containment.
All CBOR maps use deterministic encoding with canonical key ordering (RFC 8949 section 4.2). See CBOR Metadata for details.
CBOR Metadata Schema
Tensogram v2 uses CBOR (Concise Binary Object Representation) for all structured metadata. There are four kinds of CBOR structures in a message, each living in its own frame:
- GlobalMetadata — in header or footer metadata frames
- DataObjectDescriptor — inside each data object frame
- IndexFrame — in header or footer index frames
- HashFrame — in header or footer hash frames
All CBOR maps use deterministic encoding with canonical key ordering per RFC 8949 section 4.2. Keys are sorted by the byte representation of their CBOR-encoded key, applied recursively to nested maps. This means the same metadata always produces the same bytes — important if you hash messages or compare them by digest.
GlobalMetadata
The global metadata frame contains a single CBOR map. The only required key is version; everything else is optional.
| Key | Type | Required | Description |
|---|---|---|---|
version | uint | Yes | Format version. Currently 2 |
base | array of maps | No | Per-object metadata — one entry per data object, each entry holds ALL metadata for that object independently |
_reserved_ | map | No | Library internals (provenance: encoder, time, uuid). Client code MUST NOT write to this. |
_extra_ | map | No | Client-writable catch-all for ad-hoc message-level annotations |
| any unknown key | any | No | Silently ignored on decode (forward compatibility) |
Each data object is self-describing via its own per-frame descriptor (see below). The base array provides per-object metadata at the message level so readers can discover object metadata from the global frame alone, without opening each data object frame.
The base Array
The base array is one entry per data object. Each entry is a CBOR map holding ALL structured metadata for that object. The encoder auto-populates _reserved_.tensor (containing ndim, shape, strides, dtype) in each entry. Application keys (e.g. "mars") are preserved:
{
"base": [
{
"mars": { "class": "od", "stream": "oper", "param": "2t", "date": "20260404" },
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
}
},
{
"mars": { "class": "od", "stream": "oper", "param": "10u", "date": "20260404" },
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
}
}
]
}
Each entry corresponds to one data object in order. Entries are independent — there is no tracking of which keys are common across objects. If you need to extract commonalities (e.g. for display or merge operations), use the compute_common() utility in software after decoding.
Key difference from earlier versions: There is no
common/payloadsplit. Everybase[i]entry is self-contained. MARS keys that are shared across all objects (e.g.class,stream,date) are simply repeated in each entry.
The _reserved_ Section
The _reserved_ section at the message level holds library-managed provenance information. Client code can read these values but must not write to _reserved_ — the encoder validates this and rejects messages where client code has written to it.
{
"_reserved_": {
"encoder": { "name": "tensogram", "version": "0.1.0" },
"time": "2026-04-06T12:00:00Z",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
}
}
Note:
_reserved_.encoder.versionis set to the library’s crate version at compile time viaenv!("CARGO_PKG_VERSION")— the value above reflects the tensogram version in use.
Within each base[i] entry, the encoder also auto-populates _reserved_.tensor:
{
"_reserved_": {
"tensor": {
"ndim": 2,
"shape": [721, 1440],
"strides": [1440, 1],
"dtype": "float32"
}
}
}
The _extra_ Section
The _extra_ section is a client-writable catch-all for ad-hoc message-level annotations:
{
"_extra_": {
"source": "ifs-cycle49r2",
"experiment_tag": "alpha-run-003"
}
}
Example GlobalMetadata
A complete example with two data objects (temperature and wind fields):
{
"version": 2,
"base": [
{
"mars": {
"class": "od", "stream": "oper", "expver": "0001",
"date": "20260404", "time": "0000", "step": "0",
"levtype": "sfc", "grid": "regular_ll", "param": "2t"
},
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
}
},
{
"mars": {
"class": "od", "stream": "oper", "expver": "0001",
"date": "20260404", "time": "0000", "step": "0",
"levtype": "sfc", "grid": "regular_ll", "param": "10u"
},
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32" }
}
}
],
"_reserved_": {
"encoder": { "name": "tensogram", "version": "0.6.0" },
"time": "2026-04-06T12:00:00Z",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
"_extra_": {
"source": "ifs-cycle49r2"
}
}
Each base[i] entry is fully self-contained. The only key that varies between the two entries above is param. All other MARS keys are repeated — this is by design. Commonalities can be computed in software via compute_common() when needed.
Optional: Full GRIB Namespace Keys
When the GRIB importer runs with preserve_all_keys (CLI: --all-keys), all non-mars ecCodes namespace keys are stored under a "grib" sub-object within each base[i] entry:
{
"base": [
{
"mars": { "class": "od", "grid": "regular_ll", "param": "2t", "..." : "..." },
"grib": {
"geography": { "Ni": 1440, "Nj": 721, "gridType": "regular_ll" },
"time": { "dataDate": 20260404, "dataTime": 0 },
"ls": { "edition": 2, "centre": "ecmf", "packingType": "grid_ccsds" },
"parameter": { "paramId": 167, "shortName": "2t", "units": "K" },
"statistics": { "max": 311.03, "min": 212.84, "avg": 277.6 }
},
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
}
}
]
}
The namespaces captured are: ls, geography, time, vertical, parameter, statistics. Keys may overlap between namespaces (e.g. gridType appears in both ls and geography); each namespace stores its own copy. Empty namespaces are omitted.
DataObjectDescriptor
Each data object frame contains its own CBOR descriptor. This descriptor fully describes how to decode the payload — its type, shape, encoding pipeline, and optional per-object metadata. It lives inside the data object frame (not in a central metadata block).
| Key | Type | Required | Description |
|---|---|---|---|
type | text | Yes | Object type, e.g. "ntensor" (Rust field: obj_type) |
ndim | uint | Yes | Number of dimensions |
shape | array of uint | Yes | Size of each dimension |
strides | array of uint | Yes | Element stride per dimension |
dtype | text | Yes | Data type string (see Data Types) |
byte_order | text | Yes | "big" or "little" |
encoding | text | Yes | "none" or "simple_packing" |
filter | text | Yes | "none" or "shuffle" |
compression | text | Yes | "none", "szip", "zstd", "lz4", "blosc2", "zfp", or "sz3" |
hash | map | No | Integrity hash of the payload (see below) |
masks | map | No | NaN / Inf bitmask companion descriptors (see below) |
| encoding params | various | Conditional | Required when encoding != "none" |
| filter params | various | Conditional | Required when filter != "none" |
| compression params | various | Conditional | Required when compression != "none" |
| any other key | any | No | Per-object encoding parameters |
Example: Temperature Field Descriptor
Here is what a descriptor might look like for a global temperature field at 0.25-degree resolution, compressed with zstd:
{
"type": "ntensor",
"ndim": 2,
"shape": [721, 1440],
"strides": [1440, 1],
"dtype": "float32",
"byte_order": "little",
"encoding": "simple_packing",
"reference_value": 193.72,
"binary_scale_factor": -16,
"decimal_scale_factor": 0,
"bits_per_value": 16,
"filter": "none",
"compression": "zstd",
"zstd_level": 3,
"hash": {
"type": "xxh3",
"value": "a1b2c3d4e5f60718"
}
}
The params field in DataObjectDescriptor is for encoding parameters only (e.g. reference_value, bits_per_value). MARS keys and other application metadata are stored in the global metadata base[i]["mars"].
Encoding Parameters (simple_packing)
| Key | Type | Description |
|---|---|---|
reference_value | float | Minimum value in the original data |
binary_scale_factor | int | Power-of-2 scaling factor |
decimal_scale_factor | int | Power-of-10 scaling factor |
bits_per_value | uint | Number of bits per packed value (1-64) |
Filter Parameters (shuffle)
| Key | Type | Description |
|---|---|---|
shuffle_element_size | uint | Byte width of each element (e.g., 4 for float32) |
Compression Parameters
szip:
| Key | Type | Description |
|---|---|---|
szip_rsi | uint | Reference sample interval |
szip_block_size | uint | Block size (typically 8 or 16) |
szip_flags | uint | AEC encoding flags |
szip_block_offsets | array of uint | Bit offsets of RSI block boundaries (computed by the library or provided via encode_pre_encoded, see Pre-encoded Payloads) |
zstd:
| Key | Type | Default | Description |
|---|---|---|---|
zstd_level | int | 3 | Compression level (1-22) |
lz4: No additional parameters required.
blosc2:
| Key | Type | Default | Description |
|---|---|---|---|
blosc2_codec | text | "lz4" | Internal codec: blosclz, lz4, lz4hc, zlib, zstd |
blosc2_clevel | int | 5 | Compression level (0-9) |
blosc2_typesize | uint | (auto) | Element byte width for shuffle optimization |
zfp:
| Key | Type | Description |
|---|---|---|
zfp_mode | text | "fixed_rate", "fixed_precision", or "fixed_accuracy" |
zfp_rate | float | Bits per value (only for fixed_rate) |
zfp_precision | uint | Bit planes to keep (only for fixed_precision) |
zfp_tolerance | float | Max absolute error (only for fixed_accuracy) |
sz3:
| Key | Type | Description |
|---|---|---|
sz3_error_bound_mode | text | "abs", "rel", or "psnr" |
sz3_error_bound | float | Error bound value |
Hash Descriptor
The optional hash field records an integrity digest of the raw payload bytes.
| Key | Type | Description |
|---|---|---|
type | text | "xxh3" |
value | text | Hex-encoded digest |
NaN / Inf mask companion (masks)
When the object was encoded with allow_nan=true and/or
allow_inf=true AND the payload actually contained at least one
matching non-finite value, the descriptor carries a masks
sub-map. Each kind (nan, inf+, inf-) is independently
optional — only the kinds that appeared are present.
{
... standard DataObjectDescriptor fields ...,
"masks": {
"nan": {
"method": "roaring",
"offset": 40,
"length": 12
},
"inf+": {
"method": "rle",
"offset": 52,
"length": 3
}
}
}
Each entry:
| Key | Type | Description |
|---|---|---|
method | text | "none" | "rle" | "roaring" | "blosc2" | "zstd" | "lz4" — compression method actually used (may differ from the requested method due to the small-mask auto-fallback) |
offset | uint | Byte offset of the mask blob from the start of the payload region (= first byte after the 16-byte frame header) |
length | uint | Byte length of the mask blob on disk |
params | map | Optional method-specific parameters (e.g. {"level": 3} for zstd, {"codec": "lz4", "level": 5} for blosc2) |
Canonical key order for masks is the byte-lex sort inf+ < inf- < nan.
The encoder writes mask blobs between the payload and the CBOR
descriptor in the same canonical order. See
NaN / Inf Handling for the encode
/ decode semantics.
IndexFrame
Index frames (header or footer) contain a CBOR map that lets readers jump directly to any data object without scanning.
| Key | Type | Description |
|---|---|---|
offsets | array of uint | Byte offset of each data object frame from message start |
lengths | array of uint | Byte length of each data object frame |
Object count is derived from offsets.len(); lengths.len() must
equal offsets.len() or the decoder emits a MetadataError.
Example IndexFrame
{
"offsets": [256, 1048832, 2097408],
"lengths": [1048576, 1048576, 524288]
}
The offsets array gives O(1) random access to any object — seek
to offsets[i] and read lengths[i] bytes.
HashFrame
Hash frames (header or footer) mirror the per-object inline hash slots of each data-object frame’s footer (see wire-format.md §2.4), so readers can inspect the aggregate without walking every frame.
| Key | Type | Description |
|---|---|---|
algorithm | text | Hash algorithm name. "xxh3" is the only value a v3 encoder emits. |
hashes | array of text | Hex-encoded digest for each object, in emission order. |
Object count is derived from hashes.len(). An unknown
algorithm value triggers an UnknownHashAlgorithm warning at
validate time; the inline slots remain the authoritative check.
Example HashFrame
{
"algorithm": "xxh3",
"hashes": [
"a1b2c3d4e5f60718",
"b2c3d4e5f6071829",
"c3d4e5f60718293a"
]
}
Canonical Encoding
All CBOR maps are encoded with keys sorted by the byte representation of their CBOR-encoded key (RFC 8949 section 4.2). This sorting is applied recursively — nested maps are also sorted.
For short string keys (the common case), this is equivalent to sorting by the key string itself. For long keys or non-string keys, the CBOR byte encoding determines the order.
Why does this matter? If you hash an entire message or compare messages by digest, deterministic encoding ensures that logically identical messages produce identical bytes even if the keys were inserted in different order during construction.
Metadata Value Types
All Tensogram metadata — whether in GlobalMetadata, the base / _reserved_ / _extra_ sections, or per-object params — is stored as CBOR. This page describes which value types are valid, which are forbidden, and why.
Allowed Types
Use only the subset of CBOR types that have direct JSON equivalents:
| CBOR type | Rust / Python equivalent | Example |
|---|---|---|
| text string | String / str | "imaging", "2026-01-12" |
| integer | i64 / int | 850, -1, 0 |
| float | f64 / float | 3.14, -273.15 |
| boolean | bool / bool | true, false |
| null | None / None | (absence of a value) |
| array | Vec<Value> / list | [1440, 721], ["t2", "flair"] |
| map | BTreeMap<String, Value> / dict | {"device": "mri", "sequence": "t2_flair"} |
Map keys must be text strings. Nested arrays and maps are allowed and encoded recursively.
Forbidden Types
The following CBOR types are not allowed in Tensogram metadata:
| Type | Reason |
|---|---|
| byte strings | Opaque blobs break cross-language interoperability; use base64 text instead |
| CBOR tags | Tags (#6.<n>) are not parsed by most CBOR libraries and can change value semantics |
| undefined | Only valid in streaming CBOR; never appears in map values |
| half-precision floats (f16) | Not supported by many JSON bridges; use f64 |
| non-string map keys | Integer or binary keys are non-canonical and not searchable |
The base Section
The base section of GlobalMetadata is a CBOR array of maps — one entry
per data object. Each entry holds ALL structured metadata for that object
independently. The encoder auto-populates _reserved_.tensor (with ndim, shape,
strides, dtype) in each entry when you call encode() or
StreamingEncoder::finish(). Any other keys the application placed in a base
entry before encoding (e.g. a per-object vocabulary namespace) are preserved.
The example below uses the MARS vocabulary; any application namespace works the
same way:
{
"version": 2,
"base": [
{
"mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "2t", "levtype": "sfc" },
"_reserved_": {
"tensor": { "ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float64" }
}
},
{
"mars": { "class": "od", "type": "fc", "grid": "O1280", "param": "lnsp", "levtype": "ml" },
"_reserved_": {
"tensor": { "ndim": 1, "shape": [137], "strides": [1], "dtype": "float64" }
}
}
]
}
Each entry is fully self-contained — all keys for that object appear in its
entry. There is no separate “common” section for shared keys. If you need to
extract commonalities (e.g. for display), use the compute_common() utility in
software after decoding.
Note:
basedescribes the collection of objects at the message level. Individual tensor encoding details (encoding pipeline, hash) remain in each object’s ownDataObjectDescriptor. TheDataObjectDescriptor.paramsfield is reserved for encoding parameters only — it does not carry application metadata.
Practical Guidance
- Prefer integers for numeric identifiers (
paramId,date,run_id). - Use text strings for classification codes even if they happen to be numeric-looking — consistency with your chosen vocabulary is more important than type optimisation.
- Use nested maps for namespaced keys (e.g.,
"mars": {...},"bids": {...},"dicom": {...}). - Keep individual values small. Avoid storing large arrays (e.g., grid coordinates) in metadata — they belong in data objects.
See Also
- CBOR Metadata Schema — field-level reference for all CBOR structures
- Metadata Concepts — how global and per-object metadata relate
- Vocabularies — example application-layer vocabularies used with Tensogram
- GRIB MARS Key Mapping — how GRIB keys are mapped during import
Data Types
The dtype field in an object descriptor names the element type of the tensor. It is stored as a lowercase text string in CBOR.
Type Table
| CBOR string | Rust variant | Bytes per element | Notes |
|---|---|---|---|
float16 | Dtype::Float16 | 2 | IEEE 754 half-precision |
bfloat16 | Dtype::Bfloat16 | 2 | Brain float — same exponent range as float32, less mantissa precision |
float32 | Dtype::Float32 | 4 | IEEE 754 single-precision |
float64 | Dtype::Float64 | 8 | IEEE 754 double-precision |
complex64 | Dtype::Complex64 | 8 | Pair of float32 (real, imaginary) |
complex128 | Dtype::Complex128 | 16 | Pair of float64 (real, imaginary) |
int8 | Dtype::Int8 | 1 | Signed |
int16 | Dtype::Int16 | 2 | Signed |
int32 | Dtype::Int32 | 4 | Signed |
int64 | Dtype::Int64 | 8 | Signed |
uint8 | Dtype::Uint8 | 1 | Unsigned |
uint16 | Dtype::Uint16 | 2 | Unsigned |
uint32 | Dtype::Uint32 | 4 | Unsigned |
uint64 | Dtype::Uint64 | 8 | Unsigned |
bitmask | Dtype::Bitmask | 0* | Packed bits |
*bitmask returns 0 from byte_width() — see the edge case note below.
Byte Order
The byte_order field in the payload descriptor specifies whether multi-byte elements are stored in big-endian ("big") or little-endian ("little") order. This applies to the stored payload bytes after encoding.
Single-byte types (int8, uint8, bitmask) are unaffected by byte order.
Bitmask Edge Case
Dtype::Bitmask is for packing boolean or categorical data sub-byte. The payload size is ceil(num_elements / 8) bytes. The byte_width() method returns 0 as a sentinel; callers that need the actual payload size must compute it:
#![allow(unused)]
fn main() {
let payload_bytes = if dtype == Dtype::Bitmask {
(num_elements + 7) / 8
} else {
num_elements * dtype.byte_width()
};
}
Choosing a dtype
| Situation | Recommended dtype |
|---|---|
| Temperature, wind speed, pressure (weather) | float32 |
| High-precision scientific analysis | float64 |
| ML model weights | bfloat16 or float16 |
| Integer indices, counts | int32 or int64 |
| Land-sea masks, validity flags | uint8 or bitmask |
| Complex wave spectra | complex64 |
Quick Start
This page walks you through encoding and decoding a real tensor — a 2D temperature field — in about 20 lines of Rust.
Installation
Rust
cargo add tensogram
Or add it to your Cargo.toml manually:
[dependencies]
tensogram = "0.15"
Optional features:
| Feature | What it adds |
|---|---|
mmap | Zero-copy memory-mapped file reads |
async | Async I/O via tokio |
remote | Read from S3, GCS, Azure Blob, or HTTP |
szip-pure | Pure-Rust szip (no C dependency) |
zstd-pure | Pure-Rust zstd (no C dependency) |
All compression codecs (szip, zstd, lz4, blosc2, zfp, sz3) and multi-threading are enabled by default.
cargo add tensogram --features mmap,async,remote
Python
pip install tensogram
With xarray and Zarr backends:
pip install tensogram[all] # everything
pip install tensogram[xarray] # xarray backend only
pip install tensogram[zarr] # Zarr v3 store only
CLI
cargo install tensogram-cli
Encode a 2-D Float Field
This example encodes a 100×200 float32 grid — representative of many
scientific 2-D fields (temperature, pressure, intensity, density, …).
use std::collections::BTreeMap;
use tensogram::{
encode, decode, GlobalMetadata, DataObjectDescriptor,
ByteOrder, Dtype, EncodeOptions, DecodeOptions,
};
fn main() {
// 1. Make some synthetic data: 100×200 float32 grid
// In production, this would come from your model output, sensor,
// or upstream pipeline.
let shape = vec![100u64, 200];
let strides = vec![200u64, 1]; // C-contiguous (row-major)
let num_elements = 100 * 200;
let data: Vec<u8> = (0..num_elements)
.flat_map(|i| (273.15f32 + (i as f32 / 100.0)).to_be_bytes())
.collect();
// 2. Describe the tensor
let global = GlobalMetadata {
version: 2,
..Default::default()
};
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 2,
shape,
strides,
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "none".to_string(),
filter: "none".to_string(),
compression: "none".to_string(),
params: BTreeMap::new(),
hash: None, // hash is added automatically by EncodeOptions::default()
};
// 3. Encode — produces a self-contained message
let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default()).unwrap();
println!("Encoded {} bytes", message.len());
// 4. Decode it back
let (meta, objects) = decode(&message, &DecodeOptions::default()).unwrap();
println!(
"Decoded: {} objects, shape {:?}, dtype {}",
objects.len(),
objects[0].0.shape,
objects[0].0.dtype,
);
assert_eq!(objects[0].1, data);
}
Add Application Metadata
Real messages need application-layer metadata so downstream tools know what
the data represents. Per-object metadata goes into the base array — one
entry per data object — and is organised under a namespace key so that
multiple vocabularies can coexist.
The example below uses ECMWF’s MARS vocabulary for concreteness. The same
mechanism works with any vocabulary: CF conventions ("cf"), BIDS
("bids"), DICOM ("dicom"), or your own ("product", "experiment",
"device", …).
#![allow(unused)]
fn main() {
use ciborium::Value;
// Build a "mars" namespace for the object — one concrete vocabulary example.
// You can just as easily use "bids", "dicom", "product", or any custom name.
let mars_map = vec![
(Value::Text("class".into()), Value::Text("od".into())),
(Value::Text("date".into()), Value::Text("20260401".into())),
(Value::Text("step".into()), Value::Integer(6.into())),
(Value::Text("type".into()), Value::Text("fc".into())),
(Value::Text("param".into()), Value::Text("2t".into())),
];
let mut entry = BTreeMap::new();
entry.insert("mars".to_string(), Value::Map(mars_map));
let global = GlobalMetadata {
version: 2,
base: vec![entry], // one entry per data object
..Default::default()
};
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 2,
shape: vec![100, 200],
strides: vec![200, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "none".to_string(),
filter: "none".to_string(),
compression: "none".to_string(),
params: BTreeMap::new(),
hash: None,
};
}
What’s Next?
- Use simple_packing to reduce payload size by 4-8x
- Use the File API to append many messages to a
.tgmfile - Use the CLI to inspect files without writing any code
Vocabularies
Tensogram is vocabulary-agnostic: the library never interprets metadata keys. The same message can carry any combination of application-defined namespaces alongside the auto-populated library-reserved keys. This page collects example vocabularies that have been (or could naturally be) used with Tensogram, so you can pick a convention that matches your domain — or invent your own.
How metadata is structured
A Tensogram message’s per-object metadata lives in base[i], a
BTreeMap<String, ciborium::Value>. By convention, each application
vocabulary sits under its own top-level namespace key so that multiple
vocabularies can coexist without collision:
{
"version": 2,
"base": [{
"mars": { "class": "od", "param": "2t" },
"cf": { "standard_name": "air_temperature", "units": "K" },
"custom": { "experiment": "run-042" }
}]
}
All three namespaces above are valid, visible to tooling, and survive round-trip. The library never reads or validates their contents.
Example vocabularies
MARS (ECMWF, weather forecasting)
Used internally at ECMWF and by downstream consumers of ECMWF’s MARS archive. Keys describe the operational provenance of a forecast field: class, stream, type, parameter, level, date/time, step, etc.
{
"mars": {
"class": "od", "stream": "oper", "type": "fc",
"date": "20260401", "time": "1200", "step": 6,
"param": "2t", "levtype": "sfc"
}
}
The GRIB importer (tensogram convert-grib) automatically populates this
namespace from GRIB MARS keys. See
MARS Key Mapping for the full key list.
CF Conventions (climate, ocean, atmospheric)
CF Conventions are the standard attribute
vocabulary for climate and forecast data in NetCDF. The NetCDF importer
(tensogram convert-netcdf --cf) lifts the CF allow-list into a "cf"
sub-map. See NetCDF CF Metadata Mapping.
{
"cf": {
"standard_name": "air_temperature",
"long_name": "2 metre temperature",
"units": "K",
"cell_methods": "time: mean"
}
}
BIDS (neuroimaging)
The Brain Imaging Data Structure organises neuroimaging datasets with entity-level metadata. A natural fit for Tensogram messages carrying fMRI, dMRI, or EEG tensors.
{
"bids": {
"subject": "sub-01", "session": "ses-01",
"task": "rest", "run": 1, "acq": "hires"
}
}
DICOM (medical imaging)
DICOM tags are the standard descriptors
for medical imaging studies. They can be mapped into a "dicom" namespace
for use with Tensogram messages carrying imaging volumes, time-series, or
segmentation masks.
{
"dicom": {
"Modality": "MR", "SeriesDescription": "T2_FLAIR",
"SliceThickness": 1.0, "RepetitionTime": 8000
}
}
Zarr attributes (generic)
Zarr v3 attribute maps are generic key-value stores. When using the Zarr
backend (tensogram-zarr), group-level and array-level attributes are
surfaced through _extra_ and per-array descriptor params.
Custom namespaces
For any domain that does not have an established vocabulary, or when a pipeline wants to carry bespoke fields alongside a standard namespace, invent your own:
{
"experiment": {
"id": "run-042",
"operator": "alice",
"hypothesis": "beam stability",
"started_at": "2026-04-18T10:30:00Z"
}
}
Suggested conventions for custom namespaces:
- Use a short, lowercase namespace key (
"product","instrument","run","experiment","device"). - Group related fields under a single namespace rather than scattering them
at the top level of
base[i]. - Prefer ISO 8601 timestamps, SI units in
unitsfields, and UTF-8 text for identifiers. - Document your namespace schema somewhere versioned (a README, a JSON schema, a wiki page) so downstream consumers can interpret it consistently.
Multiple vocabularies in one message
You can freely mix vocabularies in the same base[i] entry — the library
preserves all of them:
{
"base": [{
"mars": { "param": "2t", "levtype": "sfc" },
"cf": { "standard_name": "air_temperature", "units": "K" },
"provenance": { "pipeline_id": "pp-17", "stage": "post-process" }
}]
}
This lets one team’s producers emit messages that are simultaneously interpretable by tools expecting MARS, CF-aware tooling, and an internal provenance tracker.
Looking up keys
The dotted-path helpers exposed by each binding vary. The CLI, the C FFI
(tgm_metadata_get_string / _get_int / _get_float), the C++ wrapper
(metadata::get_string / get_int / get_float), and the TypeScript
package (getMetaKey) all accept a full dotted path. The Rust crate and
the Python package do not expose a dotted-path helper at this time; use
direct nested access instead.
TypeScript — dotted path
import { getMetaKey } from '@ecmwf/tensogram';
const param = getMetaKey(meta, 'mars.param');
const subject = getMetaKey(meta, 'bids.subject');
CLI — dotted path
# Filter messages on a namespaced key
tensogram ls data.tgm -w "mars.param=2t/10u"
tensogram ls data.tgm -w "bids.subject=sub-01"
# Print specific keys
tensogram get -p "cf.standard_name,cf.units" data.tgm
Python — dict-style nested access
# Metadata.__getitem__ does a top-level search across base[i] (skipping
# _reserved_) and falls back to the message-level _extra_ map. The returned
# value is a plain Python dict, so the next lookup is standard dict access.
param = meta["mars"]["param"]
subject = meta["bids"]["subject"]
# meta.base[i], meta.reserved, and meta.extra are also available directly
# if you want the raw per-object / reserved / extra dicts.
first_base = meta.base[0]
Rust — pattern-match on ciborium::Value
#![allow(unused)]
fn main() {
use ciborium::Value;
use tensogram::GlobalMetadata;
// `meta.base` is `Vec<BTreeMap<String, Value>>`. Find the namespace on
// the first-matching base entry, then pull a text field from the nested
// map. Falls back to `meta.extra` for message-level annotations.
fn get_text<'a>(meta: &'a GlobalMetadata,
namespace: &str, field: &str) -> Option<&'a str> {
let pull = |map: &'a [(Value, Value)]| -> Option<&'a str> {
map.iter().find_map(|(k, v)| match (k, v) {
(Value::Text(k), Value::Text(v)) if k == field => Some(v.as_str()),
_ => None,
})
};
for entry in &meta.base {
if let Some(Value::Map(items)) = entry.get(namespace)
&& let Some(val) = pull(items)
{
return Some(val);
}
}
if let Some(Value::Map(items)) = meta.extra.get(namespace) {
return pull(items);
}
None
}
let param = get_text(&meta, "mars", "param");
}
Tensogram keeps the Rust surface small on purpose. If your pipeline needs dotted-path lookup in Rust, wrap the snippet above in a helper of your own, or call out to the CLI.
Lookup semantics (all bindings that support dotted paths)
First match across base[0], base[1], … (skipping _reserved_ within
each entry), then fall back to the message-level _extra_ map. An
explicit _extra_.key (or extra.key) prefix bypasses the base search.
See also
- Metadata concepts — how the
base,_reserved_, and_extra_sections fit together. - CBOR Metadata Schema — field-level reference.
- Metadata Value Types — which CBOR types are allowed inside metadata.
- GRIB MARS Key Mapping — what the GRIB importer produces.
- NetCDF CF Metadata Mapping — what the
NetCDF importer produces with
--cf.
Jupyter Notebook Walk-through
The examples/jupyter/
directory carries a curated set of narrative notebooks that introduce
Tensogram interactively, with live visualisations. Unlike the flat
.py examples under examples/python/
— which are minimal reference snippets for copy-paste — the notebooks
are for learning.
Every notebook is executed end-to-end on every PR by the
notebooks CI job, so they cannot rot.
The five journeys
| # | Notebook | What you will learn |
|---|---|---|
| 1 | 01_quickstart_and_mars.ipynb | Encode & decode a 2D tensor, visualise it, attach MARS metadata, walk the base / _reserved_ / _extra_ layout. |
| 2 | 02_encoding_and_fidelity.ipynb | Sweep every encoding × filter × compression combination and plot ratio vs time vs fidelity. |
| 3 | 03_from_grib_to_tensogram.ipynb | Convert a real ECMWF opendata GRIB2 file with the new Python API (tensogram.convert_grib + tensogram.convert_grib_buffer). |
| 4 | 04_from_netcdf_to_tensogram.ipynb | Build a small CF-compliant NetCDF in-process, convert it with tensogram.convert_netcdf, and open the result as an xarray Dataset via engine="tensogram". |
| 5 | 05_validation_and_parallelism.ipynb | Run the four validation levels, inject corruption, sweep threads=0…N and plot the speedup. |
Running the notebooks locally
Option 1 — uv pip install (recommended)
# Build the Python bindings with GRIB and NetCDF support.
# Requires libeccodes + libnetcdf installed at the OS level.
uv venv .venv --python 3.13
source .venv/bin/activate
uv pip install maturin
cd python/bindings
maturin develop --features grib,netcdf
cd ../..
# Install notebook-only dependencies + the xarray backend.
uv pip install -e examples/jupyter
# Launch JupyterLab.
jupyter lab examples/jupyter/
Option 2 — conda env create
conda env create -f examples/jupyter/environment.yml
conda activate tensogram-jupyter
jupyter lab examples/jupyter/
Option 3 — Binder / Colab
Launch badges in the notebook directory’s
README.md
— zero local install.
OS-level dependencies
Notebooks 03 (GRIB) and 04 (NetCDF) need C libraries installed at the operating system level. They are not Python packages.
| Library | Needed by | macOS (Homebrew) | Debian / Ubuntu |
|---|---|---|---|
libeccodes | notebook 03 | brew install eccodes | apt install libeccodes-dev |
libnetcdf + libhdf5 | notebook 04 | brew install netcdf hdf5 | apt install libnetcdf-dev libhdf5-dev |
The official PyPI wheels (pip install tensogram) do not ship
GRIB / NetCDF support: the manylinux_2_28 base image lacks the C
libraries. If you try to call tensogram.convert_grib(...) on a
wheel without the feature, you get a clean
RuntimeError("tensogram was built without GRIB support...") that
points you at this page.
To enable the feature, rebuild from source:
git clone https://github.com/ecmwf/tensogram
cd tensogram/python/bindings
maturin develop --features grib,netcdf
Running the notebooks in CI
The repository runs the notebooks end-to-end on every PR via a
dedicated notebooks job. The gate is:
pytest --nbval-lax examples/jupyter/ -v
--nbval-lax executes every cell in every notebook and fails the
build on any exception. Cell outputs are not compared — we commit
the notebooks with empty outputs (enforced by the
python/tests/test_jupyter_structure.py guard).
Output hygiene
Committed notebooks must have empty cell outputs. Install the
nbstripout pre-commit hook once:
uv pip install nbstripout
nbstripout --install
With the hook installed, git commit automatically strips outputs.
Adding a new notebook
- Copy an existing
.ipynbas a template. - First cell must be a markdown license banner mentioning “ECMWF” or “Apache”.
- Last cell must be a “Where to go next” markdown pointer.
- If you import
matplotlib, callmatplotlib.use("Agg")before the firstimport matplotlib.pyplot. - Update
EXPECTED_NOTEBOOKSinpython/tests/test_jupyter_structure.py. - Link it from
examples/jupyter/README.mdand this guide page. - Run
pytest --nbval-lax examples/jupyter/locally before committing.
Encoding Data
This page covers the encode() function and EncodeOptions in detail.
Function Signature
#![allow(unused)]
fn main() {
pub fn encode(
global_metadata: &GlobalMetadata,
descriptors: &[(&DataObjectDescriptor, &[u8])],
options: &EncodeOptions,
) -> Result<Vec<u8>>
}
global_metadata— reference to message-level metadata (version, base entries,_extra_fields)descriptors— a slice of(descriptor, data)pairs, one per objectoptions— controls hash algorithm and compression backend selection (theemit_precedersfield is reserved for future buffered-mode support; preceders are currently only emitted viaStreamingEncoder::write_preceder)
Returns a complete, self-contained message as a Vec<u8>.
EncodeOptions
#![allow(unused)]
fn main() {
pub struct EncodeOptions {
/// Hash algorithm to use. None disables hashing entirely.
pub hash_algorithm: Option<HashAlgorithm>,
/// Reserved — buffered `encode()` rejects `true`. Use
/// `StreamingEncoder::write_preceder()` instead.
pub emit_preceders: bool,
/// Which backend to use for szip / zstd when both FFI and pure-Rust
/// implementations are compiled in.
pub compression_backend: CompressionBackend,
}
impl Default for EncodeOptions {
fn default() -> Self {
Self {
hash_algorithm: Some(HashAlgorithm::Xxh3),
emit_preceders: false,
compression_backend: CompressionBackend::default(),
}
}
}
}
The default applies xxh3 hashing to every object payload. Use None to skip hashing:
#![allow(unused)]
fn main() {
let options = EncodeOptions {
hash_algorithm: None,
..Default::default()
};
}
What Encode Does
For each object, in order:
- Validate — checks that each pair has a descriptor and corresponding data
- Run the encoding pipeline — applies encoding, filter, compression from the object’s
DataObjectDescriptor - Hash — if
hash_algorithmis set, computes and stores the hash in the descriptor - Serialize CBOR — encodes the
GlobalMetadataand allDataObjectDescriptors to canonical CBOR - Frame — assembles preamble, header frames (metadata/index/hash), data object frames, and postamble
Encoding with Simple Packing
To use simple_packing, you need to compute the quantization parameters first, then put them in the DataObjectDescriptor:
#![allow(unused)]
fn main() {
use tensogram_encodings::simple_packing;
use ciborium::Value;
// Your original values as f64 (simple_packing always works on f64).
// source_data might be a temperature grid, pressure field, intensity
// image, or any other bounded-range scalar field.
let values: Vec<f64> = source_data.iter().map(|&x| x as f64).collect();
// Compute quantization parameters for 16 bits per value
let params = simple_packing::compute_params(&values, 16, 0)?;
// Put the parameters into the descriptor
let mut packing_params = BTreeMap::new();
packing_params.insert("reference_value".into(),
Value::Float(params.reference_value));
packing_params.insert("binary_scale_factor".into(),
Value::Integer((params.binary_scale_factor as i64).into()));
packing_params.insert("decimal_scale_factor".into(),
Value::Integer((params.decimal_scale_factor as i64).into()));
packing_params.insert("bits_per_value".into(),
Value::Integer((params.bits_per_value as i64).into()));
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 2,
shape: vec![100, 200],
strides: vec![200, 1],
dtype: Dtype::Float64,
byte_order: ByteOrder::Big,
encoding: "simple_packing".to_string(),
filter: "none".to_string(),
compression: "none".to_string(),
params: packing_params,
hash: None,
};
}
Then encode as normal, passing the original raw bytes (as f64 bytes):
#![allow(unused)]
fn main() {
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();
let global = GlobalMetadata { version: 2, ..Default::default() };
let message = encode(&global, &[(&desc, &raw)], &EncodeOptions::default())?;
}
The encoder applies simple_packing internally. The payload stored in the message is the packed bits, not the original f64 bytes.
Encoding Multiple Objects
Pass multiple (descriptor, data) pairs:
#![allow(unused)]
fn main() {
let global = GlobalMetadata { version: 2, ..Default::default() };
let message = encode(
&global,
&[(&spectrum_desc, &spectrum_data), (&mask_desc, &land_mask_data)],
&EncodeOptions::default(),
)?;
}
Each descriptor independently specifies its own encoding, compression, dtype, and byte order. The encoder processes each pair in sequence.
Error Conditions
| Error | Cause |
|---|---|
Encoding | NaN in data when using simple_packing |
Encoding | bits_per_value out of range (0–64) |
Compression | Compressor-specific error (invalid params, unsupported dtype) |
Metadata | CBOR serialization failed |
Pre-Encoded Data API (Advanced)
When to use this API
The encode_pre_encoded API is for advanced callers whose data is already
encoded by an external pipeline (e.g., a GPU kernel that emits packed bytes,
or a streaming receiver passing payloads through). It bypasses Tensogram’s
internal encoding pipeline and uses the supplied bytes verbatim.
Do NOT use this API for ordinary encoding. Use encode() instead.
⚠️ The bit-vs-byte trap
WARNING: When using
compression="szip", theszip_block_offsetsparameter contains bit offsets, not byte offsets. The first offset must be 0 and every offset must satisfyoffset <= encoded_bytes_len * 8. This matches the libaec/szip wire format. See cbor-metadata.md for the format reference.Getting this wrong is the #1 caller mistake. Tensogram validates the offsets structurally (monotonicity, bounds) but cannot detect a byte-instead-of-bit mistake until decode_range fails.
API surface
Rust
#![allow(unused)]
fn main() {
pub fn encode_pre_encoded(
metadata: &GlobalMetadata,
descriptors_and_data: &[(&DataObjectDescriptor, &[u8])],
options: &EncodeOptions,
) -> Result<Vec<u8>, TensogramError>
}
Python
import tensogram
msg: bytes = tensogram.encode_pre_encoded(
global_meta_dict={"version": 2},
descriptors_and_data=[(descriptor_dict, raw_bytes)],
hash="xxh3",
)
C
tgm_error tgm_encode_pre_encoded(
const char *metadata_json,
const uint8_t *const *data_ptrs,
const size_t *data_lens,
size_t num_objects,
const char *hash_algo,
tgm_bytes_t *out
);
C++
std::vector<std::uint8_t> tensogram::encode_pre_encoded(
const std::string& metadata_json,
const std::vector<std::pair<const std::uint8_t*, std::size_t>>& objects,
const encode_options& opts = {}
);
Hash semantics
The library always recomputes the hash of the pre-encoded bytes using
the algorithm specified in EncodeOptions.hash_algorithm (default xxh3). Any hash
the caller stored on the descriptor is silently overwritten. This guarantees
the wire format invariant descriptor.hash == hash_algo(bytes) always holds.
Provenance semantics
The encoded message is byte-format-indistinguishable from one produced by
encode(). The decoder cannot tell which API produced it. The provenance
fields _reserved_.encoder.name, _reserved_.time, and _reserved_.uuid
are populated identically.
Self-consistency checks
Before encoding, the library validates:
- Caller has not set
EncodeOptions.emit_preceders(rejected). - Caller has not put
_reserved_in their metadata (rejected). - Each descriptor passes the standard
validate_objectchecks. - If
compression="szip"andszip_block_offsetsis supplied:- It’s a CBOR Array of u64.
- First offset is 0.
- Strictly monotonically increasing.
- All bit offsets
<= bytes_len * 8.
- If
szip_block_offsetsis supplied butcompression != "szip", rejected.
These are structural checks only. The library does NOT trial-decode the bytes to verify they actually decode correctly.
Limitation: encoding=“none” size check
When encoding="none", the validate_object check enforces
payload_len == shape_product * dtype_byte_width. This means you cannot pass
compression-only payloads (e.g., zstd-compressed raw bytes) with
encoding="none" because the compressed size will not match the expected raw
size. Wrap such payloads in at least simple_packing or another encoding.
Worked example: simple_packing + szip with decode_range
#![allow(unused)]
fn main() {
use tensogram::{
encode_pre_encoded, DataObjectDescriptor, EncodeOptions,
GlobalMetadata, ByteOrder, Dtype,
};
use std::collections::BTreeMap;
use ciborium::Value;
// Pre-encoded bytes from a GPU kernel + szip block offsets in BITS
let pre_encoded_bytes: Vec<u8> = /* from GPU */;
let szip_offsets_bits: Vec<u64> = vec![0, 8192, 16384, /* ... */];
let mut params: BTreeMap<String, ciborium::Value> = BTreeMap::new();
params.insert("bits_per_value".into(), Value::Integer(24u64.into()));
params.insert("reference_value".into(), Value::Float(0.0));
params.insert("binary_scale_factor".into(), Value::Integer((-10i64).into()));
params.insert("decimal_scale_factor".into(), Value::Integer(0i64.into()));
params.insert("szip_rsi".into(), Value::Integer(128i64.into()));
params.insert("szip_block_size".into(), Value::Integer(16i64.into()));
params.insert("szip_flags".into(), Value::Integer(8i64.into()));
params.insert("szip_block_offsets".into(),
Value::Array(szip_offsets_bits.into_iter()
.map(|o| Value::Integer(o.into()))
.collect()));
let desc = DataObjectDescriptor {
obj_type: "ntensor".into(),
ndim: 2,
shape: vec![1024, 1024],
strides: vec![1024, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "simple_packing".into(),
filter: "none".into(),
compression: "szip".into(),
params,
hash: None,
};
let msg = encode_pre_encoded(
&GlobalMetadata::default(),
&[(&desc, &pre_encoded_bytes)],
&EncodeOptions::default(),
)?;
// decode_range works because szip_block_offsets is present.
}
How it works
flowchart TD
subgraph pre["encode_pre_encoded path"]
A[Caller bytes] --> B[validate_object]
B --> C[validate_szip_block_offsets]
C --> D[Recompute hash]
end
subgraph normal["encode path"]
G[Caller bytes] --> H[Run encoding pipeline]
H --> D
end
D --> E[Wrap in CBOR framing]
E --> F[Wire message]
The pre-encoded path skips the pipeline entirely. The wire format is identical.
Byte order
When using encoding="none", the caller’s bytes are stored verbatim — the
library does NOT validate or flip byte order on encode. The bytes must be in
the byte order declared in the descriptor’s byte_order field.
For example, if byte_order="big" and encoding="none", the caller must
provide big-endian bytes.
On decode, the library automatically converts to native byte order by
default (native_byte_order=true). Callers can use from_ne_bytes() or
data_as<T>() directly without worrying about which byte order was used on
the wire. Set native_byte_order=false to get the raw wire-order bytes.
Streaming API
StreamingEncoder::write_object_pre_encoded() is the streaming counterpart of
encode_pre_encoded(). It writes a single pre-encoded object to the stream.
It can be interleaved freely with write_object() (normal encode) calls.
Rust
#![allow(unused)]
fn main() {
let mut enc = StreamingEncoder::new(output, &metadata, &options)?;
enc.write_object_pre_encoded(&descriptor, &pre_encoded_bytes)?;
enc.finish()?;
}
Python
enc = tensogram.StreamingEncoder({"version": 2})
enc.write_object_pre_encoded(descriptor_dict, raw_bytes)
msg = enc.finish()
C++
tensogram::streaming_encoder enc(path, metadata_json);
enc.write_object_pre_encoded(descriptor_json, data_ptr, data_len);
enc.finish();
Error reference
encode_pre_encoded can raise the following errors:
| Error condition | Message contains |
|---|---|
obj_type is empty | "obj_type must not be empty" |
ndim doesn’t match shape.len() | "ndim … does not match shape.len()" |
strides.len() doesn’t match shape.len() | "strides.len() … does not match shape.len()" |
encoding="none" and data size wrong | "data_len … does not match expected … bytes" |
emit_preceders=true in buffered mode | "emit_preceders is not supported" |
Caller set _reserved_ in metadata | "_reserved_" |
szip_block_offsets not starting at 0 | "first offset must be 0" |
szip_block_offsets not strictly increasing | "strictly increasing" |
szip_block_offsets exceeds bit bound | "exceeds … bit bound" |
szip_block_offsets with non-szip compression | "szip_block_offsets provided but compression" |
| Unknown encoding string | "encoding" |
| Unknown dtype | "unknown dtype" |
Strides convention
The library treats strides as opaque metadata — it only validates that
strides.len() == shape.len(). The convention differs between language bindings:
- Rust tests use element strides (e.g.,
[1]for 1D,[5, 1]for shape[4, 5]) - C++ tests use byte strides (e.g.,
[4]for float32,[12, 4]for shape[2, 3]float32)
Both conventions work correctly since the library does not interpret stride values.
Cross-references
- Encoding — the normal
encode()API - Decoding —
decode_rangerequirements for partial reads - Compression — szip details
- CBOR metadata — wire format reference
Decoding Data
Tensogram provides four decode functions for different use cases. Choose the one that does the least work for your situation — they are all zero-copy on the metadata path.
The DecodedObject Type
Before diving in, it helps to know the common return type:
#![allow(unused)]
fn main() {
type DecodedObject = (DataObjectDescriptor, Vec<u8>);
}
A DecodedObject is a tuple of the object’s descriptor (shape, dtype, encoding info, etc.) and the decoded raw bytes. You will see this pattern throughout the decode API.
Four Decode Functions
decode — full message
#![allow(unused)]
fn main() {
pub fn decode(
message: &[u8],
options: &DecodeOptions,
) -> Result<(GlobalMetadata, Vec<(DataObjectDescriptor, Vec<u8>)>)>
}
Decodes all objects. Returns the global metadata and a vector of DecodedObject tuples — one per object, with raw bytes in the logical dtype after de-quantization.
#![allow(unused)]
fn main() {
let (meta, objects) = decode(&message, &DecodeOptions::default())?;
// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}, bytes: {}", desc.shape, desc.dtype, data.len());
}
decode_metadata — metadata only
#![allow(unused)]
fn main() {
pub fn decode_metadata(message: &[u8]) -> Result<GlobalMetadata>
}
Reads only the CBOR section. Does not touch any payload bytes. Use this for filtering and listing.
#![allow(unused)]
fn main() {
let meta = decode_metadata(&message)?;
println!("version: {}", meta.version);
}
decode_object — single object by index
#![allow(unused)]
fn main() {
pub fn decode_object(
message: &[u8],
index: usize,
options: &DecodeOptions,
) -> Result<(GlobalMetadata, DataObjectDescriptor, Vec<u8>)>
}
Decodes one object without reading the others. Uses the binary header’s offset table to seek directly to the right payload. O(1) seek regardless of how many objects the message contains.
Returns the global metadata, the object’s descriptor, and the decoded bytes as a three-element tuple.
#![allow(unused)]
fn main() {
// Decode only the second object (index 1)
let (meta, descriptor, payload) = decode_object(&message, 1, &DecodeOptions::default())?;
println!("shape: {:?}, dtype: {}", descriptor.shape, descriptor.dtype);
}
Edge case: If
index >= num_objects, returnsTensogramError::Object("index out of range").
decode_range — partial sub-tensor
#![allow(unused)]
fn main() {
pub fn decode_range(
message: &[u8],
object_index: usize,
ranges: &[(u64, u64)], // (offset, count) in flattened element order
options: &DecodeOptions,
) -> Result<(DataObjectDescriptor, Vec<Vec<u8>>)>
}
Decodes one or more contiguous slices of elements from an object. Each (offset, count) pair in ranges selects a span of elements along the flattened dimension; the function returns one byte vector per range by default. This split-result design avoids an unnecessary copy when the caller needs the ranges individually (e.g. to feed separate array slices).
Rust — split results (default)
#![allow(unused)]
fn main() {
// Two separate ranges from object 0
let (desc, parts) = decode_range(
&message, 0,
&[(100, 50), (300, 25)],
&DecodeOptions::default(),
)?;
assert_eq!(parts.len(), 2); // one Vec<u8> per range
println!("first range bytes: {}", parts[0].len());
println!("second range bytes: {}", parts[1].len());
}
Rust — joined result
If you prefer a single contiguous buffer, flatten the results:
#![allow(unused)]
fn main() {
let joined: Vec<u8> = parts.into_iter().flatten().collect();
}
Python — split results (default, join=False)
import tensogram
parts = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range
print(len(parts)) # 2
print(parts[0].shape) # (50,)
Python — joined result (join=True)
arr = tensogram.decode_range(buf, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# arr is a single flat numpy array with all ranges concatenated
print(arr.shape) # (75,)
N-dimensional slicing: The xarray backend maps N-dimensional slice notation (e.g.
ds["temperature"].sel(lat=slice(10, 20), lon=slice(30, 40))) into the(offset, count)pairs thatdecode_rangeexpects, so you rarely need to compute flattened offsets by hand when working through xarray.
Pre-encoded messages: Messages produced via
encode_pre_encodedonly supportdecode_rangeif the caller provided the necessary bit-preciseszip_block_offsets(see Pre-encoded Payloads).
Edge case:
decode_rangeworks with all encoding+compression combinations that support random access: uncompressed data,simple_packing(bit extraction),szip(RSI block seeking),blosc2(chunk access), andzfpfixed-rate mode. It returns an error for theshufflefilter (byte rearrangement breaks contiguous sample ranges) and for stream compressors (zstd,lz4,sz3) that don’t support partial decode.
DecodeOptions
#![allow(unused)]
fn main() {
pub struct DecodeOptions {
/// If true, verify the hash of each decoded payload.
pub verify_hash: bool,
/// When true (the default), decoded payloads are converted to the
/// caller's native byte order. Set to false to receive bytes in the
/// message's declared wire byte order.
pub native_byte_order: bool,
/// Which backend to use for szip / zstd when both FFI and pure-Rust
/// implementations are compiled in.
pub compression_backend: CompressionBackend,
}
impl Default for DecodeOptions {
fn default() -> Self {
Self {
verify_hash: false,
native_byte_order: true,
compression_backend: CompressionBackend::default(),
}
}
}
}
Native byte order (default)
By default, all decoded data is returned in the caller’s native byte order — the library handles any necessary byte-swapping automatically. You never need to check byte_order or call .byteswap():
#![allow(unused)]
fn main() {
let (_, objects) = decode(&message, &DecodeOptions::default())?;
let floats: Vec<f32> = objects[0].1
.chunks_exact(4)
.map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
.collect();
}
In Python, numpy arrays are always directly usable:
_, objects = tensogram.decode(msg)
arr = objects[0][1] # numpy array — values are correct, no byteswap needed
This applies to all decode functions (decode, decode_object, decode_range), all encodings (none, simple_packing), all compression codecs, and all language bindings (Rust, Python, C, C++).
Wire byte order (opt-in)
Set native_byte_order: false to receive the raw bytes in the message’s declared wire byte order. This is useful for zero-copy forwarding or when you need the exact on-wire representation:
#![allow(unused)]
fn main() {
let opts = DecodeOptions { native_byte_order: false, ..Default::default() };
let (_, objects) = decode(&message, &opts)?;
// objects[0].1 is in the descriptor's declared byte_order (e.g. big-endian)
}
Hash verification
Hash verification is opt-in. Enable it when data integrity is critical:
#![allow(unused)]
fn main() {
let options = DecodeOptions { verify_hash: true, ..Default::default() };
let result = decode(&message, &options);
// Returns Err(TensogramError::HashMismatch { expected, actual }) if corrupted
}
Edge case: If the descriptor has no hash (i.e. the message was encoded with
hash_algorithm: None),verify_hash: truesilently skips verification for that object. No error is returned.
Working with the Decoded Bytes
Decoded bytes are in native byte order (with the default DecodeOptions). Cast them as native:
#![allow(unused)]
fn main() {
// float32 object → use from_ne_bytes
let floats: Vec<f32> = data
.chunks_exact(4)
.map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
.collect();
}
For simple_packing decoded data, the output is always f64 bytes (8 bytes per element), regardless of the original dtype stored in the descriptor:
#![allow(unused)]
fn main() {
// simple_packing always decodes to f64, in native byte order
let values: Vec<f64> = data
.chunks_exact(8)
.map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
.collect();
}
Scanning for Messages First
If you’re working with a buffer that might contain multiple messages (e.g. a .tgm file loaded into memory), scan it first to get message boundaries:
#![allow(unused)]
fn main() {
let offsets = scan(&big_buffer); // Vec<(usize, usize)> = (start, length)
for (start, len) in offsets {
let msg = &big_buffer[start..start + len];
let meta = decode_metadata(msg)?;
println!("version: {}", meta.version);
}
}
The scan function is tolerant of corruption — it skips invalid regions and continues looking for the next valid TENSOGRM marker.
NaN / Inf handling
By default the Tensogram encoder rejects any NaN or ±Inf in
float / complex payloads. The encode call fails with
TensogramError::Encoding (C FFI: TgmError::Encoding; Python:
EncodingError; TypeScript: EncodingError; C++: tensogram::encoding_error)
and names the element index, dtype, and a hint that points at the
opt-in flags described below.
This chapter walks through the three policies available on encode:
- Reject (default) — any non-finite input fails the call. Use this when your pipeline guarantees finite values and any NaN / Inf is a bug you want to surface loudly.
- Allow NaN — NaN values are substituted with
0.0on the wire and their positions are recorded in a compressed bitmask stored alongside the payload. Decode restores canonical NaN at those positions by default. - Allow ±Inf — same as
allow_nanbut for+∞and−∞together (the flag covers both signs; two per-sign bitmasks are written when both kinds appear in the payload).
The mask companion is formally called the NTensorFrame —
wire-format type 9, defined in
plans/BITMASK_FRAME.md
and the wire-format reference.
When to use which policy
| Situation | Flag to set |
|---|---|
| Finite data only, want hard failure on contamination | default (both off) |
NetCDF _FillValue → NaN, Zarr missing data, sensor gaps | allow_nan=true |
Propagating numerical overflow as ±Inf | allow_inf=true |
| Mixed missing-value / overflow data | both true |
Don’t pre-process to a sentinel value when allow_nan /
allow_inf does the job — the bitmask is designed to compress
aggressively (hybrid Roaring containers by default) and keeps the
missing-data semantics visible to the decoder. Sentinel values
throw that information away.
Cross-language opt-in
Rust
#![allow(unused)]
fn main() {
use tensogram::{encode, EncodeOptions, GlobalMetadata, DataObjectDescriptor};
let options = EncodeOptions {
allow_nan: true,
allow_inf: true,
..Default::default()
};
let msg = encode(&meta, &[(&desc, payload_bytes)], &options)?;
}
Python
import numpy as np
import tensogram
data = np.array([1.0, np.nan, 3.0], dtype=np.float64)
msg = tensogram.encode(
{"version": 2},
[(desc, data)],
allow_nan=True,
)
decoded = tensogram.decode(msg)
# decoded.objects[0].data() → [1.0, nan, 3.0]
TypeScript
import { encode, decode } from '@ecmwf/tensogram';
const msg = encode(
{ version: 2 },
[{ descriptor, data: new Float64Array([1, NaN, 3]) }],
{ allowNan: true },
);
const decoded = decode(msg);
C++
tensogram::encode_options opts;
opts.allow_nan = true;
auto msg = tensogram::encode(metadata_json, objects, opts);
CLI
$ tensogram --allow-nan reshuffle -o out.tgm input.tgm
$ TENSOGRAM_ALLOW_NAN=1 tensogram convert-netcdf data.nc -o data.tgm
Decode-side reconstruction
By default every decode path restores the canonical quiet-NaN / ±Inf
bit pattern at every masked position. Opt out (e.g. to inspect
the on-disk zero-substituted representation) by passing
restore_non_finite=false:
# Get the 0.0-substituted payload without the NaN bits.
raw = tensogram.decode(msg, restore_non_finite=False)
# raw.objects[0].data() → [1.0, 0.0, 3.0]
The advanced decode_with_masks API (Rust + Python) returns both
the zero-substituted payload AND the raw decompressed
per-kind Vec<bool> masks, so callers can build custom
missing-value representations without materialising canonical NaN
bytes.
Lossy reconstruction — read this carefully
The masked encode path does not preserve the original NaN payload bits. On decode every masked NaN is restored with the canonical quiet-NaN pattern:
f32::NANbits =0x7FC00000f64::NANbits =0x7FF8000000000000- Float16 / bfloat16 use their dtype-native quiet-NaN patterns
- Complex64 / complex128 restore the canonical pattern to both real and imag components
Signalling NaNs, custom payload bits, and mixed real / imag
kinds for complex dtypes are therefore flattened to the canonical
form through a mask round-trip. If you need bit-exact NaN
preservation, pre-encode your payload and use
encode_pre_encoded to bypass the substitute-and-mask stage
entirely. See plans/BITMASK_FRAME.md §7.1
for the full design rationale.
Mask compression methods
Six methods are available per-kind:
| Method | Best for | Feature |
|---|---|---|
roaring (default) | any mask shape | pure Rust, works on WASM |
rle | highly clustered masks (land / sea, swath gaps) | pure Rust |
blosc2 | dense dtype-aligned masks | blosc2 feature |
zstd | generic good-ratio | zstd feature |
lz4 | decode-speed priority | lz4 feature |
none | tiny masks (auto-fallback) | always available |
Small masks (uncompressed bit-packed byte count ≤ 128 by default)
automatically fall back to none regardless of the requested
method — compressing a few bytes costs more than it saves. Set
small_mask_threshold_bytes = 0 to disable the auto-fallback.
Set per-kind methods via the matching options:
msg = tensogram.encode(
meta, [(desc, data)],
allow_nan=True, allow_inf=True,
nan_mask_method='rle',
pos_inf_mask_method='roaring',
neg_inf_mask_method='roaring',
small_mask_threshold_bytes=0,
)
Validation
tensogram validate --full cross-checks every NaN / ±Inf in the
decoded payload against the frame’s mask companion: masked
positions are expected and pass; any NaN / Inf at a non-masked
position is reported as NanDetected / InfDetected
(see the validator reference).
Files without a mask companion keep the pre-0.17 semantics — any non-finite value in the decoded output is an error.
Migration from pre-0.17
Prior to 0.17 the reject_nan / reject_inf opt-in flags upgraded
the NaN check to be pipeline-independent. These flags are
removed in 0.17 (breaking change). Rejection is now always on by
default; opt in to masked substitution with the replacement flags:
| Pre-0.17 | 0.17+ |
|---|---|
reject_nan=False (default, pass-through) | allow_nan=True (substitute + mask) |
reject_nan=True (opt-in reject) | default (always reject) |
reject_inf=False / True | same split, allow_inf |
See CHANGELOG.md for the full breaking-change list and upgrade notes.
Working with Files
The TensogramFile struct provides a high-level API for reading and writing .tgm files. It handles lazy scanning, buffered appending, and random access by message index.
Creating a File
#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, EncodeOptions};
let mut file = TensogramFile::create("forecast.tgm")?;
}
This creates (or truncates) the file. No data is written yet.
Appending Messages
#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::{
GlobalMetadata, DataObjectDescriptor, ByteOrder, Dtype, EncodeOptions,
};
let global = GlobalMetadata { version: 2, ..Default::default() };
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 2,
shape: vec![100, 200],
strides: vec![200, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "none".to_string(),
filter: "none".to_string(),
compression: "none".to_string(),
params: BTreeMap::new(),
hash: None,
};
file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}
Each append encodes one message and appends it to the end of the file. You can call it as many times as you like — each message is independent and self-describing.
Typical pattern for writing a multi-message file (one message per parameter, run, subject, sample, experiment — whatever your pipeline produces):
#![allow(unused)]
fn main() {
let mut file = TensogramFile::create("output.tgm")?;
for key in ["2t", "10u", "10v", "msl"] {
let (global, desc, data) = produce_field(key);
file.append(&global, &[(&desc, &data)], &EncodeOptions::default())?;
}
}
Opening and Counting Messages
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;
// Streaming scan happens here (lazily, on first access)
let count = file.message_count()?;
println!("{} messages in file", count);
}
The first access triggers a streaming scan that reads preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, every read_message call is a seek + read — no further scanning.
Reading Messages
#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions};
// Read raw bytes of message 3
let raw_bytes = file.read_message(3)?;
// Decode message 3
let (meta, objects) = decode(&raw_bytes, &DecodeOptions::default())?;
// Each element is (DataObjectDescriptor, Vec<u8>)
let (ref desc, ref data) = objects[0];
println!("shape: {:?}, dtype: {}", desc.shape, desc.dtype);
}
Both are O(1) after the initial scan: they seek to the stored offset and read length bytes.
Iterating Over All Messages
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open("forecast.tgm")?;
for raw in file.iter()? {
let raw = raw?;
let meta = tensogram::decode_metadata(&raw)?;
println!("version: {}", meta.version);
}
}
Memory note: For files with many large messages, prefer iterating by index with
read_message(i)inside a loop to process one at a time.
Random Access by Index
One of Tensogram’s design goals is O(1) object access. After scanning, any message is reachable in constant time. Within a message, any object is reachable in constant time via the binary header’s offset table:
flowchart TD
A["file.read_message(42)"]
B["Message bytes"]
C["Binary header"]
D["Seek to payload 2"]
E["Decode only object 2"]
A -- "seek + read" --> B
B --> C
C -- "lookup offset for object 2" --> D
D --> E
style A fill:#388e3c,stroke:#2e7d32,color:#fff
style E fill:#1565c0,stroke:#0d47a1,color:#fff
File Layout Diagram
forecast.tgm
├── [message 0] — TENSOGRM ... 39277777
├── [message 1] — TENSOGRM ... 39277777
├── [message 2] — TENSOGRM ... 39277777
│ ├── Preamble (24B)
│ ├── Header Metadata Frame (CBOR GlobalMetadata)
│ ├── Header Index Frame (CBOR offsets)
│ ├── Data Object Frame 0 (payload + CBOR descriptor)
│ └── Data Object Frame 1 (payload + CBOR descriptor)
│ └── Postamble (16B)
└── ...
No file-level header, no file-level index. All indexing is per-message, built in-memory at scan time.
Remote Access (optional)
Enable the remote feature to open .tgm files on S3, GCS, Azure, or HTTP with selective range-based reads:
[dependencies]
tensogram = { path = "...", features = ["remote"] }
#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};
let mut file = TensogramFile::open_source("s3://bucket/forecast.tgm")?;
// Fetch only the second object from message 0 — no full download
let (meta, desc, data) = file.decode_object(0, 1, &DecodeOptions::default())?;
}
Supports header-indexed and footer-indexed files (read-only) from Rust, Python, xarray, and zarr. See the Remote Access guide for storage options, request budgets, and limitations.
Memory-Mapped I/O (optional)
Enable the mmap feature to use memory-mapped file access:
[dependencies]
tensogram = { path = "...", features = ["mmap"] }
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_mmap("forecast.tgm")?;
// Scan happens during open_mmap — no lazy scan needed
let count = file.message_count()?;
// Reads from the memory-mapped region (no additional seek)
let raw = file.read_message(0)?;
}
This is useful for large files where you want to avoid per-message seek + read overhead. The file is mapped read-only. All existing decode functions work unchanged.
Async I/O (optional)
Enable the async feature for tokio-based non-blocking file operations:
[dependencies]
tensogram = { path = "...", features = ["async"] }
#![allow(unused)]
fn main() {
let mut file = TensogramFile::open_async("forecast.tgm").await?;
// Read a message without blocking the async runtime
let raw = file.read_message_async(0).await?;
// Decode also runs on a blocking thread (safe for FFI codecs)
let (meta, objects) = file.decode_message_async(0, &opts).await?;
}
All CPU-intensive work (scanning, decoding, FFI calls to compression libraries) runs via tokio::task::spawn_blocking, so it won’t block the async runtime.
Edge Cases
Appending to an Existing File
TensogramFile::create truncates. To append to an existing file, use standard file I/O:
#![allow(unused)]
fn main() {
use std::io::Write;
let mut f = std::fs::OpenOptions::new().append(true).open("forecast.tgm")?;
let global = GlobalMetadata { version: 2, ..Default::default() };
let message = encode(&global, &[(&desc, &data)], &EncodeOptions::default())?;
f.write_all(&message)?;
}
Or open the file with TensogramFile::open and use append() — the append method always writes at the end regardless of how the file was opened.
Corrupted Messages
The scanner skips corrupted messages and continues. A message is considered corrupted if:
- The
total_lengthfield points to a location where39277777is not present - The header is truncated
The scanner recovers by advancing one byte and searching for the next TENSOGRM.
Empty Files
message_count() returns 0 for an empty file. read_message(0) returns an error.
Remote Access
Enable the remote feature to open .tgm files on HTTP, S3, GCS, or Azure without downloading the whole file. Individual objects are fetched via targeted range requests.
[dependencies]
tensogram = { path = "...", features = ["remote"] }
Opening a Remote File
#![allow(unused)]
fn main() {
use tensogram::TensogramFile;
// Auto-detect: local path or remote URL
let mut file = TensogramFile::open_source("https://example.com/data.tgm")?;
// S3
let mut file = TensogramFile::open_source("s3://bucket/data.tgm")?;
}
open_source inspects the URL scheme and routes to the remote backend for s3://, s3a://, gs://, az://, azure://, http://, https://. Everything else is treated as a local path.
The Rust open() method is unchanged and always opens a local file. In Python, TensogramFile.open() auto-detects remote URLs.
You can also check whether a string is a remote URL without opening:
#![allow(unused)]
fn main() {
use tensogram::is_remote_url;
assert!(is_remote_url("s3://bucket/file.tgm"));
assert!(!is_remote_url("/local/path/file.tgm"));
}
Storage Options (Credentials, Region, etc.)
Pass an explicit options map for fine-grained control:
#![allow(unused)]
fn main() {
use std::collections::BTreeMap;
use tensogram::TensogramFile;
let mut opts = BTreeMap::new();
opts.insert("aws_access_key_id".to_string(), "AKIA...".to_string());
opts.insert("aws_secret_access_key".to_string(), "...".to_string());
opts.insert("region".to_string(), "eu-west-1".to_string());
let mut file = TensogramFile::open_remote("s3://bucket/data.tgm", &opts)?;
}
When no options are passed, credentials are read from the environment (e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, GOOGLE_APPLICATION_CREDENTIALS).
Python Usage
import tensogram
# Auto-detect remote URL
with tensogram.TensogramFile.open("s3://bucket/data.tgm") as f:
meta = f.file_decode_metadata(0)
result = f.file_decode_object(0, 0)
data = result["data"] # numpy array
# With explicit storage options
with tensogram.TensogramFile.open_remote(
"s3://bucket/data.tgm",
{"region": "eu-west-1"}
) as f:
print(f.source()) # "s3://bucket/data.tgm"
print(f.is_remote()) # True
xarray Usage
import xarray as xr
ds = xr.open_dataset(
"s3://bucket/data.tgm",
engine="tensogram",
storage_options={"region": "eu-west-1"},
)
Supported Schemes
| Scheme | Backend | Notes |
|---|---|---|
http://, https:// | HTTP | allow_http is set automatically for http:// |
s3://, s3a:// | Amazon S3 | Env-based or explicit credentials |
gs:// | Google Cloud Storage | Service account or env |
az://, azure:// | Azure Blob Storage | MSI or env |
All backends are provided by the object_store crate.
Object-Level Access
Three methods provide selective access without downloading full messages:
#![allow(unused)]
fn main() {
use tensogram::DecodeOptions;
// Metadata only — triggers layout discovery on first call, then cached
let meta = file.decode_metadata(0)?;
// Descriptors — reads only the descriptor data needed for each object
let (meta, descriptors) = file.decode_descriptors(0)?;
// Single object by index — fetches only the target object frame
let (meta, desc, data) = file.decode_object(0, 2, &DecodeOptions::default())?;
}
These methods also work on local files, where they read the full message and decode the requested parts.
Request Budget
Header-indexed files (buffered writes)
| Phase | Operation | HTTP Requests |
|---|---|---|
| Open | open_source / open_remote | 1 HEAD + 1 GET (first preamble only, 24 B) |
| Next message | first data access to message i | 1 GET (preamble + layout combined) |
| Cached | decode_metadata(i) again | 0 (served from cache) |
| Object read | decode_object(i, j) | 1 GET per object (if layout already cached) |
| Descriptors | decode_descriptors(i) | 1–3 GETs per object (descriptor-only reads for large frames) |
| Message count | message_count() | 1 GET per undiscovered message (24 B each, preamble only) |
Footer-indexed files (buffered with known total_length)
| Phase | Operation | HTTP Requests |
|---|---|---|
| Open | open_source / open_remote | 1 HEAD + 1 GET (first preamble only, 24 B) |
| Next message | first data access to message i | 1 GET (preamble) + 1 GET (suffix) |
| Cached | decode_metadata(i) again | 0 (served from cache) |
| Object read | decode_object(i, j) | 1 GET per object (if layout already cached) |
| Descriptors | decode_descriptors(i) | 1–3 GETs per object |
| Message count | message_count() | 1 GET per undiscovered message (24 B each) |
Streaming files (total_length=0)
| Phase | Operation | HTTP Requests |
|---|---|---|
| Open | open_source / open_remote | 1 HEAD + 1 GET (preamble) + 1 GET (END_MAGIC check) |
| First access | decode_metadata(0) | 2 GETs (postamble + footer region) |
| Object read | decode_object(0, j) | 1 GET per object |
| Message count | message_count() | 0 (streaming is always the last message) |
Layout discovery is combined with message scanning for both header-indexed and footer-indexed messages — the library reads the preamble and layout in one GET (header-indexed) or two GETs (footer-indexed suffix read). message_count() uses a lean scan path (24 bytes per preamble). Streaming messages (total_length=0) must be the last message in a multi-message file.
How It Works (Header-Indexed Example)
sequenceDiagram
participant App
participant TensogramFile
participant ObjectStore
App->>TensogramFile: open_source("s3://bucket/file.tgm")
TensogramFile->>ObjectStore: HEAD (get file size)
TensogramFile->>ObjectStore: GET range 0..24 (preamble)
Note right of TensogramFile: Discover message offsets
App->>TensogramFile: decode_object(0, 2)
TensogramFile->>ObjectStore: GET range 24..N (header chunk, up to 256KB)
Note right of TensogramFile: First access: parse metadata + index, cache layout
TensogramFile->>ObjectStore: GET range offset..offset+len (object frame 2)
TensogramFile-->>App: (metadata, descriptor, decoded_bytes)
Checking if a File is Remote
#![allow(unused)]
fn main() {
use tensogram::TensogramFile;
let file = TensogramFile::open_source("s3://bucket/data.tgm")?;
assert!(file.is_remote());
println!("source: {}", file.source()); // "s3://bucket/data.tgm"
}
source() returns the original URL for remote files and the file path for local files.
Error Handling
Remote access can return different TensogramError variants depending on the failure:
| Error condition | Error type | When it happens |
|---|---|---|
| Invalid URL | Remote | open_source / open_remote with a malformed URL |
| Connection failure | Remote | Network unreachable, DNS failure, timeout |
| File not found | Remote | HTTP 404, S3 NoSuchKey |
| No valid messages | Remote | File contains no parseable messages |
| Unsupported layout | Remote | Message lacks both header-index and footer-index flags |
| Object index out of range | Object | decode_object(i, j) where j >= object_count |
All errors are returned as Result. The library avoids panics.
Shared Runtime
Remote I/O uses a process-wide shared tokio runtime (multi-thread, 2 workers) created on first use. All RemoteBackend instances share the same runtime, so TCP connection pools and DNS caches are reused across calls.
The sync bridge adapts to the calling context:
- Not in a tokio runtime (Python, CLI): the shared runtime’s handle drives the future directly — no extra thread creation.
- Inside a multi-thread tokio runtime (
#[tokio::test], server handler):block_in_placetells tokio to spawn a replacement worker so the blocked thread doesn’t cause runtime starvation. - Inside a current-thread tokio runtime: falls back to a scoped thread, since
block_in_placeis not supported on single-threaded runtimes.
Async API
The async feature enables async methods for decode, read, and metadata extraction. These work for both local and remote files:
#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, DecodeOptions};
// Async decode methods (feature = "async")
let meta = file.decode_metadata_async(0).await?;
let (meta, descs) = file.decode_descriptors_async(0).await?;
let (meta, desc, data) = file.decode_object_async(0, 0, &DecodeOptions::default()).await?;
let msg = file.read_message_async(0).await?;
}
When both remote and async features are enabled, async open methods are also available:
#![allow(unused)]
fn main() {
// Async open (auto-detects local vs remote) — requires remote + async
let mut file = TensogramFile::open_source_async("s3://bucket/data.tgm").await?;
// Async open with explicit storage options
let mut file = TensogramFile::open_remote_async(
"s3://bucket/data.tgm",
&opts,
).await?;
}
For remote backends, async methods directly await object store operations, bypassing the sync bridge entirely. For local backends, they use spawn_blocking for file I/O.
[dependencies]
tensogram = { path = "...", features = ["remote", "async"] }
Range Reads
TensogramFile::decode_range() supports partial object decoding for both local and remote files. It takes an object index and a list of (offset, count) element ranges, returning only the requested elements without decoding the entire object.
For remote files, it fetches the full object frame (via indexed access) then runs the range decode pipeline on the raw payload. This is most beneficial with szip-compressed objects that have szip_block_offsets, where only the compressed blocks covering the requested range are decompressed.
#![allow(unused)]
fn main() {
// Rust: decode elements 100..200 from object 0
let ranges = vec![(100, 100)];
let (desc, parts) = file.decode_range(0, 0, &ranges, &DecodeOptions::default())?;
}
# Python: decode elements 100..200 from object 0
arr = file.file_decode_range(0, 0, [(100, 100)], join=True)
The xarray backend uses file_decode_range automatically when slicing remote arrays that support partial decode (uncompressed or szip-compressed objects without shuffle filters).
Descriptor-Only Reads
decode_descriptors() fetches only the CBOR descriptor from each data object frame, not the full payload. For large objects (hundreds of MB), this avoids downloading the entire frame just to extract a few hundred bytes of metadata.
For frames smaller than 64 KB, the full frame is read in a single request (fewer round-trips). For larger frames, the library reads only the frame header (16 bytes), footer (12 bytes), and the CBOR descriptor region.
Limitations
- Streaming messages must be last. In multi-message files, streaming-encoded messages (
total_length=0) must be the last message. The remote scanner assumes the streaming message extends to the end of the file. - Optimistic scan for buffered messages. Remote message scanning validates preamble magic and
total_lengthplausibility but does not verify end-of-message markers for buffered messages. Streaming messages (total_length=0) do validate the END_MAGIC at EOF. - Read-only. Remote writes are not supported.
- Header probe size. Layout discovery reads a single chunk of up to 256 KB from the header region. If the metadata or index frame does not fit in this chunk,
decode_metadata()will error (it does not retry with a larger read). - HTTP server requirements. The remote HTTP server must support
HEADrequests (for file size) andRangerequest headers (for partial reads). read_message()anddecode_message()download the full message even for remote files. Usedecode_metadata(),decode_descriptors(), ordecode_object()for selective access.- Zarr remote reads are lazy per-chunk. The zarr store fetches only metadata at open time; individual chunks are decoded on first access. Local files still use eager decode for lower latency.
- Sequential async access. Async methods take
&mut self, so a single file handle cannot serve concurrent async reads. Open separate handles for parallelism.
Iterators
Tensogram provides lazy iterator APIs for traversing messages and objects without loading everything into memory at once.
Hierarchy
graph TD
F[File / Buffer] -->|messages| M1[Message 1]
F -->|messages| M2[Message 2]
F -->|messages| M3[Message N]
M1 -->|objects| O1["(DataObjectDescriptor, Vec<u8>)"]
M1 -->|objects| O2["(DataObjectDescriptor, Vec<u8>)"]
O1 -->|access| D1["descriptor + data"]
O2 -->|access| D2["descriptor + data"]
Rust API
Buffer message iterator
Iterate over messages in a &[u8] byte buffer. Zero-copy: yields slices pointing into the original buffer.
#![allow(unused)]
fn main() {
use tensogram::{messages, decode, DecodeOptions};
let buf: Vec<u8> = std::fs::read("multi.tgm")?;
for msg_bytes in messages(&buf) {
let (meta, objects) = decode(msg_bytes, &DecodeOptions::default())?;
println!("version={} objects={}", meta.version, objects.len());
}
}
The iterator calls scan() once on construction, then yields &[u8] slices in sequence. Garbage between valid messages is silently skipped.
MessageIter implements ExactSizeIterator, so .len() returns the remaining count at any point.
Object iterator
Iterate over the decoded objects (tensors) inside a single message. Each item is a (DataObjectDescriptor, Vec<u8>) tuple:
#![allow(unused)]
fn main() {
use tensogram::{objects, DecodeOptions};
for result in objects(&msg_bytes, DecodeOptions::default())? {
let (descriptor, data) = result?;
println!("shape={:?} dtype={} encoding={} bytes={}",
descriptor.shape, descriptor.dtype, descriptor.encoding, data.len());
}
}
Each object is decoded through the full pipeline on demand — objects you don’t consume are never decoded.
For metadata-only access (no payload decode), use objects_metadata. This returns DataObjectDescriptors without decoding any payloads:
#![allow(unused)]
fn main() {
use tensogram::objects_metadata;
for desc in objects_metadata(&msg_bytes)? {
println!("shape={:?} dtype={} byte_order={}", desc.shape, desc.dtype, desc.byte_order);
}
}
File iterator
Iterate over messages stored in a .tgm file with seek-based lazy I/O:
#![allow(unused)]
fn main() {
use tensogram::{TensogramFile, objects, DecodeOptions};
let mut file = TensogramFile::open("forecast.tgm")?;
for raw in file.iter()? {
let raw = raw?;
// Nested: iterate objects within this message
for result in objects(&raw, DecodeOptions::default())? {
let (desc, data) = result?;
println!("{:?} {} {} bytes", desc.shape, desc.dtype, data.len());
}
}
}
file.iter() scans the file once (if not already scanned), then returns a FileMessageIter that reads each message via seek + read. The iterator does not borrow the TensogramFile — it owns an open file handle and a copy of the message offsets.
C / C++ API
The C FFI uses an opaque-handle + next() pattern. Each iterator returns TGM_OK while items remain, and TGM_END_OF_ITER as an end sentinel.
Buffer iterator
tgm_buffer_iter_t *iter;
tgm_buffer_iter_create(buf, buf_len, &iter);
const uint8_t *msg_ptr;
size_t msg_len;
while (tgm_buffer_iter_next(iter, &msg_ptr, &msg_len) == TGM_OK) {
// msg_ptr borrows from the original buffer
tgm_message_t *msg;
tgm_decode(msg_ptr, msg_len, 0, &msg);
// ... use msg ...
tgm_message_free(msg);
}
tgm_buffer_iter_free(iter);
Lifetime: the buffer must remain valid until
tgm_buffer_iter_free.
File iterator
tgm_file_t *file;
tgm_file_open("data.tgm", &file);
tgm_file_iter_t *iter;
tgm_file_iter_create(file, &iter);
tgm_bytes_t raw;
while (tgm_file_iter_next(iter, &raw) == TGM_OK) {
// raw.data is owned — free with tgm_bytes_free
tgm_message_t *msg;
tgm_decode(raw.data, raw.len, 0, &msg);
// ... use msg ...
tgm_message_free(msg);
tgm_bytes_free(raw);
}
tgm_file_iter_free(iter);
tgm_file_close(file);
Object iterator
tgm_object_iter_t *iter;
tgm_object_iter_create(msg_ptr, msg_len, 0, &iter);
tgm_message_t *obj;
while (tgm_object_iter_next(iter, &obj) == TGM_OK) {
uint64_t ndim = tgm_object_ndim(obj, 0);
const uint64_t *shape = tgm_object_shape(obj, 0);
// ... use shape, data ...
tgm_message_free(obj);
}
tgm_object_iter_free(iter);
C++ API
The C++ wrapper (include/tensogram.hpp) provides RAII iterator classes that manage the underlying C handles automatically.
Buffer iterator
#include <tensogram.hpp>
auto buf = /* read file into std::vector<uint8_t> */;
tensogram::buffer_iterator iter(buf.data(), buf.size());
const std::uint8_t* msg_ptr;
std::size_t msg_len;
while (iter.next(msg_ptr, msg_len)) {
auto msg = tensogram::decode(msg_ptr, msg_len);
std::printf("version=%llu objects=%zu\n", msg.version(), msg.num_objects());
}
File iterator
auto f = tensogram::file::open("forecast.tgm");
tensogram::file_iterator iter(f);
std::vector<std::uint8_t> raw;
while (iter.next(raw)) {
auto msg = tensogram::decode(raw.data(), raw.size());
std::printf("objects=%zu\n", msg.num_objects());
}
Object iterator
tensogram::object_iterator iter(msg_ptr, msg_len);
tensogram::message obj = tensogram::decode(msg_ptr, msg_len); // placeholder for next()
while (iter.next(obj)) {
auto o = obj.object(0);
auto shape = o.shape();
std::printf("dtype=%s shape=[%llu, %llu]\n",
o.dtype_string().c_str(), shape[0], shape[1]);
}
Range-based for on message
auto msg = tensogram::decode(buf, len);
for (const auto& obj : msg) {
std::printf("dtype=%s bytes=%zu\n",
obj.dtype_string().c_str(), obj.data_size());
}
Python API
TensogramFile supports iteration, indexing, and slicing:
import tensogram
# Iterate all messages
with tensogram.TensogramFile.open("forecast.tgm") as f:
for meta, objects in f:
for desc, arr in objects:
print(f" shape={arr.shape} dtype={desc.dtype}")
# Index and slice
with tensogram.TensogramFile.open("forecast.tgm") as f:
meta, objects = f[0] # first message
meta, objects = f[-1] # last message
subset = f[10:20] # range of messages
every_5th = f[::5] # strided access
# Buffer iteration
buf = open("data.tgm", "rb").read()
for meta, objects in tensogram.iter_messages(buf):
desc, arr = objects[0]
print(f" shape={arr.shape}")
decode(), decode_message(), file iteration, and iter_messages() return Message namedtuples with .metadata and .objects fields.
Tuple unpacking (meta, objects = msg) also works. TensogramFile supports len(f) and context manager (with).
Thread safety: iterators own independent file handles and buffer copies — no shared mutable state. Safe under free-threaded Python (PEP 703, no GIL).
Edge cases
| Scenario | Behavior |
|---|---|
| Empty buffer / file | Iterator yields zero items |
| Garbage between messages | Silently skipped by scanner |
| Truncated message at end | Skipped (not yielded) |
| Zero-object message | objects() returns empty iterator |
| I/O error during file iteration | FileMessageIter::next() yields Err(...) |
Python API
Tensogram provides native Python bindings via PyO3. All tensor data crosses the boundary as NumPy arrays.
Installation
# From PyPI (once published)
pip install tensogram
# From source
pip install maturin numpy
cd python/bindings && maturin develop
Quick Start
import numpy as np
import tensogram
# Encode a 2D temperature field
temps = np.random.randn(100, 200).astype(np.float32) + 273.15
meta = {"version": 2}
desc = {"type": "ntensor", "shape": [100, 200], "dtype": "float32"}
msg = tensogram.encode(meta, [(desc, temps)])
# Decode it back
meta, objects = tensogram.decode(msg)
desc, array = objects[0]
print(array.shape) # (100, 200)
Encoding
Basic encoding
tensogram.encode() takes metadata, a list of (descriptor, array) pairs, and returns wire-format bytes:
msg = tensogram.encode(
{"version": 2},
[({"type": "ntensor", "shape": [3], "dtype": "float32"}, np.array([1, 2, 3], dtype=np.float32))],
hash="xxh3", # default; use None to skip hashing
)
Descriptor keys
Every object in a message is described by a dict. The three required keys define what the tensor looks like; the optional keys control how it is stored on the wire.
| Key | Required | Default | Description |
|---|---|---|---|
"type" | yes | — | Object type, e.g. "ntensor" |
"shape" | yes | — | Tensor dimensions, e.g. [100, 200] |
"dtype" | yes | — | Data type name (see Data Types) |
"strides" | no | row-major | Element strides; computed automatically if omitted |
"byte_order" | no | native | "little" or "big"; defaults to host byte order |
"encoding" | no | "none" | Encoding stage — see below |
"filter" | no | "none" | Filter stage — see below |
"compression" | no | "none" | Compression stage — see below |
Any additional keys (e.g. "reference_value", "bits_per_value") are stored in the descriptor’s .params dict and passed through to the encoding pipeline.
The encoding pipeline
Each object passes through a three-stage pipeline before it is stored. You control each stage via descriptor keys:
raw bytes → encoding → filter → compression → wire payload
Encoding transforms the data representation:
| Value | What it does | Use case |
|---|---|---|
"none" | Pass-through (default) | Exact values, integer data |
"simple_packing" | Quantize floats to packed integers | Bounded-range scalar fields (GRIB-compatible) |
Filter rearranges bytes to improve compressibility:
| Value | What it does | Use case |
|---|---|---|
"none" | Pass-through (default) | Most cases |
"shuffle" | Byte-transpose by element width (requires "shuffle_element_size") | Improves lz4/zstd ratio on typed data |
Compression reduces the payload size:
| Value | Random access | Type | Use case |
|---|---|---|---|
"none" | yes | — | No compression |
"zstd" | no | lossless | General-purpose, best ratio/speed tradeoff |
"lz4" | no | lossless | Fastest decompression |
"szip" | yes (RSI blocks) | lossless | Integer/packed data (CCSDS 121.0-B-3) |
"blosc2" | yes (chunks) | lossless | Large tensors, multi-codec |
"zfp" | yes (fixed-rate) | lossy | Floating-point arrays |
"sz3" | no | lossy | Error-bounded scientific data |
Compression parameters are passed as extra descriptor keys. For example, zstd level:
desc = {
"type": "ntensor", "shape": [1000], "dtype": "float32",
"compression": "zstd", "zstd_level": 9,
}
For the full list of compressor parameters, see Compression.
Common pipeline combinations
# Lossless, fast decompression
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
"compression": "lz4"}
# Lossless, best ratio (shuffle_element_size must match dtype byte width)
desc = {"type": "ntensor", "shape": shape, "dtype": "float32",
"filter": "shuffle", "shuffle_element_size": 4, "compression": "zstd", "zstd_level": 12}
# Quantise a bounded-range float field to 16-bit packed ints, then compress
# (the same pipeline GRIB 2 uses for simple_packing + CCSDS).
# compute_packing_params expects a flat float64 array
values = data.astype(np.float64).ravel()
params = tensogram.compute_packing_params(values, bits_per_value=16, decimal_scale_factor=0)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
"encoding": "simple_packing", "compression": "zstd", **params}
# Lossy float compression with error bound (zfp operates on float64)
desc = {"type": "ntensor", "shape": shape, "dtype": "float64",
"compression": "zfp", "zfp_mode": "fixed_accuracy", "zfp_tolerance": 0.01}
Invalid combinations: Some pipeline combinations are rejected at encode time — e.g.
zfp+shuffle(ZFP operates on typed floats, not byte-shuffled data) orsimple_packing+sz3(both are encoding stages). See Compression — Invalid Combinations.
Multiple objects per message
A single message can contain multiple tensors, each with its own descriptor:
spectrum = np.random.randn(256).astype(np.float64)
mask = np.array([1, 0, 1, 1, 0], dtype=np.uint8)
msg = tensogram.encode(
{"version": 2},
[
({"type": "ntensor", "shape": [256], "dtype": "float64", "compression": "zstd"}, spectrum),
({"type": "ntensor", "shape": [5], "dtype": "uint8"}, mask),
],
)
Pre-encoded data
If you already have compressed/packed payloads (e.g. from another system), use tensogram.encode_pre_encoded() with the same interface. The library skips the encoding pipeline and writes the bytes as-is:
msg = tensogram.encode_pre_encoded(meta, [(desc, pre_compressed_bytes)])
See Pre-Encoded Data API for details.
Decoding
Full decode
meta, objects = tensogram.decode(msg)
Returns a Message namedtuple with .metadata and .objects. Tuple unpacking works directly.
By default, decoded arrays are in the caller’s native byte order — the library handles byte-swapping automatically. Pass native_byte_order=False to receive the raw wire byte order instead:
meta, objects = tensogram.decode(msg, native_byte_order=False)
Metadata
meta is a Metadata object:
meta.version # int — always 2
meta.base # list[dict] — per-object metadata (one entry per object)
meta.extra # dict — message-level annotations (_extra_ in CBOR)
meta.reserved # dict — library internals (_reserved_ in CBOR, read-only)
meta["key"] # dict-style access (checks base entries, then extra)
To read metadata without decoding any payloads:
meta = tensogram.decode_metadata(msg)
To read metadata and descriptors (no payload decode):
meta, descriptors = tensogram.decode_descriptors(msg)
for desc in descriptors:
print(desc.shape, desc.dtype, desc.compression)
Selective decode
Decode a single object without touching the others — O(1) seek via the binary header’s offset table:
meta, desc, array = tensogram.decode_object(msg, index=2)
Decode a sub-range of elements from one object (for compressors that support random access):
# Elements 100-149 and 300-324 from object 0
parts = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)])
# parts is a list of numpy arrays, one per range
# Or join into a single contiguous array
joined = tensogram.decode_range(msg, object_index=0, ranges=[(100, 50), (300, 25)], join=True)
# joined is a single flat numpy array of shape (75,)
decode_rangeworks with uncompressed data,simple_packing,szip,blosc2, andzfpfixed-rate mode. It returns an error for stream compressors (zstd,lz4,sz3) and for theshufflefilter. See Decoding Data for details.
Scanning and iteration
To find message boundaries in a buffer without decoding:
offsets = tensogram.scan(buf) # list of (offset, length) pairs
To iterate messages in a multi-message buffer:
for meta, objects in tensogram.iter_messages(buf):
print(meta.version, len(objects))
Hash verification
meta, objects = tensogram.decode(msg, verify_hash=True)
Raises RuntimeError if any object’s payload hash doesn’t match. If the message was encoded without a hash (hash=None), verification is silently skipped.
File API
Writing
with tensogram.TensogramFile.create("forecast.tgm") as f:
for step in range(24):
data = model.run(step)
desc = {"type": "ntensor", "shape": list(data.shape), "dtype": "float32",
"compression": "zstd"}
f.append({"version": 2, "base": [{"step": step}]}, [(desc, data)])
Each append encodes one message and writes it to the end of the file. Messages are independent and self-describing.
Reading
with tensogram.TensogramFile.open("forecast.tgm") as f:
print(len(f)) # message count
meta, objects = f[0] # index (supports negative indices)
subset = f[1:10:2] # slice → list[Message]
for meta, objects in f: # iterate all messages
for desc, array in objects:
print(desc.shape, array.dtype)
raw = f.read_message(0) # raw bytes for forwarding/caching
The first access triggers a streaming scan that records message offsets. After that, every read is an O(1) seek.
Streaming encoder
For building a message one object at a time in memory:
enc = tensogram.StreamingEncoder({"version": 2}, hash="xxh3")
for desc, data in objects:
enc.write_object(desc, data)
msg = enc.finish() # returns complete message as bytes
For pre-encoded payloads, use enc.write_object_pre_encoded(desc, raw_bytes).
Async API
AsyncTensogramFile provides the same operations as TensogramFile but as
asyncio coroutines. A single handle supports truly concurrent operations
with no per-handle mutex; internal caches are thread-safe.
Opening and decoding
import asyncio
import tensogram
async def main():
f = await tensogram.AsyncTensogramFile.open("forecast.tgm")
meta, objects = await f.decode_message(0)
result = await f.file_decode_object(0, 0)
print(result["data"].shape)
asyncio.run(main())
For remote files with credentials:
f = await tensogram.AsyncTensogramFile.open_remote(
"s3://bucket/data.tgm", {"region": "eu-west-1"}
)
Concurrent decoding with asyncio.gather
Multiple decode calls run concurrently on a single handle:
results = await asyncio.gather(
f.file_decode_object(0, 0),
f.file_decode_object(1, 0),
f.file_decode_object(2, 0),
)
Batch decoding from many messages at once
When you need the same data from many messages, for example reading how a value at one grid point changes over 300 time steps, individual requests are slow because each one is a separate HTTP round-trip.
file_decode_range_batch collects the requested element ranges across
messages and fetches the underlying data in a batched HTTP call.
file_decode_object_batch does the same for full frames:
indices = list(range(300))
row, col, grid = 100, 200, 528
offset = row * grid + col
values = await f.file_decode_range_batch(indices, 0, [(offset, 1)], join=True)
frames = await f.file_decode_object_batch(indices, 0)
For even more speed, split the work into chunks and run them concurrently:
chunks = [indices[i::16] for i in range(16)]
batch_results = await asyncio.gather(
*[f.file_decode_range_batch(chunk, 0, [(offset, 1)], join=True)
for chunk in chunks]
)
The sync TensogramFile also has file_decode_range_batch and
file_decode_object_batch with the same signatures. Both batch methods
require a remote backend; calling them on a local file raises OSError.
Layout prefetching
Before running many concurrent decodes on a remote file, prefetch the internal layout metadata to avoid repeated discovery requests:
count = await f.message_count()
await f.prefetch_layouts(list(range(count)))
Context manager and iteration
async with await tensogram.AsyncTensogramFile.open("data.tgm") as f:
await f.message_count() # required before async for or len(f)
async for meta, objects in f:
print(objects[0][1].shape)
Async iteration works on remote files (sync iteration does not).
await f.message_count() must be called once before using async for
or len(f), to discover the message count without blocking the event loop.
Other methods
count = await f.message_count()
raw = await f.read_message(0)
all_raw = await f.messages()
print(f.is_remote(), f.source())
Note:
len(f)requires a priorawait f.message_count()call. Without it,len(f)raisesRuntimeError.
When to use async vs sync
| Scenario | Recommendation |
|---|---|
| Script, CLI, or notebook | TensogramFile (sync) |
| Inside an asyncio event loop | AsyncTensogramFile |
| xarray or zarr | Sync (those frameworks are synchronous) |
| Many concurrent remote reads | asyncio.gather on one AsyncTensogramFile |
| Same data from many messages | file_decode_range_batch or file_decode_object_batch |
Validation
Two functions check whether messages and files are well-formed without consuming the data. See also the CLI reference.
report = tensogram.validate(msg)
file_report = tensogram.validate_file("data.tgm")
Levels
| Level | Checks | hash_verified |
|---|---|---|
"quick" | Structure only: magic bytes, frame layout, lengths | always False |
"default" | + metadata (CBOR) + integrity (hash verification, decompression) | True only if hash succeeds and no errors |
"checksum" | Hash verification only, structural warnings suppressed | True only if hash succeeds and no errors |
"full" | + fidelity (full decode, decoded-size check, NaN/Inf scan) | True only if hash succeeds and no errors |
# Full validation with canonical CBOR key-order checking
report = tensogram.validate(msg, level="full", check_canonical=True)
Return values
validate() returns:
{
"issues": [
{
"code": "hash_mismatch", # stable snake_case string
"level": "integrity", # which validation level found it
"severity": "error", # "error" or "warning"
"description": "...", # human-readable message
"object_index": 0, # optional — which object
"byte_offset": 1234, # optional — position in buffer
}
],
"object_count": 1,
"hash_verified": False,
}
validate_file() returns file-level issues plus per-message reports:
{
"file_issues": [
{"byte_offset": 100, "length": 19, "description": "trailing bytes after last message"}
],
"messages": [
{"issues": [], "object_count": 1, "hash_verified": True}
],
}
Interpreting results
report = tensogram.validate(msg)
if not report["issues"]:
print(f"OK — {report['object_count']} objects, hash verified")
else:
for issue in report["issues"]:
print(f"[{issue['severity']}] {issue['code']}: {issue['description']}")
GRIB / NetCDF conversion
Three PyO3-bound helpers wrap tensogram-grib
and tensogram-netcdf.
They are always callable — when the Python wheel was built without
the corresponding Cargo feature, each raises RuntimeError with a
pointer to rebuild instructions.
You can probe availability at runtime:
import tensogram
if tensogram.__has_grib__:
msgs = tensogram.convert_grib("forecast.grib2")
if tensogram.__has_netcdf__:
msgs = tensogram.convert_netcdf("data.nc")
convert_grib(path, **options) -> list[bytes]
Convert a GRIB file (as many messages as it contains) to Tensogram
wire format. Returns one bytes per output Tensogram message — join
or write sequentially to produce a .tgm file.
msgs = tensogram.convert_grib(
"forecast.grib2",
grouping="merge_all", # "merge_all" | "one_to_one"
preserve_all_keys=False, # lift every ecCodes namespace into base[i]["grib"]
encoding="simple_packing", # "none" | "simple_packing"
bits=16, # None -> defaults to 16; ignored for encoding="none"
filter="none", # "none" | "shuffle"
compression="szip", # "none" | "zstd" | "lz4" | "blosc2" | "szip"
compression_level=None, # applies to zstd / blosc2 (None = codec default)
threads=0, # 0 = sequential; honours TENSOGRAM_THREADS env var
hash="xxh3", # "xxh3" | None
# NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
allow_nan=False, # False (default) rejects any NaN input
allow_inf=False, # False (default) rejects any ±Inf input
)
with open("forecast.tgm", "wb") as fh:
for msg in msgs:
fh.write(msg)
Pipeline defaults and edge cases:
bits=Nonewithencoding="simple_packing"defaults to 16 bits.bitsoutside1..=64silently falls back toencoding="none"and emits a warning to stderr. Validate your inputs before calling if fail-fast is important.- Unknown
compression/encodingnames raiseValueErrorwith the list of valid choices in the message. - Unknown
grouping/split_by/hashvalues raiseValueError. - Missing input paths raise
FileNotFoundError. - Building the wheel without the
grib/netcdffeature causes the corresponding function to raiseRuntimeErrorat call time with rebuild instructions.
Requires libeccodes at the OS level and the wheel built with
--features grib (maturin develop --features grib). Official PyPI
wheels do not currently include the grib feature — see
Jupyter Notebook Walk-through.
convert_grib_buffer(buf, **options) -> list[bytes]
In-memory variant of convert_grib. Accepts any Python bytes-like
object (bytes, bytearray, memoryview, numpy.uint8[:]).
Useful when the GRIB bytes come from a byte-range HTTP fetch, a
cache, or any other in-memory source — no filesystem staging needed.
import requests
# Byte-range download of a single GRIB message from data.ecmwf.int.
resp = requests.get(
"https://data.ecmwf.int/forecasts/.../...grib2",
headers={"Range": "bytes=74573515-75234113"},
)
msgs = tensogram.convert_grib_buffer(
resp.content,
encoding="simple_packing",
bits=16,
compression="szip",
# See [NaN / Inf Handling](nan-inf-handling.md) for the
# `allow_nan` / `allow_inf` opt-in if your data contains
# non-finite values.
)
convert_grib and convert_grib_buffer produce bit-identical
decoded payloads for the same input. The encoded bytes may
differ — each call stamps a fresh timestamp and UUID into
_reserved_.
convert_netcdf(path, **options) -> list[bytes]
Convert a NetCDF-3 or NetCDF-4 file to Tensogram. Packed variables
(scale_factor / add_offset) are automatically unpacked to
float64.
msgs = tensogram.convert_netcdf(
"data.nc",
split_by="file", # "file" | "variable" | "record"
cf=False, # lift 16 CF attributes into base[i]["cf"]
encoding="none",
bits=None,
filter="none",
compression="zstd",
compression_level=3,
threads=0,
hash="xxh3",
# NaN / Inf handling — see docs/src/guide/nan-inf-handling.md
allow_nan=False, # False (default) rejects any NaN input
allow_inf=False, # False (default) rejects any ±Inf input
)
Note on NaN and
--encoding simple_packing. Since 0.17 the importer hard-fails on NaN or Inf in a variable targeted forsimple_packing(previous behaviour: stderr warning + fallback toencoding="none"). If your NetCDF has_FillValue/missing_valuefields unpacked to NaN, either stick with the defaultencoding="none"or pre-process the values. See the NetCDF Importer error-handling reference for the full contract.
Requires libnetcdf + libhdf5 at the OS level and the wheel built
with --features netcdf.
Error Handling
| Exception | When |
|---|---|
FileNotFoundError | convert_grib(path) / convert_netcdf(path) called with a non-existent path (subclass of OSError). |
OSError | Other file I/O failures (permission denied, disk error, etc.). |
ValueError | Invalid parameters; unknown dtype; NaN in simple packing; unknown validation level; invalid grouping / split_by / hash; unknown codec / bit width in the conversion pipeline; empty/non-GRIB input buffer; split_by="record" on a NetCDF without an unlimited dimension. |
RuntimeError | Hash mismatch during decode(..., verify_hash=True); calling convert_grib / convert_grib_buffer / convert_netcdf on a wheel built without the feature; internal ecCodes / libnetcdf C-library failures that cannot be classified as caller-input errors. |
KeyError | Missing metadata key via meta["key"]. |
Supported dtypes
| Category | Types |
|---|---|
| Floating point | float16, bfloat16, float32, float64 |
| Complex | complex64, complex128 |
| Signed integer | int8, int16, int32, int64 |
| Unsigned integer | uint8, uint16, uint32, uint64 |
| Special | bitmask |
bfloat16 is returned as ml_dtypes.bfloat16 when ml_dtypes is installed; otherwise it falls back to np.uint16.
See Data Types for byte widths and wire-format details.
Examples
See examples/python/ for complete working examples:
| Example | Topic |
|---|---|
01_encode_decode.py | Basic round-trip |
02_mars_metadata.py | Per-object metadata (ECMWF MARS vocabulary example) |
02b_generic_metadata.py | Per-object metadata using a generic application namespace |
03_simple_packing.py | Simple-packing encoding |
04_multi_object.py | Multi-object messages, selective decode |
05_file_api.py | Multi-message .tgm files |
06_hash_and_errors.py | Hash verification and error handling |
07_iterators.py | File iteration, indexing, slicing |
08_xarray_integration.py | Opening .tgm as xarray Datasets |
08_zarr_backend.py | Reading/writing through Zarr v3 |
09_dask_distributed.py | Dask distributed computing over 4-D tensors |
09_streaming_consumer.py | Streaming consumer pattern |
11_encode_pre_encoded.py | Pre-encoded data API |
12_convert_netcdf.py | NetCDF → Tensogram import via the Python API |
13_validate.py | Message and file validation |
15_async_operations.py | Async open, decode, and asyncio.gather |
17_convert_grib.py | GRIB → Tensogram import (file + in-memory buffer) |
For narrative walk-throughs with plots and explanations, see also
examples/jupyter/*.ipynb — five journey notebooks covering
quickstart/MARS, encoding pipeline fidelity, GRIB conversion, NetCDF
conversion with xarray, and validation with multi-threaded encoding.
C++ API
Tensogram provides a header-only C++17 wrapper at cpp/include/tensogram.hpp. It delegates all work to the C FFI and adds RAII handle management, typed exceptions, and idiomatic C++ patterns.
Requirements
- C++17 compiler (GCC 7+, Clang 5+, MSVC 19.14+)
- Rust static library built via
cargo build --release - CMake 3.16+ (recommended)
Build
cargo build --release
cmake -S cpp -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
Quick Start
#include <tensogram.hpp>
// Encode
std::string meta_json = R"({"version": 2, "descriptors": [...]})";
std::vector<float> data(100 * 200, 0.0f);
auto encoded = tensogram::encode(
meta_json,
{{reinterpret_cast<const uint8_t*>(data.data()), data.size() * sizeof(float)}});
// Decode
auto msg = tensogram::decode(encoded.data(), encoded.size());
auto obj = msg.object(0);
const float* values = obj.data_as<float>();
RAII Classes
| Class | Wraps | Cleanup |
|---|---|---|
message | tgm_message_t | tgm_message_free |
metadata | tgm_metadata_t | tgm_metadata_free |
file | tgm_file_t | tgm_file_close |
buffer_iterator | tgm_buffer_iter_t | tgm_buffer_iter_free |
file_iterator | tgm_file_iter_t | tgm_file_iter_free |
object_iterator | tgm_object_iter_t | tgm_object_iter_free |
streaming_encoder | tgm_streaming_encoder_t | tgm_streaming_encoder_free |
All classes are move-only (copy deleted). Handles are released automatically when the object goes out of scope.
Error Handling
C error codes are mapped to a typed exception hierarchy:
try {
auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
// Invalid message framing
} catch (const tensogram::hash_mismatch_error& e) {
// Payload integrity check failed
} catch (const tensogram::error& e) {
// Any Tensogram error (base class)
std::cerr << e.what() << " (code=" << e.code() << ")\n";
}
Validation
Two free functions validate messages and files, returning JSON strings:
// Validate a single message buffer (default level)
auto report = tensogram::validate(buf, len);
// Full validation with canonical CBOR check
auto full_report = tensogram::validate(buf, len, "full", /*check_canonical=*/true);
// Validate a .tgm file
auto file_report = tensogram::validate_file("data.tgm");
auto file_full = tensogram::validate_file("data.tgm", "full");
Validation levels: "quick", "default", "checksum", "full".
The returned JSON contains issues, object_count, and hash_verified for single messages, or file_issues and messages for files. Parse with your preferred JSON library.
An invalid level string or a missing file throws tensogram::invalid_arg_error or tensogram::io_error respectively. Validation issues (corrupted data, hash mismatches) are reported in the JSON — they do not throw.
Iterators
See Iterators for buffer, file, and object iterator usage.
Examples
See examples/cpp/ for complete working examples covering encode/decode, metadata, file API, simple packing, and iterators.
TypeScript API
Tensogram ships @ecmwf/tensogram, a TypeScript package that wraps the
WebAssembly build with typed, idiomatic helpers. Use it in any modern
browser or Node ≥ 20.
Status: Scope B is complete. Typed encode / decode / scan, dtype dispatch, metadata helpers, progressive streaming decode, and the
TensogramFilefile / URL helper are all available. Scope C follow-ups (validatewrapper,encodePreEncoded, first-classfloat16/bfloat16/complex*types, npm publish pipeline) are tracked inplans/TYPESCRIPT_WRAPPER.md.
Installation
The package is not yet published to npm. Build it locally:
# First, build the WebAssembly blob from the Rust source
cd typescript
npm install
npm run build:wasm # runs wasm-pack build -t web -d typescript/wasm
npm run build # runs wasm-pack + tsc
Or use the top-level Makefile:
make ts-build # build WASM + tsc
make ts-test # vitest
make ts-typecheck # strict tsc --noEmit on src + tests
Quick start
import {
init, encode, decode,
type DataObjectDescriptor,
type GlobalMetadata,
} from '@ecmwf/tensogram';
// One-time WASM initialisation (idempotent)
await init();
// ── Encode ────────────────────────────────────────────────────────────
const temps = new Float32Array(100 * 200);
for (let i = 0; i < temps.length; i++) temps[i] = 273.15 + i / 100;
const meta: GlobalMetadata = { version: 2 };
const descriptor: DataObjectDescriptor = {
type: 'ntensor',
ndim: 2,
shape: [100, 200],
strides: [200, 1],
dtype: 'float32',
byte_order: 'little',
encoding: 'none',
filter: 'none',
compression: 'none',
};
const msg: Uint8Array = encode(meta, [{ descriptor, data: temps }]);
// ── Decode ────────────────────────────────────────────────────────────
const { metadata, objects } = decode(msg);
const arr = objects[0].data(); // Float32Array (inferred from dtype)
console.log(arr.length); // 20000
API surface
init(opts?)
Loads and instantiates the WASM blob. Must be awaited before any other function is called. Safe to call multiple times — subsequent calls reuse the same instance.
await init(); // defaults
await init({ wasmInput: new URL('...', import.meta.url) }); // custom location
encode(metadata, objects, opts?)
| Parameter | Type | Description |
|---|---|---|
metadata | GlobalMetadata | Wire-format metadata; version: 2 is required |
objects | Array<{ descriptor, data }> | Each data is a TypedArray or Uint8Array |
opts.hash | 'xxh3' | false | Hash algorithm. Default 'xxh3'. Pass false to disable. |
Returns: Uint8Array containing the complete wire-format message.
decode(buf, opts?)
| Parameter | Type | Description |
|---|---|---|
buf | Uint8Array | Raw message bytes |
opts.verifyHash | boolean | Default false. If true, throws HashMismatchError on corruption. |
Returns: { metadata: GlobalMetadata, objects: DecodedObject[], close() }.
decodeMetadata(buf)
Returns only the metadata; does not touch any payload bytes.
decodeObject(buf, index, opts?)
O(1) seek to object index, decoding only that object.
scan(buf)
Returns Array<{ offset: number; length: number }> for each
Tensogram message found in a (potentially multi-message) buffer.
Garbage between messages is silently skipped.
DecodedObject / DecodedFrame
interface DecodedObject {
readonly descriptor: DataObjectDescriptor;
/** Copy into the JS heap. Safe across WASM memory growth. */
data(): TypedArray;
/** Zero-copy view. Invalidated if WASM memory grows. */
dataView(): TypedArray;
readonly byteLength: number;
}
interface DecodedFrame extends /* structurally */ DecodedObject {
/** The matching `base[i]` entry from the containing message. */
readonly baseEntry: BaseEntry | null;
close(): void;
}
The returned array type is picked from descriptor.dtype:
dtype | Returned TypedArray |
|---|---|
float32 | Float32Array |
float64 | Float64Array |
int8 | Int8Array |
int16 | Int16Array |
int32 | Int32Array |
int64 | BigInt64Array |
uint8 | Uint8Array |
uint16 | Uint16Array |
uint32 | Uint32Array |
uint64 | BigUint64Array |
float16 / bfloat16 | Uint16Array (no native half-precision in JS) |
complex64 | Float32Array (interleaved real, imag) |
complex128 | Float64Array (interleaved real, imag) |
bitmask | Uint8Array (packed bits) |
getMetaKey(meta, path)
Dot-path lookup matching the Rust / Python / CLI first-match-across-base
semantics: searches base[0], base[1], …, skipping the _reserved_
key in each, then falls back to _extra_.
getMetaKey(meta, 'mars.param') // 'base[0].mars.param' first match
getMetaKey(meta, '_extra_.source') // explicit _extra_ prefix
Returns undefined if the key is missing (never throws).
computeCommon(meta)
Mirror of tensogram::compute_common. Returns a
Record<string, CborValue> of keys that are present with identical
values in every entry of meta.base. Useful for display and merge
operations.
Error classes
All errors thrown from this package are instances of the abstract
TensogramError class. Eight concrete subclasses match the Rust
TensogramError variants plus the TS-layer InvalidArgumentError
and StreamingLimitError:
import {
TensogramError,
FramingError,
MetadataError,
EncodingError,
CompressionError,
ObjectError,
IoError,
RemoteError,
HashMismatchError,
InvalidArgumentError,
StreamingLimitError,
} from '@ecmwf/tensogram';
try {
decode(corruptBuffer);
} catch (err) {
if (err instanceof FramingError) {
console.error('bad wire format:', err.message);
} else if (err instanceof HashMismatchError) {
console.error('integrity failure:', err.expected, err.actual);
} else {
throw err;
}
}
Memory model
- Safe-copy by default.
object.data()/frame.data()always allocate a newTypedArrayon the JS heap. It remains valid even after the underlyingDecodedMessage/DecodedFrameis freed or WASM memory grows. - Zero-copy opt-in.
object.dataView()/frame.dataView()return a view directly into WASM linear memory. It is invalidated the next time any WASM call grows linear memory — which can happen on the nextencode()/decode(). Read the view immediately or copy it. - Explicit cleanup.
DecodedMessage,DecodedFrame, andTensogramFileall expose.close()to release WASM-side memory. AFinalizationRegistryalso calls.free()on the underlying WASM handle when the wrapper is garbage-collected, but explicit.close()is strongly recommended for deterministic cleanup.
Streaming decode
Use decodeStream(readable, opts?) to progressively decode a
ReadableStream<Uint8Array>. Works against any stream source —
fetch().body, a Node Readable.toWeb(), a Blob.stream(), or a
hand-rolled ReadableStream.
import { decodeStream } from '@ecmwf/tensogram';
const res = await fetch('/data.tgm');
for await (const frame of decodeStream(res.body!)) {
render(frame.descriptor.shape, frame.data());
frame.close();
}
Options:
| Option | Type | Description |
|---|---|---|
signal | AbortSignal | Cancels the iteration. The underlying reader is cancelled and the decoder is freed cleanly. |
maxBufferBytes | number | Max size of the internal staging buffer. Default: 256 MiB. Exceeding this throws StreamingLimitError. |
onError | (err: StreamDecodeError) => void | Called whenever a corrupt message is skipped. The iterator does not throw on skips — it keeps going. |
Key behaviours:
- Chunk-boundary tolerant. A message can be split across any number of chunks. The decoder accumulates until a complete message is seen, then emits every object as a separate frame.
- Corruption resilient. A single bad message is skipped; the
iterator keeps going with subsequent messages. Pass
onErrorto observe the skips. - Early break is safe. Breaking out of the
for awaitloop runs the generator’sfinallyblock, which releases the stream reader and frees the decoder. - AbortSignal cancels cleanly. Firing the signal cancels the underlying reader; the generator throws whatever error the signal carries.
File API
TensogramFile gives you random-access reads over a .tgm file,
whether it lives on the local file system, behind an HTTPS URL, or
already in memory.
import { TensogramFile } from '@ecmwf/tensogram';
// Node: from the local file system
const file = await TensogramFile.open('/data/input.tgm');
// Browser or Node: over HTTPS
const file = await TensogramFile.fromUrl('https://example.com/input.tgm');
// Any runtime: from pre-loaded bytes
const file = TensogramFile.fromBytes(uint8ArrayFromSomewhere);
All three factories produce an identical object:
interface TensogramFile extends AsyncIterable<DecodedMessage> {
readonly messageCount: number;
readonly byteLength: number;
readonly source: 'local' | 'remote' | 'buffer';
message(index: number, opts?: DecodeOptions): Promise<DecodedMessage>;
messageMetadata(index: number): Promise<GlobalMetadata>;
rawMessage(index: number): Uint8Array;
[Symbol.asyncIterator](): AsyncIterator<DecodedMessage>;
close(): void;
}
Usage:
const file = await TensogramFile.open('/data/input.tgm');
try {
console.log(`${file.messageCount} messages, ${file.byteLength} bytes`);
// Random access
const first = await file.message(0);
console.log(first.objects[0].descriptor.shape);
first.close();
// Async iteration
for await (const msg of file) {
// ...
msg.close();
}
} finally {
file.close();
}
TensogramFile.open(path, opts?) (Node only)
Loads the file via node:fs/promises. The node:fs/promises import
is dynamic so browser bundlers can tree-shake this code path.
| Option | Type | Description |
|---|---|---|
signal | AbortSignal | Cancels the initial read. |
TensogramFile.fromUrl(url, opts?) (any fetch-capable runtime)
Downloads the file over HTTPS using the ambient globalThis.fetch.
| Option | Type | Description |
|---|---|---|
fetch | typeof fetch | Override the fetch implementation (useful for tests and for browsers with a polyfill). |
headers | HeadersInit | Extra request headers (auth, etc.). |
signal | AbortSignal | Cancels the download. |
TensogramFile.fromBytes(bytes)
Wraps an already-loaded Uint8Array. The buffer is defensively
copied, so later mutation of the caller’s buffer is invisible to
the TensogramFile.
Range-based lazy access
Since Scope C, TensogramFile.fromUrl automatically probes the server
for HTTP Range support. When the HEAD response advertises
Accept-Ranges: bytes and a finite Content-Length, the file
switches to a lazy backend:
- The initial open issues a small
HEAD+ one 24-byte Range read per message preamble to build the boundary index. No payload data is downloaded. rawMessage(i)/message(i)fetch just the requested message’s bytes via aRange: bytes=offset-(offset+length-1)GET.- A small LRU caches recently-fetched message bytes so repeat reads are free.
When the server omits Accept-Ranges, returns non-200 on HEAD, or
the file uses streaming-mode messages (total_length=0 — the writer
did not know the final length up front), the open falls back to a
single eager GET. Behaviour is indistinguishable to callers except in
memory use and timing.
Browser callers using fromUrl directly need CORS to expose the
Accept-Ranges, Content-Range, and Content-Length headers.
Append (Node local file system)
TensogramFile#append(meta, objects, opts?) encodes the new message
in-memory, appends it to the on-disk file, refreshes the position
index, and makes the new message reachable via message(i) on the
same handle. Only supported when the file was opened via
TensogramFile.open(path) — fromBytes- and fromUrl-backed files
throw InvalidArgumentError, matching the contract in the other
language bindings.
const file = await TensogramFile.open('/data/forecast.tgm');
try {
await file.append({ version: 2 }, [{ descriptor, data }]);
console.log(`now has ${file.messageCount} messages`);
} finally {
file.close();
}
Scope-C API additions
Scope C brought the TypeScript wrapper to full API parity with Rust / Python / FFI / C++. The surface additions are:
| Function / class | What it does |
|---|---|
decodeRange(buf, objIndex, ranges, opts?) | Partial sub-tensor decode. ranges is an array of [offset, count] pairs in element units; each returned parts[i] is a dtype-typed view. Option join: true concatenates every range into a single view. |
computeHash(bytes, algo?) | Standalone xxh3 hash — matches the digest stamped by encode() on the same bytes. |
simplePackingComputeParams(values, bits, decScale?) | GRIB-style simple-packing parameter computation. Return shape uses snake-case keys so the result spreads directly into a descriptor. |
validate(buf, opts?) | Report-only validation (never throws on bad input). Modes: quick, default, checksum, full. |
validateBuffer(buf, opts?) | Multi-message buffer: reports file-level gaps / trailing garbage plus per-message reports. |
validateFile(path, opts?) | Node-only helper: reads the file via node:fs/promises then delegates to validateBuffer. |
encodePreEncoded(meta, objects, opts?) | Wrap already-encoded bytes verbatim into a wire-format message. The library still validates descriptor structure and stamps a fresh hash. |
StreamingEncoder | Frame-at-a-time construction. Two modes: buffered (default, finish() returns the complete Uint8Array) or streaming via opts.onBytes callback (bytes flow through the callback as they’re produced; finish() returns an empty Uint8Array). |
TensogramFile#append | Append a new message to a file opened via TensogramFile.open(path). Node-only. |
Streaming StreamingEncoder (no full-message buffering)
For browser uploads, WebSocket pushes, or any sink that needs bytes as
soon as they are produced, pass an onBytes callback to the
StreamingEncoder constructor:
const enc = new StreamingEncoder({ version: 2 }, {
onBytes: (chunk) => uploadSocket.send(chunk), // e.g. WebSocket.send
});
enc.writeObject(descriptor, new Float32Array([1, 2, 3]));
enc.finish(); // flushes footer; returns empty Uint8Array in streaming mode
enc.close();
Semantics:
- The callback is invoked during construction (preamble + header
metadata frame), during each
writeObject/writeObjectPreEncoded(one data-object frame’s bytes, potentially across multiple invocations), and duringfinish()(footer frames + postamble). - Concatenating every chunk the callback sees (in order) yields a
message byte-for-byte identical to what buffered mode would
return. Tested via round-trip with
decode(). - The callback must be synchronous —
Promisereturn values are silently discarded because the Rust/WASM writer contract is synchronous. Buffer internally first if you need async work. - Each
chunkis JS-owned and fresh per invocation. Copy (new Uint8Array(chunk)orchunk.slice()) if you need to keep it past the nextwriteObject— the underlyingArrayBufferis invalidated when WASM memory grows. - If the callback throws, the exception surfaces as an
IoErroron the nextwriteObject/finish. The encoder state is undefined after an error — callclose()and start over. enc.streaming(getter) reports whether anonBytessink was supplied — useful for code that needs to branch on mode.
Parity note: the Rust core StreamingEncoder<W: Write> has always
supported arbitrary sinks; the WASM/TS surface now exposes this
capability to JS code. Python / FFI / C++ bindings remain
buffered-only; extending them would follow the same JsCallbackWriter
pattern with a language-specific sink abstraction and is tracked in
plans/TYPESCRIPT_WRAPPER.md.
First-class half-precision and complex dtypes
Scope C also upgraded the dtype dispatch in {@link typedArrayFor}.
obj.data() now returns a first-class view for dtypes JS does not
have a native TypedArray for:
| Dtype | data() return type |
|---|---|
float16 | Float16Array (native when available) or Float16Polyfill (TC39-accurate) |
bfloat16 | Bfloat16Array — 1-8-7 layout, truncating-with-round-to-nearest-even narrow |
complex64 / complex128 | ComplexArray — .real(i), .imag(i), .get(i) → {re, im}, iteration |
All three classes expose .bits / .data for zero-copy access to the
underlying raw storage if you need it.
const m = decode(buf);
const f16 = m.objects[0].data(); // Float16Array or polyfill
const asFloat32 = f16.toFloat32Array(); // widened copy
const bits = f16.bits; // raw binary16
const cplx = m.objects[1].data() as ComplexArray;
for (let i = 0; i < cplx.length; i++) {
console.log(cplx.real(i), cplx.imag(i));
}
The polyfill is used automatically when the host runtime does not
ship globalThis.Float16Array. hasNativeFloat16Array() and
getFloat16ArrayCtor() expose the detection machinery for callers
that want direct control.
Breaking change from Scope B: Before Scope C,
obj.data()onfloat16/bfloat16returned a rawUint16Arrayof bits, and complex dtypes returned an interleavedFloat32Array/Float64Array. Consumers that relied on that shape can reach the same bytes via.bits(for f16/bf16) or.data(for complex).
The low-level bit-conversion helpers (halfBitsToFloat,
floatToHalfBits, bfloat16BitsToFloat, floatToBfloat16Bits) and
the isComplexDtype type-guard are internal and are not re-exported
from @ecmwf/tensogram. Callers that need bit-level manipulation
should grab the raw storage from a view’s .bits / .data accessor
and do the conversion themselves, or import directly from
@ecmwf/tensogram/float16, …/bfloat16, …/complex with the
understanding that these module paths are not part of the stable API.
Examples
See examples/typescript/ in the repository for runnable scripts:
01_encode_decode.ts— basic round-trip02_mars_metadata.ts— per-object metadata using the MARS vocabulary02b_generic_metadata.ts— per-object metadata using a generic application namespace03_multi_object.ts— multiple dtypes in one message04_decode_range.ts— partial sub-tensor decode05_streaming_fetch.ts— progressive decode over aReadableStream06_file_api.ts—TensogramFileover Node fs, fetch, and in-memory bytes07_hash_and_errors.ts— hash verification and typed errors08_validate.ts—validate(buf)+validateFile(path)11_encode_pre_encoded.ts— wrap already-encoded bytes12_streaming_encoder.ts— frame-at-a-time encoder with preceders13_range_access.ts— lazyTensogramFile.fromUrlover HTTP Range14_streaming_callback.ts—StreamingEncoderwithonBytescallback sink
Run them with:
cd examples/typescript
npm install
npx tsx 01_encode_decode.ts # or any other file
Design notes
See plans/TYPESCRIPT_WRAPPER.md for the full design document covering
architecture, phases, test strategy, memory model, and open follow-ups.
Cross-language parity
This TypeScript package decodes the same golden .tgm files used
by the Rust, Python, and C++ test suites. The committed files at
rust/tensogram/tests/golden/*.tgm are decoded by each language’s
test runner; any drift in wire-format semantics fails all four suites.
Specifically, typescript/tests/golden.test.ts decodes:
simple_f32.tgm— single-object Float32 round-tripmulti_object.tgm— mixed-dtype message (f32 / i64 / u8)mars_metadata.tgm— MARS keys underbase[0].marsmulti_message.tgm— two concatenated messages (viascan())hash_xxh3.tgm— verifyHash success + tamper detection
typescript/tests/property.test.ts and the Scope-C dtype suites add
fast-check property tests pinning:
mapTensogramErrornever throws for any finite-string input and always returns aTensogramErrorsubclass;encode → decodeis bit-exact for random Float32 shapes across random application metadata;decodeon random byte input either succeeds with a structurally valid message or throws a typedTensogramError— never panics;float32 → float16 → float32round-trip stays within half-precision ulp for any random value in a reasonable magnitude band;float32 → bfloat16 → float32round-trip stays within bfloat16 ulp;complex64encode → decode preservesreal(i)/imag(i)byte-for-byte across random shapes and values.
The CI typescript job rebuilds and runs every TS test on every PR.
Tensoscope
Tensoscope is an interactive web viewer for .tgm files. It runs entirely in the
browser — no server-side component — by decoding data via the @ecmwf/tensogram
WebAssembly package.
Quick start
Build the WASM package first, then start the dev server:
cd typescript && make ts-build
cd tensoscope && npm install && npm run dev
Open http://localhost:5173 in your browser, then drag-and-drop a .tgm file onto
the page or paste a URL into the file open dialog.
Loading a file
Two modes are supported:
- Local file — drag the
.tgmfile onto the drop zone, or click Open file. - Remote URL — paste an HTTP/HTTPS URL. The file is fetched in full before scanning. (HTTP Range support for lazy loading is planned.)
Once loaded, Tensoscope scans all messages and builds a field index without decoding any payloads.
Field browser
The left sidebar lists every decodable field in the file. Each entry shows:
- Variable name (resolved from
mars.param,name, orparammetadata keys) - Shape and dtype
Click a field to decode it and render it on the map.
Map view
Fields with two spatial dimensions (latitude × longitude) are rendered as a coloured overlay on an interactive map. Regridding from the unstructured source grid onto the display pixel grid runs in a web worker so the UI stays responsive while large arrays are processed.
Projections
Switch between flat (Mercator, powered by MapLibre GL JS) and globe (3D sphere, powered by CesiumJS with OpenStreetMap base tiles) using the projection picker in the bottom-left of the map. Camera position is preserved when switching between the two renderers.
Render modes
A Heatmap / Contours toggle in the top-left of the map switches between two rendering styles:
- Heatmap — smooth continuous gradient from the active colour scale. Pixel colours are interpolated linearly across the data range.
- Contours — filled colour bands (like
matplotlib.contourf). The data range is divided into N discrete bands where N is the number of colour steps in the active palette (default 10 for continuous palettes; stop count for custom palettes). Each band is rendered with a single solid colour.
Colour scale
The colour bar at the bottom of the map shows the current field range. Use the colour scale controls to:
- Change the colour map (perceptually uniform maps from d3-scale-chromatic)
- Lock or reset the min/max range
Animation
For files with a time or step dimension, the step slider appears below the map. Use play/pause to animate through steps at a fixed frame rate.
Docker deployment
cd tensoscope
make build # build the container image
make run # serve at http://localhost:8000
BASE_PATH=/scope make run # serve under a subpath
The image uses nginx and accepts a BASE_PATH environment variable for subpath
deployments behind a reverse proxy.
Known limitations
- Only lat/lon grids are currently regridded; polar stereographic and other projections are not yet handled.
- 3D fields (pressure levels) cannot yet be sliced via the level selector (the UI component exists but is not yet wired up).
- HTTP Range-based lazy loading is not yet implemented; the full file is fetched before any field can be displayed.
xarray Integration
The tensogram-xarray package provides a read-only xarray backend engine
for .tgm files. Once installed, you can open tensogram data with:
import xarray as xr
ds = xr.open_dataset("data.tgm", engine="tensogram")
This chapter explains the conversion philosophy, the mapping rules, and walks through progressively complex examples so you know exactly what to expect – and what to provide – when loading tensogram data into xarray.
Philosophy: Why Mapping is Needed
Tensogram and xarray have fundamentally different data models:
| Concept | Tensogram | xarray |
|---|---|---|
| Dimensions | Unnamed, positional (shape = [512, 512]) | Named ("x", "y", "latitude", "time") |
| Coordinates | Not built-in; application metadata | Arrays of values labelling each dimension |
| Variables | Data objects, indexed by position | Named DataArrays inside a Dataset |
| Attributes | CBOR maps at message and per-object level | Key-value dicts on Dataset and DataArray |
Tensogram is vocabulary-agnostic by design. The library never interprets
metadata keys – it does not know what "mars.param", "bids.subject", or
"product.name" means. xarray, on the other hand, requires named dimensions
and coordinate arrays to enable its powerful label-based indexing and
alignment.
The tensogram-xarray backend bridges this gap. It applies a set of rules
to translate tensogram structure into xarray structure, and lets you override
those rules when the defaults are not enough.
flowchart LR
A["Tensogram Message"] --> B["tensogram-xarray"]
B --> C["xr.Dataset"]
D["User Mapping<br/>(optional)"] -.-> B
E["Coordinate<br/>Auto-Detection"] -.-> B
The Mapping Pipeline
When you call xr.open_dataset("file.tgm", engine="tensogram"):
- Read metadata – only the CBOR metadata is parsed (no payload decode).
- Detect coordinates – data objects whose
nameorparammatches a known coordinate name (latitude,longitude,time, …) become coordinate arrays. - Name dimensions – if you provided
dim_names, those are used. Otherwise, axes matching a detected coordinate use that coordinate’s name; remaining axes becomedim_0,dim_1, … - Name variables – if you provided
variable_key, the value at that metadata path becomes the variable name. Otherwiseobject_0,object_1, … - Wrap data lazily – each tensor is backed by a
BackendArraythat decodes on demand. No payload bytes are read until you access.values.
Example 1: Simplest Case – Single Object, No Metadata
Creating the file:
import numpy as np
import tensogram
data = np.arange(60, dtype=np.float32).reshape(6, 10)
meta = {"version": 2}
desc = {"type": "ntensor", "shape": [6, 10], "dtype": "float32",
"byte_order": "little", "encoding": "none",
"filter": "none", "compression": "none"}
with tensogram.TensogramFile.create("simple.tgm") as f:
f.append(meta, [(desc, data)])
Opening in xarray:
>>> import xarray as xr
>>> ds = xr.open_dataset("simple.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions: (dim_0: 6, dim_1: 10)
Dimensions without coordinates: dim_0, dim_1
Data variables:
object_0 (dim_0, dim_1) float32 ...
Attributes:
tensogram_version: 2
The data object became a variable named object_0. Dimensions are
auto-generated as dim_0, dim_1. No coordinates – tensogram has
no information to generate them.
Adding dimension names:
>>> ds = xr.open_dataset("simple.tgm", engine="tensogram",
... dim_names=["latitude", "longitude"])
>>> ds["object_0"].dims
('latitude', 'longitude')
Example 2: Single Object with Coordinate Objects
When coordinate arrays are stored as separate data objects in the same message, the backend auto-detects them by name.
Creating the file:
lat = np.linspace(-90, 90, 5, dtype=np.float64)
lon = np.linspace(0, 360, 8, endpoint=False, dtype=np.float64)
temp = np.random.default_rng(42).random((5, 8)).astype(np.float32)
meta = {"version": 2, "base": [
{"name": "latitude"},
{"name": "longitude"},
{"name": "temperature"},
]}
with tensogram.TensogramFile.create("with_coords.tgm") as f:
f.append(meta, [
({"type": "ntensor", "shape": [5], "dtype": "float64", ...}, lat),
({"type": "ntensor", "shape": [8], "dtype": "float64", ...}, lon),
({"type": "ntensor", "shape": [5, 8], "dtype": "float32", ...}, temp),
])
Opening in xarray:
>>> ds = xr.open_dataset("with_coords.tgm", engine="tensogram")
>>> ds
<xarray.Dataset>
Dimensions: (latitude: 5, longitude: 8)
Coordinates:
* latitude (latitude) float64 -90.0 -45.0 0.0 45.0 90.0
* longitude (longitude) float64 0.0 45.0 90.0 135.0 180.0 225.0 270.0 315.0
Data variables:
temperature (latitude, longitude) float32 ...
Attributes:
tensogram_version: 2
How it works:
- Objects with
name: "latitude"andname: "longitude"match known coordinate names (case-insensitive). - They become coordinate arrays on the Dataset.
- The temperature object’s shape
(5, 8)matches the sizes oflatitude(5) andlongitude(8), so its dimensions are automatically resolved to("latitude", "longitude").
Known Coordinate Names
The following names are recognized (case-insensitive):
| Name | Canonical dimension |
|---|---|
lat, latitude | latitude |
lon, longitude | longitude |
x | x |
y | y |
time | time |
level | level |
pressure | pressure |
height | height |
depth | depth |
frequency | frequency |
step | step |
If no matching coordinate objects are found and no dim_names are provided,
dimensions remain generic (dim_0, dim_1, …).
Example 3: Multi-Object with variable_key
When a message contains multiple data objects, each with per-object metadata
identifying the parameter, you can use variable_key to name the variables.
Creating the file:
t2m = np.ones((3, 4), dtype=np.float32) * 273.15
u10 = np.ones((3, 4), dtype=np.float32) * 5.0
meta = {"version": 2,
"base": [
{"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "2t", "levtype": "sfc"}},
{"mars": {"class": "od", "date": "20260401", "type": "fc", "param": "10u", "levtype": "sfc"}},
],
}
with tensogram.TensogramFile.create("mars.tgm") as f:
f.append(meta, [
({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, t2m),
({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...}, u10),
])
Without variable_key:
>>> ds = xr.open_dataset("mars.tgm", engine="tensogram")
>>> list(ds.data_vars)
['object_0', 'object_1']
With variable_key:
>>> ds = xr.open_dataset("mars.tgm", engine="tensogram",
... variable_key="mars.param")
>>> list(ds.data_vars)
['2t', '10u']
>>> ds.attrs
{'tensogram_version': 2}
The variable_key supports dotted paths: "mars.param" navigates into
the nested mars dict within each object’s metadata.
Example 4: Multi-Message File with Auto-Merge
When a .tgm file contains many messages (one object each) that differ
only in metadata, open_datasets() can stack them along outer dimensions.
Creating the file:
import tensogram_xarray
rng = np.random.default_rng(99)
with tensogram.TensogramFile.create("multi.tgm") as f:
for param in ["2t", "10u"]:
for date in ["20260401", "20260402"]:
data = rng.random((3, 4), dtype=np.float32).astype(np.float32)
meta = {"version": 2,
"base": [{"mars": {"param": param, "date": date}}]}
desc = {"type": "ntensor", "shape": [3, 4], "dtype": "float32",
"byte_order": "little", "encoding": "none",
"filter": "none", "compression": "none"}
f.append(meta, [(desc, data)])
Opening with open_datasets():
>>> datasets = tensogram_xarray.open_datasets(
... "multi.tgm", variable_key="mars.param"
... )
>>> len(datasets)
1
>>> ds = datasets[0]
>>> list(ds.data_vars)
['2t', '10u']
What happened:
- The scanner read metadata from all 4 messages (no payload decode).
- Objects were grouped by structure: all have shape
(3, 4)andfloat32. variable_key="mars.param"split by parameter:2t(2 objects) and10u(2 objects).- Within each sub-group,
mars.datevaries across["20260401", "20260402"], so it became an outer dimension. - Each variable has shape
(2, 3, 4)with amars.datecoordinate.
Example 5: Heterogeneous File – Auto-Split
When a file contains objects of different shapes or dtypes, they cannot
be merged into a single Dataset. open_datasets() automatically splits
them into compatible groups.
Creating the file:
with tensogram.TensogramFile.create("hetero.tgm") as f:
# Message 0: 2D float32 temperature field
f.append({"version": 2, "base": [{"name": "temp"}]},
[({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
np.ones((3, 4), dtype=np.float32))])
# Message 1: 2D float32 wind field (same shape -- compatible)
f.append({"version": 2, "base": [{"name": "wind"}]},
[({"type": "ntensor", "shape": [3, 4], "dtype": "float32", ...},
np.ones((3, 4), dtype=np.float32) * 2)])
# Message 2: 1D int32 counts (different shape AND dtype -- incompatible)
f.append({"version": 2, "base": [{"name": "counts"}]},
[({"type": "ntensor", "shape": [5], "dtype": "int32", ...},
np.array([1, 2, 3, 4, 5], dtype=np.int32))])
Opening:
>>> datasets = tensogram_xarray.open_datasets("hetero.tgm")
>>> len(datasets)
2
>>> datasets[0] # The (3, 4) float32 group
<xarray.Dataset>
Dimensions: (dim_0: 3, dim_1: 4)
Data variables:
temp (dim_0, dim_1) float32 ...
wind (dim_0, dim_1) float32 ...
>>> datasets[1] # The (5,) int32 group
<xarray.Dataset>
Dimensions: (dim_0: 5)
Data variables:
counts (dim_0) int32 ...
Objects that share (shape, dtype) are grouped together. Incompatible
objects go to separate Datasets.
Example 6: Providing Full User Mapping
For complete control, pass all mapping parameters:
ds = xr.open_dataset(
"forecast.tgm",
engine="tensogram",
dim_names=["latitude", "longitude"],
variable_key="mars.param",
message_index=0, # which message in a multi-message file
verify_hash=True, # verify xxh3 integrity on decode
)
| Parameter | Type | Effect |
|---|---|---|
dim_names | list[str] | Names for the innermost tensor axes (positional) |
variable_key | str | Dotted path in per-object metadata for variable naming |
message_index | int | Which message to open (default 0) |
merge_objects | bool | If True, calls open_datasets() and returns first result |
verify_hash | bool | Verify xxh3 hashes during decode |
drop_variables | list[str] | Variables to exclude from the Dataset |
range_threshold | float | Fraction of total elements below which partial reads are used (default 0.5) |
For multi-message files, use tensogram_xarray.open_datasets() directly:
import tensogram_xarray
datasets = tensogram_xarray.open_datasets(
"forecast.tgm",
dim_names=["latitude", "longitude"],
variable_key="mars.param",
verify_hash=True,
)
Example 7: Lazy Loading with Dask
Data is always loaded lazily. Opening a file only reads metadata – the tensor payloads are decoded on first access. This enables working with larger-than-memory files via dask.
# Open with dask chunking
ds = xr.open_dataset("large.tgm", engine="tensogram", chunks={})
print(ds["object_0"])
# <xarray.DataArray 'object_0' (dim_0: 10000, dim_1: 10000)>
# dask.array<...>
# Compute a mean without loading the full array
mean = ds["object_0"].mean().compute()
See also: Dask Integration for a complete walkthrough with distributed computation, performance tuning, and a runnable 4-D tensor example.
When Partial Reads Are Used
The backend inspects each data object’s encoding pipeline to determine
whether partial reads via decode_range(join=False) are available:
| Compression | Filter | Partial Read? | Mechanism |
|---|---|---|---|
none | none | Yes | Direct byte offset |
szip | none | Yes | RSI block offset seeking |
blosc2 | none | Yes | Independent chunk decompression |
zfp (fixed_rate) | none | Yes | Fixed-size blocks, computable offsets |
zfp (fixed_precision) | none | No | Variable-size blocks |
zfp (fixed_accuracy) | none | No | Variable-size blocks |
zstd | none | No | Stream compressor |
lz4 | none | No | Stream compressor |
sz3 | none | No | Stream compressor |
| Any | shuffle | No | Byte rearrangement breaks contiguous ranges |
When partial reads are available, slicing a lazy array decodes only the requested region:
ds = xr.open_dataset("szip_data.tgm", engine="tensogram")
# Only the bytes for rows 100-110 are decompressed:
subset = ds["object_0"][100:110, :].values
When partial reads are not available (stream compressors or shuffle filter), the full object is decoded and then sliced in memory. This is transparent to the user – the API is identical.
N-Dimensional Slice Mapping
When you slice a lazy xarray variable backed by tensogram, the backend must
convert an N-dimensional slice into flat element ranges that decode_range()
understands. Here is how the decomposition works:
-
Find the split point – scan the slice dimensions from innermost to outermost and find the first (innermost) dimension whose slice does not cover the full axis. All dimensions inner to this point are contiguous in memory and form a single block per outer-index combination.
-
Compute the contiguous block size – multiply the lengths of all dimensions inner to (and including) the split dimension’s slice width. This gives the number of elements in each flat range.
-
Generate one range per outer-index combination – iterate over the Cartesian product of sliced indices in all dimensions outer to the split point. Each combination produces one
(offset, count)pair. -
Merge adjacent ranges – if two consecutive ranges abut in the flat layout (i.e.
offset_i + count_i == offset_{i+1}), they are merged into a single wider range to reduce I/O calls.
Concrete example: an array of shape (100, 200) sliced as [10:20, 50:100]:
- The innermost dimension (axis 1) has slice
50:100(width 50), which does not cover the full axis (200), so it is the split point. - Contiguous block size = 50 elements (just the inner slice width).
- Outer indices: axis 0 slice
10:20gives indices[10, 11, ..., 19]– 10 combinations. - This produces 10 flat ranges, each of 50 elements:
(10*200+50, 50),(11*200+50, 50), …,(19*200+50, 50). - None are adjacent (gap of 150 elements between each), so no merging occurs.
If the slice were [10:20, :] instead (full inner axis), the split point
moves to axis 0 and the 10 individual ranges of 200 elements each are
adjacent in memory – they merge into a single range (10*200, 10*200).
flowchart TD
A["N-D slice<br/>arr[10:20, 50:100]"] --> B["Find split point<br/>axis 1 (not full)"]
B --> C["Block size = 50"]
C --> D["Outer indices:<br/>axis 0 → [10..19]"]
D --> E["10 ranges of 50 elements"]
E --> F["Merge adjacent?<br/>No — gap of 150"]
F --> G["decode_range()<br/>10 × (offset, 50)"]
Range Threshold Heuristic
Even when partial reads are technically available, reading many small ranges can be slower than decoding the entire array – especially for compressed data where decompression has fixed overhead per block.
The backend uses a ratio-based heuristic controlled by the
range_threshold parameter (default 0.5):
Rule: partial reads are used only when the total number of requested elements is less than
range_threshold × total_elements.
With the default of 0.5, if you request more than 50% of the array, the
backend falls back to a full decode and slices in memory. Lower values
make the backend more aggressive about using partial reads; higher values
make it prefer full decodes.
# More aggressive partial reads (use when each range is cheap, e.g. uncompressed)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.3)
# Almost always full decode (use when decode overhead is very low)
ds = xr.open_dataset("file.tgm", engine="tensogram", range_threshold=0.9)
Installation
uv venv .venv && source .venv/bin/activate # if not already in a virtualenv
uv pip install tensogram-xarray
This pulls in tensogram and xarray as dependencies. The xarray backend
is registered automatically via entry points – no extra configuration needed.
>>> import xarray as xr
>>> "tensogram" in xr.backends.list_engines()
True
For dask support:
source .venv/bin/activate # if not already in the virtualenv
uv pip install "tensogram-xarray[dask]"
Error Handling
The backend reports errors with enough context for diagnosis. Common error scenarios and their messages:
| Scenario | Error type | Message includes |
|---|---|---|
| File not found | OSError | File path (from OS) |
Negative message_index | ValueError | "message_index must be >= 0, got -1" |
message_index out of range | ValueError | Index and file message count |
dim_names length mismatch | ValueError | Actual vs expected count |
| Unsupported dtype | TypeError | "unsupported tensogram dtype 'foo'" |
decode_range failure | Falls back to decode_object | Warning logged at DEBUG level with file, message, object, and cause |
| Incomplete hypercube in merge | ValueError | Which coordinate combination is missing |
| Silent data loss in merge | WARNING log | Variable name and count of dropped objects |
| Hash verification failure | ValueError | Object index and expected/actual hash |
| Conflicting coordinate objects | ValueError | Dimension name and mismatch details |
Hash Verification and Partial Reads
When verify_hash=True is passed, xxh3 hash verification is performed on
full object reads (decode_object) only. Partial reads via
decode_range() intentionally skip hash verification because:
- Partial reads decode only a subset of the payload, so the full-object hash cannot be validated.
- The purpose of partial reads is to minimise I/O; verifying the hash would require reading the entire payload, defeating the optimisation.
This means that for lazily-loaded arrays, hash verification happens when
a slice triggers a full-object decode (i.e. when the requested fraction
exceeds range_threshold), but not when partial decode_range() is used.
Logging
The backend uses Python’s standard logging module. To see partial-read
fallback diagnostics:
import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)
To see merge data-loss warnings (enabled by default at WARNING level):
import logging
logging.basicConfig(level=logging.WARNING)
Dask Integration
Tensogram supports Dask natively through its xarray
backend. When you open a .tgm file with chunks={}, xarray wraps every
tensor variable in a dask.array.Array. No data is read from disk until
you call .compute() or .values.
import xarray as xr
ds = xr.open_dataset("forecast.tgm", engine="tensogram", chunks={})
# ds["temperature"].data is now a dask.array -- zero I/O so far
mean = ds["temperature"].mean().compute() # data decoded here
This chapter explains how the integration works, walks through a complete example with distributed computation, and covers the performance knobs you can tune.
How It Works
The tensogram xarray backend implements BackendArray, xarray’s lazy-loading
protocol. When dask requests a chunk, the backend:
- Opens the
.tgmfile and reads the raw message bytes. - For small slices on compressors that support random access
(
none,szip,blosc2,zfpfixed-rate): maps the N-D slice to flat byte ranges and decodes only those ranges viadecode_range(). - For large slices or stream compressors: falls back to full
decode_object()and slices in memory.
The BackendArray stores only the file path (no open handles), making it
pickle-safe for dask multiprocessing and distributed execution.
flowchart LR
A["xr.open_dataset<br/>chunks={}"] --> B["BackendArray<br/>(lazy, pickle-safe)"]
B --> C["dask.array.Array"]
C -->|".compute()"| D["decode_range()<br/>or decode_object()"]
D --> E["numpy.ndarray"]
Chunking Strategies
chunks value | Behaviour |
|---|---|
{} | Automatic: one chunk per tensor object (most common) |
{"latitude": 100} | Split along latitude every 100 elements |
{"latitude": 100, "longitude": 200} | Split along both axes |
For tensogram files, chunks={} is usually the right choice because
each data object is already a self-contained tensor. Finer chunking
adds overhead from repeated file opens.
Complete Example: Distributed Statistics over 4-D Tensors
This walkthrough corresponds to examples/python/09_dask_distributed.py.
It creates 4 .tgm files representing a 4-D temperature field
(time x level x latitude x longitude), then computes statistics
entirely through dask’s lazy execution.
Step 1: Create the Data Files
Each file contains 10 data objects (one per pressure level) plus latitude and longitude coordinate arrays:
import numpy as np
import tensogram
def _desc(shape, dtype="float32", **extra):
return {
"type": "ntensor", "shape": list(shape), "dtype": dtype,
"byte_order": "little", "encoding": "none",
"filter": "none", "compression": "none", **extra,
}
LEVEL_VALUES = [1000, 925, 850, 700, 500, 400, 300, 200, 100, 50]
NLAT, NLON = 36, 72
with tensogram.TensogramFile.create("temperature_20260401.tgm") as f:
lat = np.linspace(-87.5, 87.5, NLAT, dtype=np.float64)
lon = np.linspace(0, 355, NLON, dtype=np.float64)
objects = [
(_desc([NLAT], dtype="float64", name="latitude"), lat),
(_desc([NLON], dtype="float64", name="longitude"), lon),
]
for level_hpa in LEVEL_VALUES:
field = np.random.default_rng(42).random((NLAT, NLON)).astype(np.float32)
desc = _desc([NLAT, NLON], name=f"temperature_{level_hpa}hPa")
objects.append((desc, field))
f.append({"version": 2}, objects)
Step 2: Open with Dask Lazy Loading
The critical parameters are engine="tensogram" and chunks={}:
import xarray as xr
import tensogram_xarray # registers the engine
ds = xr.open_dataset(
"temperature_20260401.tgm",
engine="tensogram",
variable_key="name", # name variables from descriptor "name" field
chunks={}, # enable dask lazy loading
)
At this point:
- No tensor data has been decoded. Only CBOR metadata was read.
- Each variable is a
dask.array.Array:
>>> type(ds["temperature_1000hPa"].data)
<class 'dask.array.core.Array'>
>>> ds["temperature_1000hPa"].shape
(36, 72)
>>> ds["temperature_1000hPa"].chunks
((36,), (72,))
Step 3: Build a 4-D Tensor from Multiple Files
Stack variables across levels within each file, then stack files across time:
import dask
import dask.array as da
# Open all 4 files
paths = ["temperature_20260401.tgm", "temperature_20260402.tgm",
"temperature_20260403.tgm", "temperature_20260404.tgm"]
datasets = [
xr.open_dataset(p, engine="tensogram", variable_key="name", chunks={})
for p in paths
]
# Stack levels within each file, then stack across time
# Build in LEVEL_VALUES order (not alphabetical) so axis matches labels
temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]
all_timesteps = []
for ds in datasets:
level_arrays = [ds[v].data for v in temp_vars]
all_timesteps.append(da.stack(level_arrays, axis=0))
full_4d = da.stack(all_timesteps, axis=0)
# Shape: (4, 10, 36, 72) -- (time, level, lat, lon)
# Still lazy -- zero I/O
Step 4: Compute Statistics with Dask
Schedule multiple computations, then execute them in a single
dask.compute() call:
# Schedule (lazy -- no computation yet)
global_mean = full_4d.mean()
global_std = full_4d.std()
global_min = full_4d.min()
global_max = full_4d.max()
# Execute all at once (data decoded from .tgm files here)
mean_val, std_val, min_val, max_val = dask.compute(
global_mean, global_std, global_min, global_max
)
print(f"Mean: {mean_val:.2f} K")
print(f"Std: {std_val:.2f} K")
print(f"Min: {min_val:.2f} K")
print(f"Max: {max_val:.2f} K")
Step 5: Selective Lazy Loading
Only the data you touch is decoded. Slicing the 4-D array triggers decoding of just the relevant chunks:
# Single point: backend uses decode_range() for the tiny slice
# (1 element out of 2592 = 0.04%, well below the 50% threshold)
point = full_4d[0, 0, 18, 0].compute()
# One pressure level across all times: touches 4 backing arrays
level_400 = full_4d[:, 5, :, :].mean().compute()
# Equatorial band: partial range decode for the selected rows
equatorial = full_4d[0, 0, 9:27, :].mean().compute()
Performance Tuning
The range_threshold Parameter
When dask requests a slice, the backend decides between partial decode
(decode_range()) and full decode (decode_object()) based on the
fraction of requested elements:
Rule: partial reads are used when
requested_elements / total_elements <= range_threshold
range_threshold | Behaviour |
|---|---|
0.3 | Aggressive partial reads (good for uncompressed data) |
0.5 (default) | Balanced: partial below 50%, full above |
0.9 | Almost always full decode (good for fast decompressors) |
# More aggressive partial reads
ds = xr.open_dataset("file.tgm", engine="tensogram",
chunks={}, range_threshold=0.3)
# Almost always full decode
ds = xr.open_dataset("file.tgm", engine="tensogram",
chunks={}, range_threshold=0.9)
Which Compressors Support Partial Reads?
| Compression | Partial Read? | Notes |
|---|---|---|
none | Yes | Direct byte offset |
szip | Yes | RSI block seeking |
blosc2 | Yes | Independent chunk decompression |
zfp (fixed_rate) | Yes | Fixed-size blocks |
zfp (other modes) | No | Variable-size blocks |
zstd | No | Stream compressor |
lz4 | No | Stream compressor |
sz3 | No | Stream compressor |
The shuffle filter also disables partial reads (byte rearrangement
breaks contiguous ranges). The fallback is always transparent: the
full object is decoded and sliced in memory.
Dask Scheduler Choice
Tensogram’s backend is thread-safe (uses a threading.Lock per array).
All three dask schedulers work:
# Synchronous (debugging)
dask.config.set(scheduler="synchronous")
# Threaded (default, good for I/O-bound work)
dask.config.set(scheduler="threads")
# Multiprocessing (BackendArray is pickle-safe)
dask.config.set(scheduler="processes")
For large-scale work, dask.distributed also works because the
BackendArray stores only the file path (no unpicklable state).
Thread Safety
The TensogramBackendArray uses a per-array threading.Lock to
serialise file I/O. This means:
- Multiple dask tasks can read different variables concurrently.
- Reads to the same variable are serialised (no concurrent file opens for the same array).
- The lock is excluded from pickle state and recreated on deserialise.
Installation
For dask support, install the optional dependency:
uv venv .venv && source .venv/bin/activate # if not already in a virtualenv
uv pip install "tensogram-xarray[dask]"
This pulls in dask[array] alongside tensogram and xarray.
Debugging
Enable debug logging to see when partial reads are used vs full decodes:
import logging
logging.getLogger("tensogram_xarray").setLevel(logging.DEBUG)
You will see messages like:
DEBUG:tensogram_xarray.array:decode_range failed for forecast.tgm msg=0 obj=2,
falling back to full decode: RangeNotSupported
This is expected for stream compressors and is not an error.
Error Handling
When Errors Are Raised
| When | What | Error type |
|---|---|---|
open_dataset() | File not found | OSError with file path |
open_dataset() | message_index negative | ValueError with index |
open_dataset() | message_index out of range | ValueError with index and count |
open_dataset() | dim_names length mismatch | ValueError with actual vs expected |
open_dataset() | Unsupported dtype | TypeError with dtype name |
.compute() | Decode failure | ValueError or RuntimeError from tensogram |
.compute() | Hash mismatch (with verify_hash=True) | ValueError with object index |
.compute() | File moved/deleted after open | OSError from OS |
Key design point: errors in metadata (file not found, bad index, wrong
dim_names) surface immediately at open_dataset() time. Errors in data
decoding surface at .compute() time because payloads are lazy-loaded.
Partial Read Fallback
When decode_range() fails (e.g. unsupported compressor for partial reads),
the backend catches the error and falls back to full decode_object():
except (ValueError, RuntimeError, OSError) as exc:
logger.debug("decode_range failed ... falling back to full decode: %s", exc)
This fallback is transparent — the user gets correct data regardless. Enable
DEBUG logging to see when fallbacks occur.
Dask Worker Errors
File paths are automatically resolved to absolute paths when the dataset is opened. This prevents “file not found” errors when dask sends work to processes with a different working directory.
If a dask worker encounters a decode error, it propagates through dask’s error handling. The traceback will show the tensogram error with file path, message index, and object index for diagnosis.
Edge Cases
Ambiguous Dimension Matching
When coordinate arrays have the same size (e.g. both latitude and
longitude have 360 elements), the backend cannot distinguish them by
shape alone. The first match gets the coordinate name; the second falls
back to a generic dim_N.
Workaround: pass explicit dim_names to disambiguate:
ds = xr.open_dataset("file.tgm", engine="tensogram",
dim_names=["latitude", "longitude"], chunks={})
Stacking Files with Different Variables
When stacking multiple .tgm files into a single dask array, verify
that every dataset contains the expected variables before stacking:
temp_vars = [f"temperature_{lev}hPa" for lev in LEVEL_VALUES]
for i, ds in enumerate(datasets):
missing = [v for v in temp_vars if v not in ds.data_vars]
if missing:
raise KeyError(f"Dataset {i} missing: {missing}")
Otherwise da.stack() will fail with a confusing KeyError from
a deep dask callback.
Zero-Object Messages
A .tgm file containing only metadata frames (no data objects) returns
an empty xr.Dataset with no variables. This is valid and does not
raise an error.
Scalar (0-D) Tensors
Data objects with shape=() (zero dimensions) are supported. They
become scalar xr.Variable objects in the dataset.
Hash Verification with Partial Reads
When verify_hash=True is set, hash verification only runs on full
object reads (via decode_object()). Partial reads via
decode_range() skip verification because only a subset of the payload
is decoded. This means:
- Large slices (above
range_threshold) trigger full decode with hash verification. - Small slices use
decode_range()without hash verification.
This is by design. If you need guaranteed hash verification on every
access, set range_threshold=0.0 to force full decodes.
Zarr v3 Backend
The tensogram-zarr package implements a Zarr v3 Store backed by .tgm files. This lets you read and write Tensogram data through the standard Zarr Python API.
Installation
uv venv .venv && source .venv/bin/activate # if not already in a virtualenv
uv pip install tensogram-zarr
Requires zarr >= 3.0, tensogram, and numpy.
Reading a .tgm file through Zarr
import zarr
from tensogram_zarr import TensogramStore
# Open existing .tgm file as a read-only Zarr store
store = TensogramStore.open_tgm("data.tgm")
root = zarr.open_group(store=store, mode="r")
# Browse available arrays
for name, arr in root.members():
print(f"{name}: shape={arr.shape}, dtype={arr.dtype}")
# Read an array (decoded eagerly at store open, served from memory)
temperature = root["2t"][:]
print(temperature.shape, temperature.mean())
# Access group-level metadata (from GlobalMetadata _extra_)
# The example below shows a MARS namespace; the attributes dict reflects
# whatever namespaces the producer put in the message's GlobalMetadata.
print(root.attrs["mars"]) # {'class': 'od', 'type': 'fc', ...}
How the mapping works
Each .tgm message maps to a Zarr group:
zarr.json # root group ← GlobalMetadata
temperature/zarr.json # array metadata ← DataObjectDescriptor
temperature/c/0/0 # chunk data ← decoded object payload
pressure/zarr.json # another array
pressure/c/0/0 # its chunk data
graph LR
TGM[".tgm file"] --> GM["GlobalMetadata"]
TGM --> OBJ1["Object 0: temperature"]
TGM --> OBJ2["Object 1: pressure"]
GM --> GZJ["zarr.json (group)"]
OBJ1 --> AZJ1["temperature/zarr.json"]
OBJ1 --> CHK1["temperature/c/0/0"]
OBJ2 --> AZJ2["pressure/zarr.json"]
OBJ2 --> CHK2["pressure/c/0/0"]
Key design decisions:
- Each TGM data object becomes one Zarr array with a single chunk (chunk shape = array shape)
- Variable names are resolved from metadata via a default lookup path (
name,mars.param,param,mars.shortName,shortName), or a custom dot-path you supply - TGM encoding metadata is preserved in Zarr array attributes under
_tensogram_*keys - Duplicate variable names get a numeric suffix (
field,field_1)
Variable naming
By default, the store tries these metadata paths to name arrays:
namemars.paramparammars.shortNameshortName- Falls back to
object_<index>
You can override with any dot-path, including non-MARS vocabularies:
# Weather pipeline using MARS
store = TensogramStore.open_tgm("weather.tgm", variable_key="mars.param")
# Neuroimaging pipeline using BIDS
store = TensogramStore.open_tgm("scans.tgm", variable_key="bids.task")
# Custom vocabulary
store = TensogramStore.open_tgm("data.tgm", variable_key="product.name")
Multi-message files
By default the store reads message 0. Select a different message with message_index:
store = TensogramStore.open_tgm("multi.tgm", message_index=2)
Writing a .tgm file through Zarr
import numpy as np
import zarr
from tensogram_zarr import TensogramStore
store = TensogramStore("output.tgm", mode="w")
root = zarr.open_group(store=store, mode="w")
# Create arrays — data is buffered in memory
root.create_array("temperature", data=np.random.rand(100, 200).astype(np.float32))
root.create_array("pressure", data=np.array([1000, 925, 850, 700], dtype=np.float64))
# Close flushes to .tgm
store.close()
The write path assembles all arrays into a single TGM message when the store is closed.
Context manager
with TensogramStore("data.tgm", mode="r") as store:
root = zarr.open_group(store=store, mode="r")
data = root["temperature"][:]
# Store automatically closed
Supported data types
| Tensogram dtype | Zarr data_type | NumPy dtype |
|---|---|---|
float16 | float16 | float16 |
float32 | float32 | float32 |
float64 | float64 | float64 |
int8 | int8 | int8 |
int16 | int16 | int16 |
int32 | int32 | int32 |
int64 | int64 | int64 |
uint8 | uint8 | uint8 |
uint16 | uint16 | uint16 |
uint32 | uint32 | uint32 |
uint64 | uint64 | uint64 |
complex64 | complex64 | complex64 |
complex128 | complex128 | complex128 |
bitmask | uint8 | uint8 |
Byte range support
The store supports Zarr’s ByteRequest types for efficient partial reads:
RangeByteRequest(start, end)— read a byte rangeOffsetByteRequest(offset)— read from offset to endSuffixByteRequest(suffix)— read last N bytes
Comparison with tensogram-xarray
| Feature | tensogram-zarr | tensogram-xarray |
|---|---|---|
| API level | Low-level (Zarr Store) | High-level (xarray engine) |
| Dimensions | Generic (dim_0, dim_1) | Named (lat, lon, time) |
| Coordinates | Not interpreted | Auto-detected from metadata |
| Multi-message | One message per store | Auto-merge into hypercubes |
| Write support | Yes | No |
| Data loading | Eager (all at open) | Lazy (on-demand decode_range) |
Use tensogram-zarr when you need direct Zarr API access or write support. Use tensogram-xarray when you want automatic coordinate detection and multi-message merging.
Edge cases and limitations
Variable name sanitization
If a metadata value used as a variable name contains / or \, those characters are replaced with _ to prevent spurious directory nesting in the virtual key space. Empty names become _.
mars.param = "temperature/surface" → variable name "temperature_surface"
Duplicate variable names
When multiple objects resolve to the same name, suffixes are appended: field, field_1, field_2, etc.
Zero-object messages
A message with no data objects is valid (metadata-only). The store produces a root group with attributes but no arrays.
Single chunk per array
Each TGM data object maps to a Zarr array with chunk_shape == array_shape (one chunk). There is no sub-chunking; partial reads within the array are handled by Zarr’s byte-range support against the single chunk. If a Zarr writer attempts to store multiple chunks for the same variable, a ValueError is raised — TensogramStore does not silently drop extra chunks.
Out-of-range message index
If message_index exceeds the number of messages in the file, an IndexError is raised. Negative indices are rejected with ValueError.
bfloat16 dtype
bfloat16 maps to Zarr data type "bfloat16" but is stored as raw 2-byte values (<V2 numpy dtype) since numpy has no native bfloat16 type. Use ml_dtypes.bfloat16 for interpretation.
Byte order handling
The read path normalises all chunk data to little-endian (matching the Zarr bytes codec default). The write path respects byte_order from the Zarr codecs metadata — if a big-endian bytes codec is specified, the data is byte-swapped before encoding to TGM.
JSON serialization (RFC 8259)
serialize_zarr_json() converts non-finite float values to their Zarr v3 string sentinels ("NaN", "Infinity", "-Infinity") so the output is valid RFC 8259 JSON.
Write path byte-count validation
When flushing to .tgm, the store validates that chunk byte count matches product(shape) * dtype_size. A mismatch raises ValueError with the expected and actual counts.
close() exception safety
If _flush_to_tgm() fails during close(), the store is still marked as closed (_is_open = False). The exception propagates normally — partial writes do not corrupt the file since TGM messages are written atomically.
When used as a context manager and an exception is already in flight, flush errors are logged at WARNING level instead of replacing the original exception.
Error handling
All errors surface with enough context for debugging:
| Scenario | Exception | Message includes |
|---|---|---|
| File not found / unreadable | OSError | File path |
| Invalid TGM message | ValueError | File path + message index |
| Object decode failure | ValueError | File path + message index + object index + variable name |
| Out-of-range message index | IndexError | Requested index + available count |
| Negative message index | ValueError | The invalid index value |
| Invalid mode | ValueError | The invalid mode string |
| Empty path | ValueError | The value passed |
| Chunk byte-count mismatch | ValueError | Variable name + expected vs actual byte count |
| Unsupported dtype on write | ValueError | Variable name + dtype |
| Invalid JSON in zarr.json | ValueError | Byte count + hex preview |
| Unknown ByteRequest type | TypeError | The type name |
| Array without chunk data | WARNING log | Variable name (array skipped) |
| No arrays to flush | WARNING log | File path |
Errors from the underlying Rust tensogram library are wrapped with Python-level context so users see which file, message, and variable caused the problem.
anemoi-inference Integration
The tensogram-anemoi package provides a plug-and-play output for
anemoi-inference, the ECMWF framework
for running AI-based weather forecast models. Once installed, anemoi-inference
automatically discovers the plugin via Python entry points — no code changes to
anemoi-inference are required.
Installation
pip install tensogram-anemoi
Or from source:
pip install -e python/tensogram-anemoi/
Usage
In an anemoi-inference run config, specify tensogram as the output:
output:
tensogram:
path: forecast.tgm
All forecast steps are written to a single .tgm file as they are produced.
Remote destinations (S3, GCS, Azure, …) are supported via fsspec:
output:
tensogram:
path: s3://my-bucket/forecast.tgm
storage_options:
key: ...
secret: ...
Configuration options
All options after path must be supplied as keyword arguments.
| Option | Type | Default | Description |
|---|---|---|---|
path | str | — | Destination file path or remote URL |
encoding | str | "none" | "none" or "simple_packing" |
bits | int | None | Bits per value (required when encoding="simple_packing") |
compression | str | "zstd" | "none", "zstd", "lz4", "szip", "blosc2" |
dtype | str | "float32" | Field array dtype: "float32" or "float64" |
storage_options | dict | {} | Forwarded to fsspec for remote paths |
stack_pressure_levels | bool | False | Stack pressure-level fields into 2-D objects |
variables | list[str] | None | Restrict output to a subset of variables |
output_frequency | int | None | Write every N steps |
write_initial_state | bool | None | Whether to write step 0 |
Pressure-level stacking
When stack_pressure_levels=True, all fields sharing the same GRIB param
are merged into a single 2-D object of shape (n_grid, n_levels), sorted by
level ascending. The "mars" namespace carries "levelist": [500, 850, ...]
instead of a scalar "level" (following standard MARS convention).
Non-pressure-level fields are always written as individual 1-D objects.
output:
tensogram:
path: forecast.tgm
stack_pressure_levels: true
Simple packing
For compact storage, use simple_packing with a bits value:
output:
tensogram:
path: forecast.tgm
encoding: simple_packing
bits: 16
compression: zstd
Coordinate arrays (lat/lon) are never lossy-encoded; only field arrays are packed.
Metadata reference
Each .tgm file produced by tensogram-anemoi contains one message per forecast
step. This section documents exactly what is stored in each message and how to
read it with the raw tensogram Python API.
Opening a file
import tensogram
tgm = tensogram.TensogramFile.open("forecast.tgm")
print(len(tgm), "steps")
meta, objects = tgm[0] # first step
meta is the decoded message metadata. objects is a list of
(descriptor, array) pairs, one entry per object in the message.
Object layout
Every message has the following fixed layout:
| Index | base[i]["name"] | Content |
|---|---|---|
| 0 | "grid_latitude" | Latitude coordinates, float64, shape (n_grid,) |
| 1 | "grid_longitude" | Longitude coordinates, float64, shape (n_grid,) |
| 2 … N | variable name or param name | Field data |
meta, objects = tgm[0]
lat_desc, lat_arr = objects[0] # latitudes
lon_desc, lon_arr = objects[1] # longitudes
fld_desc, fld_arr = objects[2] # first field
The coordinate names "grid_latitude" and "grid_longitude" are intentionally
distinct from the standard "latitude" / "longitude" names so that all objects
in a message share a single flat grid dimension rather than each coordinate
spawning its own dimension.
base[i] — per-object metadata
Each object has a corresponding entry in meta.base:
for i, entry in enumerate(meta.base):
print(i, entry)
Every entry contains:
| Key | Type | Present on | Description |
|---|---|---|---|
"name" | str | all objects | Variable or coordinate name |
"anemoi" | dict | all objects | anemoi-specific metadata (see below) |
"mars" | dict | field objects only | MARS metadata (see below) |
"anemoi" namespace
| Key | Type | Present on | Description |
|---|---|---|---|
"variable" | str | all objects | Internal anemoi-inference variable name |
For coordinates, "variable" is "latitude" or "longitude" (the canonical
name, not the "grid_*" name stored in "name"):
assert meta.base[0]["name"] == "grid_latitude"
assert meta.base[0]["anemoi"]["variable"] == "latitude"
assert meta.base[1]["name"] == "grid_longitude"
assert meta.base[1]["anemoi"]["variable"] == "longitude"
For fields, "variable" is the internal anemoi-inference name (e.g. "t500"
for 500 hPa temperature, "2t" for 2 m temperature):
assert meta.base[2]["anemoi"]["variable"] == "2t"
"mars" namespace
Coordinate objects carry no "mars" key. Every field object carries a "mars"
dict combining keys from the anemoi-inference checkpoint with the temporal keys
derived from the forecast state:
Temporal keys (present on every field object):
| Key | Type | Description | Example |
|---|---|---|---|
"date" | str | Analysis/base date (YYYYMMDD) | "20240101" |
"time" | str | Analysis/base time (HHMM) | "0000" |
"step" | int or float | Forecast lead time in hours | 6, 1.5 |
Checkpoint keys (present when available in the model checkpoint):
| Key | Type | Description | Example |
|---|---|---|---|
"param" | str | GRIB parameter short name | "2t", "t", "u" |
"levtype" | str | Level type | "sfc", "pl", "ml" |
"level" | int | Pressure level (unstacked fields only) | 500 |
"levelist" | list[int] | Pressure levels (stacked fields only) | [500, 850, 1000] |
Reading field metadata:
meta, objects = tgm[0]
# Surface field (e.g. 2 m temperature)
entry = meta.base[2]
print(entry["name"]) # "2t"
print(entry["anemoi"]["variable"]) # "2t"
print(entry["mars"]["param"]) # "2t"
print(entry["mars"]["date"]) # "20240101"
print(entry["mars"]["time"]) # "0000"
print(entry["mars"]["step"]) # 6
# Pressure-level field (unstacked)
entry = meta.base[3]
print(entry["mars"]["param"]) # "t"
print(entry["mars"]["levtype"]) # "pl"
print(entry["mars"]["level"]) # 500
With stack_pressure_levels=True, the pressure-level group has "levelist"
instead of "level", and the array is 2-D:
entry = meta.base[2] # stacked t group
print(entry["mars"]["levelist"]) # [500, 850, 1000]
print(entry["mars"]["param"]) # "t"
desc, arr = objects[2]
print(arr.shape) # (n_grid, 3) — columns sorted by level
meta.extra — message-level metadata
meta.extra carries metadata that applies to the whole message rather than
individual objects.
"dim_names" — axis-size hints
dim_names = meta.extra["dim_names"]
# e.g. {"21600": "values"}
# or {"21600": "values", "3": "level"} (with stack_pressure_levels=True)
dim_names maps the string representation of an axis length to a semantic
name. It exists to allow downstream tools to assign meaningful axis names
without requiring any anemoi-specific knowledge. The grid axis is always
labelled "values"; when pressure-level stacking is enabled, each unique
level-axis size is labelled "level".
Object descriptors
Each (descriptor, array) pair returned by objects[i] gives low-level
encoding detail:
desc, arr = objects[2]
print(desc.dtype) # "float32" or "float64"
print(desc.shape) # [n_grid] for flat, [n_grid, n_levels] for stacked
print(desc.encoding) # "none" or "simple_packing"
print(desc.compression) # "zstd", "lz4", etc.
Coordinate arrays are always float64 regardless of the dtype setting.
Field arrays use the configured dtype ("float32" by default), promoted to
float64 automatically when encoding="simple_packing".
Full inspection example
import tensogram
tgm = tensogram.TensogramFile.open("forecast.tgm")
for step_idx, (meta, objects) in enumerate(tgm):
print(f"\n--- step {step_idx} ---")
# Dimension hints
print("dim_names:", meta.extra.get("dim_names", {}))
for i, entry in enumerate(meta.base):
desc, arr = objects[i]
anemoi = entry.get("anemoi", {})
mars = entry.get("mars", {})
print(
f" [{i}] name={entry['name']!r:20s}"
f" variable={anemoi.get('variable')!r:10s}"
f" shape={arr.shape}"
f" dtype={desc.dtype}"
+ (f" step={mars.get('step')}" if mars else "")
)
Example output for a single step with surface fields and stacked pressure levels:
--- step 0 ---
dim_names: {'21600': 'values', '3': 'level'}
[0] name='grid_latitude' variable='latitude' shape=(21600,) dtype=float64
[1] name='grid_longitude' variable='longitude' shape=(21600,) dtype=float64
[2] name='2t' variable='2t' shape=(21600,) dtype=float32 step=6
[3] name='t' variable='t' shape=(21600, 3) dtype=float32 step=6
[4] name='u' variable='u' shape=(21600, 3) dtype=float32 step=6
Free-Threaded Python
Tensogram supports free-threaded Python (CPython 3.13t / 3.14t), which removes the Global Interpreter Lock (GIL) and allows true multi-threaded parallelism from Python.
What This Means
On standard CPython, the GIL serializes access to the interpreter — only one thread runs Python code at a time. Tensogram already releases the GIL during Rust computation (py.detach()), which helps, but the GIL is still re-acquired for numpy array construction and Python object creation.
On free-threaded CPython (3.13t / 3.14t), there is no GIL at all. Multiple threads can call tensogram.encode() and tensogram.decode() in true parallel. Use the included benchmark (rust/benchmarks/python/bench_threading.py) to measure scaling on your hardware.
Building for Free-Threaded Python
Install a free-threaded Python build:
# uv (recommended)
uv python install cpython-3.14+freethreaded
# Or via pyenv
pyenv install 3.14t
Build tensogram:
uv venv .venv --python python3.14t
source .venv/bin/activate
uv pip install maturin "numpy>=2.1"
cd python/bindings && maturin develop --release
Verify the GIL is disabled:
import sys
print(sys._is_gil_enabled()) # False
Thread-Safe API
All tensogram read operations are safe to call from multiple threads simultaneously:
import threading
import numpy as np
import tensogram
data = np.random.randn(1_000_000).astype(np.float32)
meta = {"version": 2, "base": [{}]}
desc = {"type": "ntensor", "shape": [1_000_000], "dtype": "float32"}
msg = tensogram.encode(meta, [(desc, data)])
def decode_worker():
for _ in range(100):
result = tensogram.decode(msg)
threads = [threading.Thread(target=decode_worker) for _ in range(8)]
for t in threads:
t.start()
for t in threads:
t.join()
Each thread can independently:
- Encode and decode messages
- Scan buffers
- Validate messages and files
- Read from
TensogramFileinstances (same handle or separate handles) - Use
StreamingEncoder(separate instances per thread)
TensogramFile Thread Safety
All read methods on TensogramFile (decode_message, read_message, decode_metadata, decode_descriptors, decode_object, decode_range, __getitem__, __len__, __iter__) use &self and support concurrent access from multiple threads on the same handle:
f = tensogram.TensogramFile.open("data.tgm")
def worker(thread_id):
# Multiple threads can read from the same handle concurrently
msg = f.decode_message(thread_id % len(f))
threads = [threading.Thread(target=worker, args=(i,)) for i in range(8)]
for t in threads:
t.start()
for t in threads:
t.join()
Only append() requires exclusive access — calling it while other threads are reading will raise RuntimeError (PyO3 runtime borrow check).
Benchmark Results
Measured on Linux x86_64 (20 cores), NumPy 2.4.4, release build. Same-version paired comparisons to isolate the GIL effect.
All scaling below comes from Python-level threading (threading.Thread). Each call into Rust is single-threaded — there is no rayon or internal parallelism within a single encode/decode. The speedups reflect multiple Python threads entering Rust concurrently via py.detach(). A future Rust-level parallel pipeline would multiply on top of these numbers.
Headline: Decode Throughput (1M float32, no codec)
| Threads | 3.13 (GIL) | 3.13t (free) | 3.14 (GIL) | 3.14t (free) |
|---|---|---|---|---|
| 1 | 416 op/s | 391 op/s | 408 op/s | 396 op/s |
| 2 | 432 (1.04x) | 775 (1.98x) | 432 (1.06x) | 776 (1.96x) |
| 4 | 427 (1.03x) | 1,356 (3.47x) | 425 (1.04x) | 1,352 (3.41x) |
| 8 | 309 (0.74x) | 1,507 (3.85x) | 293 (0.72x) | 1,841 (4.65x) |
Headline: Encode Throughput (1M float32, no codec)
| Threads | 3.13 (GIL) | 3.13t (free) | 3.14 (GIL) | 3.14t (free) |
|---|---|---|---|---|
| 1 | 608 op/s | 572 op/s | 504 op/s | 595 op/s |
| 2 | 761 (1.25x) | 709 (1.24x) | 664 (1.32x) | 702 (1.18x) |
| 4 | 659 (1.08x) | 726 (1.27x) | 468 (0.93x) | 725 (1.22x) |
| 8 | 520 (0.86x) | 706 (1.23x) | 351 (0.70x) | 717 (1.20x) |
Small Messages (16K float32, no codec)
| Threads | 3.13 (GIL) | 3.13t (free) | 3.14 (GIL) | 3.14t (free) |
|---|---|---|---|---|
| 1 | 20,765 op/s | 17,085 op/s | 20,174 op/s | 12,951 op/s |
| 2 | 23,689 (1.14x) | 35,642 (2.09x) | 23,093 (1.14x) | 35,176 (2.72x) |
| 4 | 22,629 (1.09x) | 36,483 (2.14x) | 22,839 (1.13x) | 61,583 (4.75x) |
| 8 | 23,664 (1.14x) | 79,539 (4.66x) | 22,487 (1.11x) | 73,549 (5.68x) |
| 16 | 23,418 (1.13x) | 93,627 (5.48x) | 23,369 (1.16x) | 168,786 (13.03x) |
Other Operations (1M float32)
Scan (message boundary detection — ~0.2µs/call, GIL overhead dominates):
| Threads | 3.14 (GIL) | 3.14t (free) |
|---|---|---|
| 1 | 312,930 op/s | 79,431 op/s |
| 2 | 421,701 (1.35x) | 266,103 (3.35x) |
| 4 | 629,505 (2.01x) | 811,096 (10.21x) |
| 8 | 522,940 (1.67x) | 389,106 (4.90x) |
| 16 | 516,342 (1.65x) | 1,231,777 (15.51x) |
Validate (full message validation — CPU-bound, scales well on both):
| Threads | 3.14 (GIL) | 3.14t (free) |
|---|---|---|
| 1 | 5,457 op/s | 4,347 op/s |
| 2 | 10,860 (1.99x) | 9,440 (2.17x) |
| 4 | 20,249 (3.71x) | 18,752 (4.31x) |
| 8 | 39,766 (7.29x) | 23,048 (5.30x) |
| 16 | 48,560 (8.90x) | 45,455 (10.46x) |
Decode-range (sub-array extraction, 2x1K slices from 1M):
| Threads | 3.14 (GIL) | 3.14t (free) |
|---|---|---|
| 1 | 66,488 op/s | 40,265 op/s |
| 2 | 111,544 (1.68x) | 98,319 (2.44x) |
| 4 | 103,191 (1.55x) | 167,786 (4.17x) |
| 8 | 104,752 (1.58x) | 325,101 (8.07x) |
| 16 | 103,236 (1.55x) | 475,755 (11.82x) |
Iter-messages (3 messages, 100K f32 each):
| Threads | 3.14 (GIL) | 3.14t (free) |
|---|---|---|
| 1 | 1,214 op/s | 1,195 op/s |
| 2 | 1,291 (1.06x) | 2,327 (1.95x) |
| 4 | 1,211 (1.00x) | 4,548 (3.81x) |
| 8 | 1,194 (0.98x) | 5,589 (4.68x) |
| 16 | 1,106 (0.91x) | 4,432 (3.71x) |
Key Takeaways
Methodology: 5 runs per configuration, median reported. 200–500 warmup iterations for fast operations.
- Validate scales near-linearly on both GIL and free-threaded — 8.9x (GIL) and 10.5x (free-threaded) at 16 threads. This is the most CPU-bound operation and benefits fully from
py.detach()regardless of GIL. - Free-threaded decode scales to 4.7x at 8 threads for the headline workload (1M f32, no codec). GIL-enabled stays near 1.0x because numpy array construction dominates and serializes under the GIL.
- GIL-enabled decode-range plateaus at ~1.7x —
py.detach()allows 2 threads of overlap but the lightweight result construction can’t overlap further. Free-threaded reaches 11.8x at 16 threads. - Scan shows dramatic free-threaded scaling — free-threaded reaches 15.5x at 16 threads. GIL-enabled scales to 2.0x at 4 threads but drops back at higher thread counts due to contention.
- Small messages (16K) reach 13.0x at 16 threads on free-threaded (3.14t) vs 1.2x on GIL-enabled.
- iter_messages scales to 4.7x at 8 threads on free-threaded, then drops due to contention. GIL-enabled stays flat (~1.0x).
- Single-thread trade-off — free-threaded single-thread performance varies by workload: decode is within ~5% of GIL-enabled (396 vs 408 op/s on 3.14), encode varies by version (3.14t is 18% faster than 3.14, while 3.13t is 6% slower than 3.13). Validate is ~20% slower (4,347 vs 5,457 op/s) and scan ~4x slower due to reference counting overhead on returned Python objects — both recover by 2 threads.
These numbers are machine-specific. Run the benchmark on your hardware:
python rust/benchmarks/python/bench_threading.py # full suite python rust/benchmarks/python/bench_threading.py --headline # quick comparison python rust/benchmarks/python/bench_threading.py --quick # CI smoke test
Reference Comparison: Tensogram (Python) vs ecCodes (C)
This section measures Tensogram’s Python throughput against ecCodes’ native C performance on the same pipeline — 10 million float64 values (80 MiB), 24-bit simple packing + szip compression — as a concrete reference point. The pipeline is common in operational weather forecasting and is representative of scientific-quantisation workloads more broadly.
What we measured
Both sides are measured end-to-end: from a float64 array to serialized compressed bytes (encode), and back to a float64 array (decode). Both include metadata serialization, framing, and integrity overhead — not just the raw packing step.
ecCodes (C, single-threaded): The Rust benchmark (rust/benchmarks/src/bin/grib_comparison.rs) calls ecCodes’ C library directly via FFI. Encode: allocate a GRIB handle, configure the grid (10M regular lat/lon), set packing type to CCSDS at 24 bits, write the values array, serialize to GRIB bytes. Decode: load the GRIB message from bytes, extract the values array. No Python involved. Median of 10 iterations, 3 warmup.
Tensogram (Python, multi-threaded): The same 10M float64 values, same 24-bit quantization, same szip compression. Encode: pass a numpy array + CBOR metadata dict to tensogram.encode(), which crosses the PyO3 boundary, quantizes, compresses, frames, computes the integrity hash, and returns Python bytes. Decode: pass bytes to tensogram.decode(), which deframes, decompresses, dequantizes, and returns a numpy array. Each Python thread makes independent encode/decode calls. The GIL is released during the Rust computation.
Why scaling depends on the codec
Threading helps most when the Rust computation (compression, quantization) is the dominant cost. With simple packing + szip, each encode/decode spends ~170 ms in Rust and ~20 ms in Python/numpy — so ~89% of the time runs with the GIL released and threads scale well. Without compression, the Rust work is trivial (~1 ms) and the Python overhead limits parallelism.
The tables above measure uncompressed data to isolate the threading mechanism. The results below use the production pipeline (24-bit packing + szip) and show what real workloads achieve.
Results
ecCodes CCSDS (Rust FFI, single-threaded): 870 MB/s encode, 531 MB/s decode.
Tensogram from Python (free-threaded 3.14t, 5-run median, 10M float64 24-bit packing+szip):
Decode:
| Threads | Throughput | vs ecCodes C |
|---|---|---|
| 1 | 446 MB/s | 0.84x |
| 2 | 858 MB/s | 1.62x |
| 4 | 1,596 MB/s | 3.01x |
| 8 | 2,602 MB/s | 4.90x |
Encode:
| Threads | Throughput | vs ecCodes C |
|---|---|---|
| 1 | 435 MB/s | 0.50x |
| 2 | 833 MB/s | 0.96x |
| 4 | 1,516 MB/s | 1.74x |
| 8 | 2,353 MB/s | 2.71x |
Single-threaded Tensogram from Python is slower than ecCodes from C (the PyO3 boundary costs ~10-15% on decode, ~50% on encode due to numpy data extraction for 80 MiB). But at 2 threads, decode already surpasses ecCodes. At 4 threads, both encode and decode exceed ecCodes. At 8 threads, decode reaches 4.9x ecCodes throughput — from Python.
Requirements
- Python >= 3.13t for free-threaded mode (3.12/3.13 GIL-enabled also works)
- NumPy >= 2.1 (free-threaded support)
- maturin >= 1.8 (free-threaded wheel building)
Known Limitations
Inherent:
- Shared mutable numpy arrays across threads can cause data races (same as any Python threading)
- xarray and zarr backends have their own threading models (dask, zarr locking)
By design:
TensogramFileread methods (decode_message,read_message,__getitem__, etc.) support concurrent access from multiple threads on the same handle. Onlyappend()requires exclusive access.bytesinputs to decode/scan/validate are zero-copy across the GIL release.bytearrayinputs are copied once internally by PyO3.iter_messages/PyBufferIterown a full buffer copy (the buffer must outlive iteration).
Multi-Threaded Coding Pipeline
Since v0.13.0 Tensogram exposes a caller-controlled thread budget that spreads encoding and decoding work across a scoped pool of workers. The feature is off by default — existing code paths produce byte-identical output to previous releases until the caller opts in.
This page covers:
- The
threadsoption - Cross-language parity
- Axis-A vs axis-B dispatch
- Determinism contract
- Environment variable override
- Interaction with free-threaded Python
- Benchmarks and tuning
The threads option
All four bindings expose a threads: u32 option on encode and decode
entry points:
#![allow(unused)]
fn main() {
use tensogram::{encode, decode, EncodeOptions, DecodeOptions};
// Encode with a 4-thread pool:
let msg = encode(&meta, &descriptors, &EncodeOptions {
threads: 4,
..Default::default()
})?;
// Decode with an 8-thread pool:
let (meta, objs) = decode(&msg, &DecodeOptions {
threads: 8,
..Default::default()
})?;
}
import tensogram
msg = tensogram.encode(meta, descriptors, threads=4)
decoded = tensogram.decode(msg, threads=8)
tensogram::encode_options enc{};
enc.threads = 4;
auto bytes = tensogram::encode(meta_json, objects, enc);
tensogram::decode_options dec{};
dec.threads = 8;
auto msg = tensogram::decode(buf, len, dec);
tgm_encode(meta_json, data_ptrs, data_lens, num_objects,
"xxh3", /* threads= */ 4, &out);
tgm_decode(buf, len, /* verify_hash */ 0, /* native_byte_order */ 1,
/* threads= */ 8, &msg);
tensogram --threads 8 merge -o merged.tgm a.tgm b.tgm
TENSOGRAM_THREADS=4 tensogram split -o 'part_[index].tgm' input.tgm
Value semantics
threads | Behaviour |
|---|---|
0 (default) | Sequential, single-threaded. Falls back to the TENSOGRAM_THREADS env var if set and non-zero. |
1 | Build a scoped 1-worker rayon pool. Useful for testing — everything flows through the parallel code paths but runs deterministically. |
N ≥ 2 | Build a scoped N-worker rayon pool for the duration of the call. Pool is dropped when the call returns. |
Cross-language parity
Every language binding exposes the same threads option on every
encode/decode entry point that does CPU work. Metadata-only commands
(scan, describe, list) never accept it because they never decode
payloads.
| Entry point | Rust | Python | C FFI | C++ wrapper | CLI |
|---|---|---|---|---|---|
encode / encode_pre_encoded | ✅ | ✅ | ✅ | ✅ | — (via subcommand) |
decode / decode_object / decode_range | ✅ | ✅ | ✅ | ✅ | — (via subcommand) |
TensogramFile::append | ✅ | ✅ | ✅ | ✅ | — |
TensogramFile::decode_message | ✅ | ✅ | ✅ | ✅ | — |
TensogramFile::decode_range | ✅ | ✅ | ✅ | ✅ | — |
| Batch decode (object/range) | ✅ | ✅ | — (not exposed in FFI) | — | — |
AsyncTensogramFile::* | — (async feature, trait) | ✅ | — | — | — |
StreamingEncoder::new | ✅ | ✅ | ✅ | ✅ | — |
tensogram merge | — | — | — | — | ✅ (--threads) |
tensogram split | — | — | — | — | ✅ |
tensogram reshuffle | — | — | — | — | ✅ |
tensogram convert-grib / convert-netcdf | — | — | — | — | ✅ |
tensogram validate | — | — | — | — | ⚠ (flag accepted but not plumbed — IDEAS) |
tensogram copy / merge | — | — | — | — | ✅ |
TENSOGRAM_THREADS env var fallback | ✅ | ✅ | ✅ | ✅ | ✅ |
Legend: ✅ = full support, ⚠ = flag accepted but currently a no-op (tracked in IDEAS), — = not applicable at this layer.
Threshold behaviour
For very small payloads the pool-build cost (~10–100 µs) outweighs any parallelism gain. The library transparently skips the pool when the total payload bytes are below a threshold (default 64 KiB). The threshold is tunable:
#![allow(unused)]
fn main() {
EncodeOptions {
threads: 8,
parallel_threshold_bytes: Some(0), // always parallel
// parallel_threshold_bytes: Some(usize::MAX), // never parallel
..Default::default()
}
}
Axis-A vs axis-B dispatch
The threads budget is spent along one of two axes:
-
Axis A — across objects. When a message carries multiple data objects and none of them uses an axis-B-friendly codec, rayon
par_iter()runs the encode/decode pipeline for each object on a worker in parallel. Output order is preserved exactly. -
Axis B — inside one codec. When any stage is axis-B-friendly (
simple_packingencoding,shufflefilter,blosc2orzstdcompression), the budget flows into the codec’s internal parallelism:Stage How it uses the budget simple_packingencode/decodeChunked par_iterwith byte-aligned chunk sizes — output bytes remain identical.shuffle/unshuffleParallelise the outer byte_idxloop (shuffle) or output-chunk scatter (unshuffle).blosc2CParams::nthreads/DParams::nthreads— decompress path stays single-threaded in v0.13.0.zstdFFINbWorkerslibzstd parameter on compress; decompress is inherently sequential.
Policy
Tensogram messages tend to carry a small number of very large objects, so the library prefers axis B when any codec can use it:
| Object count | Any object axis-B friendly? | Behaviour |
|---|---|---|
| 1 | — | Axis B (codec gets the full budget). |
| N ≥ 2 | yes | Axis B on each object sequentially. Avoids N × N thread over-subscription. |
| N ≥ 2 | no | Axis A (par_iter across objects), each codec single-threaded. |
This decision happens once per encode/decode call based on the
descriptors. Nothing is configurable beyond threads and
parallel_threshold_bytes — the policy is deterministic.
Determinism contract
v0.13.0 makes two different promises depending on which codecs you use.
Transparent codecs — byte-identical across thread counts
These stages produce the same encoded bytes regardless of
threads:
encoding = "none"encoding = "simple_packing"(at any bits-per-value)filter = "none"filter = "shuffle"compression ∈ {none, lz4, szip, zfp, sz3}
Encoded payload bytes are bit-exact identical for threads ∈ {0, 1, 2, 4, 8, 16, ...}. This is exercised by the
rust/tensogram/tests/threads_determinism.rs integration suite.
Opaque codecs — lossless round-trip, may differ
compression ∈ {blosc2, zstd} hand off work to third-party C
libraries. When their internal thread pool is asked to run in
parallel, blocks land in the output frame in worker completion
order. The compressed bytes may therefore differ from the
sequential path — but every variant round-trips losslessly:
- Encode with
threads=8, decode withthreads=0→ same decoded values as a pure sequential round-trip. - Golden files (produced with
threads=0) are still byte-for-byte stable across releases because the default path is unchanged.
Why this matters
Determinism across thread counts is the core property that lets
Tensogram users turn threads on in production without worrying
about cache keys, deduplication hashes, or reproducible builds
breaking. The invariant is tested at every layer — Rust, Python,
C FFI, C++ wrapper — with a sweep over {0, 1, 2, 4, 8}.
Interaction with integrity hashing
The xxh3-64 integrity hash attached to every data object
(EncodeOptions.hash_algorithm = Some(Xxh3), on by default) is a
pure function of the final encoded bytes. Hashing runs in the
calling thread after any intra-codec parallelism has joined;
each object owns its own Xxh3Default hasher on the stack and the
hasher is never shared across threads.
As a consequence the hash follows the same contract as the encoded bytes:
| Codec class | Encoded bytes across thread counts | Hash across thread counts |
|---|---|---|
| Transparent | Byte-identical | Byte-identical |
| Opaque | May reorder compressed blocks | May differ per-run |
For opaque codecs the hash is still internally consistent —
descriptor.hash == xxh3_64(encoded_payload) always holds for the
bytes that were actually written — it just may not match a hash
computed at a different thread count. verify_hash on decode
always succeeds regardless of the threads value used at encode
time.
Since the hash is folded into the codec output in lockstep (see
plans/DONE.md → Hash-while-encoding), turning on threads has
no additional hash-computation cost beyond what threading already
does to the encoded bytes themselves.
Environment variable override
TENSOGRAM_THREADS is consulted only when the caller-provided
threads is 0. This matches the existing
TENSOGRAM_COMPRESSION_BACKEND pattern:
# One-shot invocation — every library call inherits the budget.
TENSOGRAM_THREADS=4 python my_pipeline.py
# Explicit option still wins.
tensogram.encode(meta, descs, threads=0) # sequential (env honoured)
tensogram.encode(meta, descs, threads=1) # single-threaded (env ignored)
tensogram.encode(meta, descs, threads=16) # 16 workers (env ignored)
The env var is parsed once per process (OnceLock), so changing it
mid-run has no effect.
Interaction with free-threaded Python
threads is orthogonal to Python threading. For CPython 3.13+ built
with --disable-gil, you can combine:
- Python threads — run multiple Tensogram calls concurrently.
- Tensogram threads — each call uses rayon internally.
The PyO3 bindings always release the GIL around encode/decode, so the two dimensions compose cleanly. Be careful about total thread count: N Python threads × M Tensogram threads creates N×M workers. The safest starting point is one dimension at a time.
Benchmarks and tuning
The threads-scaling benchmark measures encode/decode throughput
for 7 representative codec combinations across a sweep of thread
counts:
cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
--num-points 16000000 \
--iterations 5 \
--warmup 2 \
--threads 0,1,2,4,8,16
Output columns (per case × thread count):
enc (ms),dec (ms)— median wall time overiterations.enc MB/s,dec MB/s— throughput based on the original byte size.ratio— compressed size as a percentage of original.size (MiB)— compressed size.enc x,dec x— speedup relative to thethreads=0baseline.
See the Benchmark Results page for numbers on a reference machine.
Tuning recommendations
- Start with
threads=0. The default is deterministic, well tested, and fast for small-to-medium payloads. - Turn it on globally via env.
TENSOGRAM_THREADS=$(nproc)is a reasonable starting point for CPU-bound data-movement pipelines. Leave the in-process tensogram calls asthreads=0unless you need finer control per call. - Measure before tuning. On small payloads the threshold keeps you safe, but the sweet spot for large tensors varies by codec. For simple_packing + szip, 2–4 threads already reaches diminishing returns; for blosc2 it can scale further.
- Do not stack Python threads × Tensogram threads unless you know the total fits your CPU budget. Over-subscription destroys throughput.
Benchmarks
Tensogram ships with a benchmark suite that measures all encoding and compression combinations on synthetic data. It produces tabular comparisons of speed, compressed size, and decode fidelity. The benchmarks can be re-run at any time to measure the effect of changes.
Codec Matrix Benchmark
Tests all valid encoder × compressor × bit-width combinations on 16 million synthetic float64 values.
Quick start
cargo run --release -p tensogram-benchmarks --bin codec-matrix
Override parameters with CLI flags:
cargo run --release -p tensogram-benchmarks --bin codec-matrix -- \
--num-points 16000000 \
--iterations 10 \
--warmup 3 \
--seed 42
| Flag | Default | Description |
|---|---|---|
--num-points | 16 000 000 | Number of float64 values to encode |
--iterations | 10 | Timed iterations per combination (median reported) |
--warmup | 3 | Warm-up iterations (discarded) |
--seed | 42 | PRNG seed for deterministic data generation |
Combinations measured
| Group | Description | Count |
|---|---|---|
| Baseline | No encoding, no compression | 1 |
| Lossless compressors | Raw floats compressed with zstd, LZ4, Blosc2, or szip | 4 |
| SimplePacking + lossless | Quantized to 16, 24, or 32 bits, then compressed with each of the above (or no compressor) | 15 |
| Lossy codecs | ZFP (fixed rate 16/24/32) and SZ3 (absolute error 0.01) | 4 |
| Total | 24 |
For actual results, see Benchmark Results.
How to read the results
The results page splits each benchmark into a performance table (timing, throughput, compressed size) and a fidelity table (error norms for lossy codecs).
| Column | Meaning | Better is |
|---|---|---|
| Method | Encoder + compressor. E.g. “24-bit + szip” means values are quantized to 24 bits then compressed with szip. [REF] marks the baseline. | — |
| Enc / Dec (ms) | Median encode / decode time. | Lower |
| Enc / Dec MB/s | Throughput: uncompressed size ÷ median time. | Higher |
| Ratio | Compressed size as percentage of original. 25% = compressed to ¼. Above 100% means the codec expanded the data. | Lower |
| Size (MiB) | Compressed output size. | — |
| Linf | Max absolute error (worst single value). | Smaller |
| L1 | Mean absolute error (average drift). | Smaller |
| L2 | Root mean square error (penalizes outliers). | Smaller |
For lossless codecs all three error norms are zero. Errors are absolute, in the same units as the input data.
Quick rules of thumb:
- If you need exact data back, use one of the lossless codecs.
- If you can tolerate some loss, compare Ratio vs error norms for your use case.
- Throughput (MB/s) is the most useful speed metric — it accounts for data size and lets you compare across different payload sizes.
Reference Comparison: ecCodes GRIB Encoding
Scientific codecs are easiest to understand alongside an established reference. ecCodes is a widely-deployed GRIB encoder used throughout operational weather forecasting. This benchmark compares Tensogram’s 24-bit SimplePacking + szip pipeline against ecCodes’ built-in packing methods on 10 million float64 values. Both sides are timed symmetrically: encoding measures the full path from a float64 array to compressed bytes, and decoding measures the reverse.
Requirements
- ecCodes C library installed (
brew install eccodeson macOS,apt install libeccodes-devon Debian/Ubuntu) - Build with
--features eccodes
Quick start
cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes
cargo run --release -p tensogram-benchmarks --bin grib-comparison --features eccodes -- \
--num-points 10000000 \
--iterations 10 \
--warmup 3 \
--seed 42
Methods compared
| Method | Description |
|---|---|
| ecCodes CCSDS (reference) | CCSDS packing via ecCodes — a widely-deployed operational reference |
| ecCodes simple packing | Basic fixed-bit-width packing without entropy coding |
| Tensogram 24-bit + szip | Tensogram’s SimplePacking at 24 bits followed by szip entropy coding |
For actual results, see Benchmark Results.
Benchmark pipeline flow
flowchart TD
G[Generate synthetic field] --> W[Warm-up iterations]
W --> T[Timed iterations]
T --> E[Encode]
E --> D[Decode]
D --> T
T --> F[Fidelity check]
F --> R[Print report]
style G fill:#388e3c,stroke:#2e7d32,color:#fff
style T fill:#1565c0,stroke:#0d47a1,color:#fff
style F fill:#c62828,stroke:#b71c1c,color:#fff
Each timed iteration runs a full encode → decode cycle. After all iterations complete, the last decoded output is compared against the original to produce the fidelity metrics.
Things to know
Compression expansion
Some compressors (especially LZ4 on raw 64-bit floats) may produce output larger than the input (Ratio > 100%). This is normal — high-entropy data can’t always be compressed. The baseline row is a raw copy and always shows 100%.
Szip alignment
The codec matrix may round num_points up by 1–3 values for szip block alignment.
This only matters for very small inputs.
Small data sizes
With --num-points 1, timing is dominated by per-call overhead rather than
compression throughput. Use ≥ 10 000 points for meaningful comparisons.
GRIB grid shape
For prime num_points, the GRIB benchmark creates a 1 × N grid (not a realistic
near-square grid). Use composite sizes for representative results
(e.g. --num-points 10000000).
Reproducibility
The data generator is deterministic for a given --seed, so repeated runs on the
same machine produce comparable timing. Compression ratios, sizes, and fidelity
are reproducible across machines. Timing and throughput are not.
Error handling
If a single codec fails, the benchmark logs the error and continues with the remaining combinations. The summary line reports how many succeeded and failed. The CLI exits with code 1 if any combination failed.
Running in CI
For fast CI validation, pass --num-points 10000 --iterations 1 --warmup 1:
cargo run -p tensogram-benchmarks --bin codec-matrix -- \
--num-points 10000 --iterations 1 --warmup 1
The smoke test suite (cargo test -p tensogram-benchmarks) uses 500–1000 points
and completes in under 5 seconds.
Benchmark Results
This page is a snapshot of benchmark results recorded on a specific machine. For methodology, flags, and how to re-run, see Benchmarks.
Note: Timing and throughput are machine-specific. Compression ratios, sizes, and fidelity metrics are determined by the codec and are reproducible.
Run metadata
| Field | Value |
|---|---|
| Date | 2026-04-16 |
| Tensogram version | 0.13.0 |
| CPU | Apple M4, 10 cores / 10 threads |
| OS | macOS 26.3 (Darwin 25.3.0) |
| Rust | rustc 1.94.1 |
| ecCodes | 2.46.0 |
| Methodology | 10 timed iterations, 3 warmup, median reported |
Codec Matrix
16 million float64 values (122 MiB). The test data is a synthetic smooth scientific-like field with values in the range 250–310 (a profile that also matches real temperature grids and other bounded-range physical measurements).
How fidelity is measured
After each encode→decode round-trip, the decoded values are compared to the original. Three error norms are reported, all absolute in the same units as the input:
- Linf — the largest error for any single value. Answers: “what is the worst case?”
- L1 — the average error across all values. Answers: “how far off are values on average?”
- L2 (RMSE) — root mean square error. Like L1 but penalizes large outliers more heavily. Answers: “how large are the typical errors, weighted toward the worst ones?”
For lossless codecs all three are zero.
Lossless compressors on raw floats
No encoding step — raw 64-bit floats compressed directly. Decoded values are bit-identical to the original.
| Method | Enc (ms) | Dec (ms) | Enc MB/s | Dec MB/s | Ratio | Size (MiB) |
|---|---|---|---|---|---|---|
| no compression [REF] | 3.7 | 3.7 | 32818 | 33226 | 100.0% | 122.1 |
| zstd level 3 | 128.5 | 114.5 | 950 | 1066 | 90.3% | 110.2 |
| LZ4 | 8.5 | 7.4 | 14328 | 16535 | 100.4% | 122.6 |
| Blosc2 | 51.9 | 26.6 | 2350 | 4584 | 75.2% | 91.8 |
| szip | 69.7 | 206.8 | 1753 | 590 | 100.9% | 123.2 |
Raw 64-bit floats have high entropy, so most lossless compressors cannot reduce their size. LZ4 and szip slightly expand the data. Blosc2 is the exception — its byte-shuffle step exposes compressible patterns (75%).
SimplePacking (quantization) + lossless compressors
Values are quantized to N bits, then compressed. Fidelity depends only on the bit width, not on the compressor — see the fidelity table below.
| Method | Enc (ms) | Dec (ms) | Enc MB/s | Dec MB/s | Ratio | Size (MiB) |
|---|---|---|---|---|---|---|
| 16-bit only | 17.3 | 15.1 | 7039 | 8078 | 25.0% | 30.5 |
| 16-bit + zstd | 54.2 | 36.2 | 2254 | 3375 | 24.4% | 29.7 |
| 16-bit + LZ4 | 19.7 | 22.2 | 6204 | 5493 | 25.1% | 30.6 |
| 16-bit + Blosc2 | 115.2 | 31.5 | 1060 | 3873 | 20.3% | 24.8 |
| 16-bit + szip | 53.9 | 99.3 | 2263 | 1229 | 14.6% | 17.8 |
| 24-bit only | 19.2 | 17.1 | 6347 | 7135 | 37.5% | 45.8 |
| 24-bit + zstd | 67.3 | 41.1 | 1813 | 2969 | 37.2% | 45.4 |
| 24-bit + LZ4 | 31.5 | 23.5 | 3871 | 5188 | 37.6% | 46.0 |
| 24-bit + Blosc2 | 124.9 | 40.0 | 978 | 3052 | 32.8% | 40.0 |
| 24-bit + szip | 63.3 | 133.5 | 1928 | 914 | 27.2% | 33.2 |
| 32-bit only | 21.2 | 25.3 | 5771 | 4825 | 50.0% | 61.0 |
| 32-bit + zstd | 97.8 | 37.0 | 1248 | 3299 | 49.8% | 60.8 |
| 32-bit + LZ4 | 37.1 | 45.1 | 3287 | 2706 | 50.2% | 61.3 |
| 32-bit + Blosc2 | 141.0 | 38.3 | 866 | 3183 | 45.3% | 55.3 |
| 32-bit + szip | 69.8 | 157.4 | 1748 | 775 | 39.7% | 48.4 |
Fidelity by bit width
| Bit width | Linf (max abs) | L1 (mean abs) | L2 (RMSE) |
|---|---|---|---|
| 16 bits | 4.9 × 10⁻⁴ | 2.4 × 10⁻⁴ | 2.8 × 10⁻⁴ |
| 24 bits | 1.9 × 10⁻⁶ | 9.5 × 10⁻⁷ | 1.1 × 10⁻⁶ |
| 32 bits | 7.5 × 10⁻⁹ | 3.7 × 10⁻⁹ | 4.3 × 10⁻⁹ |
For context: with input values around 280, a Linf of 1.9 × 10⁻⁶ means the worst-case relative error at 24 bits is roughly 7 parts per billion.
Lossy floating-point compressors
These operate directly on raw f64 bytes without quantization.
| Method | Enc (ms) | Dec (ms) | Enc MB/s | Dec MB/s | Ratio | Size (MiB) |
|---|---|---|---|---|---|---|
| ZFP rate 16 | 220.1 | 304.2 | 555 | 401 | 25.0% | 30.5 |
| ZFP rate 24 | 248.0 | 468.5 | 492 | 261 | 37.5% | 45.8 |
| ZFP rate 32 | 288.0 | 581.0 | 424 | 210 | 50.0% | 61.0 |
| SZ3 abs 0.01 | 131.4 | 141.0 | 929 | 865 | 6.5% | 7.9 |
Fidelity by lossy codec
| Method | Linf (max abs) | L1 (mean abs) | L2 (RMSE) |
|---|---|---|---|
| ZFP rate 16 | 1.3 × 10⁻² | 1.6 × 10⁻³ | 2.0 × 10⁻³ |
| ZFP rate 24 | 5.6 × 10⁻⁵ | 6.1 × 10⁻⁶ | 7.9 × 10⁻⁶ |
| ZFP rate 32 | 1.9 × 10⁻⁷ | 2.4 × 10⁻⁸ | 3.1 × 10⁻⁸ |
| SZ3 abs 0.01 | 1.0 × 10⁻² | 5.0 × 10⁻³ | 5.8 × 10⁻³ |
Notable observations
- 16-bit + szip achieves the best compression ratio (14.6%) among the SimplePacking combinations.
- SZ3 achieves the smallest output overall (6.5%) with a max error of 0.01. If your application tolerates that error bound, this gives the best compression in this benchmark.
- In this benchmark, higher ZFP rates gave proportionally smaller errors. ZFP fixed-rate modes always hit their target ratio exactly (25% / 37.5% / 50%).
Reference Comparison: ecCodes GRIB Encoding
GRIB is a binary format widely used in operational weather forecasting, and ecCodes (from ECMWF) is a common implementation. Comparing against it gives a concrete, reproducible reference point for Tensogram’s quantisation + entropy-coding pipeline.
This benchmark runs Tensogram’s 24-bit SimplePacking + szip and ecCodes’ built-in packing methods on the same input. Both sides are timed end-to-end: from a float64 array to serialised compressed bytes (encode), and back (decode).
10 million float64 values (76 MiB), 24-bit packing. Different dataset size from the codec matrix above.
| Method | Enc (ms) | Dec (ms) | Enc MB/s | Dec MB/s | Ratio | Size (MiB) |
|---|---|---|---|---|---|---|
| ecCodes CCSDS [REF] | 47.9 | 84.8 | 1594 | 900 | 27.2% | 20.8 |
| ecCodes simple packing | 32.6 | 7.9 | 2339 | 9660 | 37.5% | 28.6 |
| Tensogram 24-bit + szip | 43.7 | 80.4 | 1745 | 950 | 27.4% | 20.9 |
All three methods produce identical fidelity: Linf = 1.9 × 10⁻⁶, L1 = 9.5 × 10⁻⁷, L2 = 1.1 × 10⁻⁶.
Notable observations
- Tensogram and ecCodes CCSDS achieve nearly identical compression (27.4% vs 27.2%) and identical fidelity at 24 bits.
- Tensogram encode is now slightly faster than ecCodes CCSDS (43.7 vs 47.9 ms) on this machine; decode is comparable (80.4 vs 84.8 ms).
- ecCodes simple packing decodes fastest (7.9 ms) but produces a larger file (37.5% vs 27%).
Threading Scaling
The v0.13.0 multi-threaded coding pipeline lets callers spend a
threads budget on encode/decode work. Results here show the effect
of sweeping threads ∈ {0, 1, 2, 4, 8} on 16M f64 values (122 MiB)
for seven representative codec combinations. threads=0 is the
sequential baseline; speedups are measured against it.
Reminder: Transparent codecs (no codec, simple_packing, szip, lz4, zfp, sz3, shuffle) produce byte-identical encoded payloads across thread counts. Opaque codecs (blosc2, zstd with
nb_workers > 0) may produce different compressed bytes while always round-tripping losslessly.
Lossless (no encoding)
| Method | Metric | threads=0 | threads=1 | threads=2 | threads=4 | threads=8 |
|---|---|---|---|---|---|---|
| none+none | enc MB/s | 32818 | 35929 | 36801 | 35173 | 35520 |
| none+none | speedup | 1.00x | 1.09x | 1.12x | 1.07x | 1.08x |
| none+lz4 | enc MB/s | 7733 | 3619 | 3559 | 2029 | 2513 |
| none+lz4 | speedup | 1.00x | 0.47x | 0.46x | 0.26x | 0.32x |
| none+zstd(3) | enc MB/s | 942 | 1163 | 2075 | 2259 | 1839 |
| none+zstd(3) | speedup | 1.00x | 1.23x | 2.20x | 2.40x | 1.95x |
| none+blosc2(lz4) | enc MB/s | 3150 | 3140 | 5030 | 7458 | 8906 |
| none+blosc2(lz4) | speedup | 1.00x | 1.00x | 1.60x | 2.37x | 2.83x |
SimplePacking + compression
| Method | Metric | threads=0 | threads=1 | threads=2 | threads=4 | threads=8 |
|---|---|---|---|---|---|---|
| sp(16)+none | enc MB/s | 12964 | 13268 | 15584 | 15643 | 14612 |
| sp(16)+none | enc speedup | 1.00x | 1.02x | 1.20x | 1.21x | 1.13x |
| sp(16)+none | dec speedup | 1.00x | 1.14x | 2.37x | 2.34x | 2.18x |
| sp(24)+szip | enc MB/s | 2273 | 2263 | 2351 | 2389 | 2427 |
| sp(24)+szip | speedup | 1.00x | 1.00x | 1.03x | 1.05x | 1.07x |
| sp(24)+blosc2(lz4) | enc MB/s | 2371 | 2350 | 3965 | 5554 | 6388 |
| sp(24)+blosc2(lz4) | enc speedup | 1.00x | 0.99x | 1.67x | 2.34x | 2.69x |
Notable observations
- Memory-bound baselines (none+none, none+lz4) do not scale.
The parallel dispatch overhead outweighs any gain when the work
per task is already at memory bandwidth.
none+lz4actually regresses — leavethreads=0for lz4-only workloads. - blosc2 scales best. Encoding with blosc2+lz4 reaches 2.8× on 8 threads; the sp(24)+blosc2 combination reaches 2.7× on encode and 1.3× on decode.
- zstd scales ~2.4× on encode at 4 threads via libzstd’s
NbWorkers. Beyond 4 threads the benefit plateaus on this CPU. - simple_packing decode is 2.3× faster at 2+ threads — the internal chunk-parallel scatter saturates memory bandwidth quickly.
- szip is single-threaded. The marginal gains shown for
sp(24)+szipcome from parallelising thesimple_packingstage only; szip itself runs sequentially in v0.13.0.
The raw numbers above were produced by the threads-scaling binary
in rust/benchmarks. Re-run locally with:
cargo build --release -p tensogram-benchmarks
./target/release/threads-scaling \
--num-points 16000000 \
--iterations 5 \
--warmup 2 \
--threads 0,1,2,4,8
Simple Packing
Simple packing is a lossy quantisation technique derived from GRIB’s simple-packing method. It quantises a range of floating-point values into N-bit integers, dramatically reducing payload size at the cost of precision.
A 16-bit simple_packing payload is 8× smaller than the equivalent float64 and 4× smaller than float32, with precision loss typically below instrument noise for most bounded-range scientific measurements (temperatures, voltages, pressures, intensity counts).
How It Works
Given a set of float64 values V[i]:
- Find the minimum value
R(the reference value). - Scale all values relative to R:
Y[i] = (V[i] - R) × 10^D × 2^-E - Round Y[i] to the nearest integer and pack it into
Bbits (MSB first).
The parameters D (decimal scale factor), E (binary scale factor), and B (bits per value) are chosen automatically by compute_params().
flowchart TD
A["Input: V = [250.0, 251.3, 252.7]"]
B["Find reference value
R = min(V) = 250.0"]
C["Scale relative to R
[0, 1.3, 2.7] × 10^D × 2^−E"]
D["Round to integers
[0, 17369, 36044]"]
E["Pack as 16-bit MSB
00 00 43 99 8C 8C"]
A --> B --> C --> D --> E
style A fill:#388e3c,stroke:#2e7d32,color:#fff
style E fill:#1565c0,stroke:#0d47a1,color:#fff
Limitations and Edge Cases
NaN and ±Infinity are Rejected
compute_params() and encode() return an error if the data
contains any NaN or ±Infinity values. Simple packing has no
representation for non-finite numbers (unlike IEEE 754 floats), and
feeding Inf through the range / scale-factor derivation would
produce an i32::MAX-saturated binary_scale_factor that silently
decodes to NaN everywhere. Both are errors at the codec entry:
- NaN →
PackingError::NanValue(index) - +Inf / -Inf →
PackingError::InfiniteValue(index)
Remove or replace non-finite values before encoding. If you want
to preserve them, switch to encoding="none" and opt in to the NaN
/ Inf bitmask companion via allow_nan=true / allow_inf=true —
see NaN / Inf Handling for the full
semantics. Simple packing cannot represent non-finite values at
all, so the mask companion is only available on the pass-through
encoding path.
#![allow(unused)]
fn main() {
// Both rejected:
let with_nan = vec![1.0_f64, 2.0, f64::NAN, 4.0];
let with_inf = vec![1.0_f64, 2.0, f64::INFINITY, 4.0];
assert!(compute_params(&with_nan, 16, 0).is_err());
assert!(compute_params(&with_inf, 16, 0).is_err());
}
Params Safety Net
Beyond input-value validation, encode() also checks the
SimplePackingParams it receives:
reference_valuemust be finite (NaN/±Inf→ error).|binary_scale_factor| ≤ 256. The threshold catches thei32::MAX-saturation fingerprint from feedingInfthroughcompute_paramsindirectly; real-world data (|bsf| ≤ 60) fits comfortably. The constantMAX_REASONABLE_BINARY_SCALE = 256is exported fromtensogram_encodings::simple_packing.
This closes the standalone-API footgun where a caller constructs or
mutates SimplePackingParams directly rather than deriving them
from compute_params. Both failures surface as
PackingError::InvalidParams { field, reason } with a clear message
naming the offending field.
Constant Fields
If all values are identical (range = 0), compute_params() succeeds and stores everything in the reference value. All packed integers are 0. Decoding reconstructs the constant correctly.
bits_per_value Range
Valid range: 0 to 64. More than 64 bits is rejected. Zero bits is accepted — compute_params stores the first value as the reference value (not the minimum) and encode produces an empty byte buffer. Decode reconstructs the reference value for every element, so this is only lossless for constant fields. Typical range for scientific floating-point data is 8–24 bits.
| bits_per_value | Packed values | Precision vs float64 |
|---|---|---|
| 8 | 256 levels | Coarse (rough categories) |
| 16 | 65,536 levels | Good for temperature, wind |
| 24 | 16,777,216 levels | Near-float32 precision |
| 32 | ~4 billion levels | Near-float64 for most ranges |
API
compute_params
#![allow(unused)]
fn main() {
pub fn compute_params(
values: &[f64],
bits_per_value: u32,
decimal_scale_factor: i32,
) -> Result<SimplePackingParams, PackingError>
}
Computes the optimal packing parameters for the given data. Call this once before encoding.
#![allow(unused)]
fn main() {
let values: Vec<f64> = (0..1000).map(|i| 250.0 + i as f64 * 0.01).collect();
let params = compute_params(&values, 16, 0)?;
println!("reference_value: {}", params.reference_value);
println!("binary_scale_factor: {}", params.binary_scale_factor);
println!("bits_per_value: {}", params.bits_per_value);
}
encode
#![allow(unused)]
fn main() {
pub fn encode(
values: &[f64],
params: &SimplePackingParams,
) -> Result<Vec<u8>, PackingError>
}
Encodes f64 values to a packed byte buffer using the given parameters.
decode
#![allow(unused)]
fn main() {
pub fn decode(
packed: &[u8],
num_values: usize,
params: &SimplePackingParams,
) -> Result<Vec<f64>, PackingError>
}
Decodes a packed buffer back to f64 values. The num_values parameter is required because the byte length alone is not enough to determine the element count (bits per value may not divide evenly into bytes).
Precision Example
Consider a bounded-range scalar field spanning 90 units (e.g. a temperature field 220–310 K, a pressure field 950–1040 hPa, or any analogous bounded scientific quantity):
| bits_per_value | Step size | Max error |
|---|---|---|
| 8 | 0.353 units | ±0.18 units |
| 12 | 0.022 units | ±0.011 units |
| 16 | 0.00137 units | ±0.00069 units |
At 16 bits, the error is smaller than most practical sensor precisions. The same analysis applies to any physical quantity with a bounded dynamic range.
Full Integration Example
#![allow(unused)]
fn main() {
use tensogram::{encode, decode, GlobalMetadata, DataObjectDescriptor,
ByteOrder, Dtype, EncodeOptions, DecodeOptions};
use tensogram_encodings::simple_packing;
use ciborium::Value;
use std::collections::BTreeMap;
// Source data: 1000 temperature values
let values: Vec<f64> = (0..1000).map(|i| 273.0 + i as f64 * 0.05).collect();
let raw: Vec<u8> = values.iter().flat_map(|v| v.to_ne_bytes()).collect();
// Compute packing parameters
let params = simple_packing::compute_params(&values, 16, 0).unwrap();
// Build descriptor with packing params
let mut p = BTreeMap::new();
p.insert("reference_value".into(), Value::Float(params.reference_value));
p.insert("binary_scale_factor".into(),
Value::Integer((params.binary_scale_factor as i64).into()));
p.insert("decimal_scale_factor".into(),
Value::Integer((params.decimal_scale_factor as i64).into()));
p.insert("bits_per_value".into(),
Value::Integer((params.bits_per_value as i64).into()));
let desc = DataObjectDescriptor {
obj_type: "ntensor".into(),
ndim: 1,
shape: vec![1000],
strides: vec![1],
dtype: Dtype::Float64,
byte_order: ByteOrder::Big,
encoding: "simple_packing".into(),
filter: "none".into(),
compression: "none".into(),
params: p,
hash: None,
};
let global = GlobalMetadata { version: 2, ..Default::default() };
let msg = encode(&global, &[(&desc, &raw)], &EncodeOptions::default()).unwrap();
println!("Packed size: {} bytes (was {} bytes)", msg.len(), raw.len());
let (_, objects) = decode(&msg, &DecodeOptions::default()).unwrap();
let decoded: Vec<f64> = objects[0].1.chunks_exact(8)
.map(|c| f64::from_ne_bytes(c.try_into().unwrap()))
.collect();
// Check precision
for (orig, dec) in values.iter().zip(decoded.iter()) {
assert!((orig - dec).abs() < 0.001);
}
}
Byte Shuffle Filter
The shuffle filter rearranges the bytes of a multi-byte array to improve compression. It is the same algorithm used by HDF5 and NetCDF4.
Why Shuffle Helps
For float32 data, each value occupies 4 bytes. The bytes within a float are not independent — nearby values tend to share their most-significant bytes (exponent + high mantissa) while the least-significant bytes are more random.
Without shuffle, the bytes are interleaved:
[B0 B1 B2 B3][B0 B1 B2 B3][B0 B1 B2 B3]...
A compressor sees B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 ... — not very compressible because the predictable (B0, B1) bytes are mixed with the random (B3) bytes.
After shuffle, all byte-0s come first, then all byte-1s, etc.:
[B0 B0 B0 ...][B1 B1 B1 ...][B2 B2 B2 ...][B3 B3 B3 ...]
Now the B0 run and B1 run are highly compressible (long runs of similar values). The B3 run is still noisy, but it’s isolated. Overall compression improves significantly.
API
shuffle
#![allow(unused)]
fn main() {
pub fn shuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}
Rearranges bytes. element_size is the byte width of each element (e.g. 4 for float32, 8 for float64).
#![allow(unused)]
fn main() {
let floats: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
let raw: Vec<u8> = floats.iter().flat_map(|f| f.to_ne_bytes()).collect();
let shuffled = shuffle(&raw, 4)?;
// shuffled is ready for compression
}
unshuffle
#![allow(unused)]
fn main() {
pub fn unshuffle(data: &[u8], element_size: usize) -> Result<Vec<u8>, ShuffleError>
}
Reverses the shuffle. Applied automatically by the decode pipeline.
Using Shuffle in a Message
Set filter: "shuffle" in the DataObjectDescriptor and provide shuffle_element_size:
#![allow(unused)]
fn main() {
use ciborium::Value;
let mut params = BTreeMap::new();
params.insert(
"shuffle_element_size".to_string(),
Value::Integer(4.into()), // 4 bytes per float32
);
let desc = DataObjectDescriptor {
obj_type: "ntensor".to_string(),
ndim: 1,
shape: vec![100],
strides: vec![1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "none".to_string(),
filter: "shuffle".to_string(),
compression: "none".to_string(),
params,
hash: None,
};
}
Edge Cases
Element Size Must Divide the Buffer
The shuffle operation requires data.len() % element_size == 0. If this is not true, the function returns Err(ShuffleError::Misaligned). Ensure your data buffer is a whole number of elements.
Shuffle Alone Does Not Compress
Shuffle rearranges bytes but does not reduce the total byte count. It only helps when followed by a compression stage (e.g. szip, zstd, lz4, blosc2). Set compression in the descriptor to apply compression after the shuffle step.
Combining with simple_packing
When using both encoding: "simple_packing" and filter: "shuffle", the pipeline applies them in order: encode first, then shuffle. The simple_packing output is 1-byte-per-packed-chunk (MSB-first bits), so shuffle_element_size should be 1 in this case (no benefit from shuffling already-packed data). In practice, the combination is unusual — either use simple_packing alone (when quantising float values) or shuffle alone (before a lossless compressor).
Compression
Compression is the third stage of the encoding pipeline. It reduces the total byte count of the already-encoded and filtered payload.
Supported Compressors
| Compressor | Type | Random Access | Notes |
|---|---|---|---|
none | Pass-through | Yes (trivial) | No compression |
szip | Lossless | Yes (RSI blocks) | CCSDS 121.0-B-3 via libaec. Best for integer/packed data |
zstd | Lossless | No | Zstandard. Excellent ratio/speed tradeoff |
lz4 | Lossless | No | Fastest decompression. Good for real-time pipelines |
blosc2 | Lossless | Yes (chunks) | Multi-codec meta-compressor with chunk-level access |
zfp | Lossy | Yes (fixed-rate) | Purpose-built for floating-point arrays |
sz3 | Lossy | No | Error-bounded lossy compression for scientific data |
The Compressor Trait
All compressors implement a common interface with three operations:
#![allow(unused)]
fn main() {
pub trait Compressor {
fn compress(&self, data: &[u8]) -> Result<CompressResult, CompressionError>;
fn decompress(&self, data: &[u8], expected_size: usize) -> Result<Vec<u8>, CompressionError>;
fn decompress_range(
&self,
data: &[u8],
block_offsets: &[u64],
byte_pos: usize,
byte_size: usize,
) -> Result<Vec<u8>, CompressionError>;
}
}
decompress_range enables partial decode without decompressing the entire payload. Compressors that don’t support it return CompressionError::RangeNotSupported.
Lossless Compressors
Szip (libaec)
Szip implements CCSDS 121.0-B-3, a lossless compressor designed for scientific data. It works on integer data and exploits the block structure of packed values.
Random access: Szip records RSI (Reference Sample Interval) block boundaries during encoding. These offsets are stored in metadata as szip_block_offsets, enabling seek-to-block partial decode via decompress_range. When using encode_pre_encoded, the caller must provide these bit-precise block offsets themselves to enable random access (see Pre-encoded Payloads).
| Parameter | Type | Description |
|---|---|---|
szip_rsi | uint | Reference sample interval (samples per RSI block) |
szip_block_size | uint | Block size (typically 8 or 16) |
szip_flags | uint | AEC encoding flags (e.g., AEC_DATA_PREPROCESS) |
szip_block_offsets | array of uint | Bit offsets of RSI block boundaries (computed during encoding) |
Important: libaec encodes integers only. For floating-point data, use either:
simple_packing→szip(lossy quantization to integers, then compress)shuffle→szip(byte rearrangement, then compress as uint8)
Zstd (Zstandard)
General-purpose lossless compression with excellent ratio/speed tradeoff. Widely used and well-optimized.
| Parameter | Type | Default | Description |
|---|---|---|---|
zstd_level | int | 3 | Compression level (1-22). Higher = better ratio, slower |
No random access — decode_range is not supported with zstd.
LZ4
Fastest decompression of any compressor in the library. Slightly lower compression ratio than Zstd, but 3-5x faster to decompress.
No configurable parameters. No random access.
Blosc2
A meta-compressor that splits data into independently-compressed chunks, then stores them in a frame. Supports multiple internal codecs.
Random access: Because each chunk is independent, Blosc2 can decompress only the chunks covering the requested byte range. decompress_range works by mapping byte offsets to chunk indices.
| Parameter | Type | Default | Description |
|---|---|---|---|
blosc2_codec | string | "lz4" | Internal codec: blosclz, lz4, lz4hc, zlib, zstd |
blosc2_clevel | int | 5 | Compression level (0-9) |
blosc2_typesize | uint | (auto) | Element byte width for shuffle optimization |
blosc2_typesizeis automatically computed from the preceding pipeline stage: dtype byte width for unencoded data, 1 for shuffled bytes, or packed byte width for simple_packing output.
Lossy Compressors
ZFP
Purpose-built compression for floating-point arrays. ZFP compresses data in blocks of 4 elements (1D) and supports three modes:
| Mode | Parameter | Description |
|---|---|---|
fixed_rate | zfp_rate (float) | Fixed bits per value. Enables O(1) random access |
fixed_precision | zfp_precision (uint) | Fixed number of uncompressed bit planes |
fixed_accuracy | zfp_tolerance (float) | Maximum absolute error bound |
Random access: In fixed-rate mode, every block compresses to exactly the same number of bits. This means the byte offset of any block is computable from its index, enabling decompress_range without stored block offsets.
| Parameter | Type | Description |
|---|---|---|
zfp_mode | string | One of "fixed_rate", "fixed_precision", "fixed_accuracy" |
zfp_rate | float | Bits per value (only for fixed_rate) |
zfp_precision | uint | Bit planes to keep (only for fixed_precision) |
zfp_tolerance | float | Max absolute error (only for fixed_accuracy) |
Important: ZFP operates directly on floating-point data. Use
encoding: "none"andfilter: "none"— ZFP replaces both encoding and compression.
SZ3
Error-bounded lossy compression for scientific data. SZ3 uses prediction-based methods (interpolation, Lorenzo, regression) to achieve high compression ratios within strict error bounds.
| Parameter | Type | Description |
|---|---|---|
sz3_error_bound_mode | string | One of "abs", "rel", "psnr" |
sz3_error_bound | float | Error bound value (meaning depends on mode) |
Error bound modes:
abs— Absolute error:|original - decompressed| <= boundfor every elementrel— Relative error:|original - decompressed| / value_range <= boundpsnr— Peak signal-to-noise ratio lower bound
No random access — decode_range is not supported with SZ3.
Important: Like ZFP, SZ3 operates on floating-point data. Use
encoding: "none"andfilter: "none".
Choosing a Compressor
flowchart TD
A{"Data type?"}
A -->|"Integer / packed"| B{"Need random access?"}
A -->|"Float, lossy OK"| C{"Need random access?"}
A -->|"Float, lossless"| D{"Speed priority?"}
B -->|Yes| E["szip"]
B -->|No| F{"Speed or ratio?"}
F -->|Speed| G["lz4"]
F -->|Ratio| H["zstd"]
C -->|Yes| I["zfp (fixed_rate)"]
C -->|No| J{"Error bound type?"}
J -->|"Bits/precision"| K["zfp"]
J -->|"Absolute/relative"| L["sz3"]
D -->|"Fastest decompress"| M["lz4"]
D -->|"Best ratio"| N["blosc2 or zstd"]
D -->|"Need random access"| O["blosc2"]
style E fill:#388e3c,stroke:#2e7d32,color:#fff
style I fill:#388e3c,stroke:#2e7d32,color:#fff
style O fill:#388e3c,stroke:#2e7d32,color:#fff
| Use case | Recommended | Why |
|---|---|---|
| Quantised floats with partial-access support | simple_packing + szip | RSI-block random access; interoperable with GRIB 2 CCSDS packing |
| Real-time streaming | lz4 | Fastest decompression, low latency |
| Archival storage | zstd (level 9-15) | Best lossless ratio |
| ML model weights | blosc2 | Chunk random access, good for large tensors |
| Float fields, lossy OK | zfp (fixed_rate) | Best lossy ratio with random access |
| Error-bounded science | sz3 (abs) | Guaranteed error bounds per element |
| Exact integers | none or lz4 | No information loss |
Invalid Combinations
Some pipeline combinations are rejected at configuration time:
| Combination | Rejected? | Reason |
|---|---|---|
zfp + shuffle | Yes | ZFP operates on typed floats; shuffle rearranges bytes |
zfp + simple_packing | Yes | ZFP IS the encoding for floats |
sz3 + shuffle | Yes | SZ3 operates on typed data |
sz3 + simple_packing | Yes | SZ3 IS lossy encoding for floats |
shuffle + decode_range | Yes | Byte rearrangement breaks contiguous sample ranges |
zstd/lz4/sz3 + decode_range | Yes | Stream compressors don’t support partial decode |
tensogram info
Displays a summary of a Tensogram file: number of messages, total file size, and format version.
Usage
tensogram info [FILES]...
Options
| Option | Description |
|---|---|
-h, --help | Print help |
Example
$ tensogram info forecast.tgm
Messages : 48
File size: 1.2 GB
Version : 1
What it Shows
| Field | Description |
|---|---|
| Messages | Total number of valid messages found by scanning the file |
| File size | Raw byte count of the file on disk |
| Version | Format version from the first message’s metadata |
Notes
- The scan counts only valid messages (those with a matching
TENSOGRMheader and39277777terminator). Corrupted regions are skipped. - If the file is empty,
Messages: 0is shown. - Version is read from the first message. If messages have different versions, only the first is shown.
tensogram ls
Lists messages in a Tensogram file, showing metadata in tabular or JSON format.
Usage
tensogram ls [OPTIONS] [FILES]...
Options
| Option | Description |
|---|---|
-w <WHERE_CLAUSE> | Where-clause filter (e.g., mars.param=2t/10u) |
-p <KEYS> | Comma-separated keys to display |
-j | JSON output |
-h, --help | Print help |
Examples
# List all messages with default columns
tensogram ls forecast.tgm
# Only temperature fields
tensogram ls forecast.tgm -w "mars.param=2t"
# Temperature or wind
tensogram ls forecast.tgm -w "mars.param=2t/10u/10v"
# Exclude ensemble members
tensogram ls forecast.tgm -w "mars.type!=em"
# Show only date and step columns
tensogram ls forecast.tgm -p "mars.date,mars.step"
# JSON output (one object per line, good for jq)
tensogram ls forecast.tgm -j | jq '.["mars.param"]'
Where Clause Syntax
The -w flag accepts a single expression:
key=value # exact match
key=v1/v2/v3 # OR — matches any of v1, v2, v3
key!=value # not equal
key!=v1/v2 # not any of v1, v2
Key format: namespace.field for namespaced keys (e.g. mars.param) or just field for top-level keys (e.g. version).
Missing key: For key=value, a missing key is treated as non-matching. For key!=value, a missing key passes the filter.
Only one -w expression can be specified per command. To apply multiple filters, pipe commands:
tensogram ls forecast.tgm -w "mars.type=fc" | grep "2t"
Pick Keys
The -p flag selects which metadata columns to display. Keys use the same dot-notation as -w:
tensogram ls forecast.tgm -p "mars.date,mars.step,mars.param"
Without -p, all available metadata keys are shown.
Default Table Output
mars.date mars.step mars.param mars.type shape
20260401 0 2t fc [721, 1440]
20260401 0 10u fc [721, 1440]
20260401 0 10v fc [721, 1440]
20260401 6 2t fc [721, 1440]
...
JSON Output
With -j, each matching message is printed as a JSON object on its own line:
{"mars.date": "20260401", "mars.step": "0", "mars.param": "2t", "shape": "[721, 1440]"}
{"mars.date": "20260401", "mars.step": "0", "mars.param": "10u", "shape": "[721, 1440]"}
This is compatible with jq, grep, and any tool that processes newline-delimited JSON.
tensogram dump
Prints the full contents of every message in a Tensogram file — metadata keys and optionally the raw data values.
Usage
tensogram dump [OPTIONS] [FILES]...
Options
| Option | Description |
|---|---|
-w <WHERE_CLAUSE> | Filter messages (e.g. mars.param=2t, same syntax as ls) |
-p <KEYS> | Comma-separated keys to display |
-j | JSON output |
-h, --help | Print help |
Example
$ tensogram dump forecast.tgm
─── Message 0 ───
version : 1
mars.class : od
mars.type : fc
mars.date : 20260401
mars.step : 0
Object 0
type : ntensor
ndim : 2
shape : [721, 1440]
strides : [1440, 1]
dtype : float32
mars.param: 2t
encoding : none
filter : none
compression: none
hash : xxh3:a3f0123456789abc
─── Message 1 ───
...
Filtering
Use -w to limit the dump to specific messages:
# Dump only wave spectra
tensogram dump forecast.tgm -w "mars.param=wave_spectra"
JSON Output
With -j, each message is a JSON object:
{
"message": 0,
"metadata": {
"version": 2,
"base": [
{
"mars": {"class": "od", "type": "fc", "date": "20260401", "step": 0, "param": "2t"},
"_reserved_": {"tensor": {"ndim": 2, "shape": [721, 1440], "strides": [1440, 1], "dtype": "float32"}}
}
]
},
"objects": [
{"type": "ntensor", "ndim": 2, "shape": [721, 1440], "dtype": "float32",
"encoding": "none", "hash": {"type": "xxh3", "value": "a3f0..."}}
]
}
When to Use dump vs ls
- Use
lsfor a quick overview of many messages (one line per message) - Use
dumpwhen you need to see all keys for a specific message, or check encoding parameters
tensogram get
Extracts a single metadata value from messages in a file. Returns an error if the key is missing.
Usage
tensogram get [OPTIONS] -p <KEYS> [FILES]...
Options
| Option | Description |
|---|---|
-w <WHERE_CLAUSE> | Filter messages (e.g. mars.param=2t, same syntax as ls) |
-p <KEYS> | Comma-separated keys to extract (required) |
-h, --help | Print help |
Examples
# Get the mars.param value from all messages
tensogram get -p mars.param forecast.tgm
# Get the date from messages where param is 2t
tensogram get -p mars.date -w "mars.param=2t" forecast.tgm
# Get the shape of object 0
tensogram get -p shape forecast.tgm
Strict Key Lookup
Unlike ls which shows a blank for missing keys, get exits with a non-zero status if any matching message does not have the requested key:
$ tensogram get -p mars.nonexistent forecast.tgm
Error: key not found: mars.nonexistent
This makes get safe to use in shell scripts where missing data should fail fast.
Multi-Object Messages
For messages with multiple objects, get returns the first matching value found. Lookup checks top-level metadata first and then scans objects in order until it finds a match.
tensogram set
Modifies metadata keys in messages and writes the result to a new file. Matching messages are decoded, their metadata is updated, and they are re-encoded with the original payload bytes and pipeline settings.
Usage
tensogram set [OPTIONS] -s <SET_VALUES> <INPUT> <OUTPUT>
Options
| Option | Description |
|---|---|
-s <SET_VALUES> | Key=value pairs to set (comma-separated) |
-w <WHERE_CLAUSE> | Only modify messages matching this filter (e.g. mars.param=2t) |
-h, --help | Print help |
Examples
# Change mars.date to 20260402 in all messages
tensogram set input.tgm output.tgm mars.date=20260402
# Set multiple keys at once
tensogram set input.tgm output.tgm mars.date=20260402,mars.step=12
# Only modify temperature fields
tensogram set input.tgm output.tgm mars.class=rd -w "mars.param=2t"
Key=Value Syntax
Multiple mutations can be specified as a comma-separated list:
tensogram set in.tgm out.tgm key1=val1,key2=val2,key3=val3
Keys use dot-notation: mars.param sets the param field inside the mars namespace. A top-level key like experiment sets a top-level metadata field.
Object-level metadata can be updated with objects.<index>.<path>:
# Add object-specific metadata to the first object
tensogram set input.tgm output.tgm objects.0.processing.version=2
Structural/Integrity Keys
The following keys cannot be modified because they describe the physical structure of the payload. Changing them would make the metadata inconsistent with the actual bytes on disk:
| Key | Reason |
|---|---|
shape | Tensor dimensions |
strides | Memory layout |
dtype | Element type |
ndim | Number of dimensions |
type | Object type |
encoding | Encoding algorithm |
filter | Filter algorithm |
compression | Compression algorithm |
hash | Payload integrity hash |
szip_rsi | Szip compression block parameter |
szip_block_size | Szip compression block parameter |
szip_flags | Szip compression flags |
szip_block_offsets | Szip block seek table |
reference_value | Simple packing quantization parameter |
binary_scale_factor | Simple packing quantization parameter |
decimal_scale_factor | Simple packing quantization parameter |
bits_per_value | Simple packing quantization parameter |
shuffle_element_size | Shuffle filter parameter |
Attempting to modify any of these returns an error before any output is written.
Pass-Through for Non-Matching Messages
Messages that do not match the -w filter are copied verbatim to the output file. Their bytes are not re-encoded or re-hashed.
Note: Messages that are modified are re-encoded after the metadata mutation. Because the decoded payload bytes are unchanged,
setpreserves the original payload hash instead of recomputing it.
Workflow
flowchart TD
A[Read message] --> B{Matches -w?}
B -- No --> C[Write raw bytes to output]
B -- Yes --> D[Decode metadata]
D --> E[Apply mutations]
E --> F[Re-encode message\npreserve payload hash]
F --> G[Write to output]
C --> H[Next message]
G --> H
tensogram copy
Copies messages from one file to one or more output files. The output filename can include placeholders that expand to metadata values, allowing a single file to be split by parameter, date, step, or any other key.
Usage
tensogram copy [OPTIONS] <INPUT> <OUTPUT>
Options
| Option | Description |
|---|---|
-w <WHERE_CLAUSE> | Only copy messages that match this filter |
-h, --help | Print help |
Basic Copy
# Copy all messages from one file to another
tensogram copy input.tgm output.tgm
Filename Placeholders
Wrap any metadata key in square brackets to expand it in the output filename:
# One file per parameter
tensogram copy forecast.tgm "by_param/[mars.param].tgm"
# Produces: by_param/2t.tgm, by_param/10u.tgm, by_param/msl.tgm, ...
# One file per date+step combination
tensogram copy forecast.tgm "archive/[mars.date]_[mars.step].tgm"
# Produces: archive/20260401_0.tgm, archive/20260401_6.tgm, ...
# Split by type and param
tensogram copy forecast.tgm "split/[mars.type]/[mars.param].tgm"
# Produces: split/fc/2t.tgm, split/an/2t.tgm, etc.
Multiple messages with the same expanded filename are appended to the same output file. This is how you split-then-concatenate: a 1000-message file with 4 unique mars.param values produces 4 output files with ~250 messages each.
Filtering During Copy
Combine -w with placeholders for targeted extraction:
# Copy only forecasts, split by step
tensogram copy forecast.tgm "steps/[mars.step].tgm" -w "mars.type=fc"
Edge Cases
Missing Placeholder Key
If a message does not have the key referenced by a placeholder, that placeholder expands to unknown:
# If mars.param is missing, the message is written to by_param/unknown.tgm
tensogram copy forecast.tgm "by_param/[mars.param].tgm"
Output Directory
The output directory must exist before running copy. The command does not create directories. Use mkdir -p beforehand:
mkdir -p by_param
tensogram copy forecast.tgm "by_param/[mars.param].tgm"
Overwriting
If the expanded output filename already exists before the copy starts, it is truncated once and matching messages are then appended in order. This means running copy twice will duplicate messages. To avoid this, delete or rename existing outputs first.
Placeholder Syntax Conflicts
If a metadata value contains /, \, or other characters that are invalid in filenames on your OS, the resulting filename will be invalid. Choose placeholder keys whose values are filesystem-safe (e.g. dates, step numbers, short codes).
tensogram merge
Merge messages from one or more files into a single message.
Usage
tensogram merge [OPTIONS] --output <OUTPUT> [INPUTS]...
Options
| Option | Description |
|---|---|
-o, --output <OUTPUT> | Output file |
-s, --strategy <STRATEGY> | Merge strategy for conflicting metadata keys: first (default) — first value wins, last — last value wins, error — fail on conflict [default: first] |
-h, --help | Print help |
Description
All data objects from all input messages are collected into a single Tensogram message. Global metadata is merged according to --strategy: first (default) keeps the first value, last keeps the last, and error fails on conflict.
Examples
# Merge two files into one
tensogram merge file1.tgm file2.tgm -o merged.tgm
# Merge all messages in a single multi-message file
tensogram merge multi.tgm -o single.tgm
tensogram split
Split multi-object messages into separate single-object files.
Usage
tensogram split --output <OUTPUT> <INPUT>
Options
| Option | Description |
|---|---|
-o, --output <OUTPUT> | Output template (use [index] for numbering) |
-h, --help | Print help |
Description
Each data object from each message in the input file becomes its own Tensogram message, inheriting the global metadata.
Output files are named using the template:
- Use
[index]for zero-padded numbering:split_[index].tgm→split_0000.tgm,split_0001.tgm, … - Without
[index]: the index is appended before the extension:out.tgm→out_0000.tgm,out_0001.tgm, …
Examples
# Split with index template
tensogram split multi_object.tgm -o 'field_[index].tgm'
# Split with auto-numbered names
tensogram split multi_object.tgm -o output.tgm
tensogram reshuffle
Reshuffle frames: move footer frames to header position.
Usage
tensogram reshuffle --output <OUTPUT> <INPUT>
Options
| Option | Description |
|---|---|
-o, --output <OUTPUT> | Output file |
-h, --help | Print help |
Description
Converts streaming-mode messages (footer-based index and hash frames) into random-access-mode messages (header-based index and hash frames).
This is a decode → re-encode operation. The data is not modified; only the frame layout changes so that index and hash information appears before the data objects, enabling efficient random access.
Examples
tensogram reshuffle streamed.tgm -o random_access.tgm
tensogram validate
Check whether .tgm files are well-formed and intact. Analogous to grib_check or h5check.
Usage
tensogram validate [OPTIONS] <FILES>...
Validation Levels
The command runs up to four validation levels:
| Level | Name | What it checks |
|---|---|---|
| 1 | Structure | Magic bytes, frame headers, ENDF markers, total_length, postamble, frame ordering, preceder legality, preamble flags vs observed frames |
| 2 | Metadata | CBOR parses correctly, required keys present (_reserved_.tensor, dtype, shape, strides), encoding/filter/compression types recognized, object count consistency, shape/strides/ndim consistency |
| 3 | Integrity | xxh3 hash in descriptor/hash-frame matches recomputed hash, compressed payloads decompress without error |
| 4 | Fidelity | Full decode succeeds, decoded size matches shape/dtype, NaN/Inf in float arrays are errors |
Modes
| Mode | Levels | Description |
|---|---|---|
| default | 1–3 | Structure + metadata + integrity |
| quick | 1 | Structure only, no payloads |
| checksum | 3 | Hash verification only (structural errors still reported, no decompression) |
| full | 1–4 | All levels including fidelity (NaN/Inf check) |
Level selectors (--quick, --checksum, --full) are mutually exclusive. --canonical is independent and can be combined with any level selector.
All flags
| Flag | Description |
|---|---|
--quick | Quick mode: structure only (level 1) |
--checksum | Checksum only: hash verification (structural errors still reported, but metadata/decompression/fidelity checks skipped) |
--full | Full mode: all levels including fidelity (levels 1-4) |
--canonical | Check RFC 8949 canonical CBOR key ordering (combinable with any level) |
--json | Machine-parseable JSON output |
-h, --help | Print help |
Output
Human-readable (default)
file.tgm: OK (3 messages, 47 objects, hash verified)
On failure:
bad.tgm: FAILED — message 2, object 5: hash mismatch (expected a3f7..., got 91c2...)
bad.tgm: FAILED (1 error, 1 message, 3 objects)
JSON (--json)
[
{
"file": "file.tgm",
"status": "ok",
"messages": 1,
"objects": 3,
"hash_verified": true,
"file_issues": [],
"message_reports": [
{
"issues": [],
"object_count": 3,
"hash_verified": true
}
]
}
]
On failure, issues within message_reports[i].issues contain (note: object_index is 0-based in JSON; absent fields are omitted, not null):
{
"code": "hash_mismatch",
"level": "integrity",
"severity": "error",
"object_index": 4,
"description": "hash mismatch (expected a3f7..., got 91c2...)"
}
Issue codes are stable snake_case strings (e.g. hash_mismatch, invalid_magic, buffer_too_short) suitable for machine parsing.
Exit Code
0— all files pass validation1— one or more files have errors or file-level issues
Batch Mode
tensogram validate data/*.tgm
Validates all files. Reports per-file. Exits 1 if any file fails.
File-level Checks
When validating a file with multiple messages, the command also detects:
- Unrecognized bytes between messages (garbage or padding)
- Truncated messages at end of file
- Trailing bytes after the last valid message
These are reported as file-level issues and cause validation to fail (exit code 1).
Library API
The same validation is available programmatically:
#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram::{validate_message, validate_file, ValidateOptions};
// Validate a single message buffer
let report = validate_message(&bytes, &ValidateOptions::default());
assert!(report.is_ok());
// Validate a file
let file_report = validate_file(Path::new("data.tgm"), &ValidateOptions::default())?;
println!("{} messages, {} objects", file_report.messages.len(), file_report.total_objects());
}
Examples
# Default validation (levels 1-3)
tensogram validate measurements.tgm
# Quick structural check
tensogram validate --quick *.tgm
# Verify checksums only
tensogram validate --checksum archive/*.tgm
# Full validation including NaN/Inf detection (levels 1-4)
tensogram validate --full output.tgm
# Full validation with canonical CBOR check
tensogram validate --full --canonical output.tgm
# Check canonical CBOR encoding
tensogram validate --canonical output.tgm
# JSON output for CI pipelines
tensogram validate --json data/*.tgm
GRIB Import
Tensogram provides tensogram-grib, a dedicated crate for importing GRIB
(GRIdded Binary) messages into Tensogram format. GRIB is widely used in
operational weather forecasting; this importer lets you bring existing GRIB
data into Tensogram pipelines while preserving the full MARS namespace
metadata. Conversion is one-way: GRIB → Tensogram.
System Requirement
The ecCodes C library must be installed:
brew install eccodes # macOS
apt install libeccodes-dev # Debian/Ubuntu
Building
The tensogram-grib crate is excluded from the default workspace build to
avoid requiring ecCodes on machines that do not need GRIB import.
# Build the library
cd rust/tensogram-grib && cargo build
# Build CLI with GRIB support
cargo build -p tensogram-cli --features grib
Conversion Modes
Merge All (default)
All GRIB messages are combined into a single Tensogram message with N data objects. ALL MARS keys for each GRIB message are placed into the corresponding base[i] entry independently — there is no common/varying partitioning in the output.
tensogram convert-grib forecast.grib -o forecast.tgm
One-to-One (split)
Each GRIB message becomes a separate Tensogram message with one data object. All MARS keys go into base[0].
tensogram convert-grib forecast.grib -o forecast.tgm --split
Rust API
#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_grib::{convert_grib_file, ConvertOptions, Grouping};
let options = ConvertOptions {
grouping: Grouping::MergeAll,
..Default::default()
};
let messages = convert_grib_file(Path::new("forecast.grib"), &options)?;
// messages is Vec<Vec<u8>> — each element is a complete Tensogram wire-format message
}
Data Mapping
| Source (GRIB) | Target (Tensogram) |
|---|---|
Grid values (values key) | Data object payload (float64, little-endian) |
| Grid dimensions (Ni, Nj) | DataObjectDescriptor.shape as [Nj, Ni] |
| Reduced Gaussian grids (Ni=0) | Shape [numberOfPoints] (1D) |
| MARS keys (all, per message) | GlobalMetadata.base[i]["mars"] (each entry independent) |
Scope
Only GRIB → Tensogram import is supported. Tensogram → GRIB is out of scope because Tensogram’s N-tensor data model is a superset of GRIB’s 2-D-field model; a faithful down-conversion is often impossible.
See also
- NetCDF Import — sister importer for
NetCDF files; shares the
--encoding/--bits/--filter/--compressionpipeline flags withconvert-grib. - Vocabularies — other application vocabularies that can coexist with MARS in the same message.
MARS Key Mapping
The importer reads the following MARS namespace keys from each GRIB message using ecCodes’ read_key_dynamic API.
Keys Extracted
Identification
| GRIB Key | Description | Example |
|---|---|---|
class | MARS class | "od" (operational) |
type | Data type | "an" (analysis), "fc" (forecast) |
stream | Data stream | "oper", "enfo" |
expver | Experiment version | "0001" |
Parameter
| GRIB Key | Description | Example |
|---|---|---|
param | Parameter ID | "2t" (2m temperature) |
shortName | Short name | "2t" |
name | Full name | "2 metre temperature" |
paramId | Numeric ID | 167 |
discipline | WMO discipline | 0 |
parameterCategory | WMO category | 0 |
parameterNumber | WMO number | 0 |
Vertical
| GRIB Key | Description | Example |
|---|---|---|
level | Level value | 500 |
typeOfLevel | Level type | "isobaricInhPa" |
levtype | MARS level type | "pl" (pressure level) |
Temporal
| GRIB Key | Description | Example |
|---|---|---|
date / dataDate | Reference date | 20260404 |
time / dataTime | Reference time | 1200 |
stepRange / step | Forecast step | "0", "6", "0-6" |
stepUnits | Step units | 1 (hours) |
Spatial
| GRIB Key | Description | Example |
|---|---|---|
gridType | Grid type | "regular_ll" |
Ni, Nj | Grid dimensions | 360, 181 |
numberOfPoints | Total grid points | 65160 |
latitudeOfFirstGridPointInDegrees | First latitude | 90.0 |
longitudeOfFirstGridPointInDegrees | First longitude | 0.0 |
latitudeOfLastGridPointInDegrees | Last latitude | -90.0 |
longitudeOfLastGridPointInDegrees | Last longitude | 359.0 |
iDirectionIncrementInDegrees | Longitude step | 1.0 |
jDirectionIncrementInDegrees | Latitude step | 1.0 |
Other
| GRIB Key | Description | Example |
|---|---|---|
bitsPerValue | Packing precision | 16 |
packingType | GRIB packing | "grid_simple" |
centre | Originating centre | "ecmf" |
subCentre | Sub-centre | 0 |
generatingProcessIdentifier | Process ID | 148 |
Storage in Tensogram
Given N GRIB messages in merge-all mode:
- Extract all MARS keys from each message using
read_key_dynamic - Store ALL keys for each GRIB message in the corresponding
base[i]["mars"]entry independently - There is no common/varying partitioning in the output — each
base[i]entry is self-contained
graph TD
A[N GRIB messages] --> B[Extract MARS keys from each]
B --> C["Store in base[i] independently"]
C --> D["base[0]: all keys from GRIB msg 0"]
C --> E["base[1]: all keys from GRIB msg 1"]
C --> F["base[N-1]: all keys from GRIB msg N-1"]
If you need to extract commonalities after decoding (e.g. for display), use the compute_common() utility in software.
Sentinel Handling
ecCodes uses sentinel values for missing keys:
- String:
"MISSING"or"not_found"→ skipped - Integer:
2147483647or-2147483647→ skipped - Float:
NaNorInf→ skipped
NetCDF Import
Tensogram ships tensogram-netcdf, a dedicated crate for importing NetCDF
(both Classic and NetCDF-4) files into Tensogram messages. NetCDF is widely
used in climate, ocean, atmospheric, and Earth-observation science, but the
importer treats any NetCDF file the same way — the mapping is structural, not
domain-specific.
The crate is exposed through the CLI as tensogram convert-netcdf and through
a thin Rust library API. Conversion is one-way: NetCDF → Tensogram. There is
no Tensogram → NetCDF writer.
System requirement
The NetCDF C library must be installed on your system:
brew install netcdf # macOS
apt install libnetcdf-dev # Debian/Ubuntu
The crate transitively pulls in HDF5 (used internally by NetCDF-4 files), so
on Debian-family distros you also want libhdf5-dev.
Building
The tensogram-netcdf crate is excluded from the default workspace build to
avoid forcing libnetcdf on every contributor. Build it explicitly:
# Library
cargo build --manifest-path rust/tensogram-netcdf/Cargo.toml
# CLI with NetCDF support
cargo build -p tensogram-cli --features netcdf
The binary then exposes the new subcommand:
tensogram convert-netcdf --help
Quick example
# Convert one file
tensogram convert-netcdf input.nc -o output.tgm
# Convert multiple files into a single output
tensogram convert-netcdf jan.nc feb.nc mar.nc -o q1.tgm
# Stream to stdout (useful for piping)
tensogram convert-netcdf input.nc | tensogram info /dev/stdin
Command-line options
| Flag | Default | Description |
|---|---|---|
-o, --output PATH | stdout | Where to write the Tensogram file. |
--split-by MODE | file | Grouping mode: file, variable, or record. See Splitting modes. |
--cf | off | Extract the CF attribute allow-list into base[i]["cf"]. See CF metadata mapping. |
--encoding ENC | none | none or simple_packing. |
--bits N | auto (16) | Bits per value for simple_packing (1–64). |
--filter FILTER | none | none or shuffle. |
--compression CODEC | none | none, zstd, lz4, blosc2, or szip. |
--compression-level N | codec default | Level for zstd (1–22) and blosc2 (0–9). |
The --encoding/--bits/--filter/--compression/--compression-level
flags are the same set used by tensogram convert-grib. Both importers share
a PipelineArgs struct so the two commands stay symmetric.
How variables become objects
Each numeric NetCDF variable in the root group is mapped 1:1 to a Tensogram
data object. The variable’s name is stored under base[i]["name"], the dtype
and shape come from the NetCDF type and dimension list, and the raw bytes
become the object payload (always little-endian).
Dtype matrix
| NetCDF type | Tensogram Dtype |
|---|---|
byte | Int8 |
ubyte | Uint8 |
short | Int16 |
ushort | Uint16 |
int | Int32 |
uint | Uint32 |
int64 | Int64 |
uint64 | Uint64 |
float | Float32 |
double | Float64 |
char and string variables, as well as the NetCDF-4 enhanced types
(compound, vlen, enum, opaque), are skipped with a warning. They have
no clean tensor representation.
Scalar variables
A NetCDF scalar (zero dimensions) becomes an object with ndim = 0,
shape = [], and a single value in the payload.
Packed data
Variables with scale_factor or add_offset attributes are unpacked during
conversion: the raw integer values are read, multiplied by the scale, offset
applied, and the result stored as Float64 regardless of the on-disk dtype.
This matches the convention used by xarray and most netCDF tooling.
The fill value (_FillValue or missing_value) is replaced with NaN in the
unpacked output. The original sentinel is preserved under
base[i]["netcdf"]["_FillValue"] so consumers can recover it.
Time coordinates
Time coordinate variables are stored as numeric values (typically Float64)
exactly as they appear in the file — Tensogram does not convert them to
calendar dates. The CF units string ("days since 1970-01-01") and
calendar ("gregorian", "noleap", etc.) are preserved under
base[i]["netcdf"] so a consumer can decode them on demand.
NetCDF-4 groups
Tensogram extracts only the root group of a NetCDF-4 file. If sub-groups are detected the importer prints a warning to stderr and continues with the root variables. Sub-group support is intentionally out of scope for v1 — most operational datasets keep their data variables at the root anyway.
Splitting modes
The --split-by flag controls how variables are grouped into Tensogram
messages.
--split-by=file (default)
All variables from one input file are bundled into a single Tensogram message containing N data objects. This is the most compact representation and is the right choice when you want to keep a NetCDF file as a single logical unit.
tensogram convert-netcdf forecast.nc -o forecast.tgm
# 1 message with N objects
--split-by=variable
Each variable becomes its own one-object Tensogram message. Useful when downstream consumers want to fetch individual variables without decoding the whole file.
tensogram convert-netcdf forecast.nc -o forecast.tgm --split-by variable
# N messages with 1 object each
--split-by=record
Splits along the unlimited (record) dimension. Each step along the unlimited
dimension produces a separate message. The unlimited dimension is detected
automatically; passing this mode against a file without one is a hard error
(NoUnlimitedDimension).
Variables that don’t depend on the unlimited dimension (e.g. a static mask
variable) are still included in every output message — that way each
record is fully self-describing.
tensogram convert-netcdf timeseries.nc -o timeseries.tgm --split-by record
# 1 message per record
Encoding pipeline flags
The pipeline flags are applied per data object before encoding into the
wire format. They use the same names and semantics as convert-grib:
| Stage | Flag | Notes |
|---|---|---|
| Encoding | --encoding simple_packing --bits N | Lossy quantization. Float64 only — non-f64 variables in the same file are skipped (with a warning) and pass through unencoded so mixed files convert cleanly. |
| Filter | --filter shuffle | Byte-shuffle filter, sets shuffle_element_size to the post-encoding byte width. |
| Compression | --compression zstd --compression-level 3 | zstd_level defaults to 3. |
| Compression | --compression lz4 | No params. |
| Compression | --compression blosc2 --compression-level 9 | Uses blosc2_codec=lz4 by default. |
| Compression | --compression szip | Sets szip_rsi=128, szip_block_size=16, szip_flags=8. Requires preceding simple_packing or shuffle because libaec szip caps at 32 bits per sample (raw f64 is 64 bits). |
Variables that contain NaN or ±Inf (typically from unpacked
_FillValue / missing_value substitution or degenerate arithmetic
upstream) cannot be represented by simple_packing — the algorithm’s
range / scale-factor derivation has no slot for non-finite values.
The importer hard-fails when --encoding simple_packing is
requested on data containing NaN or Inf. The error names the
offending variable and suggests recovery options:
error: simple_packing failed for forecast_temperature: NaN value
encountered at index 42. The variable contains NaN or Inf which
cannot be represented by simple_packing. Pre-process the data or
choose a different encoding (e.g. encoding="none").
Recovery options, in order of effort:
- Drop the
--encoding simple_packingflag AND pass--allow-nan. The default pipeline (encoding="none") combined with the NaN bitmask companion frame round-trips NaN values losslessly. See NaN / Inf Handling. - Substitute non-finite values with an in-band sentinel before conversion if you need simple_packing throughout.
- Split the conversion with
--split-by variableand re-run per-variable, using--encoding simple_packingonly for the variables you know are NaN-free.
Prior behaviour (pre-0.17). The importer used to soft-downgrade NaN-bearing variables to
encoding="none"with a stderr warning. That silently hid data-quality problems from automated pipelines; 0.17 surfaces them as hard errors and pairs the fix with the--allow-nanbitmask opt-in (preferred over pre-processing). The non-f64-payload branch (a structural mismatch rather than a data-quality problem) keeps its stderr-warning + fallback behaviour unchanged.
# Pack temperature to 24-bit + zstd
tensogram convert-netcdf --encoding simple_packing --bits 24 \
--compression zstd --compression-level 3 \
era5_t2m.nc -o era5_t2m.tgm
# Shuffle + szip on a multi-variable file
tensogram convert-netcdf --filter shuffle --compression szip \
forecast.nc -o forecast.tgm
CF metadata mapping
NetCDF attributes are always extracted into a netcdf sub-map under each
base entry:
base[0]:
name: "temperature"
netcdf:
units: "K"
long_name: "Air Temperature"
standard_name: "air_temperature"
_FillValue: -32768
add_offset: 273.15
scale_factor: 0.01
_global:
Conventions: "CF-1.10"
title: "..."
institution: "..."
When --cf is set, an additional cf sub-map is added containing only the
16 CF allow-list attributes. This duplicate
copy makes CF-aware tooling cheaper because it can ignore the verbose
netcdf map and rely on a stable, standardised key set.
Limitations
- No NetCDF writer. Conversion is one-way only.
- No string or char variables. They are skipped with a warning.
- No NetCDF-4 enhanced types (
compound,vlen,enum,opaque). - Root group only. Sub-groups are skipped with a warning.
- No
tensogram-pythonbindings. The Python ecosystem talks toconvert-netcdfthroughsubprocess. The library API is Rust-only in v1. simple_packingisf64-only. Mixed-dtype files convert cleanly but onlyf64variables get packed.
Library API
If you’d rather call the importer directly from Rust:
#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_netcdf::{convert_netcdf_file, ConvertOptions, DataPipeline, SplitBy};
let options = ConvertOptions {
split_by: SplitBy::Variable,
cf: true,
pipeline: DataPipeline {
encoding: "simple_packing".to_string(),
bits: Some(24),
compression: "zstd".to_string(),
compression_level: Some(3),
..Default::default()
},
..Default::default()
};
let messages = convert_netcdf_file(Path::new("forecast.nc"), &options)?;
// messages: Vec<Vec<u8>> — each element is a complete wire-format message
}
Note: DataPipeline is defined in tensogram::pipeline and
re-exported from both tensogram_netcdf and tensogram_grib. The
underlying apply_pipeline helper is the same for both importers,
guaranteeing that convert-grib and convert-netcdf produce
byte-identical descriptor fields for equivalent flag combinations.
See also
- GRIB Import — sister importer with the same pipeline-flag semantics.
- Simple Packing, Shuffle, Compression — the encoding stages applied to each object.
- CF Metadata Mapping — full table of the
16 attributes lifted by
--cf.
NetCDF CF Metadata Mapping
When tensogram convert-netcdf --cf is set, the importer walks each
NetCDF variable and lifts a fixed set of 16 CF Conventions
v1.10
attributes into a cf sub-map under the corresponding base[i] entry. The
attributes are also still present in the verbose netcdf map alongside
every other variable attribute — the cf map is a curated, schema-stable
view that CF-aware tooling can rely on.
The allow-list lives in rust/tensogram-netcdf/src/metadata.rs
as the constant CF_ATTRIBUTES. If you change the list, update this page
to match.
Attributes lifted by --cf
| CF Attribute | Tensogram Key | Notes |
|---|---|---|
standard_name | base[i]["cf"]["standard_name"] | CF standard name from the CF Standard Name Table, e.g. "air_temperature", "eastward_wind". |
long_name | base[i]["cf"]["long_name"] | Free-form descriptive label, e.g. "2 metre temperature". |
units | base[i]["cf"]["units"] | UDUNITS-compliant string, e.g. "K", "m s-1", "days since 1970-01-01". |
calendar | base[i]["cf"]["calendar"] | Calendar for time coordinate variables, e.g. "gregorian", "noleap", "360_day". |
cell_methods | base[i]["cf"]["cell_methods"] | Aggregation description, e.g. "time: mean", "area: sum". |
coordinates | base[i]["cf"]["coordinates"] | Space-separated list of auxiliary coordinate variable names, e.g. "lon lat". |
axis | base[i]["cf"]["axis"] | Dimension role flag: "X", "Y", "Z", or "T". |
positive | base[i]["cf"]["positive"] | Direction of vertical coordinate: "up" (altitude) or "down" (depth/pressure). |
valid_min | base[i]["cf"]["valid_min"] | Minimum valid value for QA/range checks. |
valid_max | base[i]["cf"]["valid_max"] | Maximum valid value for QA/range checks. |
valid_range | base[i]["cf"]["valid_range"] | Two-element array [min, max] — alternative to valid_min/valid_max. |
bounds | base[i]["cf"]["bounds"] | Name of an associated cell-bounds variable (irregular grids). |
grid_mapping | base[i]["cf"]["grid_mapping"] | Name of an associated coordinate reference system variable. |
ancillary_variables | base[i]["cf"]["ancillary_variables"] | Space-separated list of related ancillary variable names (uncertainty, QA flags, etc.). |
flag_values | base[i]["cf"]["flag_values"] | Array of integer flag values for categorical variables. |
flag_meanings | base[i]["cf"]["flag_meanings"] | Space-separated list of meanings, paired with flag_values. |
That’s 16 attributes — the full CF allow-list as of v0.7.0.
Storage layout
For a CF-compliant temperature variable, the --cf flag produces:
base[0]:
name: "temperature"
netcdf:
units: "K"
long_name: "2 metre temperature"
standard_name: "air_temperature"
_FillValue: -32768
add_offset: 273.15
scale_factor: 0.01
cell_methods: "time: mean"
_global:
Conventions: "CF-1.10"
title: "ERA5 reanalysis"
cf:
units: "K"
long_name: "2 metre temperature"
standard_name: "air_temperature"
cell_methods: "time: mean"
The netcdf map is a verbatim dump of every variable attribute (the
_global sub-map carries the file-level attributes). The cf map is a
filtered slice containing only the allow-listed keys, in the order they
appear on the variable.
What is not extracted
The allow-list is intentionally narrow. The following CF concepts are out
of scope for v0.7.0 — they are accessible via the verbose netcdf map but
not surfaced under cf:
- Grid mapping variable contents — only the
grid_mappingreference is lifted, not the projection parameters of the referenced variable. - Coordinate variable contents — coordinate variables are converted to their own data objects, not inlined into other variables’ metadata.
- Bounds variable contents — only the
boundsreference is lifted. - Cell measures —
cell_measuresis not in the allow-list. - Climatology bounds —
climatologyis not lifted. - Geometry containers — CF 1.8+ geometries are out of scope.
- Labels and string-valued auxiliary coordinates — not in the allow-list.
- Compound coordinates /
compress— ragged-array support is out of scope.
If you need these, read the raw NetCDF metadata from base[i]["netcdf"]
instead — every original attribute is preserved there, byte-for-byte.
Why a curated allow-list?
Two reasons:
- Schema stability. Downstream tooling (xarray engines, dashboards,
indexers) wants to rely on a small, fixed key set without having to
inspect every NetCDF file’s variable-attribute zoo. The
cfmap gives them that contract. - Interop friendliness. The 16 allow-listed attributes are the ones that show up in essentially every CF-compliant climate or weather dataset. They are the lingua franca that makes CF data interoperable.
If you have a strong case for adding an attribute, file an issue on the GitHub project and we’ll evaluate it.
Related
- CF Conventions §3 — variable attributes.
- CF Conventions §8 — packed data, scale_factor / add_offset.
- CF Standard Name Table — the controlled vocabulary referenced by
standard_name. - NetCDF Import — main user guide for
tensogram convert-netcdf.
Error Handling
Tensogram uses typed errors across all language bindings. Every fallible
operation returns a Result (Rust), raises an exception (Python / C++ /
TypeScript), or returns an error code (C). No library code panics.
Error Categories
| Category | Trigger | Rust | Python | C++ | TypeScript | C Code |
|---|---|---|---|---|---|---|
| Framing | Invalid magic bytes, truncated message, bad terminator | TensogramError::Framing | ValueError | framing_error | FramingError | TGM_ERROR_FRAMING (1) |
| Metadata | CBOR parse failure, missing required field, schema violation | TensogramError::Metadata | ValueError | metadata_error | MetadataError | TGM_ERROR_METADATA (2) |
| Encoding | Encoding pipeline failure (e.g. NaN in simple_packing) | TensogramError::Encoding | ValueError | encoding_error | EncodingError | TGM_ERROR_ENCODING (3) |
| Compression | Decompression failure, unknown codec | TensogramError::Compression | ValueError | compression_error | CompressionError | TGM_ERROR_COMPRESSION (4) |
| Object | Invalid descriptor, object index out of range | TensogramError::Object | ValueError | object_error | ObjectError | TGM_ERROR_OBJECT (5) |
| I/O | File not found, permission denied, disk full | TensogramError::Io | OSError | io_error | IoError | TGM_ERROR_IO (6) |
| Hash Mismatch | Payload integrity check fails on verify_hash=True | TensogramError::HashMismatch | RuntimeError | hash_mismatch_error | HashMismatchError | TGM_ERROR_HASH_MISMATCH (7) |
| Invalid Arg | NULL pointer or invalid argument at the API boundary | — | ValueError | invalid_arg_error | InvalidArgumentError | TGM_ERROR_INVALID_ARG (8) |
| Remote | S3 / GCS / Azure / HTTP(S) object-store failure | TensogramError::Remote | OSError | remote_error | RemoteError | TGM_ERROR_REMOTE (10) |
| Streaming Limit | decodeStream internal buffer exceeded the configured maximum | — | — | — | StreamingLimitError | — |
Notes on the TypeScript column:
- All TypeScript errors extend the abstract
TensogramErrorbase class, so a singlecatch (err) { if (err instanceof TensogramError) … }handles every library-raised error. HashMismatchErrorin TypeScript additionally carries parsedexpectedandactualhex digests when the underlying Rust message is in the canonical"hash mismatch: expected X, got Y"form.StreamingLimitErroris TS-specific and is raised only fromdecodeStreamwhen the internal buffer would grow pastmaxBufferBytes(default 256 MiB).
Error Paths by Operation
Encoding
Input data + metadata dict
│
├─ Missing 'version' ──────────► Metadata error
├─ Missing 'type'/'shape'/'dtype' ► Metadata error
├─ Unknown dtype string ────────► Metadata error
├─ Unknown byte_order ──────────► Metadata error
├─ Data size ≠ shape × dtype ───► Metadata error
├─ Shape product overflow ──────► Metadata error
├─ NaN in simple_packing ───────► Encoding error
├─ Inf reference_value ─────────► Metadata error
├─ Client wrote _reserved_ ─────► Metadata error (message or base[i])
├─ base.len() > descriptors ────► Metadata error (extra entries would be lost)
├─ emit_preceders in buffered ──► Encoding error (use StreamingEncoder)
├─ Param out of range (i32/u32) ► Metadata error (zstd_level, szip_rsi, etc.)
├─ Unknown compression codec ───► Encoding error
├─ Compression codec failure ───► Compression error
└─ File I/O failure ────────────► I/O error
Decoding
Raw bytes
│
├─ No magic bytes / truncated ──► Framing error
├─ Bad frame type codes ────────► Framing error
├─ Frame total_length overflow ─► Framing error
├─ Frame ordering violation ────► Framing error (header→data→footer)
├─ cbor_offset out of range ────► Framing error
├─ CBOR parse failure ──────────► Metadata error
├─ Preceder base ≠ 1 entry ─────► Metadata error
├─ Dangling preceder (no obj) ──► Framing error
├─ Consecutive preceders ────────► Framing error
├─ base.len() > object count ───► Metadata error
├─ Object index out of range ───► Object error
├─ Shape product overflow ──────► Metadata error
├─ Decompression failure ───────► Compression error
├─ Decoding pipeline failure ───► Encoding error
└─ Hash verification mismatch ──► HashMismatch error
File Operations
TensogramFile.open(path)
│
├─ File not found ──────────────► I/O error
├─ Permission denied ───────────► I/O error
└─ Invalid file content ────────► Framing error
TensogramFile.decode_message(index)
│
├─ Index out of range ──────────► Object error / IndexError
└─ Corrupt message at offset ───► Framing error
Streaming Encoder
StreamingEncoder
│
├─ write_preceder(_reserved_) ──► Metadata error
├─ write_preceder twice ─────────► Framing error (no intervening write_object)
├─ finish() with pending prec ──► Framing error (dangling preceder)
├─ write_object invalid shape ──► Metadata error
├─ Encoding pipeline failure ───► Encoding error
├─ Variable-length hash algo ───► Framing error (see below)
└─ I/O write failure ───────────► I/O error
The streaming path writes the frame header before the payload has been
hashed, so it needs to know the final CBOR descriptor length up front.
This works only when the configured HashAlgorithm produces a digest
whose hex representation has a fixed length — currently only Xxh3
(always 16 hex chars). If a future hash algorithm with variable-length
output is used, StreamingEncoder::write_object returns
TensogramError::Framing before writing any bytes, so the caller’s
sink is never corrupted. Use the buffered encode() API for such
algorithms.
CLI Operations
set command
│
├─ Immutable key (shape, dtype) ► Error (cannot modify structural key)
├─ _reserved_ namespace ────────► Error (library-managed)
└─ Invalid object index ────────► Error (out of range)
merge command
│
├─ No input files ──────────────► Error
├─ Invalid strategy name ───────► Error
└─ Conflicting keys (error mode) ► Error (use first/last to resolve)
split command
│
└─ Single-object: pass through; multi-object: split per-object base metadata
Importer Operations (convert-grib / convert-netcdf)
Both importer crates (tensogram-grib, tensogram-netcdf) use typed
error enums and never panic on invalid or exotic input. Anything the
importer can’t represent cleanly is either surfaced as a typed error
or skipped with a warning: … line on stderr so the operator can see
what was dropped.
tensogram-netcdf errors (rust/tensogram-netcdf/src/error.rs)
│
├─ NetcdfError::Netcdf(netcdf::Error)
│ Low-level failure from libnetcdf — file missing, permission
│ denied, format error, truncated file, HDF5 error.
│
├─ NetcdfError::NoVariables
│ Input file has zero supported numeric variables after skipping
│ char/string/compound/vlen. Empty files also hit this.
│
├─ NetcdfError::NoUnlimitedDimension { file }
│ --split-by=record requested but the file has no unlimited
│ dimension. Contains the file path for diagnostics.
│
├─ NetcdfError::UnsupportedType { name, reason }
│ Variable has a type we can't represent (e.g. compound,
│ enum, opaque, vlen). Currently only the char / string
│ variants hit this path — the other complex types are
│ downgraded to a stderr warning and skipped because they
│ frequently coexist with valid numeric variables.
│
├─ NetcdfError::InvalidData(String)
│ Catch-all for:
│ - low-level read errors on a specific variable
│ - unknown --encoding / --filter / --compression names
│ - simple_packing compute_params failures on edge-case data
│ - extract_variable_record invariant violations (should be
│ unreachable; if it fires the importer is buggy)
│
├─ NetcdfError::Encode(String)
│ tensogram rejected the pipeline. Common cause:
│ szip on raw f64 (bits_per_sample=64 exceeds libaec's
│ 32-bit cap). Fix: add --filter shuffle or --encoding
│ simple_packing first.
│
└─ NetcdfError::Io(std::io::Error)
Reserved for future use — the current importer reads
through libnetcdf and writes through the CLI wrapper, so
stdlib I/O errors don't currently reach this variant.
Soft warnings (stderr, exit 0):
warning: {file}: sub-groups found; only root-group variables are converted
warning: skipping variable '{name}': Char variables are not supported
warning: skipping variable '{name}': complex type Compound(_) is not supported
warning: skipping simple_packing for variable '{name}' (not a float64 payload)
warning: variable '{name}': failed to read attribute '{attr}': {cause}
warning: failed to read global attribute '{name}': {cause}
Note: NaN/Inf in a variable that targets simple_packing now
hard-fails the conversion (see
NetCDF Importer — simple_packing on Mixed-dtype Files
below). The previous “warning: skipping simple_packing … NaN value
encountered” line no longer fires; that case is an error rather than
a warning.
The last two lines above are rare — they only fire on corrupt attribute values or unsupported upstream AttributeValue variants — but they surface instead of dropping data silently so operators can trace unexpected missing metadata.
tensogram-grib errors (rust/tensogram-grib/src/error.rs)
│
├─ GribError::Eccodes(String) — ecCodes C library error
├─ GribError::NoMessages — empty GRIB file
├─ GribError::MissingKey — required ecCodes/MARS namespace key absent
├─ GribError::InvalidShape — grid dimension mismatch
└─ GribError::Encode — tensogram encode failure
Language-Specific Patterns
Rust
#![allow(unused)]
fn main() {
use tensogram::{decode, DecodeOptions, TensogramError};
match decode(&buffer, &DecodeOptions::default()) {
Ok((meta, objects)) => { /* use data */ }
Err(TensogramError::Framing(msg)) => eprintln!("bad format: {msg}"),
Err(TensogramError::HashMismatch { expected, actual }) =>
eprintln!("integrity: {expected} ≠ {actual}"),
Err(e) => eprintln!("error: {e}"),
}
}
Python
import tensogram
# Decode errors
try:
msg = tensogram.decode(buf, verify_hash=True)
except ValueError as e:
# Framing, Metadata, Encoding, Compression, Object errors
print(f"decode failed: {e}")
except RuntimeError as e:
# Hash verification mismatch
print(f"integrity error: {e}")
except OSError as e:
# File I/O and Remote (S3/GCS/Azure/HTTP) errors
print(f"I/O error: {e}")
# File errors
try:
f = tensogram.TensogramFile.open("missing.tgm")
except OSError:
print("file not found")
# Index errors
with tensogram.TensogramFile.open("data.tgm") as f:
try:
msg = f[999]
except IndexError:
print("message index out of range")
# Packing errors
try:
tensogram.compute_packing_params(nan_array, 16, 0)
except ValueError as e:
print(f"NaN rejected: {e}")
C++
#include <tensogram.hpp>
try {
auto msg = tensogram::decode(buf, len);
} catch (const tensogram::framing_error& e) {
// Invalid message structure
std::cerr << "framing: " << e.what() << " (code " << e.code() << ")\n";
} catch (const tensogram::hash_mismatch_error& e) {
// Payload integrity failure
std::cerr << "hash: " << e.what() << "\n";
} catch (const tensogram::error& e) {
// Any Tensogram error (base class)
std::cerr << "error: " << e.what() << "\n";
}
C
#include "tensogram.h"
tgm_message* msg = tgm_decode(buf, len, 0);
if (!msg) {
tgm_error code = tgm_last_error_code();
const char* message = tgm_last_error();
fprintf(stderr, "%s (%d): %s\n",
tgm_error_string(code), code, message);
}
Note:
tgm_last_error()returns a thread-local string valid until the next FFI call on the same thread. Copy it if you need to keep it.
TypeScript
Every error thrown by @ecmwf/tensogram is an instance of the abstract
TensogramError base class. The concrete subclasses match the Rust
variants one-to-one, plus a TS-specific InvalidArgumentError and
StreamingLimitError.
import {
decode,
TensogramError,
FramingError,
HashMismatchError,
ObjectError,
StreamingLimitError,
} from '@ecmwf/tensogram';
try {
const { metadata, objects } = decode(buf, { verifyHash: true });
// ...
} catch (err) {
if (err instanceof HashMismatchError) {
// Structured fields are parsed from the Rust-side message.
console.error('integrity failure:', err.expected, err.actual);
} else if (err instanceof FramingError) {
console.error('bad wire format:', err.message);
} else if (err instanceof ObjectError) {
console.error('object index error:', err.message);
} else if (err instanceof TensogramError) {
console.error('tensogram error:', err.name, err.message);
} else {
throw err;
}
}
All concrete classes expose:
err.rawMessage— the untruncated string from the WASM / Rust side, including any error-variant prefix ("framing error: ...").err.message— the human-readable message with the prefix stripped.err.name— stable string name ("FramingError", etc.).
HashMismatchError additionally exposes parsed expected and actual
hex digests when the underlying message follows the canonical
"hash mismatch: expected X, got Y" form.
Streaming decode does not throw on a single corrupt message — the
iterator skips and continues. Register an onError callback to observe
the skips:
import { decodeStream, StreamingLimitError } from '@ecmwf/tensogram';
try {
for await (const frame of decodeStream(res.body!, {
maxBufferBytes: 64 * 1024 * 1024,
onError: ({ message, skippedCount }) => {
console.warn(`skipped corrupt message (#${skippedCount}): ${message}`);
},
})) {
render(frame.descriptor.shape, frame.data());
frame.close();
}
} catch (err) {
if (err instanceof StreamingLimitError) {
// Stream exceeded maxBufferBytes; configure a larger limit or split.
} else {
throw err;
}
}
Note:
decodeStreamdoes throw for infrastructure-level failures (buffer limit exceeded,AbortSignalfired, non-ReadableStreaminput). Only per-message corruption is routed throughonError.
Common Error Scenarios
Garbage or Truncated Input
Any non-Tensogram bytes passed to decode() produce a Framing error.
The decoder looks for the 8-byte magic TENSOGRM and a matching terminator.
Hash Mismatch After Corruption
v3 note. Frame-level integrity moved from the decoder to the
validator. verify_hash=True (Python DecodeOptions) or
TGM_DECODE_VERIFY_HASH (C) is retained for source compatibility
but is a no-op on the decode path in v3.
To detect corruption in a v3 message, run the message through
tensogram validate --checksum (CLI), validate_message (Rust),
tgm_validate (C), or the equivalent Python / TypeScript helpers.
The validator:
- Walks every frame and recomputes the xxh3-64 of its body
(payload + masks + CBOR;
cbor_offset, the hash slot, and ENDF are excluded — seeplans/WIRE_FORMAT.md§2.4). - Compares the recomputed digest to the inline hash slot at
frame_end − 12. A mismatch emits a HashMismatch validation issue carrying theexpectedandactualhex values plus the frame offset. - When both a
HeaderHashand aFooterHashaggregate frame are present, cross-checks them against each other and against the inline slots. Disagreement also surfaces as a HashMismatch. - An
UnknownHashAlgorithmwarning fires when the aggregateHashFrame.algorithmis not"xxh3"— the inline slots are still verified (they’re authoritative); only the aggregate’s algorithm identifier is advisory.
Messages encoded with hash_algorithm=None clear the
HASHES_PRESENT preamble flag and leave every inline slot at
0x00…00. On such messages, validate --checksum emits
NoHashAvailable at warning level and cannot detect corruption
beyond structural errors — re-encode with hash_algorithm = Some(Xxh3) to enable integrity checking.
Object Index Out of Range
Accessing decode_object(buf, index=N) where N ≥ number of objects
produces an Object error (Rust/C/C++) or ValueError (Python).
File indexing file[N] raises IndexError for out-of-range N.
NaN / Inf in Simple Packing
compute_packing_params() rejects both NaN and ±Inf values
with a ValueError that includes the index of the first offending
sample. simple_packing’s scale-factor derivation has no meaningful
value for non-finite input — rejecting them up front prevents the
silent corruption path where an i32::MAX-saturated
binary_scale_factor decodes to NaN everywhere.
0.17+ extends this contract to every pipeline: encoding="none"
(and every compressor) rejects NaN / ±Inf input by default. The
NaN / Inf Handling guide covers the
allow_nan / allow_inf opt-in that substitutes non-finite values
with 0.0 and records their positions in a bitmask companion
section.
File Not Found / Permission Denied
TensogramFile.open() raises OSError (Python), io_error (C++),
or returns TGM_ERROR_IO (C) for any file system failure.
NetCDF Importer — --split-by=record on Files Without Unlimited Dim
tensogram convert-netcdf --split-by record foo.nc where foo.nc has
no unlimited dimension hard-errors with
NetcdfError::NoUnlimitedDimension { file } (exit code 1). The error
message includes the path so the caller can identify which file in a
multi-input batch triggered it.
NetCDF Importer — simple_packing on Mixed-dtype Files
--encoding simple_packing is f64-only by design. Mixed files (a
typical CF temperature file has f32 lat/lon coordinates alongside
f64 data) are handled gracefully: non-f64 variables emit a stderr
warning and pass through with encoding="none", and the conversion
overall succeeds.
NaN or Inf in a targeted f64 variable is now a hard error (0.17+).
The importer fails with
NetcdfError::InvalidData("simple_packing failed for {var}: ...")
and a recovery hint, rather than silently downgrading the variable
to encoding="none". Pre-0.17 soft-downgrade hid data-quality
problems; the new behaviour surfaces them at conversion time.
Callers relying on the old fallback should either pick a
non-simple_packing encoding up front, opt into the NaN / Inf
bitmask companion via --allow-nan / --allow-inf (see
NaN / Inf Handling), pre-process NaN / Inf
out of the data, or use --split-by variable and choose
per-variable encodings.
NetCDF Importer — Unknown Codec Name
--encoding foo, --filter bar, --compression baz all hard-error
with NetcdfError::InvalidData listing the expected values. The
pre-validation fires inside apply_pipeline so the error surfaces
immediately, before any data is read from disk.
NetCDF Importer — szip on Raw f64
libaec szip caps at 32 bits per sample, but raw f64 gives
bits_per_sample = 64, so --compression szip on unencoded f64
produces a low-level aec_encode_init failed error from
tensogram wrapped in NetcdfError::Encode. Fix:
- Combine with
--encoding simple_packing --bits N(N ≤ 32), or - Combine with
--filter shuffle(which makes the element size 8 bits).
Unknown Hash Algorithm (Forward Compatibility)
When the decoder encounters a hash algorithm string it doesn’t recognize
(e.g. a future "sha256" hash), it logs a warning via tracing::warn!
and skips verification rather than failing. This ensures forward
compatibility: older decoders can still read messages produced by newer
encoders that use new hash algorithms.
No-Panic Guarantee
All Rust library code in tensogram, tensogram-encodings, and
tensogram-ffi is free from panic!(), unwrap(), expect(), todo!(),
and unimplemented!() in non-test code paths. The library guarantees:
- All fallible operations return
Result<T, TensogramError>. - Integer arithmetic uses checked operations (
checked_mul,try_from) to prevent overflow and truncation. u64 → usizeconversions useusize::try_from()to prevent truncation on 32-bit platforms.- Array indexing is guarded by prior bounds checks.
- FFI boundary code returns error codes instead of panicking, and uses
unwrap_or_default()only forCString::new()(interior null fallback). - The scan functions (
scan,scan_file) tolerate truncation oftotal_length as usizebecause the subsequent bounds check catches it. - The hash-while-encoding pipeline
(
PipelineConfig.compute_hash = trueplus the streaming encoder’s inline-hash path) verifies its CBOR-length invariant before writing any bytes and surfaces aTensogramError::Framingif a variable-length hash algorithm is ever configured — the caller’s sink is never left in a partial-write state on that specific failure mode. Internal debug assertions guard against non-deterministic CBOR serialisation during development.
Edge Cases
A collection of non-obvious situations and how the library handles them.
Corrupted Messages
What happens: The scanner (scan()) searches for TENSOGRM magic bytes and validates the postamble (last 8 bytes should be 39277777). If total_length is set, the scanner checks for the end magic at the expected position.
Recovery: If a message fails validation, the scanner skips one byte and resumes searching. A single corrupted message in a multi-message file does not prevent reading the others.
#![allow(unused)]
fn main() {
let offsets = scan(&file_bytes);
// offsets only contains valid (start, length) pairs
// Corrupted regions are silently skipped
}
Edge case within edge case: If a random byte sequence inside a valid payload happens to match TENSOGRM, the scanner might try to parse a “message” starting mid-payload. The postamble cross-check catches this: the false start’s postamble won’t contain the expected 39277777 end magic.
NaN in Simple Packing
Simple packing cannot represent NaN. The quantization formula maps the range [min, max] onto integers, and NaN has no defined place in this range.
What happens: compute_params() returns PackingError::NanValue(index) if any value is NaN. The encode() function also rejects NaN inputs before packing.
Solution: Replace NaN values with a sentinel (e.g. the minimum representable value, or a separate bitmask object) before encoding.
Inf in Simple Packing — Silent Corruption
Subtle gotcha — simple_packing’s compute_params scans for NaN but not for Inf. Passing [1.0, +Inf, 3.0]:
range = max - min = +Inf, which producesbinary_scale_factor = i32::MAX(saturating cast fromInf as i32).- Encoding yields all-zero packed integers.
- Decoding reconstructs
NaNat every position (becauseInf × 0 = NaNin IEEE 754).
Net effect: every decoded value silently becomes NaN.
Mitigation: turn on strict-finite encoding (see docs). It catches Inf upstream of the simple_packing encoder and fails with a clean EncodingError before the corruption path runs.
Also: extract_simple_packing_params catches a non-finite reference_value in the descriptor, so callers going through the high-level encode() API are protected when the computed reference happens to be ±Inf (e.g. data like [1.0, -Inf]). But for data like [1.0, +Inf, 3.0] the reference is 1.0 (finite) and only binary_scale_factor overflows — that’s not caught without the strict flag.
Decode Range on Compressed Data
decode_range() supports partial range decode for compressors that have random access capability: szip (via RSI block offsets), blosc2 (via chunk-based access), and zfp fixed-rate mode. Stream compressors (zstd, lz4, sz3) return CompressionError::RangeNotSupported.
Workaround for stream compressors: Decode the full object with decode_object() and slice the result in memory.
Bitmask Byte Width
Dtype::Bitmask returns 0 from byte_width(). This is a sentinel, not a real byte width.
Why: A bitmask of N elements occupies ceil(N / 8) bytes. The library cannot infer N from the byte width alone, so the “element size” concept doesn’t apply. Callers that need the payload size must compute it from the element count.
#![allow(unused)]
fn main() {
let num_elements: u64 = descriptor.shape.iter().product();
let payload_bytes = if descriptor.dtype == Dtype::Bitmask {
let n = usize::try_from(num_elements)?;
(n + 7) / 8
} else {
let n = usize::try_from(num_elements)?;
n * descriptor.dtype.byte_width()
};
}
verify_hash on Messages Without Hashes
If a message was encoded with hash_algorithm: None (no hash), and you decode it with verify_hash: true, the decoder silently skips hash verification for that object. No error is returned.
Rationale: The absence of a hash is not an error. The decoder cannot verify what was never stored. If you need to enforce that all messages have hashes, check descriptor.hash.is_some() after decoding.
Constant-Value Fields with simple_packing
If all values in a field are identical (range = 0), compute_params() sets binary_scale_factor such that all packed integers are 0, and the full value is recovered from reference_value alone. This is correct and handled without special cases.
Very Short Buffers
Passing a buffer shorter than the preamble size (24 bytes) to any decode function returns TensogramError::Framing("buffer too short ..."). No panic.
Object Index Out of Range
decode_object(&message, 99, &options) when the message has fewer than 100 objects returns TensogramError::Object("object index N out of range").
Empty Files
TensogramFile::message_count() returns 0. read_message(0) returns an error.
CBOR Key Ordering
The library uses canonical CBOR key ordering (RFC 8949 §4.2). If you construct a GlobalMetadata struct with keys in one order and then check the CBOR bytes, the bytes may not match your insertion order. This is intentional and correct — it ensures deterministic output.
If you need to compare metadata across languages or implementations, always compare the decoded values, not the raw CBOR bytes from different encoders.
You can verify that any CBOR output is canonical using the verify_canonical_cbor() utility:
#![allow(unused)]
fn main() {
use tensogram::verify_canonical_cbor;
let cbor_bytes = /* ... */;
verify_canonical_cbor(&cbor_bytes)?; // Returns Ok(()) if canonical, Err if not
}
Frame Ordering Violations
The decoder validates that frames appear in the expected order: header frames first, then data object frames, then footer frames. A message with frames out of order (e.g. a header metadata frame appearing after a data object frame) is rejected with TensogramError::Framing.
This catches malformed or tampered messages. Valid messages produced by the encoder always have correct ordering.
Streaming Mode (total_length = 0)
When encoding for a non-seekable output (e.g. TCP socket), the preamble’s total_length is set to 0. In this mode:
- Header index and header hash frames are omitted (the encoder doesn’t know the data object count or offsets upfront).
- The footer must contain at least the metadata frame.
- The
first_footer_offsetin the postamble points to the first footer frame.
Decoders that encounter total_length = 0 should read from the postamble backward to find the footer frames, then use the footer index (if present) for random access to data objects.
first_footer_offset is Never Zero
The postamble’s first_footer_offset field always points to a valid position:
- If footer frames exist: it points to the start of the first footer frame.
- If no footer frames exist: it points to the start of the postamble itself.
This invariant means decoders can always seek to first_footer_offset and determine whether they’ve landed on a footer frame or the postamble.
Inter-Frame Padding
The encoder may insert padding bytes between frames for memory alignment (e.g. 64-bit alignment). Padding appears between the ENDF marker of one frame and the FR marker of the next. Decoders should scan for the FR marker rather than assuming frames are contiguous.
Zero-Element Tensors
Shapes containing zero dimensions are valid: shape: [0], shape: [3, 0, 5]. This matches numpy and PyTorch semantics where zero-element tensors are legitimate objects (e.g. an empty batch). The encoded payload for a zero-element tensor is zero bytes.
Scalar Tensors
shape: [] (empty shape, ndim: 0) represents a scalar tensor containing exactly one element. The payload size equals dtype.byte_width() bytes.
Metadata-Only Messages
A message with zero data objects is valid. This can be used to transmit metadata without any tensor data (e.g. coordination signals, timestamps, provenance records). Both encode() with an empty descriptors slice and StreamingEncoder with no write_object() calls produce valid messages.
Mixed Dtypes in One Message
Multiple data objects in the same message may have different dtypes. For example, a Float32 tensor paired with a Bitmask object used as a missing-data mask. Each object’s pipeline (encoding, filter, compression) is configured independently.
Bitmask with Encoding/Compression
Bitmask data is internally packed into uint8 bytes. Any encoding or compression pipeline that supports uint8 should work with bitmask data. The total bit count must be stored separately (in the shape) since the byte count ceil(N / 8) may not equal N exactly.
Strides Validation
Strides are validated for length: strides.len() must match shape.len(). Non-contiguous strides (e.g. shape: [4, 4], strides: [8, 1]) are accepted — they indicate a view into a larger array and are semantically valid.
Version Constraints
version: 0andversion: 1are deprecated and must be rejected by the decoder.version: 2is the current version.- Higher versions (3+) are reserved for future use and will be valid once defined.
NaN/Infinity in Simple Packing Parameters
If reference_value is NaN or Infinity, encoding fails immediately with a clear error. This value is used in the quantization formula and would produce corrupt output. (binary_scale_factor and decimal_scale_factor are integers and cannot be NaN/Infinity.)
Duplicate CBOR Keys
Duplicate keys at the same level in a CBOR map are never accepted. The library uses canonical CBOR (RFC 8949 §4.2) which inherently rejects duplicate keys. Same-name keys at different nesting levels are acceptable: base[0]["foo"] and _extra_["foo"] are distinct keys.
Unknown Hash Algorithm on Decode
If a message contains a hash with an algorithm the decoder doesn’t recognize (e.g. "sha256" when only xxh3 is implemented), verify_hash: true issues a warning and skips verification rather than returning an error. This ensures forward compatibility when new hash algorithms are added.
decode_range with Empty Ranges
Calling decode_range() with an empty ranges slice (&[]) returns (descriptor, vec![]) — the parts vector is empty. This is not an error.
Preceder Metadata Error Paths
The decoder validates PrecederMetadata frames strictly:
| Condition | Error type | Message |
|---|---|---|
| Consecutive preceders without DataObject | Framing | “PrecederMetadata must be followed by a DataObject frame, got {type}” |
| Dangling preceder (no DataObject follows) | Framing | “dangling PrecederMetadata: no DataObject frame followed” |
| Base has 0 or 2+ entries | Metadata | “PrecederMetadata base must have exactly 1 entry, got {n}” |
| Metadata base entries > data objects | Metadata | “metadata base has {n} entries but message contains {m} objects” |
On the encoder side:
StreamingEncoder::write_preceder()errors if called twice without an interveningwrite_object().StreamingEncoder::finish()errors if a preceder was written without a followingwrite_object().encode()(buffered mode) errors ifemit_preceders: true— useStreamingEncoder::write_preceder()instead.
File Concatenation
Tensogram is a message format, not a file format. Multiple .tgm files can be concatenated:
cat 1.tgm 2.tgm > all.tgm
The resulting file is valid. scan() and TensogramFile will find all messages from both source files.
xarray Layer Edge Cases
meta.base Out-of-Range
If a message has more data objects than meta.base entries (e.g. 3 objects but base has only 1 entry), the xarray layer logs a warning and treats the missing base entries as empty dicts. The objects are still decoded — they just have no per-object metadata attributes.
This can happen when a message is encoded with an incomplete base array, or when objects are appended to a message without updating base. The warning helps diagnose silent metadata loss:
WARNING: meta.base has 1 entries but object index 2 requested;
per-object metadata will be empty for this object
Empty or Missing base Attribute
A message with base: [] or no base key at all is valid. All objects get empty per-object metadata and are named object_0, object_1, etc. The _reserved_ key (auto-populated by the encoder in each base entry) is always filtered out — it never appears in user-facing variable attributes.
Variable Naming with Dot Paths
When variable_key="mars.param" is used, the resolve_variable_name() function traverses the nested dict path. If any segment is missing, the function falls back to the generic object_<index> name. The obj_index used is the object’s position in the message (not its position among data variables), so a file with objects 0 (coord), 1 (data), 2 (data) would produce names like "object_1" and "object_2" for the data variables.
Coordinate Name Case Insensitivity
Coordinate detection (detect_coords) is case-insensitive: "LATITUDE", "Lat", and "latitude" all match the known coordinate name "latitude". The canonical dimension name is always lowercase (e.g. "latitude", not "LATITUDE").
Ambiguous Dimension Size Matching
When two coordinate arrays have the same size (e.g. latitude with 5 points and depth with 5 points), the dimension resolution assigns the first matching coord to the first axis that matches the size, and the second to the next axis. If the data variable is 2D [5, 5], one axis gets "latitude" and the other gets "depth". When no coord has the matching size, the axis gets a generic "dim_N" name.
Multi-Message Merge with Different Keys
When open_datasets() merges multiple messages, objects whose base entries have different key sets are handled as follows:
- Keys present in all objects with identical values become Dataset attributes (constant).
- Keys present in all objects with varying values become outer dimensions (if they form a hypercube) or separate variables.
- Keys present in some objects but not others are treated as varying with
Nonefor missing entries.
reserved Filtering Consistency
The _reserved_ key is filtered at every access point:
TensogramDataStore._get_per_object_meta()(store.py)_base_entry_from_meta()(scanner.py)_filter_reserved()(zarr store.py)
This ensures the encoder’s auto-populated tensor info (ndim, shape, strides, dtype) never leaks into user-facing metadata.
Zarr Layer Edge Cases
Group Attributes from meta.extra
Group-level attributes in the root zarr.json come from meta.extra (message-level annotations). If meta.extra is empty or absent, the group zarr.json only contains internal attributes (_tensogram_version, _tensogram_variables).
Per-Array Attributes from meta.base[i]
Per-array attributes come from meta.base[i] with the _reserved_ key filtered out. Descriptor encoding params are stored under _tensogram_params to avoid namespace collisions.
Variable Name Resolution — No Extra Fallback
Variable names are resolved exclusively from per_object_meta (from meta.base[i]). The common_meta (from meta.extra) is not searched for variable naming. This prevents all objects in a message from sharing the same name when a name key exists only at the message level.
This is consistent across both xarray and zarr layers.
Zarr Metadata Key Collision
If a base entry has keys like "zarr", "chunks", or "shape", they go into the Zarr array’s attributes dict — not the top-level metadata. There is no collision with Zarr’s own shape, chunk_grid, etc. fields.
Write Path: reserved Filtering
When writing through TensogramStore, user-set array attributes are written into base[i] entries. The _reserved_ key is explicitly filtered from these entries to prevent collision with the encoder’s auto-populated _reserved_.tensor info.
Write Path: Group Attributes
Group attributes set via Zarr become unknown top-level keys in GlobalMetadata, which the encoder preserves as _extra_. On re-read, they appear in meta.extra. Internal keys (starting with _tensogram_) and reserved structural keys (version, base, _extra_, _reserved_) are excluded.
Empty TGM File
A .tgm file with zero messages produces a root group zarr.json with no arrays. A message with zero data objects produces a root group with the message’s extra metadata but no arrays.
Variable Name Deduplication
When multiple objects resolve to the same name, suffixes _1, _2, etc. are appended. For example, three objects named "x" become "x", "x_1", "x_2".
Variable Name Sanitization
Slashes and backslashes in resolved variable names are replaced with underscores to prevent spurious directory nesting in the Zarr virtual key space. Empty names are replaced with "_".
GRIB Importer Edge Cases
This section covers behaviour specific to the tensogram-grib importer and
the tensogram convert-grib CLI — these notes apply when you are bringing
GRIB data into Tensogram, not to Tensogram itself.
Single GRIB to base[0] Has ALL MARS Keys
In OneToOne mode, each GRIB message becomes one Tensogram message. All MARS namespace keys (plus gridType as "grid") go into base[0]["mars"]. When --all-keys is enabled, non-MARS namespace keys (geography, time, vertical, parameter, statistics) go into base[0]["grib"].
MergeAll with N Fields
In MergeAll mode, N GRIB fields become one Tensogram message with N data objects. Each base[i] holds ALL metadata for that object independently — there is no common/varying partitioning at encode time. This means metadata keys are duplicated across base entries.
Performance note: With 1000 GRIB fields, this means 1000 copies of common keys (class, type, stream, expver, date, time, etc.). This is by design — the wire format prioritizes simplicity and independent object access over byte savings. Use tensogram::compute_common() at display/merge time to extract shared keys.
Different Grid Types in MergeAll
GRIB fields with different grid types (e.g. regular_ll and reduced_gg) can be merged into the same Tensogram message. Each base[i]["mars"]["grid"] independently records its grid type. Downstream consumers (xarray, zarr) must handle the structural differences (e.g. different shapes).
GRIB Shape from Ni/Nj
The shape is derived from ecCodes Ni and Nj keys (row-major: [Nj, Ni]). If either is zero or missing (e.g. reduced Gaussian grids), the shape falls back to [numberOfPoints] (1-D).
Empty params in DataObjectDescriptor
GRIB-converted data objects have empty desc.params — all metadata lives in base[i]["mars"] and base[i]["grib"], not in the per-object descriptor. This is by design: the descriptor carries only what’s needed to decode the payload (shape, dtype, encoding pipeline).
Metadata Model Edge Cases (base / reserved / extra)
The v2 metadata model has three sections: base (per-object), _reserved_ (library internals), and _extra_ (client annotations). These create several non-obvious edge cases.
reserved is Protected
Client code must not set _reserved_ in any context:
- Python:
tensogram.encode({"version": 2, "_reserved_": {...}})raisesValueError. - Python:
encode({"version": 2, "base": [{"_reserved_": {...}}]})raisesValueError. - FFI: JSON with
"base": [{"_reserved_": {...}}]returnsTgmError::Metadata. - CLI:
set -s _reserved_.tensor.ndim=5returns an error.
The encoder auto-populates _reserved_.tensor in each base entry (ndim, shape, strides, dtype) and _reserved_ at the message level (encoder, time, uuid).
Metadata Lookup Semantics (base first-match)
All lookup functions (__getitem__ in Python, tgm_metadata_get_string in FFI, lookup_key in CLI) use first-match semantics:
- Search
base[0], thenbase[1], …, skipping the_reserved_key within each entry. - If not found in any base entry, search
_extra_. - If not found →
None(FFI/CLI) orKeyError(Python).
Implication: If base[0] has product.name="temperature" and base[1] has product.name="pressure", lookups return "temperature" (the first match). This is message-level lookup, not per-object. The same applies to any namespace (MARS, BIDS, DICOM, etc.).
reserved is Hidden from Dict Access
meta["_reserved_"]→KeyError(Python). The key is skipped during base entry iteration."_reserved_" in meta→False.tgm_metadata_get_string(meta, "_reserved_.tensor")→NULL(FFI). The path is blocked.- To read
_reserved_data, usemeta.reserved(Python) or read the base entry directly viameta.base[i]["_reserved_"].
Explicit extra / extra Prefix
The CLI and FFI support explicit _extra_.key or extra.key prefixes to target the _extra_ map directly, bypassing the base search:
# CLI: write to _extra_ map
tensogram set -s "extra.custom=value" input.tgm output.tgm
tensogram set -s "_extra_.custom=value" input.tgm output.tgm
# CLI: read from _extra_ map
tensogram get -p "_extra_.custom" input.tgm
Without the prefix, set writes to all base entries. With the prefix, it writes to _extra_ specifically.
Empty Key String
An empty key "" returns None (FFI/CLI) or raises KeyError (Python). This is not an error — it simply finds no match.
base vs Descriptor Count
The base array length should match the number of data objects. The encoder auto-extends base entries (adding _reserved_.tensor) for each object. If the user provides fewer base entries than objects, the encoder creates entries for the missing ones. If the user provides more base entries than objects, the encoder returns an error.
tgm_metadata_num_objects (FFI)
tgm_metadata_num_objects() returns base.len(), which is the number of per-object metadata entries. After encoding, this matches the actual data object count because the encoder populates one base entry per object.
set Command on Zero-Object Messages
The CLI set command redirects mutations to _extra_ when the message has zero data objects. This is because base entries must align 1:1 with descriptors, and a zero-object message has no descriptors.
Both extra and extra in Python Dict
When both "_extra_" and "extra" are present in a Python metadata dict, _extra_ takes precedence (it’s the wire-format name). The "extra" key is treated as a convenience alias and only used if "_extra_" is absent.
Filter Matching with Multi-Object Messages
CLI where-clause filters (-w mars.param=2t) match at the message level. If base[0] has mars.param=2t and base[1] has mars.param=msl, the filter matches "2t" (first base entry match). To filter by per-object values, split the message first.
Split Preserves Per-Object Metadata
When splitting a multi-object message, the CLI split command assigns each object its own base entry from the original message. The _reserved_ key is stripped from each entry (the encoder regenerates it). Extra metadata is copied to all split messages.
Merge Concatenates Base Arrays
When merging messages, the CLI merge command concatenates all base arrays. The merge strategy (first/last/error) only applies to _extra_ key conflicts. The _reserved_ section is cleared and regenerated by the encoder.
Deeply Nested Paths
Dot-notation paths support arbitrary nesting depth: grib.geography.Ni, a.b.c.d.e. The recursive resolver walks through CBOR Map values at each level. If a non-Map value is encountered before the path is fully resolved, the lookup returns None.
JSON Output Structure
CLI dump -j and ls -j output uses the wire-format structure:
{
"version": 2,
"base": [{"mars": {"param": "2t"}, "_reserved_": {"tensor": {"ndim": 1}}}],
"extra": {"custom": "value"}
}
The _reserved_ keys within base entries are included in JSON output for transparency.
Metadata Refactor: Detailed Edge Cases
The following edge cases were identified during systematic review of the Rust core crate (tensogram) after the metadata refactor.
base Array Count Validation
| Scenario | Behaviour |
|---|---|
base.len() < descriptors.len() | Auto-extended with empty entries. _reserved_.tensor is inserted in each. |
base.len() == descriptors.len() | Normal path. Pre-existing application keys preserved. |
base.len() > descriptors.len() | Error: “metadata base has N entries but only M descriptors provided; extra base entries would be discarded”. |
Rationale: Silently truncating excess base entries would lose user data. Auto-extending is safe because the library adds _reserved_.tensor to each new entry.
_reserved_.tensor After Encode
After encoding, each base[i]["_reserved_"]["tensor"] always contains exactly four keys:
| Key | Value | Example |
|---|---|---|
ndim | CBOR integer | 0 for scalar, 2 for matrix |
shape | CBOR array of integers | [] for scalar, [10, 20] for matrix |
strides | CBOR array of integers | [] for scalar, [20, 1] for matrix |
dtype | CBOR text | "float32", "int64", etc. |
For scalar tensors (ndim: 0), shape and strides are empty arrays [].
Preceder _reserved_ Protection
Encoder side: StreamingEncoder::write_preceder() rejects any metadata map containing a _reserved_ key. Error: “client code must not write ‘reserved’ in preceder metadata”.
Decoder side: When the decoder encounters a _reserved_ key in a preceder’s base[0], it strips the key rather than rejecting the message. This is permissive — the data may come from a non-standard producer. The encoder-populated _reserved_.tensor from the footer metadata is preserved.
Merge order in finish(): Footer metadata is populated first (_reserved_.tensor), then preceder payloads are merged on top. Since the decoder strips _reserved_ from preceders, there is no risk of preceder _reserved_ clobbering the encoder’s _reserved_.tensor.
Backward Compatibility with Old CBOR Keys
| Old key | Behaviour on decode |
|---|---|
"common" (v2 pre-refactor) | Silently ignored (unknown CBOR key). |
"payload" (v2 pre-refactor) | Silently ignored. |
"reserved" (old name) | Silently ignored — only "_reserved_" is recognized. |
Both "reserved" and "_reserved_" | Only "_reserved_" is captured; "reserved" is ignored. |
GlobalMetadata does not use #[serde(deny_unknown_fields)], so serde drops unrecognized keys.
compute_common() Key Selection
compute_common() only examines keys from the first base entry as candidates for common keys. Keys present in later entries but absent from the first entry are never promoted to common.
Example: if entry 0 has keys {a, b} and entry 1 has {b, c}, only b is a candidate (and becomes common if values match). Key c appears only in entry 1’s remaining set.
compute_common() NaN Handling
CBOR Float(NaN) values with identical bit patterns are treated as equal by cbor_values_equal(), using f64::to_bits() comparison. This means NaN values are classified as common when all entries share the same NaN bit pattern. Standard CBOR equality (PartialEq) would fail because NaN != NaN.
compute_common() CBOR Map Ordering
cbor_values_equal() compares CBOR maps positionally (entry-by-entry). Two maps with the same keys and values in different order are NOT equal. This is correct because canonical CBOR encoding ensures all maps are always sorted — different-order maps can only arise from non-canonical input.
Shape Product Overflow
All shape-product computations use checked_mul to detect overflow. This applies to encode(), decode(), ObjectIter::next(), and decode_range(). If the product overflows u64, a TensogramError::Metadata("shape product overflow") is returned. No silent wraparound.
_extra_ Scope Independence
_extra_ is message-level, while base[i] entries are per-object. Keys with the same name can exist in both:
#![allow(unused)]
fn main() {
meta.base[0].insert("mars".into(), ...); // per-object
meta.extra.insert("mars".into(), ...); // message-level
// Both preserved after encode/decode round-trip
}
Empty _extra_ in CBOR
An empty _extra_ map is omitted from CBOR output via skip_serializing_if = "BTreeMap::is_empty". On decode, a missing _extra_ key is deserialized as an empty BTreeMap. Round-trips correctly.
Deeply Nested _reserved_ in base Entries
Only the top-level _reserved_ key in base[i] is rejected by the encoder. Deeply nested _reserved_ keys (like {"foo": {"_reserved_": ...}}) are allowed and preserved. The encoder only checks entry.contains_key("_reserved_").
CLI set on Zero-Object Messages
When tensogram set modifies a zero-object message, keys that would normally go into base are redirected to _extra_ instead (since base entries must align 1:1 with data objects, and there are none).
Error Handling Reference
This section documents all error types, how they propagate across languages, and what messages users can expect.
TensogramError Variants (Rust)
The core library defines seven error variants in TensogramError:
| Variant | When it occurs | Example message |
|---|---|---|
Framing(String) | Invalid wire format — magic bytes, postamble, frame ordering | "buffer too short (12 bytes, need >= 24)" |
Metadata(String) | Metadata validation failures — version, base count, CBOR parse | "metadata base has 3 entries but only 2 descriptors provided" |
Encoding(String) | Encoding pipeline errors — simple_packing NaN, bit-width | "NaN value at index 42" |
Compression(String) | Compression/decompression failures — codec errors, range access | "RangeNotSupported: zstd does not support partial decode" |
Object(String) | Per-object errors — index out of range, shape overflow | "object index 99 out of range (num_objects=2)" |
Io(io::Error) | File system errors — open, read, write, seek | "data.tgm: No such file or directory" |
HashMismatch { expected, actual } | Integrity check failure | "hash mismatch: expected=abc123, actual=def456" |
Python Exception Mapping
The Python bindings convert TensogramError to Python exceptions:
| Rust variant | Python exception | Prefix in message |
|---|---|---|
Framing | ValueError | FramingError: |
Metadata | ValueError | MetadataError: |
Encoding | ValueError | EncodingError: |
Compression | ValueError | CompressionError: |
Object | ValueError | ObjectError: |
Io | IOError | (raw io message) |
HashMismatch | RuntimeError | HashMismatch: |
Additional Python-side exceptions:
| Function | Exception | Condition |
|---|---|---|
encode() | ValueError | Missing version key, _reserved_ in dict, unknown dtype |
decode() | ValueError | Corrupted buffer, invalid CBOR |
Metadata.__getitem__() | KeyError | Key not found in base or extra |
Metadata.__getitem__("_reserved_") | KeyError | _reserved_ is always hidden from dict access |
TensogramFile.__getitem__() | IndexError | Message index out of range |
TensogramFile.__getitem__() | TypeError | Non-integer, non-slice index |
compute_packing_params() | ValueError | NaN in input array |
encode(hash="sha256") | ValueError | "unknown hash: sha256" |
Example: handling errors in Python:
import tensogram
# File not found
try:
with tensogram.TensogramFile.open("missing.tgm") as f:
pass
except IOError as e:
print(f"File error: {e}")
# → "File error: file not found: missing.tgm"
# Corrupted buffer
try:
tensogram.decode(b"garbage")
except ValueError as e:
print(f"Decode error: {e}")
# → "Decode error: FramingError: buffer too short ..."
# Hash verification failure
try:
meta, objects = tensogram.decode(buf, verify_hash=True)
except RuntimeError as e:
print(f"Integrity error: {e}")
# → "Integrity error: HashMismatch: expected=..., actual=..."
# Missing metadata key
meta, objects = tensogram.decode(buf)
try:
val = meta["nonexistent"]
except KeyError:
print("Key not found")
# Index out of range
with tensogram.TensogramFile.open("data.tgm") as f:
try:
msg = f[999]
except IndexError as e:
print(f"Index error: {e}")
# → "message index 999 out of range for file with 2 messages"
CLI Error Handling
All CLI commands:
- Print errors to stderr with
error:prefix - Show the full error chain (nested causes)
- Exit with code 1 on any error
- Exit with code 0 on success
Common CLI error scenarios:
# File not found
$ tensogram ls nonexistent.tgm
error: file not found: nonexistent.tgm
# Invalid where clause
$ tensogram ls -w "bad-clause" data.tgm
error: invalid where clause: invalid where-clause: bad-clause (expected key=value or key!=value)
# Missing key in strict get
$ tensogram get -p "nonexistent" data.tgm
error: key not found: nonexistent
# Protected namespace
$ tensogram set -s "_reserved_.tensor.ndim=5" input.tgm output.tgm
error: cannot modify '_reserved_' — this namespace is managed by the library
# Immutable descriptor key
$ tensogram set -s "shape=broken" input.tgm output.tgm
error: cannot modify immutable key: shape
# Merge conflict with error strategy
$ tensogram merge --strategy error a.tgm b.tgm -o merged.tgm
error: conflicting values for key 'param' (use --strategy first or last to resolve)
# Invalid merge strategy
$ tensogram merge --strategy unknown a.tgm b.tgm -o merged.tgm
error: unknown merge strategy 'unknown': expected first, last, or error
# Message index out of range (via file.read_message)
$ tensogram dump corrupt.tgm
error: framing error: buffer too short ...
xarray Backend Error Handling
| Scenario | Behaviour |
|---|---|
| File not found | IOError from tensogram.TensogramFile.open() |
| Corrupt file | ValueError from tensogram.decode_descriptors() |
message_index out of range | ValueError from TensogramFile.read_message() |
message_index < 0 | ValueError("message_index must be >= 0, got -1") |
meta.base shorter than objects | Warning logged; missing entries treated as empty dicts |
| Unsupported dtype | TypeError("unsupported tensogram dtype ...") |
dim_names count mismatch | ValueError("dim_names has N entries but tensor has M dimensions") |
decode_range failure | Warning logged; falls back to full decode_object() |
File with zero messages + merge_objects=True | Returns empty xr.Dataset() |
Zarr Store Error Handling
| Scenario | Behaviour |
|---|---|
| File not found | OSError("failed to open TGM file ...") wrapping the original error |
| Corrupt message | ValueError("failed to decode message ...") wrapping the original error |
| Failed object decode | ValueError("failed to decode object N ...") wrapping the original error |
message_index out of range | IndexError("message_index N out of range (file has M message(s))") |
message_index < 0 | ValueError("message_index must be >= 0, got -1") |
| Invalid mode | ValueError("invalid mode 'x'; expected 'r', 'w', or 'a'") |
| Empty path | ValueError("path must be a non-empty string, got ''") |
| Store already open | ValueError("store is already open") |
| Write to read-only store | Raises from Zarr base class |
| Flush failure during exception | Warning logged; original exception preserved |
| Unsupported dtype on write | ValueError("unsupported dtype for variable ...") |
| Chunk size mismatch on write | ValueError("chunk data for 'var': expected N bytes ... got M") |
| Multiple chunks per variable | ValueError("variable 'var' has N chunk keys; TensogramStore only supports single-chunk arrays") |
Unsupported ByteRequest type | TypeError("unsupported ByteRequest type: ...") |
| Zero messages in file | Root group zarr.json with empty attributes; no arrays |
IO Error Path Context
All file I/O errors include the file path in the error message. This applies to:
TensogramFile::open()—"file not found: /path/to/file.tgm"TensogramFile::create()—"cannot create /path/to/file.tgm: Permission denied"- Internal re-opens (scan, read, append) —
"/path/to/file.tgm: No such file or directory"
This ensures that when errors propagate through multiple layers (e.g. Rust → Python → xarray), the original file path is always visible in the error message.
Internals
This page explains implementation decisions that are not obvious from the public API. Useful if you’re contributing to the library or implementing a compatible reader in another language.
Deterministic CBOR Canonicalization
The library encodes all CBOR structures (global metadata, data object descriptors, index frames, hash frames) using a three-step process:
- Serialize the struct to a
ciborium::Valuetree using serde. - Recursively sort all map keys by their CBOR byte encoding.
- Write the sorted Value tree to bytes.
Standard serde serialization into ciborium does not guarantee key order (it depends on the HashMap/BTreeMap iteration order of the struct). Even though the library uses BTreeMap throughout (which gives alphabetical iteration order for string keys), relying on that would be fragile. The explicit canonicalization step ensures the output matches RFC 8949 §4.2 regardless of how the keys were stored.
GlobalMetadata / DataObjectDescriptor struct
↓ serde serialization
ciborium::Value::Map (arbitrary key order)
↓ canonicalize() — sort all maps recursively by CBOR-encoded key bytes
ciborium::Value::Map (canonical order)
↓ write to bytes
CBOR bytes (deterministic)
Note: canonicalize() returns Result<()> and propagates errors rather than panicking.
BTreeMap Throughout
The extra (serialized as _extra_), reserved (serialized as _reserved_), and base entry fields in GlobalMetadata, as well as the params field in DataObjectDescriptor, are BTreeMap<String, ciborium::Value>. This:
- Gives alphabetical iteration order for string keys (which matches CBOR canonical order for short strings).
- Avoids the non-determinism of
HashMap. - Makes it easy to read and write keys without worrying about order.
Frame-Based Wire Format (v2)
The v2 wire format uses a frame-based structure instead of the v1 monolithic binary header.
Preamble (24 bytes)
MAGIC "TENSOGRM" (8) + version u16 (2) + flags u16 (2) + reserved u32 (4) + total_length u64 (8)
The preamble flags indicate which optional frames are present (header/footer metadata, index, hashes). total_length = 0 signals streaming mode.
Frame Header (16 bytes)
Every frame (metadata, index, hash, data object) starts with:
"FR" (2) + frame_type u16 (2) + version u16 (2) + flags u16 (2) + total_length u64 (8)
And ends with "ENDF" (4 bytes). Frame versions are independent of message version.
Data Object Frame Layout
Each data object is a self-contained frame:
Frame header (16B) + [CBOR descriptor] + payload bytes + [CBOR descriptor] + cbor_offset u64 (8B) + "ENDF" (4B)
The cbor_offset is the byte offset from the frame start to the CBOR descriptor. A flag bit controls whether the CBOR descriptor appears before or after the payload (default: after, since encoding parameters like hash are only known after encoding completes).
Postamble (16 bytes)
first_footer_offset u64 (8) + END_MAGIC "39277777" (8)
first_footer_offset is never zero. It points to the first footer frame, or to the postamble itself when no footer frames are present.
Two-Pass Index Construction
When encoding a non-streaming message, the index frame contains byte offsets of each data object. But the index frame’s own size affects those offsets (circular dependency). The encoder solves this with a two-pass approach:
- First pass: compute index CBOR with placeholder offsets to determine the index frame size.
- Second pass: compute final offsets using the known index frame size, re-encode the index CBOR.
If the re-encoded CBOR changes size (edge case), the encoder returns an error rather than silently producing incorrect offsets.
Encoder Structure
The encode_message() function delegates to five focused helpers:
build_hash_frame_cbor()— collects hashes from objects and serializes the HashFramebuild_index_frame()— runs the two-pass index construction described abovecompute_object_offsets()— calculates byte offsets with 8-byte alignmentcompute_message_flags()— sets preamble flags from optional frame presenceassemble_message()— writes preamble, frames, and postamble into the final buffer
simple_packing Bit Layout
Values are packed MSB-first (most significant bit first), following the same bit layout as the GRIB 2 simple_packing specification so that quantised payloads are interoperable with existing GRIB tooling:
Element 0: bits [0 .. B-1]
Element 1: bits [B .. 2B-1]
Element 2: bits [2B .. 3B-1]
...
The last byte is zero-padded on the right if N × B is not a multiple of 8.
The decode formula is:
V[i] = R + (packed[i] × 2^E) / 10^D
Where:
R= reference_value (minimum of original data)E= binary_scale_factorD= decimal_scale_factorpacked[i]= the integer read from the packed bits
Lazy File Scanning
TensogramFile::open() does not read the file. The first call that needs the message list (e.g. message_count(), read_message()) triggers a streaming scan using scan_file(). The scanner reads only preamble-sized chunks and seeks forward, so it never loads the entire file into memory. After that, the list of (offset, length) pairs is cached in memory for the lifetime of the TensogramFile object.
// No I/O here
let mut file = TensogramFile::open("huge.tgm")?;
// Streaming scan happens here (once) — reads preamble chunks, seeks forward
let count = file.message_count()?;
// O(1) seek + read
let msg = file.read_message(999)?;
Error Hierarchy
TensogramError
├── Framing — invalid magic, truncated preamble, bad frame markers, missing postamble
├── Metadata — CBOR serialization/deserialization failure
├── Encoding — invalid encoding params, NaN in simple_packing
├── Compression — compressor error (szip, zstd, lz4, blosc2, zfp, sz3)
├── Object — index out of range
├── Io — filesystem errors (wraps std::io::Error)
└── HashMismatch { expected, actual } — payload integrity failure
All public functions return Result<T> where the error is TensogramError. The Io variant wraps std::io::Error via the From impl, so ? on any std::io::Result produces a TensogramError::Io automatically.
Memory-Mapped I/O (mmap feature)
The mmap feature gate enables memory-mapped file access via memmap2. When you open a file with TensogramFile::open_mmap(), the file is mapped into virtual memory and the existing scan() function runs directly on the mapped buffer. Subsequent read_message() calls return copies from the mapped region without additional seeks.
// Requires: cargo build --features mmap
let mut file = TensogramFile::open_mmap("huge.tgm")?;
let count = file.message_count()?; // already scanned during open_mmap
let msg = file.read_message(42)?; // copies from mmap, no seek
The regular open() path still works without the feature and uses streaming seek-based scanning.
Async I/O (async feature)
The async feature gate adds tokio-based async variants: open_async(), read_message_async(), and decode_message_async(). All CPU-intensive work (scanning, decoding, FFI calls to libaec/zfp/blosc2) runs via spawn_blocking to avoid blocking the async runtime.
// Requires: cargo build --features async
let mut file = TensogramFile::open_async("forecast.tgm").await?;
let (meta, objects) = file.decode_message_async(0, &opts).await?;
Frame Ordering Validation
The decoder enforces that frames appear in the expected order within a message: header frames first, then data object frames, then footer frames. A DecodePhase state machine tracks the current phase and returns TensogramError::Framing if a frame type appears out of order.
This catches malformed messages where, for example, a header metadata frame appears after a data object frame.
Canonical CBOR Verification
The library provides verify_canonical_cbor() to check that a CBOR byte slice is in RFC 8949 §4.2.1 canonical form. This is used internally by tests to verify that all CBOR output (metadata, descriptors, index frames, hash frames) is deterministic. It can also be used by external tools that need to validate Tensogram CBOR output against the spec.