Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zarr v3 Backend

The tensogram-zarr package implements a Zarr v3 Store backed by .tgm files. This lets you read and write Tensogram data through the standard Zarr Python API.

Installation

uv venv .venv && source .venv/bin/activate   # if not already in a virtualenv
uv pip install tensogram-zarr

Requires zarr >= 3.0, tensogram, and numpy.

Reading a .tgm file through Zarr

import zarr
from tensogram_zarr import TensogramStore

# Open existing .tgm file as a read-only Zarr store
store = TensogramStore.open_tgm("data.tgm")
root = zarr.open_group(store=store, mode="r")

# Browse available arrays
for name, arr in root.members():
    print(f"{name}: shape={arr.shape}, dtype={arr.dtype}")

# Read an array (decoded eagerly at store open, served from memory)
temperature = root["2t"][:]
print(temperature.shape, temperature.mean())

# Access group-level metadata (from GlobalMetadata _extra_)
# The example below shows a MARS namespace; the attributes dict reflects
# whatever namespaces the producer put in the message's GlobalMetadata.
print(root.attrs["mars"])  # {'class': 'od', 'type': 'fc', ...}

How the mapping works

Each .tgm message maps to a Zarr group:

zarr.json                     # root group ← GlobalMetadata
temperature/zarr.json         # array metadata ← DataObjectDescriptor
temperature/c/0/0             # chunk data ← decoded object payload
pressure/zarr.json            # another array
pressure/c/0/0                # its chunk data
graph LR
    TGM[".tgm file"] --> GM["GlobalMetadata"]
    TGM --> OBJ1["Object 0: temperature"]
    TGM --> OBJ2["Object 1: pressure"]
    
    GM --> GZJ["zarr.json (group)"]
    OBJ1 --> AZJ1["temperature/zarr.json"]
    OBJ1 --> CHK1["temperature/c/0/0"]
    OBJ2 --> AZJ2["pressure/zarr.json"]
    OBJ2 --> CHK2["pressure/c/0/0"]

Key design decisions:

  • Each TGM data object becomes one Zarr array with a single chunk (chunk shape = array shape)
  • Variable names are resolved from metadata via a default lookup path (name, mars.param, param, mars.shortName, shortName), or a custom dot-path you supply
  • TGM encoding metadata is preserved in Zarr array attributes under _tensogram_* keys
  • Duplicate variable names get a numeric suffix (field, field_1)

Variable naming

By default, the store tries these metadata paths to name arrays:

  1. name
  2. mars.param
  3. param
  4. mars.shortName
  5. shortName
  6. Falls back to object_<index>

The lookup runs against a shallow root-key merge of meta.base[i] and desc.params (matching the xarray backend’s long-standing behaviour): for each top-level key, meta.base[i] wins if present, otherwise desc.params fills in. Nested dicts are not deep-merged, so a base[i] entry with {"mars": {"levtype": "sfc"}} entirely shadows a desc.params["mars"] with {"param": "2t"}; no mars.param from the descriptor survives that case. After merging, the priority chain above is applied to the combined dict, so a higher-priority key from either source beats a lower-priority one.

You can override with any dot-path, including non-MARS vocabularies:

# Weather pipeline using MARS
store = TensogramStore.open_tgm("weather.tgm", variable_key="mars.param")

# Neuroimaging pipeline using BIDS
store = TensogramStore.open_tgm("scans.tgm", variable_key="bids.task")

# Custom vocabulary
store = TensogramStore.open_tgm("data.tgm", variable_key="product.name")

Common pitfall: name in the descriptor dict

Application metadata belongs in meta["base"][i], not in the descriptor:

# ✗ Avoid — triggers a UserWarning; works via fallback but is not canonical
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32",
        "name": "temperature"}  # ← goes into desc.params, flagged
tensogram.encode({}, [(desc, data)])

# ✓ Canonical — the simplest form
meta = { "base": [{"name": "temperature"}]}
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32"}
tensogram.encode(meta, [(desc, data)])

The descriptor fallback exists so files produced by the ✗ form still surface the correct names in zarr (and xarray); the write-side warning exists so the mistake is visible at the time it’s made. See issue #67.

Multi-message files

By default the store reads message 0. Select a different message with message_index:

store = TensogramStore.open_tgm("multi.tgm", message_index=2)

Writing a .tgm file through Zarr

import numpy as np
import zarr
from tensogram_zarr import TensogramStore

store = TensogramStore("output.tgm", mode="w")
root = zarr.open_group(store=store, mode="w")

# Create arrays — data is buffered in memory
root.create_array("temperature", data=np.random.rand(100, 200).astype(np.float32))
root.create_array("pressure", data=np.array([1000, 925, 850, 700], dtype=np.float64))

# Close flushes to .tgm
store.close()

The write path assembles all arrays into a single TGM message when the store is closed.

Context manager

with TensogramStore("data.tgm", mode="r") as store:
    root = zarr.open_group(store=store, mode="r")
    data = root["temperature"][:]
# Store automatically closed

Supported data types

Tensogram dtypeZarr data_typeNumPy dtype
float16float16float16
float32float32float32
float64float64float64
int8int8int8
int16int16int16
int32int32int32
int64int64int64
uint8uint8uint8
uint16uint16uint16
uint32uint32uint32
uint64uint64uint64
complex64complex64complex64
complex128complex128complex128
bitmaskuint8uint8

Byte range support

The store supports Zarr’s ByteRequest types for efficient partial reads:

  • RangeByteRequest(start, end) — read a byte range
  • OffsetByteRequest(offset) — read from offset to end
  • SuffixByteRequest(suffix) — read last N bytes

Comparison with tensogram-xarray

Featuretensogram-zarrtensogram-xarray
API levelLow-level (Zarr Store)High-level (xarray engine)
DimensionsGeneric (dim_0, dim_1)Named (lat, lon, time)
CoordinatesNot interpretedAuto-detected from metadata
Multi-messageOne message per storeAuto-merge into hypercubes
Write supportYesNo
Data loadingEager (all at open)Lazy (on-demand decode_range)

Use tensogram-zarr when you need direct Zarr API access or write support. Use tensogram-xarray when you want automatic coordinate detection and multi-message merging.

Edge cases and limitations

Variable name sanitization

If a metadata value used as a variable name contains / or \, those characters are replaced with _ to prevent spurious directory nesting in the virtual key space. Empty names become _.

mars.param = "temperature/surface"  →  variable name "temperature_surface"

Duplicate variable names

When multiple objects resolve to the same name, suffixes are appended: field, field_1, field_2, etc.

Zero-object messages

A message with no data objects is valid (metadata-only). The store produces a root group with attributes but no arrays.

Single chunk per array

Each TGM data object maps to a Zarr array with chunk_shape == array_shape (one chunk). There is no sub-chunking; partial reads within the array are handled by Zarr’s byte-range support against the single chunk. If a Zarr writer attempts to store multiple chunks for the same variable, a ValueError is raised — TensogramStore does not silently drop extra chunks.

Out-of-range message index

If message_index exceeds the number of messages in the file, an IndexError is raised. Negative indices are rejected with ValueError.

bfloat16 dtype

bfloat16 maps to Zarr data type "bfloat16" but is stored as raw 2-byte values (<V2 numpy dtype) since numpy has no native bfloat16 type. Use ml_dtypes.bfloat16 for interpretation.

Byte order handling

The read path normalises all chunk data to little-endian (matching the Zarr bytes codec default). The write path respects byte_order from the Zarr codecs metadata — if a big-endian bytes codec is specified, the data is byte-swapped before encoding to TGM.

JSON serialization (RFC 8259)

serialize_zarr_json() converts non-finite float values to their Zarr v3 string sentinels ("NaN", "Infinity", "-Infinity") so the output is valid RFC 8259 JSON.

Write path byte-count validation

When flushing to .tgm, the store validates that chunk byte count matches product(shape) * dtype_size. A mismatch raises ValueError with the expected and actual counts.

close() exception safety

If _flush_to_tgm() fails during close(), the store is still marked as closed (_is_open = False). The exception propagates normally — partial writes do not corrupt the file since TGM messages are written atomically.

When used as a context manager and an exception is already in flight, flush errors are logged at WARNING level instead of replacing the original exception.

Error handling

All errors surface with enough context for debugging:

ScenarioExceptionMessage includes
File not found / unreadableOSErrorFile path
Invalid TGM messageValueErrorFile path + message index
Object decode failureValueErrorFile path + message index + object index + variable name
Out-of-range message indexIndexErrorRequested index + available count
Negative message indexValueErrorThe invalid index value
Invalid modeValueErrorThe invalid mode string
Empty pathValueErrorThe value passed
Chunk byte-count mismatchValueErrorVariable name + expected vs actual byte count
Unsupported dtype on writeValueErrorVariable name + dtype
Invalid JSON in zarr.jsonValueErrorByte count + hex preview
Unknown ByteRequest typeTypeErrorThe type name
Array without chunk dataWARNING logVariable name (array skipped)
No arrays to flushWARNING logFile path

Errors from the underlying Rust tensogram library are wrapped with Python-level context so users see which file, message, and variable caused the problem.