Zarr v3 Backend
The tensogram-zarr package implements a Zarr v3 Store backed by .tgm files. This lets you read and write Tensogram data through the standard Zarr Python API.
Installation
uv venv .venv && source .venv/bin/activate # if not already in a virtualenv
uv pip install tensogram-zarr
Requires zarr >= 3.0, tensogram, and numpy.
Reading a .tgm file through Zarr
import zarr
from tensogram_zarr import TensogramStore
# Open existing .tgm file as a read-only Zarr store
store = TensogramStore.open_tgm("data.tgm")
root = zarr.open_group(store=store, mode="r")
# Browse available arrays
for name, arr in root.members():
print(f"{name}: shape={arr.shape}, dtype={arr.dtype}")
# Read an array (decoded eagerly at store open, served from memory)
temperature = root["2t"][:]
print(temperature.shape, temperature.mean())
# Access group-level metadata (from GlobalMetadata _extra_)
# The example below shows a MARS namespace; the attributes dict reflects
# whatever namespaces the producer put in the message's GlobalMetadata.
print(root.attrs["mars"]) # {'class': 'od', 'type': 'fc', ...}
How the mapping works
Each .tgm message maps to a Zarr group:
zarr.json # root group ← GlobalMetadata
temperature/zarr.json # array metadata ← DataObjectDescriptor
temperature/c/0/0 # chunk data ← decoded object payload
pressure/zarr.json # another array
pressure/c/0/0 # its chunk data
graph LR
TGM[".tgm file"] --> GM["GlobalMetadata"]
TGM --> OBJ1["Object 0: temperature"]
TGM --> OBJ2["Object 1: pressure"]
GM --> GZJ["zarr.json (group)"]
OBJ1 --> AZJ1["temperature/zarr.json"]
OBJ1 --> CHK1["temperature/c/0/0"]
OBJ2 --> AZJ2["pressure/zarr.json"]
OBJ2 --> CHK2["pressure/c/0/0"]
Key design decisions:
- Each TGM data object becomes one Zarr array with a single chunk (chunk shape = array shape)
- Variable names are resolved from metadata via a default lookup path (
name,mars.param,param,mars.shortName,shortName), or a custom dot-path you supply - TGM encoding metadata is preserved in Zarr array attributes under
_tensogram_*keys - Duplicate variable names get a numeric suffix (
field,field_1)
Variable naming
By default, the store tries these metadata paths to name arrays:
namemars.paramparammars.shortNameshortName- Falls back to
object_<index>
The lookup runs against a shallow root-key merge of meta.base[i]
and desc.params (matching the xarray backend’s long-standing
behaviour): for each top-level key, meta.base[i] wins if present,
otherwise desc.params fills in. Nested dicts are not deep-merged,
so a base[i] entry with {"mars": {"levtype": "sfc"}} entirely
shadows a desc.params["mars"] with {"param": "2t"}; no mars.param
from the descriptor survives that case. After merging, the priority
chain above is applied to the combined dict, so a higher-priority key
from either source beats a lower-priority one.
You can override with any dot-path, including non-MARS vocabularies:
# Weather pipeline using MARS
store = TensogramStore.open_tgm("weather.tgm", variable_key="mars.param")
# Neuroimaging pipeline using BIDS
store = TensogramStore.open_tgm("scans.tgm", variable_key="bids.task")
# Custom vocabulary
store = TensogramStore.open_tgm("data.tgm", variable_key="product.name")
Common pitfall: name in the descriptor dict
Application metadata belongs in meta["base"][i], not in the descriptor:
# ✗ Avoid — triggers a UserWarning; works via fallback but is not canonical
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32",
"name": "temperature"} # ← goes into desc.params, flagged
tensogram.encode({}, [(desc, data)])
# ✓ Canonical — the simplest form
meta = { "base": [{"name": "temperature"}]}
desc = {"type": "ntensor", "shape": [10, 8], "dtype": "float32"}
tensogram.encode(meta, [(desc, data)])
The descriptor fallback exists so files produced by the ✗ form still surface the correct names in zarr (and xarray); the write-side warning exists so the mistake is visible at the time it’s made. See issue #67.
Multi-message files
By default the store reads message 0. Select a different message with message_index:
store = TensogramStore.open_tgm("multi.tgm", message_index=2)
Writing a .tgm file through Zarr
import numpy as np
import zarr
from tensogram_zarr import TensogramStore
store = TensogramStore("output.tgm", mode="w")
root = zarr.open_group(store=store, mode="w")
# Create arrays — data is buffered in memory
root.create_array("temperature", data=np.random.rand(100, 200).astype(np.float32))
root.create_array("pressure", data=np.array([1000, 925, 850, 700], dtype=np.float64))
# Close flushes to .tgm
store.close()
The write path assembles all arrays into a single TGM message when the store is closed.
Context manager
with TensogramStore("data.tgm", mode="r") as store:
root = zarr.open_group(store=store, mode="r")
data = root["temperature"][:]
# Store automatically closed
Supported data types
| Tensogram dtype | Zarr data_type | NumPy dtype |
|---|---|---|
float16 | float16 | float16 |
float32 | float32 | float32 |
float64 | float64 | float64 |
int8 | int8 | int8 |
int16 | int16 | int16 |
int32 | int32 | int32 |
int64 | int64 | int64 |
uint8 | uint8 | uint8 |
uint16 | uint16 | uint16 |
uint32 | uint32 | uint32 |
uint64 | uint64 | uint64 |
complex64 | complex64 | complex64 |
complex128 | complex128 | complex128 |
bitmask | uint8 | uint8 |
Byte range support
The store supports Zarr’s ByteRequest types for efficient partial reads:
RangeByteRequest(start, end)— read a byte rangeOffsetByteRequest(offset)— read from offset to endSuffixByteRequest(suffix)— read last N bytes
Comparison with tensogram-xarray
| Feature | tensogram-zarr | tensogram-xarray |
|---|---|---|
| API level | Low-level (Zarr Store) | High-level (xarray engine) |
| Dimensions | Generic (dim_0, dim_1) | Named (lat, lon, time) |
| Coordinates | Not interpreted | Auto-detected from metadata |
| Multi-message | One message per store | Auto-merge into hypercubes |
| Write support | Yes | No |
| Data loading | Eager (all at open) | Lazy (on-demand decode_range) |
Use tensogram-zarr when you need direct Zarr API access or write support. Use tensogram-xarray when you want automatic coordinate detection and multi-message merging.
Edge cases and limitations
Variable name sanitization
If a metadata value used as a variable name contains / or \, those characters are replaced with _ to prevent spurious directory nesting in the virtual key space. Empty names become _.
mars.param = "temperature/surface" → variable name "temperature_surface"
Duplicate variable names
When multiple objects resolve to the same name, suffixes are appended: field, field_1, field_2, etc.
Zero-object messages
A message with no data objects is valid (metadata-only). The store produces a root group with attributes but no arrays.
Single chunk per array
Each TGM data object maps to a Zarr array with chunk_shape == array_shape (one chunk). There is no sub-chunking; partial reads within the array are handled by Zarr’s byte-range support against the single chunk. If a Zarr writer attempts to store multiple chunks for the same variable, a ValueError is raised — TensogramStore does not silently drop extra chunks.
Out-of-range message index
If message_index exceeds the number of messages in the file, an IndexError is raised. Negative indices are rejected with ValueError.
bfloat16 dtype
bfloat16 maps to Zarr data type "bfloat16" but is stored as raw 2-byte values (<V2 numpy dtype) since numpy has no native bfloat16 type. Use ml_dtypes.bfloat16 for interpretation.
Byte order handling
The read path normalises all chunk data to little-endian (matching the Zarr bytes codec default). The write path respects byte_order from the Zarr codecs metadata — if a big-endian bytes codec is specified, the data is byte-swapped before encoding to TGM.
JSON serialization (RFC 8259)
serialize_zarr_json() converts non-finite float values to their Zarr v3 string sentinels ("NaN", "Infinity", "-Infinity") so the output is valid RFC 8259 JSON.
Write path byte-count validation
When flushing to .tgm, the store validates that chunk byte count matches product(shape) * dtype_size. A mismatch raises ValueError with the expected and actual counts.
close() exception safety
If _flush_to_tgm() fails during close(), the store is still marked as closed (_is_open = False). The exception propagates normally — partial writes do not corrupt the file since TGM messages are written atomically.
When used as a context manager and an exception is already in flight, flush errors are logged at WARNING level instead of replacing the original exception.
Error handling
All errors surface with enough context for debugging:
| Scenario | Exception | Message includes |
|---|---|---|
| File not found / unreadable | OSError | File path |
| Invalid TGM message | ValueError | File path + message index |
| Object decode failure | ValueError | File path + message index + object index + variable name |
| Out-of-range message index | IndexError | Requested index + available count |
| Negative message index | ValueError | The invalid index value |
| Invalid mode | ValueError | The invalid mode string |
| Empty path | ValueError | The value passed |
| Chunk byte-count mismatch | ValueError | Variable name + expected vs actual byte count |
| Unsupported dtype on write | ValueError | Variable name + dtype |
| Invalid JSON in zarr.json | ValueError | Byte count + hex preview |
| Unknown ByteRequest type | TypeError | The type name |
| Array without chunk data | WARNING log | Variable name (array skipped) |
| No arrays to flush | WARNING log | File path |
Errors from the underlying Rust tensogram library are wrapped with Python-level context so users see which file, message, and variable caused the problem.