The Encoding Pipeline
Every object payload passes through a three-stage pipeline on the way in (encoding) and out (decoding). The stages always run in the same order:
flowchart TD
subgraph Encode["Encode Path"]
direction TB
A["Raw bytes"]
B["Stage 1 — Encoding
(lossy quantization)"]
C["Stage 2 — Filter
(byte shuffle)"]
D["Stage 3 — Compression
(szip / zstd / lz4 / blosc2 / zfp / sz3)"]
A --> B --> C --> D
end
S[("Stored bytes")]
subgraph Decode["Decode Path"]
direction TB
F["Stage 3 — Decompress"]
G["Stage 2 — Unshuffle"]
H["Stage 1 — Dequantize"]
I["Raw bytes"]
F --> G --> H --> I
end
D --> S --> F
style A fill:#e8f5e9,stroke:#388e3c
style S fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style I fill:#e8f5e9,stroke:#388e3c
style Encode fill:#e3f2fd,stroke:#1565c0,color:#1565c0
style Decode fill:#fce4ec,stroke:#c62828,color:#c62828
Each stage is independently configurable per object via fields in the DataObjectDescriptor. Set a stage to "none" to skip it. For callers with already-encoded payloads, a pipeline-bypass option exists via encode_pre_encoded (see Pre-encoded Payloads).
Stage 1: Encoding
Encoding transforms values to reduce the number of bits needed to represent them. The only supported encoding right now is simple_packing — a lossy quantisation that maps a bounded range of floating-point values onto N-bit integers. The bit layout matches GRIB 2 simple_packing so quantised payloads are interoperable with existing GRIB tooling.
| Value | Meaning |
|---|---|
"none" | Pass through unchanged |
"simple_packing" | Lossy quantization (see Simple Packing) |
Stage 2: Filter
Filters rearrange bytes to improve compression ratios. The shuffle filter reorders bytes by their significance level (all most-significant bytes first, then all second-most-significant bytes, etc.), which makes float data much more compressible because nearby values have similar high bytes.
| Value | Meaning |
|---|---|
"none" | Pass through unchanged |
"shuffle" | Byte-level shuffle (see Byte Shuffle Filter) |
Stage 3: Compression
Compression reduces the total byte count. Seven compressors are implemented:
| Value | Type | Random Access | Notes |
|---|---|---|---|
"none" | Pass-through | Yes | No compression |
"szip" | Lossless | Yes | CCSDS 121.0-B-3 via libaec |
"zstd" | Lossless | No | Excellent ratio/speed tradeoff |
"lz4" | Lossless | No | Fastest decompression |
"blosc2" | Lossless | Yes | Multi-codec, chunk-level access |
"zfp" | Lossy | Yes (fixed-rate) | Floating-point arrays |
"sz3" | Lossy | No | Error-bounded scientific data |
See Compression for full details on each compressor, including parameters and random access support.
Note: ZFP and SZ3 operate directly on typed floating-point data. Use them with
encoding: "none"andfilter: "none"– they replace both encoding and compression.
Typical Combinations
| Use case | encoding | filter | compression |
|---|---|---|---|
| Exact integers (e.g. a mask) | none | none | none |
| Lossy bounded-range floats | simple_packing | none | szip |
| Best lossless (floats) | none | shuffle | szip or blosc2 |
| GRIB 2 CCSDS-interoperable | simple_packing | none | szip |
| Real-time streaming | none | none | lz4 |
| Archival storage | none | shuffle | zstd |
| ML model weights | none | none | blosc2 |
| Lossy float w/ random access | none | none | zfp (fixed_rate) |
| Error-bounded science | none | none | sz3 |
How It Looks in Code
The entire pipeline is configured through the DataObjectDescriptor:
#![allow(unused)]
fn main() {
DataObjectDescriptor {
obj_type: "ntensor".into(),
ndim: 2,
shape: vec![721, 1440],
strides: vec![1440, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "simple_packing".into(),
filter: "none".into(),
compression: "szip".into(),
masks: None,
params: BTreeMap::from([
("reference_value".into(), Value::Float(230.5)),
("bits_per_value".into(), Value::Integer(16.into())),
]),
}
}
All encoding parameters (reference_value, bits_per_value, szip_block_offsets, etc.) go into the params map. The encoder populates additional params during encoding (like block offsets for szip), and the decoder reads them back.
Integrity Hashing
Every frame ends with an inline 8-byte hash slot followed by the ENDF marker. For data object frames, the slot lives at frame_end − 12, and the hash covers the frame body (payload + any mask blobs + CBOR descriptor). Populating the slot is controlled message-wide via the HASHES_PRESENT preamble flag, set by EncodeOptions.hash_algorithm = Some(HashAlgorithm::Xxh3) (the default).
To verify integrity after decoding, run tensogram validate --checksum. The validator walks every frame and recomputes the xxh3-64 digest against the stored slot without parsing CBOR on the fast path.
| Algorithm | Hash length | Notes |
|---|---|---|
xxh3 | 8-byte raw / 16 hex chars (64-bit) | Default. Fast, non-cryptographic |
Edge case: The hash covers the frame body only — header,
cbor_offset, the hash slot itself, and theENDFmarker are not part of the hashed region.