Pre-Encoded Data API (Advanced)
When to use this API
The encode_pre_encoded API is for advanced callers whose data is already
encoded by an external pipeline (e.g., a GPU kernel that emits packed bytes,
or a streaming receiver passing payloads through). It bypasses Tensogram’s
internal encoding pipeline and uses the supplied bytes verbatim.
Do NOT use this API for ordinary encoding. Use encode() instead.
⚠️ The bit-vs-byte trap
WARNING: When using
compression="szip", theszip_block_offsetsparameter contains bit offsets, not byte offsets. The first offset must be 0 and every offset must satisfyoffset <= encoded_bytes_len * 8. This matches the libaec/szip wire format. See cbor-metadata.md for the format reference.Getting this wrong is the #1 caller mistake. Tensogram validates the offsets structurally (monotonicity, bounds) but cannot detect a byte-instead-of-bit mistake until decode_range fails.
API surface
Rust
#![allow(unused)]
fn main() {
pub fn encode_pre_encoded(
metadata: &GlobalMetadata,
descriptors_and_data: &[(&DataObjectDescriptor, &[u8])],
options: &EncodeOptions,
) -> Result<Vec<u8>, TensogramError>
}
Python
import tensogram
msg: bytes = tensogram.encode_pre_encoded(
global_meta_dict={"version": 2},
descriptors_and_data=[(descriptor_dict, raw_bytes)],
hash="xxh3",
)
C
tgm_error tgm_encode_pre_encoded(
const char *metadata_json,
const uint8_t *const *data_ptrs,
const size_t *data_lens,
size_t num_objects,
const char *hash_algo,
tgm_bytes_t *out
);
C++
std::vector<std::uint8_t> tensogram::encode_pre_encoded(
const std::string& metadata_json,
const std::vector<std::pair<const std::uint8_t*, std::size_t>>& objects,
const encode_options& opts = {}
);
Hash semantics
The library always recomputes the hash of the pre-encoded bytes using
the algorithm specified in EncodeOptions.hash_algorithm (default xxh3). Any hash
the caller stored on the descriptor is silently overwritten. This guarantees
the wire format invariant descriptor.hash == hash_algo(bytes) always holds.
Provenance semantics
The encoded message is byte-format-indistinguishable from one produced by
encode(). The decoder cannot tell which API produced it. The provenance
fields _reserved_.encoder.name, _reserved_.time, and _reserved_.uuid
are populated identically.
Self-consistency checks
Before encoding, the library validates:
- Caller has not set
EncodeOptions.emit_preceders(rejected). - Caller has not put
_reserved_in their metadata (rejected). - Each descriptor passes the standard
validate_objectchecks. - If
compression="szip"andszip_block_offsetsis supplied:- It’s a CBOR Array of u64.
- First offset is 0.
- Strictly monotonically increasing.
- All bit offsets
<= bytes_len * 8.
- If
szip_block_offsetsis supplied butcompression != "szip", rejected.
These are structural checks only. The library does NOT trial-decode the bytes to verify they actually decode correctly.
Limitation: encoding=“none” size check
When encoding="none", the validate_object check enforces
payload_len == shape_product * dtype_byte_width. This means you cannot pass
compression-only payloads (e.g., zstd-compressed raw bytes) with
encoding="none" because the compressed size will not match the expected raw
size. Wrap such payloads in at least simple_packing or another encoding.
Worked example: simple_packing + szip with decode_range
#![allow(unused)]
fn main() {
use tensogram::{
encode_pre_encoded, DataObjectDescriptor, EncodeOptions,
GlobalMetadata, ByteOrder, Dtype,
};
use std::collections::BTreeMap;
use ciborium::Value;
// Pre-encoded bytes from a GPU kernel + szip block offsets in BITS
let pre_encoded_bytes: Vec<u8> = /* from GPU */;
let szip_offsets_bits: Vec<u64> = vec![0, 8192, 16384, /* ... */];
let mut params: BTreeMap<String, ciborium::Value> = BTreeMap::new();
params.insert("bits_per_value".into(), Value::Integer(24u64.into()));
params.insert("reference_value".into(), Value::Float(0.0));
params.insert("binary_scale_factor".into(), Value::Integer((-10i64).into()));
params.insert("decimal_scale_factor".into(), Value::Integer(0i64.into()));
params.insert("szip_rsi".into(), Value::Integer(128i64.into()));
params.insert("szip_block_size".into(), Value::Integer(16i64.into()));
params.insert("szip_flags".into(), Value::Integer(8i64.into()));
params.insert("szip_block_offsets".into(),
Value::Array(szip_offsets_bits.into_iter()
.map(|o| Value::Integer(o.into()))
.collect()));
let desc = DataObjectDescriptor {
obj_type: "ntensor".into(),
ndim: 2,
shape: vec![1024, 1024],
strides: vec![1024, 1],
dtype: Dtype::Float32,
byte_order: ByteOrder::Big,
encoding: "simple_packing".into(),
filter: "none".into(),
compression: "szip".into(),
params,
hash: None,
};
let msg = encode_pre_encoded(
&GlobalMetadata::default(),
&[(&desc, &pre_encoded_bytes)],
&EncodeOptions::default(),
)?;
// decode_range works because szip_block_offsets is present.
}
How it works
flowchart TD
subgraph pre["encode_pre_encoded path"]
A[Caller bytes] --> B[validate_object]
B --> C[validate_szip_block_offsets]
C --> D[Recompute hash]
end
subgraph normal["encode path"]
G[Caller bytes] --> H[Run encoding pipeline]
H --> D
end
D --> E[Wrap in CBOR framing]
E --> F[Wire message]
The pre-encoded path skips the pipeline entirely. The wire format is identical.
Byte order
When using encoding="none", the caller’s bytes are stored verbatim — the
library does NOT validate or flip byte order on encode. The bytes must be in
the byte order declared in the descriptor’s byte_order field.
For example, if byte_order="big" and encoding="none", the caller must
provide big-endian bytes.
On decode, the library automatically converts to native byte order by
default (native_byte_order=true). Callers can use from_ne_bytes() or
data_as<T>() directly without worrying about which byte order was used on
the wire. Set native_byte_order=false to get the raw wire-order bytes.
Streaming API
StreamingEncoder::write_object_pre_encoded() is the streaming counterpart of
encode_pre_encoded(). It writes a single pre-encoded object to the stream.
It can be interleaved freely with write_object() (normal encode) calls.
Rust
#![allow(unused)]
fn main() {
let mut enc = StreamingEncoder::new(output, &metadata, &options)?;
enc.write_object_pre_encoded(&descriptor, &pre_encoded_bytes)?;
enc.finish()?;
}
Python
enc = tensogram.StreamingEncoder({"version": 2})
enc.write_object_pre_encoded(descriptor_dict, raw_bytes)
msg = enc.finish()
C++
tensogram::streaming_encoder enc(path, metadata_json);
enc.write_object_pre_encoded(descriptor_json, data_ptr, data_len);
enc.finish();
Error reference
encode_pre_encoded can raise the following errors:
| Error condition | Message contains |
|---|---|
obj_type is empty | "obj_type must not be empty" |
ndim doesn’t match shape.len() | "ndim … does not match shape.len()" |
strides.len() doesn’t match shape.len() | "strides.len() … does not match shape.len()" |
encoding="none" and data size wrong | "data_len … does not match expected … bytes" |
emit_preceders=true in buffered mode | "emit_preceders is not supported" |
Caller set _reserved_ in metadata | "_reserved_" |
szip_block_offsets not starting at 0 | "first offset must be 0" |
szip_block_offsets not strictly increasing | "strictly increasing" |
szip_block_offsets exceeds bit bound | "exceeds … bit bound" |
szip_block_offsets with non-szip compression | "szip_block_offsets provided but compression" |
| Unknown encoding string | "encoding" |
| Unknown dtype | "unknown dtype" |
Strides convention
The library treats strides as opaque metadata — it only validates that
strides.len() == shape.len(). The convention differs between language bindings:
- Rust tests use element strides (e.g.,
[1]for 1D,[5, 1]for shape[4, 5]) - C++ tests use byte strides (e.g.,
[4]for float32,[12, 4]for shape[2, 3]float32)
Both conventions work correctly since the library does not interpret stride values.
Cross-references
- Encoding — the normal
encode()API - Decoding —
decode_rangerequirements for partial reads - Compression — szip details
- CBOR metadata — wire format reference