NetCDF Import

Tensogram ships tensogram-netcdf, a dedicated crate for importing NetCDF (both Classic and NetCDF-4) files into Tensogram messages. NetCDF is widely used in climate, ocean, atmospheric, and Earth-observation science, but the importer treats any NetCDF file the same way — the mapping is structural, not domain-specific.

The crate is exposed through the CLI as tensogram convert-netcdf and through a thin Rust library API. Conversion is one-way: NetCDF → Tensogram. There is no Tensogram → NetCDF writer.

System requirement

The NetCDF C library must be installed on your system:

brew install netcdf            # macOS
apt install libnetcdf-dev      # Debian/Ubuntu

The crate transitively pulls in HDF5 (used internally by NetCDF-4 files), so on Debian-family distros you also want libhdf5-dev.

Building

The tensogram-netcdf crate is excluded from the default workspace build to avoid forcing libnetcdf on every contributor. Build it explicitly:

# Library
cargo build --manifest-path rust/tensogram-netcdf/Cargo.toml

# CLI with NetCDF support
cargo build -p tensogram-cli --features netcdf

The binary then exposes the new subcommand:

tensogram convert-netcdf --help

Quick example

# Convert one file
tensogram convert-netcdf input.nc -o output.tgm

# Convert multiple files into a single output
tensogram convert-netcdf jan.nc feb.nc mar.nc -o q1.tgm

# Stream to stdout (useful for piping)
tensogram convert-netcdf input.nc | tensogram info /dev/stdin

Command-line options

Flag	Default	Description
`-o`, `--output PATH`	stdout	Where to write the Tensogram file.
`--split-by MODE`	`file`	Grouping mode: `file`, `variable`, or `record`. See Splitting modes.
`--cf`	off	Extract the CF attribute allow-list into `base[i]["cf"]`. See CF metadata mapping.
`--encoding ENC`	`none`	`none` or `simple_packing`.
`--bits N`	auto (16)	Bits per value for `simple_packing` (1–64).
`--filter FILTER`	`none`	`none` or `shuffle`.
`--compression CODEC`	`none`	`none`, `zstd`, `lz4`, `blosc2`, or `szip`.
`--compression-level N`	codec default	Level for `zstd` (1–22) and `blosc2` (0–9).

The --encoding/--bits/--filter/--compression/--compression-level flags are the same set used by tensogram convert-grib. Both importers share a PipelineArgs struct so the two commands stay symmetric.

How variables become objects

Each numeric NetCDF variable in the root group is mapped 1:1 to a Tensogram data object. The variable’s name is stored under base[i]["name"], the dtype and shape come from the NetCDF type and dimension list, and the raw bytes become the object payload (always little-endian).

Dtype matrix

NetCDF type	Tensogram `Dtype`
`byte`	`Int8`
`ubyte`	`Uint8`
`short`	`Int16`
`ushort`	`Uint16`
`int`	`Int32`
`uint`	`Uint32`
`int64`	`Int64`
`uint64`	`Uint64`
`float`	`Float32`
`double`	`Float64`

char and string variables, as well as the NetCDF-4 enhanced types (compound, vlen, enum, opaque), are skipped with a warning. They have no clean tensor representation.

Scalar variables

A NetCDF scalar (zero dimensions) becomes an object with ndim = 0, shape = [], and a single value in the payload.

Packed data

Variables with scale_factor or add_offset attributes are unpacked during conversion: the raw integer values are read, multiplied by the scale, offset applied, and the result stored as Float64 regardless of the on-disk dtype. This matches the convention used by xarray and most netCDF tooling.

The fill value (_FillValue or missing_value) is replaced with NaN in the unpacked output. The original sentinel is preserved under base[i]["netcdf"]["_FillValue"] so consumers can recover it.

Time coordinates

Time coordinate variables are stored as numeric values (typically Float64) exactly as they appear in the file — Tensogram does not convert them to calendar dates. The CF units string ("days since 1970-01-01") and calendar ("gregorian", "noleap", etc.) are preserved under base[i]["netcdf"] so a consumer can decode them on demand.

NetCDF-4 groups

Tensogram extracts only the root group of a NetCDF-4 file. If sub-groups are detected the importer prints a warning to stderr and continues with the root variables. Sub-group support is intentionally out of scope for this release — most operational datasets keep their data variables at the root anyway.

Splitting modes

The --split-by flag controls how variables are grouped into Tensogram messages.

`--split-by=file` (default)

All variables from one input file are bundled into a single Tensogram message containing N data objects. This is the most compact representation and is the right choice when you want to keep a NetCDF file as a single logical unit.

tensogram convert-netcdf forecast.nc -o forecast.tgm
# 1 message with N objects

`--split-by=variable`

Each variable becomes its own one-object Tensogram message. Useful when downstream consumers want to fetch individual variables without decoding the whole file.

tensogram convert-netcdf forecast.nc -o forecast.tgm --split-by variable
# N messages with 1 object each

`--split-by=record`

Splits along the unlimited (record) dimension. Each step along the unlimited dimension produces a separate message. The unlimited dimension is detected automatically; passing this mode against a file without one is a hard error (NoUnlimitedDimension).

Variables that don’t depend on the unlimited dimension (e.g. a static mask variable) are still included in every output message — that way each record is fully self-describing.

tensogram convert-netcdf timeseries.nc -o timeseries.tgm --split-by record
# 1 message per record

Encoding pipeline flags

The pipeline flags are applied per data object before encoding into the wire format. They use the same names and semantics as convert-grib:

Stage	Flag	Notes
Encoding	`--encoding simple_packing --bits N`	Lossy quantization. Float64 only — non-`f64` variables in the same file are skipped (with a warning) and pass through unencoded so mixed files convert cleanly.
Filter	`--filter shuffle`	Byte-shuffle filter, sets `shuffle_element_size` to the post-encoding byte width.
Compression	`--compression zstd --compression-level 3`	`zstd_level` defaults to 3.
Compression	`--compression lz4`	No params.
Compression	`--compression blosc2 --compression-level 9`	Uses `blosc2_codec=lz4` by default.
Compression	`--compression szip`	Sets `szip_rsi=128`, `szip_block_size=16`, `szip_flags=8`. Requires preceding `simple_packing` or `shuffle` because libaec szip caps at 32 bits per sample (raw `f64` is 64 bits).

Variables that contain NaN or ±Inf (typically from unpacked _FillValue / missing_value substitution or degenerate arithmetic upstream) cannot be represented by simple_packing — the algorithm’s range / scale-factor derivation has no slot for non-finite values.

The importer hard-fails when --encoding simple_packing is requested on data containing NaN or Inf. The error names the offending variable and suggests recovery options:

error: simple_packing failed for forecast_temperature: NaN value
encountered at index 42. The variable contains NaN or Inf which
cannot be represented by simple_packing. Pre-process the data or
choose a different encoding (e.g. encoding="none").

Recovery options, in order of effort:

Drop the --encoding simple_packing flag AND pass --allow-nan. The default pipeline (encoding="none") combined with the NaN bitmask companion frame preserves NaN positions; decode restores a canonical quiet-NaN at each position (specific NaN payloads are not preserved — see NaN / Inf Handling).
Substitute non-finite values with an in-band sentinel before conversion if you need simple_packing throughout.
Split the conversion with --split-by variable and re-run per-variable, using --encoding simple_packing only for the variables you know are NaN-free.

Prior behaviour (pre-0.17). The importer used to soft-downgrade NaN-bearing variables to encoding="none" with a stderr warning. That silently hid data-quality problems from automated pipelines; 0.17 surfaces them as hard errors and pairs the fix with the --allow-nan bitmask opt-in (preferred over pre-processing). The non-f64-payload branch (a structural mismatch rather than a data-quality problem) keeps its stderr-warning + fallback behaviour unchanged.

# Pack temperature to 24-bit + zstd
tensogram convert-netcdf --encoding simple_packing --bits 24 \
  --compression zstd --compression-level 3 \
  era5_t2m.nc -o era5_t2m.tgm

# Shuffle + szip on a multi-variable file
tensogram convert-netcdf --filter shuffle --compression szip \
  forecast.nc -o forecast.tgm

CF metadata mapping

NetCDF attributes are always extracted into a netcdf sub-map under each base entry:

base[0]:
  name: "temperature"
  netcdf:
    units: "K"
    long_name: "Air Temperature"
    standard_name: "air_temperature"
    _FillValue: -32768
    add_offset: 273.15
    scale_factor: 0.01
    _global:
      Conventions: "CF-1.10"
      title: "..."
      institution: "..."

When --cf is set, an additional cf sub-map is added containing only the 16 CF allow-list attributes. This duplicate copy makes CF-aware tooling cheaper because it can ignore the verbose netcdf map and rely on a stable, standardised key set.

Limitations

No NetCDF writer. Conversion is one-way only.
No string or char variables. They are skipped with a warning.
No NetCDF-4 enhanced types (compound, vlen, enum, opaque).
Root group only. Sub-groups are skipped with a warning.
simple_packing is f64-only. Mixed-dtype files convert cleanly but only f64 variables get packed.

The importer is also available from Python via tensogram.convert_netcdf() when the wheel is built with the netcdf feature.

Library API

If you’d rather call the importer directly from Rust:

#![allow(unused)]
fn main() {
use std::path::Path;
use tensogram_netcdf::{convert_netcdf_file, ConvertOptions, DataPipeline, SplitBy};

let options = ConvertOptions {
    split_by: SplitBy::Variable,
    cf: true,
    pipeline: DataPipeline {
        encoding: "simple_packing".to_string(),
        bits: Some(24),
        compression: "zstd".to_string(),
        compression_level: Some(3),
        ..Default::default()
    },
    ..Default::default()
};

let messages = convert_netcdf_file(Path::new("forecast.nc"), &options)?;
// messages: Vec<Vec<u8>> — each element is a complete wire-format message
}

Note: DataPipeline is defined in tensogram::pipeline and re-exported from both tensogram_netcdf and tensogram_grib. The underlying apply_pipeline helper is the same for both importers, guaranteeing that convert-grib and convert-netcdf produce byte-identical descriptor fields for equivalent flag combinations.

Keyboard shortcuts

Tensogram