Arrow schema

openproteo-core's arrow feature exposes a single record-batch schema that is identical across all vendors. One row = one spectrum; peak arrays live in two LargeList<Float> columns alongside scalar metadata columns.

Schema (flat)

Column	Arrow type	Notes
`index`	`UInt64`	0-based, strictly increasing.
`native_id`	`Utf8`	Vendor native id (e.g. `controllerType=0 ...`).
`ms_level`	`UInt32`	1, 2, ...
`polarity`	`UInt8`	0 = positive, 1 = negative, 255 = unknown.
`retention_time_sec`	`Float64`	Seconds.
`scan_window_low_mz`	`Float64`	Optional, null if unknown.
`scan_window_high_mz`	`Float64`
`precursor_mz`	`Float64`	Null for MS1.
`precursor_charge`	`Int32`	Null when not assigned.
`precursor_isolation_lo`	`Float64`
`precursor_isolation_hi`	`Float64`
`activation`	`Utf8`	`HCD`, `CID`, `ETD`, ...
`analyzer`	`Utf8`	`Orbitrap`, `TOF`, ...
`scan_mode`	`Utf8`
`mz`	`LargeList<Float64>`	Ascending peaks.
`intensity`	`LargeList<Float32>`	Same length as `mz`.
`inv_mobility_per_peak`	`LargeList<Float32>` or null	Present on Bruker TDF when mobility is enabled.

SpectrumBatchBuilder::new(Option<MobilityArrayKind>) toggles the final column. Pass None for instruments without ion mobility (every column is materialized but stays null).

Building a batch

use openproteo_core::arrow::{spectrum_record_schema, SpectrumBatchBuilder};
use openproteo_core::SpectrumSource;

let mut src = opentimstdf::mzml::TdfSource::open("sample.d")?;
let mut b = SpectrumBatchBuilder::new(Some(openproteo_core::MobilityArrayKind::InverseK0));
for s in src.iter_spectra() {
    b.push(&s);
}
let batch = b.finish()?;
assert_eq!(batch.schema(), spectrum_record_schema());
# Ok::<(), Box<dyn std::error::Error>>(())

Why `LargeList`?

Peak arrays for a single TDF MS1 frame routinely cross the 2^31 byte boundary when stored back-to-back, especially in 32-bit float intensities. LargeList (64-bit offsets) avoids the silent truncation that List (32-bit offsets) would otherwise introduce.

Schema (flat)​

Building a batch​

Why LargeList?​

Schema (flat)

Building a batch

Why `LargeList`?