Skip to main content

Arrow schema

openproteo-core's arrow feature exposes a single record-batch schema that is identical across all vendors. One row = one spectrum; peak arrays live in two LargeList<Float> columns alongside scalar metadata columns.

Schema (flat)

ColumnArrow typeNotes
indexUInt640-based, strictly increasing.
native_idUtf8Vendor native id (e.g. controllerType=0 ...).
ms_levelUInt321, 2, ...
polarityUInt80 = positive, 1 = negative, 255 = unknown.
retention_time_secFloat64Seconds.
scan_window_low_mzFloat64Optional, null if unknown.
scan_window_high_mzFloat64
precursor_mzFloat64Null for MS1.
precursor_chargeInt32Null when not assigned.
precursor_isolation_loFloat64
precursor_isolation_hiFloat64
activationUtf8HCD, CID, ETD, ...
analyzerUtf8Orbitrap, TOF, ...
scan_modeUtf8
mzLargeList<Float64>Ascending peaks.
intensityLargeList<Float32>Same length as mz.
inv_mobility_per_peakLargeList<Float32> or nullPresent on Bruker TDF when mobility is enabled.

SpectrumBatchBuilder::new(Option<MobilityArrayKind>) toggles the final column. Pass None for instruments without ion mobility (every column is materialized but stays null).

Building a batch

use openproteo_core::arrow::{spectrum_record_schema, SpectrumBatchBuilder};
use openproteo_core::SpectrumSource;

let mut src = opentimstdf::mzml::TdfSource::open("sample.d")?;
let mut b = SpectrumBatchBuilder::new(Some(openproteo_core::MobilityArrayKind::InverseK0));
for s in src.iter_spectra() {
b.push(&s);
}
let batch = b.finish()?;
assert_eq!(batch.schema(), spectrum_record_schema());
# Ok::<(), Box<dyn std::error::Error>>(())

Why LargeList?

Peak arrays for a single TDF MS1 frame routinely cross the 2^31 byte boundary when stored back-to-back, especially in 32-bit float intensities. LargeList (64-bit offsets) avoids the silent truncation that List (32-bit offsets) would otherwise introduce.