Skip to main content

Quickstart: Python

Install

The openproteo metapackage is the single pip install surface for the stack. The base install brings the vendor-agnostic reader (openproteo-io); each per-vendor extra layers on a native binding for direct vendor access.

pip install openproteo # openproteo_io reader
pip install openproteo[thermo] # + opentfraw
pip install openproteo[bruker] # + opentimstdf
pip install openproteo[waters] # + openwraw
pip install openproteo[all] # all vendor extensions

You can also install openproteo-io directly if you only want the unified reader without the metapackage shim:

pip install 'openproteo-io[arrow]'

Detect and convert

import openproteo as op

det = op.detect_format("sample.raw")
print(det.vendor, det.path) # 'thermo' /path/to/sample.raw

op.to_mzml("sample.raw", "sample.mzML", indexed=True)

The functions detect_format, to_mzml, iter_spectra, and the Spectrum class are re-exported from openproteo_io. You can also import them from openproteo_io directly; the two paths refer to the same objects.

Vendor dispatch

For code paths that prefer the native vendor bindings (because they expose vendor-specific surfaces beyond mzML / Arrow), the metapackage ships a structural format detector and a dispatcher:

import openproteo as op

kind = op.detect("sample.raw") # 'thermo' / 'bruker' / 'waters' / None
reader = op.open_run("sample.raw") # opentfraw.RawFile / opentimstdf.Reader / ...

open_run requires the matching extra to be installed; otherwise it raises ImportError.

Iterate spectra (numpy)

iter_spectra yields Spectrum objects whose peak arrays are zero-copy numpy views over Rust-owned buffers. Each peak array can be read exactly once (the buffer is moved into numpy on first access).

import openproteo_io as op

for s in op.iter_spectra("sample.raw"):
if s.ms_level == 1:
mz = s.mz # numpy.ndarray, dtype=float64
intensity = s.intensity # numpy.ndarray, dtype=float32
print(s.index, s.retention_time_sec, mz.size)

For Bruker TDF, ion-mobility scans expose a per-peak inverse-mobility array via s.inv_mobility_per_peak.

Arrow record batches

import openproteo_io as op
import pyarrow as pa

reader = op.read_arrow("sample.d") # pa.RecordBatchReader
schema = reader.schema
table = reader.read_all()
print(table.num_rows, table.schema.names)

The schema is identical across vendors, which makes it easy to land multi-vendor data sets into the same Arrow / Parquet / DuckDB table.

Pandas

import openproteo_io as op
import pandas as pd

table = op.read_arrow("sample.raw").read_all()
df = table.to_pandas()
print(df.head())