Skip to main content

Polars integration

Importing openyxdb automatically monkey-patches Polars with YXDB support when Polars is installed. No additional setup is needed.

import polars as pl
import openyxdb # registers everything on import

Top-level read functions

pl.read_yxdb(path)

Eagerly reads a YXDB file and returns a DataFrame:

df = pl.read_yxdb("data.yxdb")

pl.scan_yxdb(path)

Returns a LazyFrame. Projection and row-limit pushdown are performed at the C++ reader level -- only the columns and rows you request are decoded from disk:

lf = pl.scan_yxdb("data.yxdb")
df = lf.select("id", "score").filter(pl.col("score") > 90).collect()

Pushdown support

OptimizationSupportedNotes
Projection pushdownYesOnly requested columns are decoded from disk.
Row-limit pushdownYeshead(n) / fetch(n) stops decoding once n rows are produced.
Batched streamingYesDefault 65,536 rows; honours Polars' batch_size hint.
Predicate pushdownPartialPredicates evaluate per batch after decode. YXDB has no per-block statistics, so genuine file-level predicate skipping is not possible. Combined with head, predicates still short-circuit once enough rows are collected.

DataFrame and LazyFrame namespace plugins

df.yxdb.write(path) -- write a DataFrame to a YXDB file:

df.yxdb.write("output.yxdb")
df.yxdb.write("output.yxdb", chunk_size=65_536)

lf.yxdb.sink(path) -- execute a lazy plan and write to YXDB in chunks:

lf.yxdb.sink("output.yxdb")
lf.yxdb.sink("output.yxdb", chunk_size=65_536)

openyxdb.scan_yxdb(path)

The standalone openyxdb.scan_yxdb is the same as pl.scan_yxdb and is available before Polars monkey-patching runs:

import openyxdb
lf = openyxdb.scan_yxdb("data.yxdb")

openyxdb.sink_yxdb(lf, path, chunk_size=None, engine="streaming")

The standalone sink function accepts either a LazyFrame or a DataFrame:

import openyxdb, polars as pl

lf = pl.scan_csv("source.csv").select("id", "value")
openyxdb.sink_yxdb(lf, "output.yxdb", chunk_size=65_536)

Manual registration

If openyxdb is imported before Polars is installed, or if you want to re-register:

openyxdb.register_polars()

End-to-end example

import polars as pl
import openyxdb

# Lazy scan -> filter -> project -> sink
(
pl.scan_yxdb("sales.yxdb")
.filter(pl.col("region") == "US")
.select("order_id", "amount", "date")
.yxdb.sink("us_sales.yxdb")
)

# Verify the output
print(pl.read_yxdb("us_sales.yxdb").head(5))