Skip to main content

DuckDB integration

OpenYXDB provides helpers to use YXDB files as data sources in DuckDB queries and to write DuckDB query results back to YXDB.

The integration works through Arrow interop: YXDB files are loaded into Arrow tables and registered on the DuckDB connection. A native DuckDB table function (SELECT * FROM read_yxdb(...)) would require a C++ DuckDB extension, which is out of scope for this package.

Setup

pip install openyxdb[duckdb]

Register a file as a SQL view

register_duckdb(con, name, path) registers a YXDB file as a named view on an existing connection:

import duckdb, openyxdb

con = duckdb.connect()
openyxdb.register_duckdb(con, "sales", "sales.yxdb")

result = con.execute("SELECT region, SUM(amount) FROM sales GROUP BY region").df()

Get a DuckDB relation

to_duckdb(path, con=None, table_name=None) loads the file into a DuckDB connection and returns a relation. If table_name is provided, the relation is also registered as a view:

import duckdb, openyxdb

con = duckdb.connect()
rel = openyxdb.to_duckdb("data.yxdb", con=con, table_name="yx")

# Use the relation directly
print(rel.query("yx", "SELECT COUNT(*) FROM yx").fetchone())

# Or use SQL now that it's registered
con.execute("SELECT * FROM yx LIMIT 5").fetchdf()

Write a DuckDB query to YXDB

from_duckdb(source, path, con=None, chunk_size=None) accepts a SQL string or a DuckDBPyRelation and writes the result to a YXDB file:

import duckdb, openyxdb

con = duckdb.connect()
openyxdb.register_duckdb(con, "yx", "data.yxdb")

# Write a filtered query to a new YXDB file
openyxdb.from_duckdb(
"SELECT id, score FROM yx WHERE score > 90",
"top_scores.yxdb",
con=con,
)

You can also pass a relation object:

rel = con.execute("SELECT id, score FROM yx WHERE score > 90")
openyxdb.from_duckdb(rel, "top_scores.yxdb")

In-memory workflow

Connect without a file to work entirely in memory:

import duckdb, openyxdb

con = duckdb.connect(":memory:")
openyxdb.register_duckdb(con, "orders", "orders.yxdb")
openyxdb.register_duckdb(con, "customers", "customers.yxdb")

openyxdb.from_duckdb(
"SELECT o.*, c.name FROM orders o JOIN customers c ON o.customer_id = c.id",
"enriched_orders.yxdb",
con=con,
)