DuckDB: Importing Polars dataframes the easy and hard way using Rust
Table of Contents
Recently I have been working on a project where I did data manipulation through Polars Dataframes and then wanted to query the result with DuckDB for that OLAP goodness. I wanted to build on the Rust Arrow ecosystem. On the surface, Arrow looks like a nice ecosystem with good interoperability. Polars abstracts over Arrow. DuckDB advertises support for both Polars and Arrow. That turned out to be a mirage. Recently I have been working on a project where I did data manipulation through Polars Dataframes and then wanted to query the result with DuckDB for that OLAP goodness. I wanted to build on the Rust Arrow ecosystem. On the surface, Arrow looks like a nice ecosystem with good interoperability. Polars abstracts over Arrow. DuckDB advertises support for both Polars and Arrow. Diving in, this turned out to be the first time I had to use C FFI to call Rust libraries from each other because, well, near zero interoperability exists if you stick to pure Rust as the Arrow ecosystem has evolved. Let’s begin by implementing the trivial solution. Nearly all DuckDB examples start with Parquet files, and Polars makes it trivial to write Parquet to disk. When querying the table, DuckDB kindly returns the underlying table as DataFrames split into batches to reduce the overall memory footprint. This works, with the added overhead of serializing everything to Parquet and temporarily storing all data on disk in an intermediate state. Working in a production setting creating a library of Parquets for DuckDb or a Cloud Database to query is the reasonable choice. But this is not a production setting; this is a hobby project where I want to send data from an API directly to my database without managing files on disk. No intermediary saving to the file system. No serialization overhead. No deserialization overhead. No management of temporary files to ensure I do not run out of space on my 1 vCPU instance. Searching through the DuckDB library, I quickly found the appender API for efficiently importing large datasets, which conveniently utilizes Arrow record batches! That should work, right? Polars is Arrow. DuckDB wants Arrow. A match made in heaven! Let’s go to Polars and find the equivalent method to get some record batches from the dataframe. I find the slightly more ambiguous and cumbersome function Let’s try using that function! Happily, I try compiling! Did you manage to spot the error? Look at the types. A RecordBatchT is very much not a RecordBatch. No matter how you finagle with the types or underlying data. This is where our naive view of the Arrow ecosystem ends. Apparently everything wasn’t the same Arrow. Not anymore. Polars ripped out the few remaining interoperability functions in October 2024. We now face two options: Like a true Rustacean, I first attempted number 1. Splitting up the RecordBatchT to its constituent parts is easy enough. But then I face converting this enum containing 41 variants with several of the variants being enums themselves to another enum of the same size. Once for the schema fields and once for the data arrays. With subtle incompatibilities. Noooope! I settled on the C FFI interface both crates expose. That is meant to be a stable interface. The documentation for Arrow FFI module is nice with good examples, while the Polars Arrow FFI module’s documentation is sparse to say the least. We also have an issue for the ConnectorX project where someone previously ran into this issue with incompatible Arrow types, with the now deprecated Arrow2 crate in the mix. This led me down the path of exporting Polars Arrow data to the C interface types and then transmuting the type to the Arrow C interface type. Which is a safe operation since I use the ABI-compatible C interface as per the standard. (This assumes both implementations are bug-free. I set up quite a large test framework and did not find anything surprising.) Both crates expose their own ABI-compatible I begin by with finding the export functionality in the Transmute to the equivalent Arrow types. Constructing an Arrow array from the C FFI. The Create the companion Tying everything together: The last step is to, in a loop, call the Nice! Now let’s see if it works: And testing if I truly get out what I send in: A test run later: Finally. A straight shot of data into my DuckDB database without passing through serialization to the file system. Was this detour worth it? Maybe not, but I had fun working with a C FFI and utilizing what Arrow is all about: A columnar format for data analytics you can send between any language in memory. While we may never truly have left Rust, we at least bounced around the C data interface transmuting ABI-compatible types. What remains is checking whether the entire conversion from Polars Arrow to official Arrow is zero-copy. But the result satisfies me. I don’t need to write a routine to safely delete temporary data to prevent my microscopic 1 vCPU instance from running out of space when I deploy it to the real world. Personally, I would not recommend working with the Rust API for Polars yet. Suddenly, they rip out an interoperability function like the support for the official Arrow implementation, and you scramble to fix it. The Polars Rust library serves as a WIP for the Python project, and the rough edges feel endless. The documentation tells a similar story. The deeper you go, the worse the documentation gets. This truly rears its ugly head when you must use the internal Polars crates like What path would I take if repeating this project? I would not use the Rust Polars library and instead work with Arrow RecordBatches directly. The Rust Arrow ecosystem is a stable, fairly mature platform with good documentation. By cutting out Polars and Dataframes I would do the DataFram- esque work in the database or directly with the Rust types when querying the API. DataFrames are easier to debug than SQL. But somehow SQL tends to just keep chugging along if acceptably designed. On the point of databases: The Rust DuckDB library is both surprisingly solid due to its history being a fork of rusqlite creating a solid base and very immature when it comes to what DuckDB layered on top. The Rust DuckDB library can spit out Polars DataFrames but cannot load them. It provides only a handful of the more powerful parameterized types and data loaders, and large parts of the documentation contain non-descriptive single sentences describing what instead of why and how. Asking an LLM how to solve anything complex devolves into a Rustified version of what is available in the Python library using hallucinated functions. But that will likely change as DuckDB matures together with the wider Rust data ecosystem. Terminology
Easy - Roundtrip through the filesystem
let mut dataframe = df!?;
new
.finish?;
let conn = open_in_memory?;
conn.execute?;
let mut stmt = conn.prepare?;
let mut result = stmt.query_polars?;
let next = result.next.ok_or?;
let result_df = result
.into_iter
.try_fold?;
assert_eq!; // Returns true. They are equal!
Hard - Appending Arrow RecordBatches to DuckDB tables
rechunk_to_record_batch
.
let dataframe = df!?;
let conn = open_in_memory?;
// We need to manually create the table because the
// appender appends to an existing table.
conn.execute?;
// Convert the Dataframe to the underlying RecordBatch.
let record_batch = dataframe.rechunk_to_record_batch;
// Stuff it in the table!
let mut appender = conn.appender?;
appender.append_record_batch?;
appender.flush?;
error: mismatched types
-/lib.rs:122:38
|
122 | appender.append_record_batch?;
| |
| arguments to this method are incorrect
|
= note: expected `
FFI:ing Arrow data
#[repr(c)]
types used for the C data interface. The path forward is:polars-arrow
types.polars_arrow::ffi
module://
let exported_array = export_array_to_c;
let exported_schema = export_field_to_c;
let array: FFI_ArrowArray =
unsafe ;
let schema: FFI_ArrowSchema =
unsafe ;
make_array
function handles errors to manage some incompatibilities. For example polars-arrow
supports an int128
type which does not comply with the spec and will cause an error when you attempt to represent it using the official Arrow library.let data = unsafe ;
let array = make_array;
Field
type, utilizing the C data interface that maps the datatype from Polars Arrow to official Arrow.let field = new
polars_to_arrow_rs_array
function for each Array
and Field
combo in the RecordBatchT
we got an eon ago from rechunk_to_record_batch
.
let dataframe = df!?;
let record_batch = polars_dataframe_to_arrow_record_batch?;
let conn = open_in_memory?;
let mut appender = conn.appender?;
appender.append_record_batch?;
appender.flush?;
let mut stmt = conn.prepare?;
let mut result = stmt.query_polars?;
let next = result.next.ok_or?;
let result_df = result
.into_iter
.try_fold?;
assert_eq!;
running 1 test
test ... ok
appender
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured;
1 filtered out; finished in 0.05s
Conclusion
polars-arrow
.