DuckDB: Importing Polars dataframes the easy and hard way using Rust

Recently I have been working on a project where I did data manipulation through Polars Dataframes and then wanted to query the result with DuckDB for that OLAP goodness.

I wanted to build on the Rust Arrow ecosystem. On the surface, Arrow looks like a nice ecosystem with good interoperability. Polars abstracts over Arrow. DuckDB advertises support for both Polars and Arrow.

Diving in, this turned out to be the first time I had to use C FFI to call Rust libraries from each other because, well, near zero interoperability exists if you stick to pure Rust as the Arrow ecosystem has evolved.

Terminology

  • Arrow - A columnar format for working with structured data in memory. Many modern data processing solutions use it as their backing format.
  • FFI - Foreign Function Interface. For example, calling a Rust function from Python.
  • DuckDB - The SQLite for data analysis. A local-first database that enables fast analysis on datasets larger than memory. Skip dealing with the cloud data warehouse until you run out of fast storage!
  • Polars - The modern dataframe library equivalent to good old Pandas built in Rust. It exposes a modern API with lazy computation for blisteringly fast analyses in Python. The Rust library feels like an unstable WIP for the Python frontend, but can be quite fun to use!
  • RecordBatch - A table-like structure comprising multiple Arrow arrays, each representing a column. A schema describes the layout allowing us to know what each column contains! No type guessing à la CSV.

Easy - Roundtrip through the filesystem

Let’s begin by implementing the trivial solution. Nearly all DuckDB examples start with Parquet files, and Polars makes it trivial to write Parquet to disk.

let mut dataframe = df!{
    "a" => [1, 2, 3],
    "b" => [4, 4, 5],
    "c" => ["hey", "what", "now"],
    "d" => [3.44, 5.88, 1231f64],
}?;

ParquetWriter::new(&mut std::fs::File::create(&"./tmp.parquet")?)
    .finish(&mut dataframe)?;

let conn = duckdb::Connection::open_in_memory()?;
conn.execute(
    &format!("CREATE TABLE tmp_table AS
        SELECT * FROM './tmp.parquet'"),
    [],
)?;

When querying the table, DuckDB kindly returns the underlying table as DataFrames split into batches to reduce the overall memory footprint.

let mut stmt = conn.prepare("SELECT * FROM tmp_table;")?;
let mut result = stmt.query_polars([])?;

let next = result.next().ok_or(anyhow::anyhow!("Empty result"))?;
let result_df = result
    .into_iter()
    .try_fold(next, |acc, df| acc.vstack(&df))?;

assert_eq!(dataframe, result_df); // Returns true. They are equal!

This works, with the added overhead of serializing everything to Parquet and temporarily storing all data on disk in an intermediate state.

Working in a production setting creating a library of Parquets for DuckDb or a Cloud Database to query is the reasonable choice.

But this is not a production setting; this is a hobby project where I want to send data from an API directly to my database without managing files on disk.

No intermediary saving to the file system. No serialization overhead. No deserialization overhead. No management of temporary files to ensure I do not run out of space on my 1 vCPU instance.


Hard - Appending Arrow RecordBatches to DuckDB tables

Searching through the DuckDB library, I quickly found the appender API for efficiently importing large datasets, which conveniently utilizes Arrow record batches!

impl Appender<'_> {
    pub fn append_record_batch(
        &mut self,
        record_batch: RecordBatch
    ) -> Result<()>
}

That should work, right? Polars is Arrow. DuckDB wants Arrow. A match made in heaven! Let’s go to Polars and find the equivalent method to get some record batches from the dataframe.

I find the slightly more ambiguous and cumbersome function rechunk_to_record_batch.

impl DataFrame {
    pub fn rechunk_to_record_batch(
        self,
        compat_level: CompatLevel,
    ) -> RecordBatchT<Box<dyn Array>>
}

Let’s try using that function!

let dataframe = df!{
    "a" => [1, 2, 3],
    "b" => [4, 4, 5],
    "c" => ["hey", "what", "now"],
    "d" => [3.44, 5.88, 1231f64],
}?;
let conn = duckdb::Connection::open_in_memory()?;

// We need to manually create the table because the  
// appender appends to an existing table. 
conn.execute(
    "CREATE TABLE tmp_table (
        a INTEGER,
        b INTEGER,
        c VARCHAR,
        d DOUBLE)",
    [],
)?;

// Convert the Dataframe to the underlying RecordBatch.
let record_batch = dataframe.rechunk_to_record_batch(
    CompatLevel::newest()
);

// Stuff it in the table!
let mut appender = conn.appender("tmp_table")?;
appender.append_record_batch(record_batch)?;
appender.flush()?;

Happily, I try compiling!

error[E0308]: mismatched types
   --> src/lib.rs:122:38
    |
122 | appender.append_record_batch(record_batch)?;
    |          ------------------- ^^^^^^^^^^^^ expected `RecordBatch`, 
    |          |                    found `RecordBatchT<Box<dyn Array>>`
    |          |
    |          arguments to this method are incorrect
    |
    = note: expected struct `arrow::array::RecordBatch`
               found struct `RecordBatchT<Box<dyn polars_arrow::array::Array>>`

Did you manage to spot the error? Look at the types. A RecordBatchT is very much not a RecordBatch. No matter how you finagle with the types or underlying data.

This is where our naive view of the Arrow ecosystem ends. Apparently everything wasn’t the same Arrow. Not anymore. Polars ripped out the few remaining interoperability functions in October 2024.

We now face two options:

  1. Manually write a conversion from Polars Arrow to official Arrow.
  2. Use the Arrow C data interface through FFI, which is most commonly use to send data from, for example, Python to the Polars Rust backend.

Like a true Rustacean, I first attempted number 1. Splitting up the RecordBatchT to its constituent parts is easy enough.

pub fn into_schema_and_arrays(self) -> (ArrowSchemaRef, Vec<A>)

But then I face converting this enum containing 41 variants with several of the variants being enums themselves to another enum of the same size. Once for the schema fields and once for the data arrays. With subtle incompatibilities. Noooope!

I settled on the C FFI interface both crates expose. That is meant to be a stable interface.

The documentation for Arrow FFI module is nice with good examples, while the Polars Arrow FFI module’s documentation is sparse to say the least.

We also have an issue for the ConnectorX project where someone previously ran into this issue with incompatible Arrow types, with the now deprecated Arrow2 crate in the mix.

This led me down the path of exporting Polars Arrow data to the C interface types and then transmuting the type to the Arrow C interface type. Which is a safe operation since I use the ABI-compatible C interface as per the standard. (This assumes both implementations are bug-free. I set up quite a large test framework and did not find anything surprising.)

FFI:ing Arrow data

Both crates expose their own ABI-compatible #[repr(c)] types used for the C data interface. The path forward is:

  1. Convert data to the C data interface types as per the polars-arrow types.
  2. Transmute the types to the Arrow variety.
  3. Load the data into the Rust representation.

I begin by with finding the export functionality in the polars_arrow::ffi module:

// 
let exported_array = polars_arrow::ffi::export_array_to_c(array);
let exported_schema = polars_arrow::ffi::export_field_to_c(field);

Transmute to the equivalent Arrow types.


let array: arrow::ffi::FFI_ArrowArray =
    unsafe { transmute(exported_array) };
let schema: arrow::ffi::FFI_ArrowSchema =
    unsafe { transmute(exported_schema) };

Constructing an Arrow array from the C FFI. The make_array function handles errors to manage some incompatibilities. For example polars-arrow supports an int128 type which does not comply with the spec and will cause an error when you attempt to represent it using the official Arrow library.

let data = unsafe { arrow::ffi::from_ffi(array, &schema)? };
let array = arrow::array::make_array(data);

Create the companion Field type, utilizing the C data interface that maps the datatype from Polars Arrow to official Arrow.

let field = arrow::datatypes::Field::new(
    field.name.as_str(),
    array.data_type().clone(),
    field.is_nullable,
)

Tying everything together:

pub fn polars_to_arrow_rs_array(
    field: &polars_arrow::datatypes::Field,
    array: Box<dyn polars_arrow::array::Array>,
) -> Result<(arrow::datatypes::Field, arrow::array::ArrayRef),
        arrow::error::ArrowError> {

    // Export the C ABI compatible types using the FFI module.
    let exported_array = polars_arrow::ffi::export_array_to_c(array);
    let exported_schema = polars_arrow::ffi::export_field_to_c(field);

    // Tranmute mute the Polars Arrow FFI array to the Arrow-rs 
    // FFI array. 
    //
    // SAFETY: Per the Arrow spec this is ABI compatible.
    let array: arrow::ffi::FFI_ArrowArray =
        unsafe { std::mem::transmute(exported_array) };
    let schema: arrow::ffi::FFI_ArrowSchema =
        unsafe { std::mem::transmute(exported_schema) };


    // SAFETY: Per the Arrow spec this is ABI compatible.
    let data = unsafe { arrow::ffi::from_ffi(array, &schema)? };
    let array = arrow::array::make_array(data);

    Ok((
        arrow::datatypes::Field::new(
            field.name.as_str(),
            array.data_type().clone(),
            field.is_nullable,
        ),
        array,
    ))
}

The last step is to, in a loop, call the polars_to_arrow_rs_array function for each Array and Field combo in the RecordBatchT we got an eon ago from rechunk_to_record_batch.

pub fn polars_dataframe_to_arrow_record_batch(
    df: polars::prelude::DataFrame,
) -> Result<arrow::record_batch::RecordBatch, arrow::error::ArrowError> {
    
    let (schema, arrays) = df
        .rechunk_to_record_batch(polars::prelude::CompatLevel::newest())
        .into_schema_and_arrays();

    let (fields, arrays): (Vec<_>, Vec<_>) = arrays
        .into_iter()
        .zip(schema.iter_values())
        // Convert to Arrow using the function we built earlier.
        .map(|(array, field)| polars_to_arrow_rs_array(field, array))
        .collect::<Result<Vec<_>, _>>()?
        .into_iter()
        .unzip();

    let schema = arrow::datatypes::SchemaRef::new(
            arrow::datatypes::Schema::new(fields));

    let record_batch = arrow::record_batch::RecordBatch::try_new(
        schema,
        arrays,
    )?;

    Ok(record_batch)
}

Nice! Now let’s see if it works:

let dataframe = df!{
    "a" => [1, 2, 3],
    "b" => [4, 4, 5],
    "c" => ["hey", "what", "now"],
    "d" => [3.44, 5.88, 1231f64],
}?;
let record_batch = polars_dataframe_to_arrow_record_batch(
    dataframe.clone()
)?;

let conn = duckdb::Connection::open_in_memory()?;
let mut appender = conn.appender("tmp_table")?;
appender.append_record_batch(record_batch)?;
appender.flush()?;

And testing if I truly get out what I send in:

let mut stmt = conn.prepare("SELECT * FROM tmp_table;")?;
let mut result = stmt.query_polars([])?;

let next = result.next().ok_or(anyhow::anyhow!("Empty result"))?;
let result_df = result
    .into_iter()
    .try_fold(next, |acc, df| acc.vstack(&df))?;

assert_eq!(dataframe, result_df);

A test run later:

running 1 test
test tests::appender ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 
1 filtered out; finished in 0.05s

Finally. A straight shot of data into my DuckDB database without passing through serialization to the file system.

Was this detour worth it? Maybe not, but I had fun working with a C FFI and utilizing what Arrow is all about: A columnar format for data analytics you can send between any language in memory. While we may never truly have left Rust, we at least bounced around the C data interface transmuting ABI-compatible types.

What remains is checking whether the entire conversion from Polars Arrow to official Arrow is zero-copy. But the result satisfies me. I don’t need to write a routine to safely delete temporary data to prevent my microscopic 1 vCPU instance from running out of space when I deploy it to the real world.


Conclusion

Personally, I would not recommend working with the Rust API for Polars yet. Suddenly, they rip out an interoperability function like the support for the official Arrow implementation, and you scramble to fix it. The Polars Rust library serves as a WIP for the Python project, and the rough edges feel endless.

The documentation tells a similar story. The deeper you go, the worse the documentation gets. This truly rears its ugly head when you must use the internal Polars crates like polars-arrow.

What path would I take if repeating this project?

I would not use the Rust Polars library and instead work with Arrow RecordBatches directly.

The Rust Arrow ecosystem is a stable, fairly mature platform with good documentation. By cutting out Polars and Dataframes I would do the DataFram- esque work in the database or directly with the Rust types when querying the API.

DataFrames are easier to debug than SQL. But somehow SQL tends to just keep chugging along if acceptably designed.

On the point of databases: The Rust DuckDB library is both surprisingly solid due to its history being a fork of rusqlite creating a solid base and very immature when it comes to what DuckDB layered on top.

The Rust DuckDB library can spit out Polars DataFrames but cannot load them. It provides only a handful of the more powerful parameterized types and data loaders, and large parts of the documentation contain non-descriptive single sentences describing what instead of why and how. Asking an LLM how to solve anything complex devolves into a Rustified version of what is available in the Python library using hallucinated functions.

But that will likely change as DuckDB matures together with the wider Rust data ecosystem.