A native Rust library for Apache Hudi, with bindings to Python
Thehudi-rs
project aims to broaden the use ofApache Hudifor a diverse range of
users and projects.
Source | Installation Command |
---|---|
PyPi | pip install hudi |
Crates.io | cargo add hudi |
Note
These examples expect a Hudi table exists at/tmp/trips_table
,created using
thequick start guide.
Read a Hudi table into a PyArrow table.
fromhudiimportHudiTable
hudi_table=HudiTable("/tmp/trips_table")
records=hudi_table.read_snapshot()
importpyarrowaspa
importpyarrow.computeaspc
arrow_table=pa.Table.from_batches(records)
result=arrow_table.select(
["rider","ts","fare"]).filter(
pc.field("fare")>20.0)
print(result)
Add crate hudi with datafusion feature to your application to query a Hudi table.
cargo new my_project --bin&&cdmy_project
cargo add tokio@1 datafusion@39
cargo add hudi --features datafusion
Updatesrc/main.rs
with the code snippet below thencargo run
.
usestd::sync::Arc;
usedatafusion::error::Result;
usedatafusion::prelude::{DataFrame,SessionContext};
usehudi::HudiDataSource;
#[tokio::main]
asyncfnmain()->Result<()>{
letctx =SessionContext::new();
lethudi =HudiDataSource::new("/tmp/trips_table").await?;
ctx.register_table("trips_table",Arc::new(hudi))?;
letdf:DataFrame= ctx.sql("SELECT * from trips_table where fare > 20.0").await?;
df.show().await?;
Ok(())
}
Ensure cloud storage credentials are set properly as environment variables, e.g.,AWS_*
,AZURE_*
,orGOOGLE_*
.
Relevant storage environment variables will then be picked up. The target table's base uri with schemes such
ass3://
,az://
,orgs://
will be processed accordingly.
Check out thecontributing guidefor all the details about making contributions to the project.