Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimenting with arrow2 #68

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
099398e
Wip.
jorgecarleitao Jun 8, 2021
a5b2557
resolve merge conflicts and bump to latest arrow2
houqp Sep 4, 2021
a0c9669
use lexicographical_partition_ranges from arrow2
houqp Sep 4, 2021
3218759
Merge remote-tracking branch 'upstream/master' into arrow22
houqp Sep 4, 2021
a035200
Fix build errors
houqp Sep 6, 2021
843fbe6
Fix DataFusion test and try to make ballista compile (#4)
yjshen Sep 18, 2021
fccbddb
pin arrow-flight to 0.1 in arrow2 repo
houqp Sep 18, 2021
77c69cf
turn on io_parquet_compression feature for arrow2
houqp Sep 18, 2021
2d2e379
estimate array memory usage with estimated_bytes_size
houqp Sep 18, 2021
cb187a6
Merge remote-tracking branch 'upstream/master' into arrow2-merge
houqp Sep 18, 2021
25363d2
fix compile and tests
houqp Sep 19, 2021
7a5294b
Make ballista compile (#6)
yjshen Sep 24, 2021
4030615
Make `cargo test` compile (#7)
yjshen Sep 25, 2021
fde82cf
fix str to timestamp scalarvalue casting
houqp Sep 25, 2021
b585f3b
fixing datafusion tests (#8)
yjshen Sep 25, 2021
99907fd
fix crypto expression tests
houqp Sep 26, 2021
b2f709d
fix floating point precision
houqp Sep 26, 2021
ed5281c
fix list scalar to_arry method for timestamps
houqp Sep 26, 2021
f9504e7
Fix tests (#9)
yjshen Sep 26, 2021
33b6931
Ignore last test, fix `cargo clippy`, format and pass integration tes…
yjshen Sep 28, 2021
ca53b64
bump to latest arrow2, remove ord for interval type
houqp Sep 29, 2021
8702e12
add back case insenstive regex support
houqp Sep 30, 2021
41153dc
support type cast failure message
houqp Oct 2, 2021
ba57aa8
bump to arrow2 and parquet2 0.7, replace arrow-flight with arrow-format
houqp Nov 23, 2021
387fdf6
chore: arrow2 to 0.8, parquet to 0.8, prost to 0.9, tonic to 0.6
yjshen Nov 30, 2021
0d504e6
Merge remote-tracking branch 'upstream/master' into arrow22
houqp Dec 19, 2021
ea6d7fa
Fix build and tests
houqp Dec 20, 2021
44db376
Merge remote-tracking branch 'origin/master' into arrow2_merge
Igosuki Jan 11, 2022
ca9b485
merge latest datafusion
Igosuki Jan 11, 2022
b9125bc
start migrating avro to arrow2
Igosuki Jan 11, 2022
99fdac3
lints
Igosuki Jan 11, 2022
1b916aa
merge latest datafusion
Igosuki Jan 12, 2022
d611d4d
Fix hash utils
Igosuki Jan 12, 2022
171332f
missing import in hash_utils test with no_collision
Igosuki Jan 12, 2022
4344454
address clippies in root workspace
Igosuki Jan 12, 2022
257a7c5
fix tests #1
Igosuki Jan 12, 2022
b5cb938
fix decimal tests
houqp Jan 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -318,8 +318,7 @@ jobs:
run: |
cargo miri setup
cargo clean
# Ignore MIRI errors until we can get a clean run
cargo miri test || true
cargo miri test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️


# Coverage job was failing. https://github.com/apache/arrow-datafusion/issues/590 tracks re-instating it

Expand Down
2 changes: 1 addition & 1 deletion ballista/rust/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ tokio = "1.0"
tonic = "0.4"
uuid = { version = "0.8", features = ["v4"] }

arrow-flight = { version = "4.0" }
arrow-flight = { git = "https://github.com/jorgecarleitao/arrow2", rev = "43d8cf5c54805aa437a1c7ee48f80e90f07bc553" }

datafusion = { path = "../../../datafusion" }

Expand Down
9 changes: 4 additions & 5 deletions ballista/rust/core/src/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ use arrow_flight::utils::flight_data_to_arrow_batch;
use arrow_flight::Ticket;
use arrow_flight::{flight_service_client::FlightServiceClient, FlightData};
use datafusion::arrow::{
array::{StringArray, StructArray},
array::{StructArray, Utf8Array},
datatypes::{Schema, SchemaRef},
error::{ArrowError, Result as ArrowResult},
record_batch::RecordBatch,
Expand Down Expand Up @@ -104,10 +104,8 @@ impl BallistaClient {
let path = batch
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.expect(
"execute_partition expected column 0 to be a StringArray",
);
.downcast_ref::<Utf8Array<i32>>()
.expect("execute_partition expected column 0 to be a Utf8Array");

let stats = batch
.column(1)
Expand Down Expand Up @@ -206,6 +204,7 @@ impl Stream for FlightDataStream {
flight_data_to_arrow_batch(
&flight_data_chunk,
self.schema.clone(),
true,
&[],
)
});
Expand Down
18 changes: 7 additions & 11 deletions ballista/rust/core/src/execution_plans/query_stage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ use crate::memory_stream::MemoryStream;
use crate::utils;

use async_trait::async_trait;
use datafusion::arrow::array::{ArrayRef, StringBuilder};
use datafusion::arrow::array::{ArrayRef, Utf8Array};
use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::{DataFusionError, Result};
Expand Down Expand Up @@ -156,9 +156,7 @@ impl ExecutionPlan for QueryStageExec {
]));

// build result set with summary of the partition execution status
let mut c0 = StringBuilder::new(1);
c0.append_value(&path).unwrap();
let path: ArrayRef = Arc::new(c0.finish());
let path: ArrayRef = Arc::new(Utf8Array::<i32>::from_slice(&[path]));

let stats: ArrayRef = stats
.to_arrow_arrayref()
Expand Down Expand Up @@ -188,7 +186,7 @@ impl ExecutionPlan for QueryStageExec {
#[cfg(test)]
mod tests {
use super::*;
use datafusion::arrow::array::{StringArray, StructArray, UInt32Array, UInt64Array};
use datafusion::arrow::array::*;
use datafusion::physical_plan::memory::MemoryExec;
use tempfile::TempDir;

Expand All @@ -213,17 +211,15 @@ mod tests {
assert_eq!(1, batch.num_rows());
let path = batch.columns()[0]
.as_any()
.downcast_ref::<StringArray>()
.downcast_ref::<Utf8Array<i32>>()
.unwrap();
let file = path.value(0);
assert!(file.ends_with("data.arrow"));
let stats = batch.columns()[1]
.as_any()
.downcast_ref::<StructArray>()
.unwrap();
let num_rows = stats
.column_by_name("num_rows")
.unwrap()
let num_rows = stats.values()[0]
.as_any()
.downcast_ref::<UInt64Array>()
.unwrap();
Expand All @@ -241,8 +237,8 @@ mod tests {
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(UInt32Array::from(vec![Some(1), Some(2)])),
Arc::new(StringArray::from(vec![Some("hello"), Some("world")])),
Arc::new(UInt32Array::from(&[Some(1), Some(2)])),
Arc::new(Utf8Array::<i32>::from(&[Some("hello"), Some("world")])),
],
)?;
let partition = vec![batch.clone(), batch];
Expand Down
1 change: 1 addition & 0 deletions ballista/rust/core/src/serde/logical_plan/from_proto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -742,6 +742,7 @@ impl TryInto<datafusion::scalar::ScalarValue> for &protobuf::ScalarValue {
let pb_scalar_type = opt_scalar_type
.as_ref()
.ok_or_else(|| proto_error("Protobuf deserialization err: ScalaListValue missing required field 'datatype'"))?;

let typechecked_values: Vec<ScalarValue> = values
.iter()
.map(|val| val.try_into())
Expand Down
19 changes: 7 additions & 12 deletions ballista/rust/core/src/serde/physical_plan/from_proto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,6 @@ impl TryInto<Arc<dyn ExecutionPlan>> for &protobuf::PhysicalPlanNode {
&filenames,
Some(projection),
None,
scan.batch_size as usize,
scan.num_partitions as usize,
None,
)?))
Expand Down Expand Up @@ -199,17 +198,13 @@ impl TryInto<Arc<dyn ExecutionPlan>> for &protobuf::PhysicalPlanNode {
PhysicalPlanType::Window(window_agg) => {
let input: Arc<dyn ExecutionPlan> =
convert_box_required!(window_agg.input)?;
let input_schema = window_agg
.input_schema
.as_ref()
.ok_or_else(|| {
BallistaError::General(
"input_schema in WindowAggrNode is missing.".to_owned(),
)
})?
.clone();
let physical_schema: SchemaRef =
SchemaRef::new((&input_schema).try_into()?);
let input_schema = window_agg.input_schema.ok_or_else(|| {
BallistaError::General(
"input_schema in WindowAggrNode is missing.".to_owned(),
)
})?;

let physical_schema = Arc::new(input_schema);

let physical_window_expr: Vec<Arc<dyn WindowExpr>> = window_agg
.window_expr
Expand Down
2 changes: 1 addition & 1 deletion ballista/rust/core/src/serde/physical_plan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ mod roundtrip_tests {

use datafusion::{
arrow::{
compute::kernels::sort::SortOptions,
compute::sort::SortOptions,
datatypes::{DataType, Field, Schema},
},
logical_plan::Operator,
Expand Down
2 changes: 1 addition & 1 deletion ballista/rust/core/src/serde/physical_plan/to_proto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ impl TryInto<protobuf::PhysicalPlanNode> for Arc<dyn ExecutionPlan> {
let filenames = exec
.partitions()
.iter()
.flat_map(|part| part.filenames().to_owned())
.map(|part| part.filename.clone())
.collect();
Ok(protobuf::PhysicalPlanNode {
physical_plan_type: Some(PhysicalPlanType::ParquetScan(
Expand Down
55 changes: 15 additions & 40 deletions ballista/rust/core/src/serde/scheduler/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@

use std::{collections::HashMap, sync::Arc};

use datafusion::arrow::array::{
ArrayBuilder, ArrayRef, StructArray, StructBuilder, UInt64Array, UInt64Builder,
};
use datafusion::arrow::array::*;
use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
use datafusion::logical_plan::LogicalPlan;
use datafusion::physical_plan::ExecutionPlan;
Expand Down Expand Up @@ -142,52 +140,29 @@ impl PartitionStats {
]
}

pub fn to_arrow_arrayref(self) -> Result<Arc<StructArray>, BallistaError> {
let mut field_builders = Vec::new();

let mut num_rows_builder = UInt64Builder::new(1);
match self.num_rows {
Some(n) => num_rows_builder.append_value(n)?,
None => num_rows_builder.append_null()?,
}
field_builders.push(Box::new(num_rows_builder) as Box<dyn ArrayBuilder>);

let mut num_batches_builder = UInt64Builder::new(1);
match self.num_batches {
Some(n) => num_batches_builder.append_value(n)?,
None => num_batches_builder.append_null()?,
}
field_builders.push(Box::new(num_batches_builder) as Box<dyn ArrayBuilder>);

let mut num_bytes_builder = UInt64Builder::new(1);
match self.num_bytes {
Some(n) => num_bytes_builder.append_value(n)?,
None => num_bytes_builder.append_null()?,
}
field_builders.push(Box::new(num_bytes_builder) as Box<dyn ArrayBuilder>);

let mut struct_builder =
StructBuilder::new(self.arrow_struct_fields(), field_builders);
struct_builder.append(true)?;
Ok(Arc::new(struct_builder.finish()))
pub fn to_arrow_arrayref(&self) -> Result<Arc<StructArray>, BallistaError> {
let num_rows = Arc::new(UInt64Array::from(&[self.num_rows])) as ArrayRef;
let num_batches = Arc::new(UInt64Array::from(&[self.num_batches])) as ArrayRef;
let num_bytes = Arc::new(UInt64Array::from(&[self.num_bytes])) as ArrayRef;
let values = vec![num_rows, num_batches, num_bytes];

Ok(Arc::new(StructArray::from_data(
self.arrow_struct_fields(),
values,
None,
)))
}

pub fn from_arrow_struct_array(struct_array: &StructArray) -> PartitionStats {
let num_rows = struct_array
.column_by_name("num_rows")
.expect("from_arrow_struct_array expected a field num_rows")
let num_rows = struct_array.values()[0]
.as_any()
.downcast_ref::<UInt64Array>()
.expect("from_arrow_struct_array expected num_rows to be a UInt64Array");
let num_batches = struct_array
.column_by_name("num_batches")
.expect("from_arrow_struct_array expected a field num_batches")
let num_batches = struct_array.values()[1]
.as_any()
.downcast_ref::<UInt64Array>()
.expect("from_arrow_struct_array expected num_batches to be a UInt64Array");
let num_bytes = struct_array
.column_by_name("num_bytes")
.expect("from_arrow_struct_array expected a field num_bytes")
let num_bytes = struct_array.values()[2]
.as_any()
.downcast_ref::<UInt64Array>()
.expect("from_arrow_struct_array expected num_bytes to be a UInt64Array");
Expand Down
14 changes: 6 additions & 8 deletions ballista/rust/core/src/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,10 @@ use crate::serde::scheduler::PartitionStats;

use datafusion::arrow::error::Result as ArrowResult;
use datafusion::arrow::{
array::{
ArrayBuilder, ArrayRef, StructArray, StructBuilder, UInt64Array, UInt64Builder,
},
datatypes::{DataType, Field, SchemaRef},
ipc::reader::FileReader,
ipc::writer::FileWriter,
array::*,
datatypes::{DataType, Field},
io::ipc::read::FileReader,
io::ipc::write::FileWriter,
record_batch::RecordBatch,
};
use datafusion::execution::context::{ExecutionConfig, ExecutionContext};
Expand Down Expand Up @@ -63,7 +61,7 @@ pub async fn write_stream_to_disk(
stream: &mut Pin<Box<dyn RecordBatchStream + Send + Sync>>,
path: &str,
) -> Result<PartitionStats> {
let file = File::create(&path).map_err(|e| {
let mut file = File::create(&path).map_err(|e| {
BallistaError::General(format!(
"Failed to create partition file at {}: {:?}",
path, e
Expand All @@ -73,7 +71,7 @@ pub async fn write_stream_to_disk(
let mut num_rows = 0;
let mut num_batches = 0;
let mut num_bytes = 0;
let mut writer = FileWriter::try_new(file, stream.schema().as_ref())?;
let mut writer = FileWriter::try_new(&mut file, stream.schema().as_ref())?;

while let Some(result) = stream.next().await {
let batch = result?;
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/src/bin/nyctaxi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ use std::process;
use std::time::Instant;

use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::util::pretty;
use datafusion::arrow::io::print;

use datafusion::error::Result;
use datafusion::execution::context::{ExecutionConfig, ExecutionContext};
Expand Down Expand Up @@ -124,7 +124,7 @@ async fn execute_sql(ctx: &mut ExecutionContext, sql: &str, debug: bool) -> Resu
let physical_plan = ctx.create_physical_plan(&plan)?;
let result = collect(physical_plan).await?;
if debug {
pretty::print_batches(&result)?;
print::print(&result)?;
}
Ok(())
}
Expand Down
Loading