-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arrow2] Merge arrow2 and datafusion latest #1697
Conversation
…ch` (apache#1542) * Clarify docs about `Accumulator::update` and `Accumulator::update_batch` * Update datafusion/src/physical_plan/mod.rs Co-authored-by: QP Hou <[email protected]> * Improve docs, add comments on merge_batch Co-authored-by: QP Hou <[email protected]>
* Initial implementation of variance * get simple f64 type tests working * add math functions to ScalarValue, some tests * add to expressions and tests * add more tests * add test for ScalarValue add * add tests for scalar arithmetic * add test, finish variance * fix warnings * add more sql tests * add stddev and tests * add the hooks and expression * add more tests * fix lint and clipy * address comments and fix test errors * address comments * add population and sample for variance and stddev * address more comments * fmt * add test for less than 2 values * fix inconsistency in the merge logic * fix lint and clipy * use batch operations * remove unused code * lint and clipy * fix s typo * clipy fix * fix lint
* Fix 'new_without_default' clippy warnings * Fix 'from_over_into' clippy warnings * Push 'allow(module_inception)' down from lib.rs * Fix 'complex_type' clippy warning with type alias * Remove unused yet allowed clippy warnings * Fix 'new_without_default' clippy warnings for tests
* Update tonic/prost deps * Update to arrow 7.0.0-SNAPSHOT * Update datafusion and tests for arrow changes * fix doc tests * Update avro support * Use released arrow 7.0.0
…d supporting `ScalarValue` arithmetic (apache#1550) * Remove Scalar::{add,mul,div} * fixups
…ache#1537) * Make call SchedulerServer::new once in ballista-scheduler process * cargo fmt * fix wrong annotation
* initial change * fix some logic * adding basic tests * mark update and merge unimplemented * all tests pass * add doc and tests * lint * clipy * add more merge tests * add merge tests for stddev * lint * clippy
…rnal Sort implementation (apache#1526) * Simplified memory management * External sorter as an example * document more * use std::sync::Mutex instead of futures::lock::Mutex * resolve comments * Adding tests for ExternalSorter as well as DiskManager * Adding test for memory manager * Prevent allocate more memory than we actually have * unnecessary unsafe * fix lint * Fix requesters_total update non-atomic * Fix lint * resolve comments
…1442) * support cast/try_cast for decimal: signed numeric to decimal * add test case * change test case * change dependency * remove useless code * restore dependency
…cion rule (apache#1483) * support comparison for decimal data type * add more test for decimal comparison; refactor binary test * change cargo temporarily * add license for binary rule file * fix clippy
* add correlation function * add readme * add sql test * fix divide by 0
* support decimal to numeric * fix error test
* implement Hash for various types * change more partialOrd => hash * update unit tests
* add from_slice trait to ease arrow2 migration * update more
…Config` and `PhysicalPlanConfig` (apache#1562) * Remove batch_size from PhysicalPlanConfig first * Fix newly emerged lint problems * Remove batch_size from `ExecutionConfig`, rename `PhysicalPlanConfig` to `FileScanConfig` to make it clearer * Revert "Fix newly emerged lint problems" This reverts commit 7be7ccb.
…e#1589) * refactor to support str and bin * incorporate comments
* support comparison for decimal data type * add more test for decimal comparison; refactor binary test * change cargo temporarily * support decimal to arithmetic operation * minor fix * add doc/comment for coercion rule * address comments
* add support show tables and show columns for ballista * format code style * fmt code style * fmt code style * fmt code * fmt code * add ballista.with_information_schema to BallistaConfig * add ballista.with_information_schema to BallistaConfig * fix test clippy * fix test clippy * fix test clippy
* Support dictionary / dictionary comparisons by unpacking them * Comment rationale for not directly comparing Dictionary types
…1541) * Rebase * Update tests * Add GenericError * Cleanup * Remove object store trait updates * Fix clippy * Cleanup code review comments
…le (apache#1607) * Consolidate coercion rules * Make more functions not pub * restore comments * Update datafusion/src/physical_plan/coercion_rule/binary_rule.rs Co-authored-by: xudong.w <[email protected]> Co-authored-by: xudong.w <[email protected]>
…nager` (apache#1668) * Improve configuration and resource use of `MemoryManager` and `DiskManager` * fmt
* Use NamedTempFile rather than `String` in DiskManager * fix ballista
…che#1682) * Add gauge * fix doc * fix fmt
…ester (apache#1678) * skip empty batch while inserting * `SortPreservingMerge` fuzz testing Co-authored-by: Yijie Shen <[email protected]>
… that cannot be returned
I changed the base to arrow2 branch rather than master. |
Woops 😆
Le ven. 28 janv. 2022 à 17:04, Andrew Lamb ***@***.***> a
écrit :
… I changed the base to arrow2 branch rather than master.
Epic work @Igosuki <https://github.com/Igosuki> L
—
Reply to this email directly, view it on GitHub
<#1697 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBRRG6CVTWP63DXKXU3UYK5AZANCNFSM5NBDF3ZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a brief walk through this code @Igosuki -- specifically the commits you added. I must say it is a very cool piece of work (especially with the compatibility traits, and RecordBatch
)
👍
// FIXME: return recods_read from infer_schema | ||
schemas.push(schema.clone()); | ||
records_to_read -= records_to_read; | ||
if records_read == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -109,3 +111,321 @@ pub fn struct_array_from(pairs: Vec<(Field, ArrayRef)>) -> StructArray { | |||
let values = pairs.iter().map(|v| v.1.clone()).collect(); | |||
StructArray::from_data(DataType::Struct(fields), values, None) | |||
} | |||
|
|||
/// Imitate arrow-rs Schema behavior by extending arrow2 Schema | |||
pub trait SchemaExt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is cool
Although the extension traits help, I think the subtle differences between arrow-rs and arrow2 such as in Array creation will make merging always a bit painful, especially when merging unit tests... Not sure this way will be sustainable in the long run. |
Really nice work on both the extension trait and recordbatch struct 👍 With regards to unit tests merging, @jimexist did some prior work to help reduce conflicts at: #1588. But I agree with you that merging is a lot of work. I think the goal is not to maintain the in the long run, but rather close the gaps in the arrow2 milestones as fast as we could, then make a final decision on whether we want to merge this into master or not as a community. |
I believe we should not use squash merge to maintain long running branches, so I am going to manually push these changes into the arrow2 branch. |
@Igosuki if you are planning on fixing some of the tests in the short term, then let me know so I won't be stepping on your toes. Otherwise, I will help with the tests this weekend. |
@houqp Some fixes here https://github.com/Igosuki/arrow-datafusion/tree/arrow_testfix_20220129 remain a parquet projection issue, the decimal type issue I mentioned here and a test that fails because it's not spilling any memory when it's expected that it will. |
Which issue does this PR close?
None
Rationale for this change
Updates the arrow2 branch with the latest of arrow2 and datafusion master
What changes are included in this PR?
Extension traits to keep the API consistent and make future datafusion master merging easier.
Internally, some methods start using
Chunk
, I haven't done any perf tests.Are there any user-facing changes?
RecordBatch is
datafusion::record_batch::RecordBatch
instead ofdatafusion::arrow::record_batch::RecordBatch
Issues
Some tests fail, I haven't fixed everything.
Notably, checking schema validity between columns and schemas in record batch fails for decimal tests because arrow creates a Decimal(32, 32) column for Int128Array : https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/mod.rs#L295