forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dremio 11 #21
Closed
Closed
Dremio 11 #21
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…14881) * apache@730e9c5 updates `ts-jest` configuration to remove deprecation warnings * apache@e4d83f2 updates `print-buffer-alignment.js` debug utility for latest APIs * apache@3b9d18c updates `arrow2csv` to print zero-based rowIds * apache@b6c42f3 fixes apache#14791 * Closes: apache#14791 Authored-by: ptaylor <[email protected]> Signed-off-by: Dominik Moritz <[email protected]>
…pache#14892) * Closes: apache#14883 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
* Closes: apache#14883 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…s in the R package directory (apache#14678) A first stab at documenting the post-Arrow-release/pre-CRAN submission process! Builds on excellent documentation on the Confluence page ( https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRpackages ) but in a more "checklisty" form to make sure we don't miss steps. Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Neal Richardson <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…ead of Jiras (apache#14903) Authored-by: Dominik Moritz <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…b>.pc.in. (apache#14900) * Closes: apache#14869 Authored-by: Luke Elliott <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…can decode without DictionaryHashTable (apache#14902) * Closes: apache#14901 Authored-by: 郭峰 <[email protected]> Signed-off-by: David Li <[email protected]>
…ption thrown (apache#14891) * Closes: apache#14890 Authored-by: 郭峰 <[email protected]> Signed-off-by: David Li <[email protected]>
Currently you get an error like "ArrowInvalid: Failed to parse string: '2021-01-02 00:00:00+01:00' as a scalar of type timestamp[s]expected no zone offset" Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: David Li <[email protected]>
…cache times out (apache#14850) Explicitly starting the sccache server prior to the compilation has removed the flakiness in my testing. * Closes: apache#14849 Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…4587) This fixes a bug which happens when a Vector has been created with multiple data blobs and at least one of them has been padded. See https://observablehq.com/d/14488c116b338560 for a reproduction of the error and more details. Lead-authored-by: Thomas Sarlandie <[email protected]> Co-authored-by: Dominik Moritz <[email protected]> Co-authored-by: Paul Taylor <[email protected]> Signed-off-by: Dominik Moritz <[email protected]>
…che#14887) Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Nic Crane <[email protected]>
Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory. Addresses what was discussed in apache#13155 (comment) - [x] Update documentation - [x] Ensure the filtered dataset preserves the filter when writing it back - [x] Ensure the filtered dataset preserves the filter when joining - [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods. - [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to apache#13409 (comment)) - [x] Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche - [x] Add support in replace_schema - [x] Error in get_fragments in case a filter is set. - [x] Verify support in UnionDataset Lead-authored-by: Alessandro Molina <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Alessandro Molina <[email protected]>
This PR fixes some broken links and runs `devtools::document()` with the newest roxygen (7.2.2). * Closes: apache#14884 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
* Closes: apache#14876 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…ffers" (apache#14915) The API is deprecated since v0.17 and the implementation has been removed in apache@ab93ba1 * Closes: apache#14916 Authored-by: Tao He <[email protected]> Signed-off-by: David Li <[email protected]>
Similarly as apache#14719 @ milesgranger has been contributing regularly the last few months both in PRs (https://github.com/apache/arrow/commits?author=milesgranger) as issue triage. Adding him to the collaborators (triage role) enables him to do that on github as well (disclaimer: Miles is a colleague of mine). Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…apache#14806) See https://issues.apache.org/jira/browse/ARROW-18421 Lead-authored-by: LouisClt <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…adata API (apache#13041) In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430. Lead-authored-by: Yue Ni <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Closes apache#14855. * Closes: apache#14855 Authored-by: David Li <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
`ArrowBuf#setOne` should have int64 params Authored-by: 郭峰 <[email protected]> Signed-off-by: David Li <[email protected]>
…ncoder and StructSubfieldEncoder (apache#14910) * Closes: apache#14909 Authored-by: 郭峰 <[email protected]> Signed-off-by: David Li <[email protected]>
Authored-by: Miles Granger <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…pache#14832) Synching after conda-forge/arrow-cpp-feedstock#875, which does quite a lot of things, see this [summary](conda-forge/arrow-cpp-feedstock#875 (review)). I'm not keeping the commit history here, but it might be instructive to check the commits there to see why certain changes came about. It also fixes the CI that was broken by apache@a3ef64b (undoing the changes of apache#14102 in `tasks.yml`). Finally, it adapts to conda making a long-planned [switch](conda-forge/conda-forge.github.io#1586) w.r.t. to the format / extension of the artefacts it produces. I'm very likely going to need some help (or at least pointers) for the R-stuff. CC @ xhochy (for context, I never got a response to conda-forge/r-arrow-feedstock#55, but I'll open a PR to build against libarrow 10). Once this is done, I can open issues to tackle the tests that shouldn't be failing, resp. the segfaults on PPC resp. in conjunction with `sparse`. * Closes: apache#14828 Authored-by: H. Vetinari <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…e group_by/summarise statements are used (apache#14905) Reprex using CRAN arrow: ``` r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) mtcars |> arrow_table() |> select(mpg, cyl) |> group_by(mpg, cyl) |> group_by(cyl, value = "foo") |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl, value [3] #> mpg cyl value `"foo"` #> <dbl> <dbl> <dbl> <chr> #> 1 21 6 6 foo #> 2 21 6 6 foo #> 3 22.8 4 4 foo #> 4 21.4 6 6 foo #> 5 18.7 8 8 foo #> 6 18.1 6 6 foo #> 7 14.3 8 8 foo #> 8 24.4 4 4 foo #> 9 22.8 4 4 foo #> 10 19.2 6 6 foo #> # … with 22 more rows ``` <sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> After this PR: ``` r library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. library(dplyr, warn.conflicts = FALSE) mtcars |> arrow_table() |> select(mpg, cyl) |> group_by(mpg, cyl) |> group_by(cyl, value = "foo") |> collect() #> # A tibble: 32 × 3 #> # Groups: cyl, value [3] #> mpg cyl value #> <dbl> <dbl> <chr> #> 1 21 6 foo #> 2 21 6 foo #> 3 22.8 4 foo #> 4 21.4 6 foo #> 5 18.7 8 foo #> 6 18.1 6 foo #> 7 14.3 8 foo #> 8 24.4 4 foo #> 9 22.8 4 foo #> 10 19.2 6 foo #> # … with 22 more rows ``` <sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> * Closes: apache#14872 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…pache#14746) Authored-by: aandres <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…d PATs (apache#14928) The old variant with token passed as name does only work for classic PATS, passing the token as password works for both classic and fine grained PATs. * Closes: apache#14927 Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…ache#14355) See: [ARROW-17932](https://issues.apache.org/jira/browse/ARROW-17932). Adds a `json::StreamingReader` class (modeled after `csv::StreamingReader`) with an async-reentrant interface and support for parallel block decoding. Some parts of the existing `TableReader` implementation have been refactored to utilize the new facilities. Lead-authored-by: benibus <[email protected]> Co-authored-by: Ben Harkins <[email protected]> Co-authored-by: Will Jones <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…he#14803) Basically this patch has defined the interface of `ColumnIndex` and `OffsetIndex`. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes. BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect. Lead-authored-by: Gang Wu <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…#14921) With the added benchmarking running, the Go workflows occasionally take longer than the 30 minutes, so we need to increase the workflow timeout time. Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…pache#33691) * Regex for removing HTML comments was pathologically slow because of greedy pattern matching * Output of regex replacement was ignored (!) * Collapse extraneous newlines in generated commit message * Improve debugging output * Closes: apache#33687 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…apache#33700) * Closes: apache#15243 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Fixes a link and removes a reference to "feather" that was sitting front and centre. * Closes: apache#33705 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Nic Crane <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…3693) This PR removes the `keep` argument from the test for `semi_join()`, which are causing the unit tests to fail. It also removes the argument `suffix` argument (which is not part of the dplyr function signature) from the function signature here. Closes: apache#33666 Authored-by: Nic Crane <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
) ### Rationale for this change If we have Homebrew's RE2, we may mix re2.h from Homebrew's RE2 and bundled RE2. If we mix re2.h and libre2.a, we may generate wrong re2::RE2::Options. It may crashes our program. ### What changes are included in this PR? Ensure removing Homebrew's RE2. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#25633 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Jacob Wujciak-Jens <[email protected]>
This closes apache#15265 * Closes: apache#15265 Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Jacob Wujciak-Jens <[email protected]>
…-null (apache#14814) The C data interface may expose null data pointers for zero-sized buffers. Make sure that all buffer pointers remain non-null internally. Followup to apacheGH-14805 * Closes: apache#14875 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ay offset (apache#15210) * Closes: apache#20512 Lead-authored-by: Will Jones <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Jacob Wujciak-Jens <[email protected]>
…ature more closely matching read_csv_arrow (apache#33614) This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.: ``` r library(arrow) library(dplyr) # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c("1", "2", "NULL")) file_path <- file.path(tf, "file1.txt") write.table(df, file_path, sep = ",", row.names = FALSE) read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect() #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA ``` This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter). In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR. Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…h new style GitHub issues and old style JIRA issues (apache#33615) I've decided to do all the archery release tasks on a single PR: * Closes: apache#14997 * Closes: apache#14999 * Closes: apache#15002 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
I am not updating the flatbuffers dependency yet since it requires rebuilding the protobufs. Authored-by: Dominik Moritz <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
lriggs
pushed a commit
to lriggs/arrow
that referenced
this pull request
Dec 27, 2024
…n timezone (apache#45051) ### Rationale for this change If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception. This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash. Here is a backtrace excerpt: ``` dremio#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 dremio#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 dremio#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()dremio#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()dremio#1}&&)::{lambda()dremio#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116 dremio#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 dremio#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900 ``` ### What changes are included in this PR? Catch C++ exceptions when iterating ORC batches instead of letting them slip through. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#40633 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pushing a new branch based on rel-2300 that was then merged with apache/arrow 'apache-arrow-11.0.0 ' tag.