Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dremio 11 #21

Closed
wants to merge 958 commits into from
Closed

Dremio 11 #21

wants to merge 958 commits into from

Conversation

lriggs
Copy link

@lriggs lriggs commented May 11, 2023

Pushing a new branch based on rel-2300 that was then merged with apache/arrow 'apache-arrow-11.0.0 ' tag.

trxcllnt and others added 30 commits December 9, 2022 10:20
…14881)

* apache@730e9c5 updates `ts-jest` configuration to remove deprecation warnings
* apache@e4d83f2 updates `print-buffer-alignment.js`  debug utility for latest APIs
* apache@3b9d18c updates `arrow2csv` to print zero-based rowIds
* apache@b6c42f3 fixes apache#14791


* Closes: apache#14791

Authored-by: ptaylor <[email protected]>
Signed-off-by: Dominik Moritz <[email protected]>
…s in the R package directory (apache#14678)

A first stab at documenting the post-Arrow-release/pre-CRAN submission process! Builds on excellent documentation on the Confluence page ( https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRpackages ) but in a more "checklisty" form to make sure we don't miss steps.

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
…ead of Jiras (apache#14903)

Authored-by: Dominik Moritz <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…can decode without DictionaryHashTable (apache#14902)

* Closes: apache#14901

Authored-by: 郭峰 <[email protected]>
Signed-off-by: David Li <[email protected]>
Currently you get an error like "ArrowInvalid: Failed to parse string: '2021-01-02 00:00:00+01:00' as a scalar of type timestamp[s]expected no zone offset"

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: David Li <[email protected]>
…cache times out (apache#14850)

Explicitly starting the sccache server prior to the compilation has removed the flakiness in my testing. 
* Closes: apache#14849

Authored-by: Jacob Wujciak-Jens <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
…4587)

This fixes a bug which happens when a Vector has been created with multiple data blobs and at least one of them has been padded.

See https://observablehq.com/d/14488c116b338560 for a reproduction of the error and more details.

Lead-authored-by: Thomas Sarlandie <[email protected]>
Co-authored-by: Dominik Moritz <[email protected]>
Co-authored-by: Paul Taylor <[email protected]>
Signed-off-by: Dominik Moritz <[email protected]>
…che#14887)

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory.

Addresses what was discussed in apache#13155 (comment)

- [x] Update documentation
- [x] Ensure the filtered dataset preserves the filter when writing it back
- [x] Ensure the filtered dataset preserves the filter when joining
- [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods.
- [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to apache#13409 (comment))
- [x]  Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche 
- [x] Add support in replace_schema
- [x] Error in get_fragments in case a filter is set.
- [x] Verify support in UnionDataset


Lead-authored-by: Alessandro Molina <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Alessandro Molina <[email protected]>
This PR fixes some broken links and runs `devtools::document()` with the newest roxygen (7.2.2).
* Closes: apache#14884

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
…ffers" (apache#14915)

The API is deprecated since v0.17 and the implementation has been removed in apache@ab93ba1

* Closes: apache#14916

Authored-by: Tao He <[email protected]>
Signed-off-by: David Li <[email protected]>
Similarly as apache#14719

@ milesgranger has been contributing regularly the last few months both in PRs (https://github.com/apache/arrow/commits?author=milesgranger) as issue triage. Adding him to the collaborators (triage role) enables him to do that on github as well (disclaimer: Miles is a colleague of mine).

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…adata API (apache#13041)

In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430.

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
`ArrowBuf#setOne` should have int64 params

Authored-by: 郭峰 <[email protected]>
Signed-off-by: David Li <[email protected]>
…ncoder and StructSubfieldEncoder (apache#14910)

* Closes: apache#14909

Authored-by: 郭峰 <[email protected]>
Signed-off-by: David Li <[email protected]>
Authored-by: Miles Granger <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…pache#14832)

Synching after conda-forge/arrow-cpp-feedstock#875, which does quite a lot of things, see this [summary](conda-forge/arrow-cpp-feedstock#875 (review)). I'm not keeping the commit history here, but it might be instructive to check the commits there to see why certain changes came about.

It also fixes the CI that was broken by apache@a3ef64b (undoing the changes of apache#14102 in `tasks.yml`).

Finally, it adapts to conda making a long-planned [switch](conda-forge/conda-forge.github.io#1586) w.r.t. to the format / extension of the artefacts it produces.

I'm very likely going to need some help (or at least pointers) for the R-stuff. CC @ xhochy
(for context, I never got a response to conda-forge/r-arrow-feedstock#55, but I'll open a PR to build against libarrow 10).

Once this is done, I can open issues to tackle the tests that shouldn't be failing, resp. the segfaults on PPC resp. in conjunction with `sparse`.
* Closes: apache#14828

Authored-by: H. Vetinari <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…e group_by/summarise statements are used (apache#14905)

Reprex using CRAN arrow:

``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl, value [3]
#>      mpg   cyl value `"foo"`
#>    <dbl> <dbl> <dbl> <chr>  
#>  1  21       6     6 foo    
#>  2  21       6     6 foo    
#>  3  22.8     4     4 foo    
#>  4  21.4     6     6 foo    
#>  5  18.7     8     8 foo    
#>  6  18.1     6     6 foo    
#>  7  14.3     8     8 foo    
#>  8  24.4     4     4 foo    
#>  9  22.8     4     4 foo    
#> 10  19.2     6     6 foo    
#> # … with 22 more rows
```

<sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

After this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 3
#> # Groups:   cyl, value [3]
#>      mpg   cyl value
#>    <dbl> <dbl> <chr>
#>  1  21       6 foo  
#>  2  21       6 foo  
#>  3  22.8     4 foo  
#>  4  21.4     6 foo  
#>  5  18.7     8 foo  
#>  6  18.1     6 foo  
#>  7  14.3     8 foo  
#>  8  24.4     4 foo  
#>  9  22.8     4 foo  
#> 10  19.2     6 foo  
#> # … with 22 more rows
```

<sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
* Closes: apache#14872

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
…d PATs (apache#14928)

The old variant with token passed as name does only work for classic PATS, passing the token as password works for both classic and fine grained PATs.
* Closes: apache#14927

Authored-by: Jacob Wujciak-Jens <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…ache#14355)

See: [ARROW-17932](https://issues.apache.org/jira/browse/ARROW-17932).

Adds a `json::StreamingReader` class (modeled after `csv::StreamingReader`) with an async-reentrant interface and support for parallel block decoding.

Some parts of the existing `TableReader` implementation have been refactored to utilize the new facilities.

Lead-authored-by: benibus <[email protected]>
Co-authored-by: Ben Harkins <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…he#14803)

Basically this patch has defined the interface of `ColumnIndex` and `OffsetIndex`. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes.

BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect.

Lead-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…#14921)

With the added benchmarking running, the Go workflows occasionally take longer than the 30 minutes, so we need to increase the workflow timeout time.

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
pitrou and others added 23 commits January 18, 2023 09:32
…pache#33691)

* Regex for removing HTML comments was pathologically slow because of greedy pattern matching
* Output of regex replacement was ignored (!)
* Collapse extraneous newlines in generated commit message
* Improve debugging output

* Closes: apache#33687

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Fixes a link and removes a reference to "feather" that was sitting front and centre.
* Closes: apache#33705

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
…3693)

This PR removes the `keep` argument from the test for `semi_join()`, which are causing the unit tests to fail.  It also removes the argument `suffix` argument (which is not part of the dplyr function signature) from the function signature here.

Closes: apache#33666

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
)

### Rationale for this change

If we have Homebrew's RE2, we may mix re2.h from Homebrew's RE2 and bundled RE2.
If we mix re2.h and libre2.a, we may generate wrong re2::RE2::Options. It may crashes our program.

### What changes are included in this PR?

Ensure removing Homebrew's RE2.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#25633

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Jacob Wujciak-Jens <[email protected]>
This closes apache#15265
* Closes: apache#15265

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Jacob Wujciak-Jens <[email protected]>
…-null (apache#14814)

The C data interface may expose null data pointers for zero-sized buffers.
Make sure that all buffer pointers remain non-null internally.

Followup to apacheGH-14805

* Closes: apache#14875

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…ay offset (apache#15210)

* Closes: apache#20512

Lead-authored-by: Will Jones <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Jacob Wujciak-Jens <[email protected]>
…ature more closely matching read_csv_arrow (apache#33614)

This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:

``` r
library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA
```

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored `CsvFileFormat$create()`.  Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
…h new style GitHub issues and old style JIRA issues (apache#33615)

I've decided to do all the archery release tasks on a single PR:
* Closes: apache#14997
* Closes: apache#14999
* Closes: apache#15002

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@lriggs lriggs closed this May 11, 2023
@lriggs lriggs reopened this May 11, 2023
@lriggs lriggs changed the base branch from dremio to dremio-11 May 11, 2023 23:33
lriggs and others added 2 commits May 11, 2023 16:33
I am not updating the flatbuffers dependency yet since it requires rebuilding the protobufs.

Authored-by: Dominik Moritz <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
@lriggs lriggs closed this Jul 28, 2023
lriggs pushed a commit to lriggs/arrow that referenced this pull request Dec 27, 2024
…n timezone (apache#45051)

### Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:
```
dremio#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
dremio#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
dremio#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()dremio#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()dremio#1}&&)::{lambda()dremio#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
dremio#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
dremio#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
```

### What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40633

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.