Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved table functions #13627

Merged
merged 20 commits into from
Jan 17, 2023
Merged

Conversation

paul-rogers
Copy link
Contributor

Description

Another step toward the Druid catalog, this one providing extended functionality to the table functions introduced in an earlier PR.

  • Revises properties, definitions in the catalog
  • Adds a "table function" abstraction to model such functions
  • Specific functions for HTTP, inline, local and S3.
  • Extended SQL types in the catalog. Primarily for later use, also used to handle types for external tables.
  • Restructure external table definitions to use table functions
  • EXTEND syntax for Druid's extern table function
  • Support for array-valued table function parameters
  • Support for array-valued SQL query parameters

Prior to this PR, MSQ did not support query parameters. This PR starts to provide support. Some of the required functionality will follow in the planner-focused catalog PR.

Release note

See the reference.md documentation page in this PR for the revised functionality. Of note are the new table functions and the array table function parameters. This functionality allows the user to, say, reuse the same SQL for every ingest, only passing in a different set of input files as query parameters on each ingest.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

* Revises properties, definitions in the catalog
* Adds a "table function" abstraction to model such functions
* Specific functions for HTTP, inline, local and S3.
* Extended SQL types in the catalog
* Restructure extenal table definitions to use table functions
* EXTEND syntax for Druid's extern table function
* Support for array-valued table function parameters
* Support for array-valued SQL query parameters
Removed some files to shrink PR
@paul-rogers
Copy link
Contributor Author

To anyone who wants to review, here is a summary of an overview of this PR provided via a private channel.

Let's assume you are already familiar with the model layer from last time. The key changes here are to make table functions more generic. First, there is a table function abstraction: something that has a set of parameter definitions, accepts a set of arguments, and returns an external table definition. The table function replaces the property annotations from the prior PR, so the annotation stuff was removed. Before, there was a definition for each kind of external table. This was too restrictive. So, now there is a generic external table definition that has two "parts". One is a definition of the input source, the other is a list of possible input formats.

Each input source has its own metadata class which says which properties are available (as before), but also says what table functions are available. There three kind of functions. 1) An "ad-hoc" (from scratch) function that is a fancy form of the existing extern function. 2) a "partial" function where some of the data resides in the catalog, the rest is given by the query using the table name as a function name. 3) a "complete" function where the catalog table name can be used either as a table or as a zero-argument table function. The input source metadata specifies the various properties, how function arguments are mapped to properties, and how to merge catalog and argument values. Formats are similar, but are shared by all input sources.

Given this, the second part is the SQL layer table functions are redone to use the table function abstraction and the input source definition. Basically, the HTTP input source function holds onto the HTTP input source definition, and asks that definition for the table function definition, which then provide the list of parameters and process the actual call. Given that, the related code is kind of one big thing: it won't work to use, say, the old Calcite code with the new model code or visa-versa.

You can start here: server/src/main/java/org/apache/druid/catalog/model/table. That's the core. Then, spiral outwards. server/src/main/java/org/apache/druid/catalog/model has a bunch of supporting changes. sql/src/main/java/org/apache/druid/sql/calcite/external is the Calcite integration for the table functions. Most of this stuff is either a complete mystery (because you don't know the bizarre ways that Calcite handles table macros) or trivial (because it builds on the code mentioned above.) Another big wad of files are the corresponding tests.

Oh, and if you want to know what all this code actually does, look at the reference.md documentation file.

A follow-on PR will add catalog integration to the Calcite planner so that we incorporate catalog information when planning an MSQ query. There are some bits toward that goal in this PR (where they overlap with external tables). However, to keep this PR from growing bigger; the planner work, which uses table specs from the catalog, comes later.

@zachjsh zachjsh self-requested a review January 5, 2023 18:12
* `bucket` (`VARCHAR`) - Corresponds to the `bucket` field of the `objects` JSON field. SQL
does not have syntax for an array of objects. Instead, this function taks a single bucket,
and one or more objects within that bucket.
* `paths` (`ARRAY` of `VARCHAR`) - Corresponds to the `path` fields of the `object` JSON field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the accessKeyId, secretAccessKey, and assumeRoleArn here as some of the examples below use them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They already appear? Check down two paragraphs. The intro to this paragraph says, "The function accepts the following parameters to specify the objects to read" while the later one says "also accepts the following security properties".

The string constant can also include any of the keywords mentioned above:

- `HOUR` - Same as `'PT1H'`
- `DAY` - Same as `'P1D'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is WEEK an option here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this change because it will come in a later PR. Any of Druid's valid strings will work. However, WEEK isn't a valid SQL keyword, so it can't be used in the syntax shown earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make WEEK an option. Kill WEEK, it's not commutable with any other granularity larger than it (MONTH, YEAR, etc.) and generates oddities when used.

+ "Please use an equivalent of these granularities: %s.",
Arrays.stream(GranularityType.values())
.filter(granularityType -> !granularityType.equals(GranularityType.NONE))
.map(Enum::name)
Copy link
Contributor

@zachjsh zachjsh Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return the enum name as it is or the Period string (ex. this would return MINUTE, or PT1M)? If the user provides the granularity enum name on the next go, that seems that it will fail the validation here, as it wont be a valid Period?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error message, copied from somewhere in MSQ. The error message shouldn't get copied into the spec (unless I've done something very wrong.)

There was a comment in an early PR that someone didn't like two ways to specify granularities, and we wanted only the ISO 8601 format. So, the user should use that. Easy enough to fix if we want to do something else.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers, thanks for the detailed write-up and unit tests. This feature looks pretty good! Checkpointing a first pass that mostly contains nits.

@paul-rogers paul-rogers merged commit 22630b0 into apache:master Jan 17, 2023
@paul-rogers paul-rogers deleted the 210103-table-fns branch January 17, 2023 16:42
@maytasm
Copy link
Contributor

maytasm commented Jan 17, 2023

@paul-rogers
Is this PR suppose to be merged?
I don't see any approval and just wanted to make sure that this is not a mistake

@paul-rogers
Copy link
Contributor Author

@maytasm, thanks for checking. @zachjsh had given his approval, but I see it wasn't actually recorded here. @zachjsh, can you do the honors? Thanks!

@paul-rogers paul-rogers changed the title Much improved table functions Improved table functions Jan 18, 2023
@zachjsh
Copy link
Contributor

zachjsh commented Jan 18, 2023

@paul-rogers , hmm it doesnt seem to give me the ability to approve the pr anymore 😢 . But I do approve of change.

@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants