Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest changes for moving the filter on right side of correlate to inside the unnest datasource #13934

Merged
merged 19 commits into from
Mar 23, 2023

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Mar 14, 2023

Addresses the following:

  1. Removed the right filter transform
  2. Since Calcite already figured out the filter on top of uncollect that is pushed inside the data source
  3. Rest of the filters on the left behave as earlier
  4. Slight change in logic in filter rewrite as we not have to iterate over filter list to find filters on unnested column
  5. Fix for selector filters by removing the intermediate project above the DruidUnnestRel only when it is doing a CAST or LITERAL
  6. Handling special case for OR filters involving unnested column and regular left data source columns which are planned on top of Correlate

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…inside the unnest cursors. Added tests for scenarios such as

1. filter on unnested column which involves a left filter rewrite
2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
3. not filters
4. SQL functions applied on top of unnested column
5. null present in first row of the column to be unnested
…of project before the filter on top of uncollect
Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some logic bugs still. Likely, we need a data set that has more rows and matches/skips various different rows based on different conditions.

{
if (filter == null) {
return;
}

final Set<String> requiredColumns = filter.getRequiredColumns();

// Run filter post-correlate if it refers to any virtual columns.
// Run filter post-unnest if it refers to any virtual columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this

// Run filter post-unnest if it refers to any virtual columns. This is a conservative judgement call
// that perhaps forces the code to use a ValueMatcher where an index would've been available,
// which can have real performance implications. This is an interim choice made to value correctness
// over performance. When we need to optimize this performance, we should be able to
// create a VirtualColumnDatasource that contains all of the virtual columns, in which case the query
// itself would stop carrying them and everything should be able to be pushed down.

Comment on lines 323 to 329
if (unnestFilter instanceof AndFilter) {
for (Filter filter : ((AndFilter) unnestFilter).getFilters()) {
filterSplitter.addPostFilterWithPreFilterIfRewritePossible(filter);
}
} else {
filterSplitter.add(queryFilter);
filterSplitter.addPostFilterWithPreFilterIfRewritePossible(unnestFilter);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason that we are breaking up the unnestFilter here? Or is this a reflection of a C&P of the logic for the normal filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We break it up in the case of AND so that we can find out if one of them can be added to the pre filters through re-write. Adding to pre-filters takes place inside addPostFilterWithPreFilterIfRewritePossible method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made without breaking up, refactored

* @see CorrelateFilterRTransposeRule similar, but for right-hand side filters
* @see DruidFilterUnnestRule similar, but for right-hand side filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment legitimate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter unnest rule does handle the filters on the right which is the top of uncollect though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, is this comment telling me that I should expect the DruidFilterUnnestRule to be the same code as this? I'm not sure why this is telling me to see that other one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I removed it

Comment on lines 194 to 199
if (unnestFilter != null) {
return virtualColumn.equals(that.virtualColumn) && unnestFilter.equals(that.unnestFilter)
&& base.equals(that.base);
} else {
return virtualColumn.equals(that.virtualColumn) && base.equals(that.base);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd equals implementation. If the current object's unnestFilter is null, then the that object can have any unnest filter it wants?

Generally speaking, I suggest letting IntelliJ generate your equals and hashCode. (I'm assuming that perhaps you adjusted it yourself?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, fixed

Comment on lines 323 to 324
List<Filter> preFilterList = new ArrayList<>();
List<Filter> postFilterList = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preFilterList, postFilterList here is very confusing and easy to mix up with the filterSplitter... I'm not sure how to separate it exactly, but I almost thought that these were getting AND'd together because I confused them with the FilterSplitter filters.

List<Filter> postFilterList = new ArrayList<>();
if (queryFilter instanceof OrFilter) {
for (Filter filter : ((OrFilter) queryFilter).getFilters()) {
if (filter.getRequiredColumns().contains(outputColumnName)) {
final Filter newFilter = rewriteFilterOnUnnestColumnIfPossible(filter, inputColumn, inputColumnCapabilites);
if (newFilter != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are doing an if/else with a negative condition. Please invert it. If you are doing an if/else please always make the condition carry as few negations as possible.

// any rows that do not match this filter at all.
preFilters.add(newFilter);
preFilterList.add(newFilter);
postFilterList.add(newFilter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postFilter should be the original filter, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been refactored with comments

preFilterList.add(newFilter);
postFilterList.add(newFilter);
} else {
postFilterList.add(filter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not re-writable, then I don't think the OR can be pushed down at all. If you only pushdown the things that don't refer to the output of the unnest, you run the risk of not including some rows that should've been included. It should be relatively easy to test for this. If there's a query with an MV_TO_STRING(unnested_dim, 'a') = 'b' OR other_dim = 'a', I don't think it'll believe that it can remap it. If it pushes down the other_dim = a without attempting to push down the first part of the OR, it will not include a row where nested_dim is ["b"] and other_dim is "c".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test cases to mimic all the different cases

}
}
}
if (!preFilterList.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anotehr if/else on negative logic. Keep it positive please.

}

if (!postFilterList.isEmpty()) {
AndFilter andFilter = new AndFilter(postFilterList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these AND'd? Didn't they come from an OR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

@somu-imply
Copy link
Contributor Author

somu-imply commented Mar 15, 2023

Simplified the code for adding filters, added comments on how the rewriting is done and changed some variable names to make things easier to read

@somu-imply somu-imply marked this pull request as ready for review March 16, 2023 01:45
Comment on lines 265 to 314
public void test_unnest_adapters_with_no_base_filter_active_unnest_filter()
{

Sequence<Cursor> cursorSequence = UNNEST_STORAGE_ADAPTER2.makeCursors(
null,
UNNEST_STORAGE_ADAPTER2.getInterval(),
VirtualColumns.EMPTY,
Granularities.ALL,
false,
null
);

cursorSequence.accumulate(null, (accumulated, cursor) -> {
ColumnSelectorFactory factory = cursor.getColumnSelectorFactory();

DimensionSelector dimSelector = factory.makeDimensionSelector(DefaultDimensionSpec.of(OUTPUT_COLUMN_NAME));
int count = 0;
while (!cursor.isDone()) {
Object dimSelectorVal = dimSelector.getObject();
if (dimSelectorVal == null) {
Assert.assertNull(dimSelectorVal);
}
cursor.advance();
count++;
}
Assert.assertEquals(1, count);
Filter unnestFilter = new SelectorDimFilter(OUTPUT_COLUMN_NAME, "1", null).toFilter();
VirtualColumn vc = new ExpressionVirtualColumn(
OUTPUT_COLUMN_NAME,
"\"" + COLUMNNAME + "\"",
null,
ExprMacroTable.nil()
);
final String inputColumn = UNNEST_STORAGE_ADAPTER2.getUnnestInputIfDirectAccess(vc);
Pair<Filter, Filter> filterPair = UNNEST_STORAGE_ADAPTER2.computeBaseAndPostUnnestFilters(
null,
unnestFilter,
VirtualColumns.EMPTY,
inputColumn,
INCREMENTAL_INDEX_STORAGE_ADAPTER.getColumnCapabilities(inputColumn)
);
SelectorFilter left = ((SelectorFilter) filterPair.lhs);
SelectorFilter right = ((SelectorFilter) filterPair.rhs);
Assert.assertEquals(inputColumn, left.getDimension());
Assert.assertEquals(OUTPUT_COLUMN_NAME, right.getDimension());
Assert.assertEquals(right.getValue(), left.getValue());
return null;
});
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structuring of this test for the purposes of validating that the filters are being managed properly seems a bit off. All of the decisioning of what should be pushed down and what shouldn't will have been done when you call makeCursors and you should be able to see it from there.

Specifically, we want to validate that it chose to push down specific filters and that it attached other filters to a PostJoinCursor.

You can have a TestStorageAdapter that exists just to have makeCursor called on it. You can store the values that were called and then return a Sequence<Cursor> where the Cursor objects are also just empty implementations.

Then, in the test, you can ask your TestStorageAdapter for which filters it was called with and validate that it's the ones you expect to be pushed down. You then take the cursors and validate that they are PostJoinCursor objects, then reach in and grab their filters (or it might need to be the ValueMatchers) and ensure that those filters are what you expect.

Copy link
Contributor Author

@somu-imply somu-imply Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is to see if the lhs and rhs of filter splitter are correct because those are the parts that gets pushed down or to the PostJoinCursor and this kind of does that. I can also validate that the cursor I get after makeCursor is actually a PostJoinCursor. Atm we do not have a method on the cursor to get the filter passed to it so I was calling the method which modifies the filters to validate. Thanks for the feedback, I'll try to refactor the tests based on the suggestions

Copy link
Contributor

@imply-cheddar imply-cheddar Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this test is calling the methods that the implementation should call. Essentially trying to mimic the actual implementation and making sure that it does the right thing. That's a fine test, if you assume that the base implementation will never change outside of these methods. But as soon as someone changes the implementation outside of one of these methods, the tests will no longer validate what we want to validate.

It's better to focus on testing the contract of the object. The contract in this case is that it is given some filters on makeCursors and some of those should be rewritten and passed down (to a delegate makeCursors call) and some of them should be placed on the return cursor for filtering. If you write the test to validate that contract, then it will be a cleaner test that is more helpful in the face of code changes. Fwiw, there are two reasons to have tests

  1. To give you confidence that the changes you are making work
  2. To give everybody on the project confidence that their changes work

Goal 2 is infinitely more important than number 1 and is best served by testing to contract and not implementation as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm working on the refactoring

Comment on lines 96 to 102
final Project rightP = call.rel(0);
final SqlKind rightProjectKind = rightP.getChildExps().get(0).getKind();
final DruidUnnestRel unnestDatasourceRel = call.rel(1);

if (rightP.getProjects().size() == 1 && (rightProjectKind == SqlKind.CAST || rightProjectKind == SqlKind.LITERAL)) {
call.transformTo(unnestDatasourceRel);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the check of whether or not this actually does a rewrite in the matches call instead of the onMatch call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will followup on this, will add the matches code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the matches part along with new examples

Comment on lines +4100 to +4133
public void testUnnestWithMultipleOrFiltersOnUnnestedColumnsAndOnOriginalColumn()
{
skipVectorize();
cannotVectorize();
testQuery(
"SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' or dim3='d' ",
QUERY_CONTEXT_UNNEST,
ImmutableList.of(
Druids.newScanQueryBuilder()
.dataSource(UnnestDataSource.create(
new TableDataSource(CalciteTests.DATASOURCE3),
expressionVirtualColumn("j0.unnest", "\"dim3\"", ColumnType.STRING),
null
))
.intervals(querySegmentSpec(Filtration.eternity()))
.resultFormat(ScanQuery.ResultFormat.RESULT_FORMAT_COMPACTED_LIST)
.filters(
or(
selector("j0.unnest", "b", null),
selector("dim3", "d", null)
)
)
.legacy(false)
.context(QUERY_CONTEXT_UNNEST)
.columns(ImmutableList.of("j0.unnest"))
.build()
),
ImmutableList.of(
new Object[]{"b"},
new Object[]{"b"},
new Object[]{"d"}
)
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has got me wondering what will happen if you run

"SELECT d3 FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) as unnested (d3) where d3='b' or dim3='a' "

Instead. I expect that it should match all of the ["a", "b"] rows and thus be 4 results of a, b, a, b. But I'm curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be the rows that are a and b. The table has 3 such rows, so would be 3 results a,b,b.
Screen Shot 2023-03-15 at 11 28 27 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought there were 2 rows with ["a", "b"], my bad... Does it work the same if you invert it? If you say dim3='b' OR d3 = 'a'?

Btw, you might as well add these as tests as well. They might not be testing any new corners, but anything that's like "I wonder what happens if" is worth including.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added :)

@cheddar cheddar merged commit 2ad133c into apache:master Mar 23, 2023
@somu-imply somu-imply deleted the unnest_changes1 branch March 23, 2023 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants