Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTINCT was added unexpectly #428

Closed
Oceania2018 opened this issue Jan 27, 2022 · 50 comments
Closed

DISTINCT was added unexpectly #428

Oceania2018 opened this issue Jan 27, 2022 · 50 comments

Comments

@Oceania2018
Copy link

Oceania2018 commented Jan 27, 2022

Seriously degraded query performance due to the addition of the Distinct keyword.
There are 2M records in table ContentItemIndex.

It's shown as false

image

image

After ran var pageOfContentItems = await query.Skip(pager.GetStartIndex()).Take(pager.PageSize).ListAsync(_contentManager);
It's changed to true.

image

Might be same issue with #407

@sebastienros
Copy link
Owner

How many items in the page?
What makes you think DISTINCT is the issue? Have you tried the query in SQL Server without it, and if so what were the results with and without it?

@Oceania2018
Copy link
Author

@sebastienros Below is the comparison:

image

@sebastienros
Copy link
Owner

Before I dive into it, I remember the distinct was added because some indexes might results in the same content item multiple times, and it should count as one in pages.

@sebastienros
Copy link
Owner

Is there a way to do a distinct on a single column, like ID, can you check?

@sebastienros
Copy link
Owner

sebastienros commented Jan 27, 2022

SELECT [Document].* WHERE Document.Id IN
(
-- whole query with filters
SELECT DISTINCT [Document].[Id] ... FROM v1001_Document WHERE ...
)

@Oceania2018
Copy link
Author

Before I dive into it, I remember the distinct was added because some indexes might results in the same content item multiple times, and it should count as one in pages.

There is Latest = 1 added, so we don't need distinct anymore.

@Oceania2018
Copy link
Author

Is there a way to do a distinct on a single column, like ID, can you check?

This is original SQL

-- 2.3 million rows
-- With DISTINCT
SELECT DISTINCT [v1001_Document].*, [ContentItemIndex_a1].[ModifiedUtc] 
FROM [v1001_Document] 
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1] 
	ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id] 
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions' 
	AND  (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') ) 
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC 
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY 

-- Without DISTINCT
SELECT [v1001_Document].*, [ContentItemIndex_a1].[ModifiedUtc] 
FROM [v1001_Document] 
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1] 
	ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id] 
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions' 
	AND  (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') ) 
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC 
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY 

@sebastienros
Copy link
Owner

There is Latest = 1

This is specific to OC and a special query that doesn't have this issue. The DISTINCT here will work for many other queries. Might be an option to allow a caller to disable it too. But first I want to see if a better query can be built Can you try this one?

SELECT [v1001_Document].* FROM [v1001_Document] WHERE [v1001_Document].[Id] IN 
(
SELECT DISTINCT [v1001_Document].[Id]
FROM [v1001_Document] 
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1] 
	ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id] 
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions' 
	AND  (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') ) 
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC 
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY 
)

@Oceania2018
Copy link
Author

image

@sebastienros
Copy link
Owner

Can you change to SELECT DISTINCT [v1001_Document].[Id], [ContentItemIndex_a1].[ModifiedUtc] then ?

@Oceania2018
Copy link
Author

image

@sebastienros
Copy link
Owner

Last attempt

SELECT [v1001_Document].* FROM [v1001_Document] WHERE [v1001_Document].[Id] IN (
  SELECT [v1001_Document].[Id] FROM (  
    SELECT DISTINCT [v1001_Document].[Id], [ContentItemIndex_a1].[ModifiedUtc]
    FROM [v1001_Document] 
    INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1] 
	    ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id] 
    WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions' 
	    AND  (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') ) 
    ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC 
    OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY 
  )
)

Was thinking about doing other round-trips to the database automatically when duplicates are found. Will try that too, it would just be perfect in cases where there aren't any duplicates, and involves subsequent queries when there are too many of them. Could also switch to a Distinct query if it detects duplicates.

@Oceania2018
Copy link
Author

It's even slower:

image

@sebastienros
Copy link
Owner

Will do the optimistic approach then, with round-trips to the db when there are collisions. We could also use DISTINCT ultimately after too many round-trips. But maybe DISTINCT should be opt-in, or based on a configurable number of round-trips (1 meaning that any duplicate triggers a Distinct). Would be an argument of paging.

@Oceania2018
Copy link
Author

Oceania2018 commented Jan 27, 2022

Will do the optimistic approach then, with round-trips to the db when there are collisions. We could also use DISTINCT ultimately after too many round-trips. But maybe DISTINCT should be opt-in, or based on a configurable number of round-trips (1 meaning that any duplicate triggers a Distinct). Would be an argument of paging.

It's good to have an option argument.

@sebastienros
Copy link
Owner

Problem is that without distinct, something like Skip(10) will count the duplicates so it won't work, even with roundtrips, since these are not read so we can't account for collisions.

Right now I can only think about being able to opt-out of Distinct with a custom method on the query, we knowledge that the indexes are unique.

@Oceania2018
Copy link
Author

Oceania2018 commented Jan 27, 2022

There won't have duplicates if there are coresponding filters.
For example in OrchardCore:
WHERE Latest = 1 and Published = 1
It should be under controlled by outside caller.

@sebastienros
Copy link
Owner

Don't just think about Orchard and this specific query. YesSql has no idea that you set Latest=1 and Published=1 or that it means there can't be duplicates because of it. You as the developer is the only one to know. So either the Indexes can be marked "unique" and then YesSql would be able to infer the use of DISTINCT, or you call a specific method that will inform YesSql it can optimize the query because it's unique. Or call the opposite if all the queries are unique by default. PR almost ready.

@Oceania2018
Copy link
Author

I assume that yessql always runs together with other software/ system, so the caller can generally ensure the correctness of the data, yessql does not need to consider whether to use distinct most probably.
Of course I'm not an expert of yessql, in the end it's up to you to decide the best way to implement it.

@BenedekFarkas
Copy link

BenedekFarkas commented Aug 16, 2022

Hey, we are seeing this issue on a live site when listing only ~8000 (10 per page) items on the Admin Dashboard, with the site hosted on Azure with Azure SQL. An S0 tier DB can take as much as 18 seconds to load a page of 10 items above the first page. An S2 DB does the same in 4-6.5 seconds.

Running the same query on the same DB directly, but without distinct dramatically improves performance. Furthermore, loading all 8k items at once (without paging) is less than 1 second.

Here's the query:

SELECT DISTINCT
    [Document].*,
    [ContentItemIndex_a1].[ModifiedUtc]
FROM [Document]
INNER JOIN [ContentItemIndex] AS [ContentItemIndex_a1] ON [ContentItemIndex_a1].[DocumentId] = [Document].[Id]
WHERE
    [Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions'
    AND (([ContentItemIndex_a1].[ContentType] = 'CustomContentTypeName') AND ([ContentItemIndex_a1].[Latest] = 1))
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY

What would you recommend for next steps?

@BenedekFarkas
Copy link

Maybe @deanmarcussen or @hishamco can also chime in? Please see my comment above.

@hishamco
Copy link
Contributor

Before I dig into the details, why we added Distinct implicitly? We can have as LINQ syntactic suger method, when it's needed

@deanmarcussen
Copy link
Collaborator

@sebastienros was working on the issue here #432

It needs resurrecting.

@lahma
Copy link

lahma commented Aug 23, 2022

Before I dig into the details, why we added Distinct implicitly? We can have as LINQ syntactic suger method, when it's needed

We should not confuse the in-memory distinct with database server's query level distinct - idea is to minimize payload on wire. So ideally nothing more than necessary is transferred to server client - latency and network hops kill performance (when done excessively).

I'm the famous dog flying a helicopter here as I'm not familiar with the insides, but by the looks of it - should always drop SQL level distinct handling if the select contains only columns from one table AND the column set contains row's unique identifier, id column ensure that all is unique.

@BenedekFarkas
Copy link

There original issue that induced this change is #202.

@hishamco and @lahma: I don't 100% understand the use case, but I think the duplicate entries mentioned in those index table might be a side effect of content versions.

@deanmarcussen thanks for pointing this out, I'll check that PR!

@BenedekFarkas
Copy link

BenedekFarkas commented Sep 2, 2022

I've done some testing locally and in Azure with the changes included in #432 and results are very promising! The obvious effect is that the DISTINCT clause is gone and navigating on admin listing pages (the one that lists 8000 content items) no longer suffers from excessive load times.
Instead, it remains low (few hundred ms depending one the DB tier) and remains mostly consistent with some natural variance.

We'll continue testing and let you know if we find anything else!

@BenedekFarkas
Copy link

So, elders of YesSQL, what do you think? Can anyone else verify those changes?

@BenedekFarkas
Copy link

@sebastienros any news about the PR? See my comment above with our testing results.

@MikeAlhayek
Copy link
Collaborator

@sebastienros I think the reason distinct is added is because sometimes we using With<SomeIndex>(x => x.ColumnName.IsIn(...) in that case the index SomeIndex could have multiple rows with the same documentId which will return the same document multiple time.

One way to solve this is by doing something like this instead

SELECT d.* FROM (
   SELECT d.Id AS DocumentId, COUNT(1) AS CountColumn
   FROM Document AS d 
   INNER JOIN -- whole query with filters
   GROUP BY d.Id
) AS s
INNER JOIN Document AS d ON d.Id = s.DocumentId

@BenedekFarkas
Copy link

BenedekFarkas commented Nov 30, 2022

Our investigation was performance-focused and mostly conducted by @wAsnk - he found that:

SELECT [Document].*, [ContentItemIndex_a1].[ModifiedUtc] 
FROM [Document] INNER JOIN [ContentItemIndex] AS [ContentItemIndex_a1] ON [ContentItemIndex_a1].[DocumentId] = [Document].[Id] 
WHERE [Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions'
	AND (([ContentItemIndex_a1].[ContentType] = 'Profile') 
	AND ([ContentItemIndex_a1].[Latest] = 1))
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC
OFFSET 500 ROWS FETCH NEXT 10 ROWS ONLY

has really bad performance without cache, which can be emulated with:

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

The optimized query, which performs as expected even with many content items is:

SELECT [Document].*, a1.[ModifiedUtc] FROM [Document] INNER JOIN (
	SELECT * FROM [ContentItemIndex] AS [ContentItemIndex_a1]
	WHERE [ContentItemIndex_a1].[ContentType] = 'Profile'
		AND [ContentItemIndex_a1].[Latest] = 1
	ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC
	OFFSET 500 ROWS FETCH NEXT 10 ROWS ONLY)
AS a1 ON a1.[DocumentId] = [Document].[Id] 

But we need to check how this would behave regarding the duplicate results issue - thanks for @MikeAlhayek for chiming in about that!
Also, YesSQL experts, do you have an idea how much of the overall changes fall on YesSQL itself and if there's room for improvement in how OC uses YesSQL?

@sebastienros
Copy link
Owner

@BenedekFarkas your last comment is great, gives me an idea if what is best in terms of perfs.

I will try to summarize the problem and inherent issues, the solutions I have tried and what's left.

@MikeAlhayek is correct, when an index returns multiple values for a single document, an inner join with this index will return two records for the same document.

The problem is that if you iterate through the results, you will render the same document (a content item for instance) twice. But you just want is to display the list of documents that have the constraint on the index. One solution could be to do it in memory and do a "distinct". But it's bad AND wrong. Bad because you still download the content item N times, so perf is not good. Wrong beause if you paged your query for 10 items per page, you might get 2 times the same content item and your page with "distinct in memory" will only have 9 documents. If the same content item was duplicated 10 times you would have a page with 1 item ... This is very important because whatever solution is used needs also to work with COUNT, OFFSET and LIMIT.

So we want the "distinct" to happen on the db side.

The current performance issue is that a DISTINCT(Document.*) is used, which will do it for all fields of the paged result. Assuming the json document is also compared it might get very slow because there is a lot of data to compare, and because the field is not indexed, so everything needs to be loaded. Why not just do DISTINCT(Document.Id)? Because we can only return Id then. So we tried in a sub-select, but it's still slow: #428 (comment)

What works well is when using GroupBy like this

SELECT [tpDocument].* 
FROM [tpDocument] 
INNER JOIN [tpEmailByAttachment] AS [EmailByAttachment_a1] ON [EmailByAttachment_a1].[DocumentId] = [BobaFett].[tpDocument].[Id] 
WHERE ([EmailByAttachment_a1].[AttachmentName] like @p0) 
GROUP BY [tpDocument].[Id]

But it doesn't work for SQL Server (all others are fine) because MSSQL wants every field from the SELECT clause to be in an aggregate function when a GROUP BY is used.

In Orchard we don't have that many indexes that return multiple results in an index. We do have some that return nothing, and this is fine here. So we can't say "this works fine in Orchard" because it's just that we haven't triggered the problem yet, or maybe we'll find out eventually... But it's rare at best, so I was thinking of making this behavior optional. If you look at the PR I was working on you'll see a new GroupByDocument method that is used in the tests to verify that everything is working. This means that now there is no DISTINCT, but a GROUP BY, and that it's off by default. So it's super fast for most Orchard scenarios, and if/when we thing an index can have multiple results then we can call the method on the query to deduplicate at the db level. I think that's a good compromise (if only it worked in SQL SERVER already). And even better solution IMO would be to introduce a new type of IIndex. Like we have MapIndex and ReduceIndex we could have MultiMapIndex. Currently ReduceIndex maps one index to multiple documents, an MapIndex maps multiple indexes to one document. I think MapIndex should actually return one index per document, and this new MultiMapIndex could return multiple indexes. This way we could detect when a query joins a multi-map index, and inject the GROUP BY behavior. The refactoring would be simple in Orchard, just return an object instead of a list. And actually, this is already the case since Map() has two overloads, one for a single document and one for a list.

For those who want to see if they can help with the implementation, I'd suggest to start from my PR branch. The main code to update is

_query._queryState._sqlBuilder.GroupBy(_query._queryState._sqlBuilder.FormatColumn(_query._queryState._documentTable, "Id"));
which is invoked by GroupByDocument but this could be replace by anything. It should convert the current query to make deduplicate it.

And this should work when Take/Skip/Count are used. So even Count should be able to count "distinct" documents (for paging purposes) when this method has been called (or even if this methods has been called, right now it will clear the grouping, and the test is only passing because it does deduplication in memory for a test).

The two tests to look into are

  • ShouldReturnDistinctDocuments
  • ShouldReturnNonUniqueResultsWithDistinctOption

NB: Found some indexes in OC that return multiple results per document ...

In the meantime I'll continue working on it, and will be more that happy if someone implements a solution before that. I'd like to try to implement the suggested query too, maybe as a fist step (without deduplication).

@sebastienros
Copy link
Owner

I think a first step would be to enable the deduplicating behavior in all cases, and not have the GroupByDocument option already. This comes at a cost of performance, hopefully slight. After all this is what the current state is doing (with a DISTINCT clause). Then later on we can introduce the MultiMapIndex and only enable grouping when such an index is used in the query. But his behavior is only added when the query returns documents, if indexes are returned we don't need to deduplicate anything.

@sebastienros
Copy link
Owner

@BenedekFarkas your optimized query doesn't work for me on MSSQL, with The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

Unless we force an offset/top by default :/

@wAsnk
Copy link

wAsnk commented Dec 2, 2022

@sebastienros If you don't have offset in the subquery, then you can move the ORDER BY outside into the main query:

SELECT [Document].*, ci.[ModifiedUtc] FROM [Document] INNER JOIN (
	SELECT * FROM ContentItemIndex]
	WHERE [ContentType] = 'Profile'
		AND [Latest] = 1)
AS ci ON ci.[DocumentId] = [Document].[Id] 
ORDER BY ci.[ModifiedUtc] DESC

@sebastienros
Copy link
Owner

@wAsnk you should try to add GROUP BY clauses too
Here is something that is working for me right now, and shows the complex cases that need to be handled (ordering, filtering, deduplicating)

Based on your example it would look something like this. Maybe you could try to check the performance on your system.

SELECT [Document].* FROM [Document] INNER JOIN
(
    SELECT [Document].Id as [Id], MAX([ContentItemIndex_a1].[ModifiedUtc]) [ModifiedUtc]
    FROM [Document] 
    INNER JOIN [ContentItemIndex] AS [ContentItemIndex_a1] ON [ContentItemIndex_a1].[DocumentId] = [Document].[Id] 
    WHERE [ContentItemIndex_a1].[ContentType] = 'Profile' AND [ContentItemIndex_a1].[Latest] = 1
    GROUP BY [Document].[Id] 
    ORDER BY [ModifiedUtc] DESC
    OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY
) AS [IndexQuery] ON [IndexQuery].[Id] = [Document].[Id]

@wAsnk
Copy link

wAsnk commented Dec 2, 2022

@sebastienros This looks great! This is the result with a much bigger offset:
This new SQL, first without cache ~1656ms, then with cache ~15ms

Current SQL in OC, without cache ~29312ms, then with cache ~141ms

@MikeAlhayek
Copy link
Collaborator

MikeAlhayek commented Dec 2, 2022

@sebastienros I would use Count() over Max() I think you’ll have a slight better performance gain since the max check won’t be applied

SELECT [Document].* 
FROM [Document] 
INNER JOIN
(
    SELECT [Document].Id as [Id], COUNT(1) AS [Counter]
    FROM [Document] 
    INNER JOIN [ContentItemIndex] AS [ContentItemIndex_a1] ON [ContentItemIndex_a1].[DocumentId] = [Document].[Id] 
    WHERE [ContentItemIndex_a1].[ContentType] = 'Profile' AND [ContentItemIndex_a1].[Latest] = 1
    GROUP BY [Document].[Id] 
    ORDER BY [ModifiedUtc] DESC
    OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY
) AS [IndexQuery] ON [IndexQuery].[Id] = [Document].[Id]

@sebastienros
Copy link
Owner

sebastienros commented Dec 2, 2022

@MikeAlhayek you can't do ORDER BY [ModifiedUtc] DESC, it has to be on the aggregate that is in the SELECT clause, [Counter] in your example. And doing it fails any test that checks for ordering.

@MikeAlhayek
Copy link
Collaborator

I did not notice the offset and order in the sub query

    ORDER BY [ModifiedUtc] DESC
    OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY

Why would you add that logic into the subquery? How about this query instead?

SELECT [Document].* 
FROM [Document] 
INNER JOIN
(
	-- this is only needed if With<> is used
    SELECT [ContentItemIndex_a1].[DocumentId], COUNT(1) AS [Counter]
    FROM [ContentItemIndex] AS [ContentItemIndex_a1]
    WHERE [ContentItemIndex_a1].[ContentType] = 'Profile' AND [ContentItemIndex_a1].[Latest] = 1
    GROUP BY [ContentItemIndex_a1].[DocumentId]
) AS [IndexQuery] ON [IndexQuery].[DocumentId] = [Document].[Id]
ORDER BY [Document].[ModifiedUtc] DESC
OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY

@wAsnk
Copy link

wAsnk commented Dec 2, 2022

Why would you add that logic into the subquery? How about this query instead?

Because that gives a solid performance boost. It joins only the taken 10 rows to the document table where Document.* is selected, otherwise, it takes a significantly larger amount of time/resources.

Also ORDER BY [Document].[ModifiedUtc] DESC won't work in the main query, you don't have ModifiedUtc in the Document table.

@sebastienros
Copy link
Owner

😢 PostgresQL doesn't perserve the order from the Document table inner join. So is this working in MSSQL by chance or is it keeping the order by design.

@sebastienros
Copy link
Owner

All tests are passing now. I added a method to opt-out of deduplication, NoDuplicates(), which will completely remove the GROUP BY too when we know that the query and its indexes are not returning multiple documents. Later we can do some breaking changes that will allow us to define this at the index level and the query will be able to optimize this by itself.

@sebastienros
Copy link
Owner

I will merge it in main, but before releasing on NuGet can someone test the package from MyGet? At least in Orchard, or to check that the perf is better for these cases?

@BenedekFarkas
Copy link

Thanks, we'll test it tomorrow (in Azure too)!

@BenedekFarkas
Copy link

BenedekFarkas commented Dec 17, 2022

Sorry for delay, I just got around to focus on this again. Is OC 1.4.0 compatible with these changes or we need to upgrade 1.5.0 to be able to use the latest versions from MyGet (and the upcoming final release)?

@BenedekFarkas
Copy link

BenedekFarkas commented Dec 17, 2022

As far as I could tell 1.5.0 is required for this, so I did the upgrade.
its-working-star-wars

Performance is neat, as expected, even with DB cache clears! Awesome stuff!

@MikeAlhayek
Copy link
Collaborator

@BenedekFarkas this was not part of OC 1.5

@BenedekFarkas
Copy link

Yep, I know, what I meant is that OC 1.4 doesn't work with (it seems) with the latest YesSQL code, but 1.5 does. So not specifically in this release, but somewhere between the two there were breaking changes.

@BenedekFarkas
Copy link

@sebastienros thanks for the update, we're now using the NuGet version! There's no corresponding OC issue for upgrading YesSQL - should I create one?

@MikeAlhayek
Copy link
Collaborator

@BenedekFarkas OC 1.6-preview already has this upgrade OrchardCMS/OrchardCore#12959

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants