-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISTINCT was added unexpectly #428
Comments
How many items in the page? |
@sebastienros Below is the comparison: |
Before I dive into it, I remember the distinct was added because some indexes might results in the same content item multiple times, and it should count as one in pages. |
Is there a way to do a distinct on a single column, like ID, can you check? |
|
There is |
This is original SQL -- 2.3 million rows
-- With DISTINCT
SELECT DISTINCT [v1001_Document].*, [ContentItemIndex_a1].[ModifiedUtc]
FROM [v1001_Document]
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1]
ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id]
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions'
AND (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') )
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY
-- Without DISTINCT
SELECT [v1001_Document].*, [ContentItemIndex_a1].[ModifiedUtc]
FROM [v1001_Document]
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1]
ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id]
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions'
AND (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') )
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY |
This is specific to OC and a special query that doesn't have this issue. The DISTINCT here will work for many other queries. Might be an option to allow a caller to disable it too. But first I want to see if a better query can be built Can you try this one? SELECT [v1001_Document].* FROM [v1001_Document] WHERE [v1001_Document].[Id] IN
(
SELECT DISTINCT [v1001_Document].[Id]
FROM [v1001_Document]
INNER JOIN [v1001_ContentItemIndex] AS [ContentItemIndex_a1]
ON [ContentItemIndex_a1].[DocumentId] = [v1001_Document].[Id]
WHERE [v1001_Document].[Type] = 'OrchardCore.ContentManagement.ContentItem, OrchardCore.ContentManagement.Abstractions'
AND (([ContentItemIndex_a1].[Latest] = 1) AND [ContentItemIndex_a1].[ContentType] IN ('Test') )
ORDER BY [ContentItemIndex_a1].[ModifiedUtc] DESC
OFFSET 30 ROWS FETCH NEXT 10 ROWS ONLY
) |
Can you change to |
Last attempt
Was thinking about doing other round-trips to the database automatically when duplicates are found. Will try that too, it would just be perfect in cases where there aren't any duplicates, and involves subsequent queries when there are too many of them. Could also switch to a Distinct query if it detects duplicates. |
Will do the optimistic approach then, with round-trips to the db when there are collisions. We could also use DISTINCT ultimately after too many round-trips. But maybe DISTINCT should be opt-in, or based on a configurable number of round-trips (1 meaning that any duplicate triggers a Distinct). Would be an argument of paging. |
It's good to have an option argument. |
Problem is that without distinct, something like Skip(10) will count the duplicates so it won't work, even with roundtrips, since these are not read so we can't account for collisions. Right now I can only think about being able to opt-out of Distinct with a custom method on the query, we knowledge that the indexes are unique. |
There won't have duplicates if there are coresponding filters. |
Don't just think about Orchard and this specific query. YesSql has no idea that you set Latest=1 and Published=1 or that it means there can't be duplicates because of it. You as the developer is the only one to know. So either the Indexes can be marked "unique" and then YesSql would be able to infer the use of DISTINCT, or you call a specific method that will inform YesSql it can optimize the query because it's unique. Or call the opposite if all the queries are unique by default. PR almost ready. |
I assume that |
Hey, we are seeing this issue on a live site when listing only ~8000 (10 per page) items on the Admin Dashboard, with the site hosted on Azure with Azure SQL. An S0 tier DB can take as much as 18 seconds to load a page of 10 items above the first page. An S2 DB does the same in 4-6.5 seconds. Running the same query on the same DB directly, but without distinct dramatically improves performance. Furthermore, loading all 8k items at once (without paging) is less than 1 second. Here's the query:
What would you recommend for next steps? |
Maybe @deanmarcussen or @hishamco can also chime in? Please see my comment above. |
Before I dig into the details, why we added |
@sebastienros was working on the issue here #432 It needs resurrecting. |
We should not confuse the in-memory distinct with database server's query level distinct - idea is to minimize payload on wire. So ideally nothing more than necessary is transferred to server client - latency and network hops kill performance (when done excessively). I'm the famous dog flying a helicopter here as I'm not familiar with the insides, but by the looks of it - should always drop SQL level distinct handling if the select contains only columns from one table AND the column set contains row's unique identifier, id column ensure that all is unique. |
There original issue that induced this change is #202. @hishamco and @lahma: I don't 100% understand the use case, but I think the duplicate entries mentioned in those index table might be a side effect of content versions. @deanmarcussen thanks for pointing this out, I'll check that PR! |
I've done some testing locally and in Azure with the changes included in #432 and results are very promising! The obvious effect is that the DISTINCT clause is gone and navigating on admin listing pages (the one that lists 8000 content items) no longer suffers from excessive load times. We'll continue testing and let you know if we find anything else! |
So, elders of YesSQL, what do you think? Can anyone else verify those changes? |
@sebastienros any news about the PR? See my comment above with our testing results. |
@sebastienros I think the reason One way to solve this is by doing something like this instead
|
Our investigation was performance-focused and mostly conducted by @wAsnk - he found that:
has really bad performance without cache, which can be emulated with:
The optimized query, which performs as expected even with many content items is:
But we need to check how this would behave regarding the duplicate results issue - thanks for @MikeAlhayek for chiming in about that! |
@BenedekFarkas your last comment is great, gives me an idea if what is best in terms of perfs. I will try to summarize the problem and inherent issues, the solutions I have tried and what's left. @MikeAlhayek is correct, when an index returns multiple values for a single document, an inner join with this index will return two records for the same document. The problem is that if you iterate through the results, you will render the same document (a content item for instance) twice. But you just want is to display the list of documents that have the constraint on the index. One solution could be to do it in memory and do a "distinct". But it's bad AND wrong. Bad because you still download the content item N times, so perf is not good. Wrong beause if you paged your query for 10 items per page, you might get 2 times the same content item and your page with "distinct in memory" will only have 9 documents. If the same content item was duplicated 10 times you would have a page with 1 item ... This is very important because whatever solution is used needs also to work with So we want the "distinct" to happen on the db side. The current performance issue is that a What works well is when using SELECT [tpDocument].*
FROM [tpDocument]
INNER JOIN [tpEmailByAttachment] AS [EmailByAttachment_a1] ON [EmailByAttachment_a1].[DocumentId] = [BobaFett].[tpDocument].[Id]
WHERE ([EmailByAttachment_a1].[AttachmentName] like @p0)
GROUP BY [tpDocument].[Id] But it doesn't work for SQL Server (all others are fine) because MSSQL wants every field from the SELECT clause to be in an aggregate function when a GROUP BY is used. In Orchard we don't have that many indexes that return multiple results in an index. We do have some that return nothing, and this is fine here. So we can't say "this works fine in Orchard" because it's just that we haven't triggered the problem yet, or maybe we'll find out eventually... But it's rare at best, so I was thinking of making this behavior optional. If you look at the PR I was working on you'll see a new For those who want to see if they can help with the implementation, I'd suggest to start from my PR branch. The main code to update is yessql/src/YesSql.Core/Services/DefaultQuery.cs Line 1626 in 8a0af86
GroupByDocument but this could be replace by anything. It should convert the current query to make deduplicate it.
And this should work when Take/Skip/Count are used. So even The two tests to look into are
NB: Found some indexes in OC that return multiple results per document ... In the meantime I'll continue working on it, and will be more that happy if someone implements a solution before that. I'd like to try to implement the suggested query too, maybe as a fist step (without deduplication). |
I think a first step would be to enable the deduplicating behavior in all cases, and not have the |
@BenedekFarkas your optimized query doesn't work for me on MSSQL, with Unless we force an offset/top by default :/ |
@sebastienros If you don't have offset in the subquery, then you can move the ORDER BY outside into the main query: SELECT [Document].*, ci.[ModifiedUtc] FROM [Document] INNER JOIN (
SELECT * FROM ContentItemIndex]
WHERE [ContentType] = 'Profile'
AND [Latest] = 1)
AS ci ON ci.[DocumentId] = [Document].[Id]
ORDER BY ci.[ModifiedUtc] DESC |
@wAsnk you should try to add GROUP BY clauses too Based on your example it would look something like this. Maybe you could try to check the performance on your system. SELECT [Document].* FROM [Document] INNER JOIN
(
SELECT [Document].Id as [Id], MAX([ContentItemIndex_a1].[ModifiedUtc]) [ModifiedUtc]
FROM [Document]
INNER JOIN [ContentItemIndex] AS [ContentItemIndex_a1] ON [ContentItemIndex_a1].[DocumentId] = [Document].[Id]
WHERE [ContentItemIndex_a1].[ContentType] = 'Profile' AND [ContentItemIndex_a1].[Latest] = 1
GROUP BY [Document].[Id]
ORDER BY [ModifiedUtc] DESC
OFFSET 490 ROWS FETCH NEXT 10 ROWS ONLY
) AS [IndexQuery] ON [IndexQuery].[Id] = [Document].[Id] |
@sebastienros This looks great! This is the result with a much bigger offset: Current SQL in OC, without cache ~29312ms, then with cache ~141ms |
@sebastienros I would use Count() over Max() I think you’ll have a slight better performance gain since the max check won’t be applied
|
@MikeAlhayek you can't do |
I did not notice the offset and order in the sub query
Why would you add that logic into the subquery? How about this query instead?
|
Because that gives a solid performance boost. It joins only the taken 10 rows to the document table where Also |
😢 PostgresQL doesn't perserve the order from the Document table inner join. So is this working in MSSQL by chance or is it keeping the order by design. |
All tests are passing now. I added a method to opt-out of deduplication, |
I will merge it in main, but before releasing on NuGet can someone test the package from MyGet? At least in Orchard, or to check that the perf is better for these cases? |
Thanks, we'll test it tomorrow (in Azure too)! |
Sorry for delay, I just got around to focus on this again. Is OC 1.4.0 compatible with these changes or we need to upgrade 1.5.0 to be able to use the latest versions from MyGet (and the upcoming final release)? |
@BenedekFarkas this was not part of OC 1.5 |
Yep, I know, what I meant is that OC 1.4 doesn't work with (it seems) with the latest YesSQL code, but 1.5 does. So not specifically in this release, but somewhere between the two there were breaking changes. |
@sebastienros thanks for the update, we're now using the NuGet version! There's no corresponding OC issue for upgrading YesSQL - should I create one? |
@BenedekFarkas OC 1.6-preview already has this upgrade OrchardCMS/OrchardCore#12959 |
Seriously degraded query performance due to the addition of the Distinct keyword.
There are 2M records in table
ContentItemIndex
.It's shown as
false
After ran
var pageOfContentItems = await query.Skip(pager.GetStartIndex()).Take(pager.PageSize).ListAsync(_contentManager);
It's changed to
true
.Might be same issue with #407
The text was updated successfully, but these errors were encountered: