Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieve the most efficient data import #13071

Open
hyzx86 opened this issue Jan 11, 2023 · 10 comments
Open

Achieve the most efficient data import #13071

hyzx86 opened this issue Jan 11, 2023 · 10 comments
Milestone

Comments

@hyzx86
Copy link
Contributor

hyzx86 commented Jan 11, 2023

The current import feature will trigger all the handlers one by one on a per-contentItem basis.

For example, if I have an excel sheet with 20,000 entries, it may take hours to import them all.

But most of the time, we know that the data is up-to-date, and we don't need to consider historical version information.
We can batch any issues before importing, such as duplicate versions.

As mentioned in this PR, #11532
the parameter that the import event fires should be a collection type.
I think it could improve the performance of some custom import event hooks

@hyzx86
Copy link
Contributor Author

hyzx86 commented Jan 11, 2023

related :#10858

@hyzx86 hyzx86 changed the title Achieve the most efficient import Achieve the most efficient data import Jan 11, 2023
@sebastienros sebastienros added this to the 1.x milestone Jan 12, 2023
@hyzx86
Copy link
Contributor Author

hyzx86 commented Jan 14, 2023

Hi @sebastienros,

I tweaked the code and did new tests, and the root cause of the import performance loss is
CreateContentItemVersionAsync function

var evictionVersions = versionsThatMaybeEvicted.Where(x => String.Equals(x.ContentItemId, importingItem.ContentItemId, StringComparison.OrdinalIgnoreCase));
var result = await CreateContentItemVersionAsync(importingItem, evictionVersions);

If I override the ContentManager removed CreateContentItemVersionAsync function and use the following code instead

  _session.Save(contentItem);
  _contentManagerSession.Store(contentItem);

The import of 20,000 pieces of data takes less than five minutes. Of course, I disabled SQLIndex

Because in CreateContentItemVersionAsync function for each data using SQL to retrieve, leading to import the performance degradation

In my opinion, the cleaning or updating of history versions should be defined by the user, rather than handled automatically in the import function .
Or we do a batch update of history versions before performing the import

@hyzx86
Copy link
Contributor Author

hyzx86 commented Jan 15, 2023

I just created a content import service and it just executes

It removes all the handlers like UpdateAsync(contentItem)

        private string[] BatchImpotHandlerNameFilter = new[]
        {
            "OrchardCore.Search.Lucene.Handlers.LuceneIndexingContentHandler",
            "OrchardCore.Search.Elasticsearch.Core.Handlers.ElasticIndexingContentHandler",
        };

...

  _session.Save(contentItem);
  _contentManagerSession.Store(contentItem);
- UpdateContentContext updateContext = new UpdateContentContext(contentItem);
- await ReversedImportHandlers.Where(x => BatchImpotHandlerNameFilter.Contains(x.GetType().FullName)).InvokeAsync((handler, context) => handler.UpdatedAsync(context), updateContext, Logger);

 var result = await ContentManager.ValidateAsync(contentItem);
              if (!result.Succeeded)
              {
                  await Notifier.ErrorAsync(result.Errors, H);
                  await YesSession.CancelAsync();
                  return 0;
              }

At this time If you execute the LuceneIndexingContentHandler will lead to the UI for a long time waiting,

Check the source of these findings Which performs a version But if I don't perform LuceneIndexingContentHandler were unable to reconstruct the index function reconstruction Lucnene index
image

Why don't we do this in LuceneWoker ?
image

@sebastienros
Copy link
Member

Importing does validation for instance, so the events are required. You can't just ignore the events.

If you think there should be an option to just ignore validation then this could be an option on the UI. WE could also find a way to remove some of the extra queries (e.g. RemoveLatestVersion) without breaking the import, and other optimizations like this.

@sebastienros
Copy link
Member

I also mentioned that in another issue, we should have a service that stops live indexing of items, and delay it to when the data is imported.

@hyzx86
Copy link
Contributor Author

hyzx86 commented Jan 20, 2023

Importing does validation for instance, so the events are required. You can't just ignore the events.

Yes, you're right ,so I ended up blocking some handlers that weren't important to my scenario

image

@sebastienros
Copy link
Member

What about having a scoped service that would allow us to disable specific handler events? With a Flags enumeration for instance (so we can pre-define groups). These would be cross-handler though. But one advantage is that you could just disable any validation when you are storing/importing a content item.

However, do you know what part of the code for these handlers was causing a perf issue? I checked their code and they were optimized for batches, like just tracking an object and scheduling a single task for the end of the request.

But thinking about it the issue might be that it's tracking all the content items from the import, and batching them is not fixing the problem when you have too many items. Then with 20K items in memory, and to process, it could be a problem. Disabling the event handlers like you did solves the issue because it just doesn't even track these items so nothing is done at the end of the request. The only option might be to be able to delay these tasks in the background, out of the request. Another option could be to be able to do it in smaller batches, but all the items would still be in memory at a point of time, so ideally we should just store the ids if these items and process then asynchronously (backgroun task). Like it's done for lucene indexes. (could be a new feature, but it looks a lot like the indexing module, with a cursor, but per task).

@hyzx86
Copy link
Contributor Author

hyzx86 commented May 20, 2023

Hi @sebastienros ,

I try to modify the LuceneIndexingContentHandler class, and test the delete and update new, seems to work well

#13721

but the below code will always delete last document :

  await luceneIndexManager.DeleteDocumentsAsync(indexSettings.IndexName, new string[] { contentItem.ContentItemId });

This means that OC's Lucene module only supports the latest version of the index, right?

There seems to be no way to get a specific version of the Lucene document and delete it

@sebastienros
Copy link
Member

There seems to be no way to get a specific version

do we no index only published content?

@Skrypt
Copy link
Contributor

Skrypt commented Mar 22, 2024

do we no index only published content?

No by config we allow to index latest version (drafted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants