Achieve the most efficient data import #13071

hyzx86 · 2023-01-11T09:52:18Z

The current import feature will trigger all the handlers one by one on a per-contentItem basis.

For example, if I have an excel sheet with 20,000 entries, it may take hours to import them all.

But most of the time, we know that the data is up-to-date, and we don't need to consider historical version information.
We can batch any issues before importing, such as duplicate versions.

As mentioned in this PR, #11532
the parameter that the import event fires should be a collection type.
I think it could improve the performance of some custom import event hooks

hyzx86 · 2023-01-11T10:06:01Z

related :#10858

hyzx86 · 2023-01-14T14:03:19Z

Hi @sebastienros,

I tweaked the code and did new tests, and the root cause of the import performance loss is
CreateContentItemVersionAsync function

OrchardCore/src/OrchardCore/OrchardCore.ContentManagement/DefaultContentManager.cs

Lines 529 to 530 in 95729da

    
           var evictionVersions = versionsThatMaybeEvicted.Where(x => String.Equals(x.ContentItemId, importingItem.ContentItemId, StringComparison.OrdinalIgnoreCase)); 
        
           var result = await CreateContentItemVersionAsync(importingItem, evictionVersions);

If I override the ContentManager removed CreateContentItemVersionAsync function and use the following code instead

  _session.Save(contentItem);
  _contentManagerSession.Store(contentItem);

The import of 20,000 pieces of data takes less than five minutes. Of course, I disabled SQLIndex

Because in CreateContentItemVersionAsync function for each data using SQL to retrieve, leading to import the performance degradation

In my opinion, the cleaning or updating of history versions should be defined by the user, rather than handled automatically in the import function .
Or we do a batch update of history versions before performing the import

hyzx86 · 2023-01-15T18:27:29Z

I just created a content import service and it just executes

It removes all the handlers like UpdateAsync(contentItem)

        private string[] BatchImpotHandlerNameFilter = new[]
        {
            "OrchardCore.Search.Lucene.Handlers.LuceneIndexingContentHandler",
            "OrchardCore.Search.Elasticsearch.Core.Handlers.ElasticIndexingContentHandler",
        };

...

  _session.Save(contentItem);
  _contentManagerSession.Store(contentItem);
- UpdateContentContext updateContext = new UpdateContentContext(contentItem);
- await ReversedImportHandlers.Where(x => BatchImpotHandlerNameFilter.Contains(x.GetType().FullName)).InvokeAsync((handler, context) => handler.UpdatedAsync(context), updateContext, Logger);

 var result = await ContentManager.ValidateAsync(contentItem);
              if (!result.Succeeded)
              {
                  await Notifier.ErrorAsync(result.Errors, H);
                  await YesSession.CancelAsync();
                  return 0;
              }

At this time If you execute the LuceneIndexingContentHandler will lead to the UI for a long time waiting,

Check the source of these findings Which performs a version But if I don't perform LuceneIndexingContentHandler were unable to reconstruct the index function reconstruction Lucnene index

Why don't we do this in LuceneWoker ?

sebastienros · 2023-01-19T18:53:18Z

Importing does validation for instance, so the events are required. You can't just ignore the events.

If you think there should be an option to just ignore validation then this could be an option on the UI. WE could also find a way to remove some of the extra queries (e.g. RemoveLatestVersion) without breaking the import, and other optimizations like this.

sebastienros · 2023-01-19T18:54:29Z

I also mentioned that in another issue, we should have a service that stops live indexing of items, and delay it to when the data is imported.

hyzx86 · 2023-01-20T05:27:32Z

Importing does validation for instance, so the events are required. You can't just ignore the events.

Yes， you're right ，so I ended up blocking some handlers that weren't important to my scenario

sebastienros · 2023-01-26T18:20:51Z

What about having a scoped service that would allow us to disable specific handler events? With a Flags enumeration for instance (so we can pre-define groups). These would be cross-handler though. But one advantage is that you could just disable any validation when you are storing/importing a content item.

However, do you know what part of the code for these handlers was causing a perf issue? I checked their code and they were optimized for batches, like just tracking an object and scheduling a single task for the end of the request.

But thinking about it the issue might be that it's tracking all the content items from the import, and batching them is not fixing the problem when you have too many items. Then with 20K items in memory, and to process, it could be a problem. Disabling the event handlers like you did solves the issue because it just doesn't even track these items so nothing is done at the end of the request. The only option might be to be able to delay these tasks in the background, out of the request. Another option could be to be able to do it in smaller batches, but all the items would still be in memory at a point of time, so ideally we should just store the ids if these items and process then asynchronously (backgroun task). Like it's done for lucene indexes. (could be a new feature, but it looks a lot like the indexing module, with a cursor, but per task).

hyzx86 · 2023-05-20T10:27:04Z

Hi @sebastienros ,

I try to modify the LuceneIndexingContentHandler class, and test the delete and update new, seems to work well

#13721

but the below code will always delete last document :

  await luceneIndexManager.DeleteDocumentsAsync(indexSettings.IndexName, new string[] { contentItem.ContentItemId });

This means that OC's Lucene module only supports the latest version of the index, right?

There seems to be no way to get a specific version of the Lucene document and delete it

sebastienros · 2023-05-25T17:50:13Z

There seems to be no way to get a specific version

do we no index only published content?

Skrypt · 2024-03-22T01:20:30Z

do we no index only published content?

No by config we allow to index latest version (drafted).

hyzx86 added the enhancement label Jan 11, 2023

hyzx86 changed the title ~~Achieve the most efficient import~~ Achieve the most efficient data import Jan 11, 2023

sebastienros added this to the 1.x milestone Jan 12, 2023

hyzx86 mentioned this issue Jan 20, 2023

Add events before and after handling bulk imports #11532

Open

hyzx86 mentioned this issue May 20, 2023

Optimize IndexingContentHandlers perf #13721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieve the most efficient data import #13071

Achieve the most efficient data import #13071

hyzx86 commented Jan 11, 2023

hyzx86 commented Jan 11, 2023

hyzx86 commented Jan 14, 2023

hyzx86 commented Jan 15, 2023 •

edited

Loading

sebastienros commented Jan 19, 2023

sebastienros commented Jan 19, 2023

hyzx86 commented Jan 20, 2023

sebastienros commented Jan 26, 2023

hyzx86 commented May 20, 2023

sebastienros commented May 25, 2023

Skrypt commented Mar 22, 2024

Achieve the most efficient data import #13071

Achieve the most efficient data import #13071

Comments

hyzx86 commented Jan 11, 2023

The current import feature will trigger all the handlers one by one on a per-contentItem basis.

hyzx86 commented Jan 11, 2023

hyzx86 commented Jan 14, 2023

hyzx86 commented Jan 15, 2023 • edited Loading

sebastienros commented Jan 19, 2023

sebastienros commented Jan 19, 2023

hyzx86 commented Jan 20, 2023

sebastienros commented Jan 26, 2023

hyzx86 commented May 20, 2023

sebastienros commented May 25, 2023

Skrypt commented Mar 22, 2024

hyzx86 commented Jan 15, 2023 •

edited

Loading