Offers new ways to compute bulk import load plans. #4933

keith-turner · 2024-09-30T17:37:50Z

Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json.

All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported.

For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place.
All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import.

Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data.

In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file.
Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import.

Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2.

Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json. All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported. 1. For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place. 2. All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import. Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data. 1. In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file. 2. Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import. BulkNewIT.testComputeLoadPlan() simulates this map reduce use case by going through the steps in code that a map reduce job would. This tests the new APIs and shows what using it would look like. Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2.

keith-turner · 2024-09-30T18:01:32Z

This PR squashed the changes from #4898 into 3.1 and resolved conflicts.

keith-turner · 2024-09-30T18:04:22Z

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java

@@ -92,12 +93,17 @@ public class RFileWriter implements AutoCloseable {

  private final FileSKVWriter writer;
  private final LRUMap<ByteSequence,Boolean> validVisibilities;
+
+  // TODO should be able to completely remove this as lower level code is already doing some things


I will remove this todo and open an issue before this PR is merged.

keith-turner added this to the 3.1.0 milestone Sep 30, 2024

fix build and bug

e61521f

keith-turner commented Sep 30, 2024

View reviewed changes

keith-turner and others added 4 commits September 30, 2024 18:32

fix build and add validation

618cd63

fix javadoc bug

16b351a

improve javadoc

8cc15ad

Add new prefix for bulk load working files

964188b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offers new ways to compute bulk import load plans. #4933

Offers new ways to compute bulk import load plans. #4933

keith-turner commented Sep 30, 2024 •

edited

Loading

keith-turner commented Sep 30, 2024

keith-turner Sep 30, 2024

Offers new ways to compute bulk import load plans. #4933

Are you sure you want to change the base?

Offers new ways to compute bulk import load plans. #4933

Conversation

keith-turner commented Sep 30, 2024 • edited Loading

keith-turner commented Sep 30, 2024

keith-turner Sep 30, 2024

Choose a reason for hiding this comment

keith-turner commented Sep 30, 2024 •

edited

Loading