-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long load times on saved dataset #4565
Comments
Hi there! apologies for the slow response. I don't think you're doing anything incorrectly, I think you're hitting some performance limits... but I think I might have a fix incoming to speed things up. Just to double check -- how are you initially loading in your data to get your initial yt dataset? using |
@Ecskrabacz10 I just opened a PR that should speed things up for you when you're loading back in via |
Apologies in advance for the long comment.Yes absolutely! I would love some help for installing the PR from source off of your branch. I am unsure if this will help, but I was looking at the lengths of my saved datasets, and it seems that my saved datasets have about 32 times the number of particles from the original dataset. I believe this may be the main cause of the long load times. Explanation of datasetFor some clarification about the dataset itself, it includes four types of species: Before I save my dataset, I have also checked to make sure that all of the lengths of the particle lists should be the same: IssueThe main issue now comes from the saved datasets. As I've stated previously, the length of the dataset grew 64 times the original size. Instead of having 2*(128^3) particles, the Currently, I save my dataset through the following procedure,
My main guess would be when I perform Another issue could arise from my cpu structure. I have currently been running on one node with 16 cores, but I do not have parallelism enabled. Once again, I'm not sure how this could cause an issue as I do not have parallelism enabled. I have also tested this with Do you or anyone else know how this process creates a dataset that is 32 times larger than the original? |
Here's how. Note that it'll take longer than regular installation (about a couple minutes), because your computer will run extra compilation steps. python -m pip install git+https://github.com/chrishavlin/yt@ytdata_check_for_all_data |
thanks for the extra info, @Ecskrabacz10 ! And after reading through, I don't actually think my PR will fully solve your problem -- that PR simply adds code to avoid yt's selection routines when using One thing you could do that would help isolate the issue: when you call |
You could also post the output of |
Thank you both for the extra insight! Here's a snapshot of what the The length of each component looks correct. So, it seems that the duplication does not happen during the initial Edit: Here is the output that I get when I load with The 1.678e+07 particles seems to be the correct amount, but when we check the shape of each field, they are 32 times the normal. This 32 seems to be reflected when initializing |
Interesting. It is suspicious that the counts are off by a factor of 32 and the dataset index gets initialized with 32 chunks... maybe each chunk is referencing the same index range and everything is getting loaded 32 times... Let me look a bit more at how all that works to see if I can reproduce this behavior with a smaller dataset. |
I was curious and wanted to check if this problem was due to the number of CPUs in the node that I have been using. I tested two different nodes, one with 16 CPUs and one with 20. After saving and loading the same dataset from the two different nodes, both saved datasets encountered the same factor of 32 problem. |
This really is very confusing. One thing that might help -- do any ewah files get generated? If so, can you send them, and also send us the ewah files from the original pre-save-dataset dataset? |
OK, was able to reproduce this on main with a simple example: import yt
import numpy as np
n_particles = int(1e6)
ppx, ppy, ppz = np.random.random(size=[3, n_particles])
ppm = np.arange(0, n_particles)
data = {
"particle_position_x": ppx,
"particle_position_y": ppy,
"particle_position_z": ppz,
"particle_mass": ppm,
}
ds = yt.load_particles(data)
ad = ds.all_data()
fn= ad.save_as_dataset('/var/tmp/test_save', fields=ds.field_list)
ds1 = yt.load(fn)
ad1 = ds1.all_data()
n_particles_out = ad1[('all', 'particle_mass')].shape[0]
print(n_particles_out, n_particles_out == n_particles, n_particles_out / n_particles) prints out
So I ended up with 4 times more particles. Furthermore, cause I gave the print(np.unique(ad1[('all','particle_mass')]).shape) prints This only happens with there are enough particles to trigger the chunking -- smaller initial So, not sure what the problem is yet, but we now have a simpler toy problem to debug. |
I'm gonna go ahead and label this a bug at this point ... |
Oh -- might be that stream dataset always uses 1 chunk, but that info is not being passed out to the saved dataset so that when it gets loaded back in the particle index re-builds with multiple chunks?? |
.ewah files get generated when I load the saved dataset, but there are no .ewah files for previous steps. I just tried uploading it, and it seems like GitHub does not support .ewah files for comments/posts. Is there anywhere I could post my sample .ewah file? I also found it very odd that the particle index re-builds in multiple chunks. I thought the number of chunks might be related to the number of fields present in the dataset. However, I don't believe that it is as each of my species has nine fields associated with them. |
The number of chunks is related to the length of the arrays when re-loading. And it only happens on re-load because when loading back in from a dataset that was created with I just found the spot in the code where this is happening, so hoping can have a fix in today. |
(and don't worry about uploading the .ewah files ) |
@Ecskrabacz10 would you be able to test out my fix in #4595 with your data? You can install from my PR branch with
|
Just installed the fix and it seems like it worked! It still loads the particles in 32 chunks; however, it has the same number of particles as the original dataset. This may just be because it's a new fix but I wanted to bring it up nonetheless:
Once again, not too big of an issue, but I wanted to bring it up just in case. |
great! I think the second error is already fixed by |
Oh and how's the load time? much faster I hope?? |
Ah okay that makes sense, thank you so much! And yes it is MUCH faster now, about 5 seconds to load each field instead of the 2-3 minutes it was taking before. |
Hello! This is not necessarily a bug report, but a question on how I can improve on saving and loading a dataset.
I have currently been working with a filetype from a simulation not currently supported by yt. I have been able to get a dataset, get a time series, and save a dataset. For some context, I have made my own dataset with ~4.2e6 particles and the following fields:
Each of the particle types has the same fields, i.e.
[("stable matter","particle_position_x")]
. I have changed values in the dataset (such as current_time, current_redshift, hubble_constant, etc) as well. Currently, I have been saving these datasets to mitigate a memory issue by the commandad.save_as_dataset(path, fields = ds.field_list)
, where ds is the dataset andad = ds.all_data()
. Naturally, this saves both the .h5 file and the .ewah file.Now is where I run into an issue. I am able to load the saved file and perform the same command to define ad with no issue. It takes only around one second to load the particle index. However, the software takes around 2.5 minutes to load some of the saved data through, say,
pos_x = ad[("all","particle_position_x")]
. This would normally be a non-issue, but when trying to load every field in the field_list, this could easily take a little under half an hour for each of the particle types.Could anyone give me any advice as to how I can make the loading process quicker? Am I possibly saving my dataset incorrectly?
The text was updated successfully, but these errors were encountered: