Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize loading ?fferent_edges. #298

Merged
merged 2 commits into from
Nov 6, 2023

Conversation

1uc
Copy link
Collaborator

@1uc 1uc commented Oct 23, 2023

Optimized reading of edge IDs by aggregating ranges into larger (GPFS-friendly) ranges before creating the appropriate HDF5 selection to reduce the number of individual reads. Then it filters out any unneeded data in memory. This is very similar to work done in #183.

This PR introduces the following:

  • Extend the internal API for merging libsonata.Selection.
  • Add internal API for reading in bulk and filtering.
  • Load ?fferent_edge in bulk.
  • Compile-time constant SONATA_PAGESIZE which controls how large the merged region need to be.

src/edge_index.cpp Outdated Show resolved Hide resolved
@1uc 1uc force-pushed the 1uc/optimize_xfferent_edges branch 4 times, most recently from 2cad097 to be4cb91 Compare October 25, 2023 13:41
@1uc 1uc force-pushed the 1uc/optimize_xfferent_edges branch from be4cb91 to fd8adc1 Compare November 1, 2023 11:32
@1uc
Copy link
Collaborator Author

1uc commented Nov 1, 2023

When reading edge ids for a large file we get a 10'000x speedup. The benchmark is:

edge_filename = "/gpfs/bbp.cscs.ch/data/scratch/proj134/matwolf/v4_all_pathways/edges_try2.h5"
edge_file = libsonata.EdgeStorage(edge_filename)
population_name = "root__neurons__root__neurons__chemical"
population = edge_file.open_population(population_name)

n_nodes = 100
n_gids = 10000000
n_ranks = n_nodes * 40
gid_stride = 2 # simulates some selection effect.

all_edge_ids = []
for k_rank in range(n_ranks):
    gids = np.arange(*fair_chunk(n_ranks, k_rank, n_gids), gid_stride)
    edge_ids = population.afferent_edges(gids)
    all_edge_ids.append(edge_ids.ranges)

This would compute the edge IDs used by each MPI rank for analysis purposes. With the optimization it takes about 2-4s without the first 10 rank take 77s (or 8.5h total).

@1uc 1uc requested a review from sergiorg-hpc November 1, 2023 11:39
@1uc
Copy link
Collaborator Author

1uc commented Nov 1, 2023

The PR uses templates to hide the difference between Selection::Ranges which is an std::vector<std::pair> and the RawIndex which an std::vector<std::array>. We need to use std::get<.> to access x.first and x[0] in a uniform manner.

@1uc 1uc marked this pull request as ready for review November 1, 2023 12:17
@1uc
Copy link
Collaborator Author

1uc commented Nov 1, 2023

The solution only works for afferent_edges but not for efferent_edges, due to locality.

sergiorg-hpc
sergiorg-hpc previously approved these changes Nov 1, 2023
Reading from parallel filesystems, e.g. GPFS, requires reading few but
large chunks. Reading multiple times from the same block/page, come with
a hefty performance penalty.

The commit implements the functionality for merging nearby reads by
adding or modifying:

  * `sortAndMerge` to allow merging ranges across gaps of a certain
    size.

  * `bulkRead` to read block-by-block and extract the requested slices
    in memory.

  * `_readSelection` to always combine reads.

  * `?fferent_edges` to optimize reading of edge IDs.

It requires a compile-time constant `SONATA_PAGESIZE` to specify the
block/pagesize to be targeted.
@joni-herttuainen joni-herttuainen force-pushed the 1uc/optimize_xfferent_edges branch from fd8adc1 to d2e189e Compare November 3, 2023 10:38
matz-e
matz-e previously approved these changes Nov 3, 2023
src/population.hpp Outdated Show resolved Hide resolved
@1uc 1uc dismissed stale reviews from matz-e and sergiorg-hpc via 98761b1 November 6, 2023 08:57
@joni-herttuainen joni-herttuainen merged commit 8366465 into master Nov 6, 2023
@joni-herttuainen joni-herttuainen deleted the 1uc/optimize_xfferent_edges branch November 6, 2023 10:11
WeinaJi pushed a commit to BlueBrain/neurodamus that referenced this pull request Jan 29, 2024
## Context
When using `WholeCell` load-balancing, the access pattern when reading
parameters during synapse creation is extremely poor and is the main
reason why we see long (10+ minutes) periods of severe performance
degradation of our parallel filesystem when running slightly larger
simulations on BB5.

Using Darshan and several PoCs we established that the time required to
read these parameters can be reduced by more than 8x and IOps can be
reduced by over 1000x when using collective MPI-IO. Moreover, the
"waiters" where reduced substantially as well. See BBPBGLIB-1070.

Following those finding we concluded that neurodamus would need to use
collective MPI-IO in the future.

We've implemented most of the required changes directly in libsonata
allowing others to benefit from the same optimizations should the need
arise. See,
BlueBrain/libsonata#309
BlueBrain/libsonata#307

and preparatory work:
BlueBrain/libsonata#315
BlueBrain/libsonata#314
BlueBrain/libsonata#298 

By instrumenting two simulations (SSCX and reduced MMB) we concluded
that neurodamus was almost collective. However, certain attributes where
read in different order on different MPI ranks. Maybe due to salting
hashes differently on different MPI ranks.

## Scope
This PR enables neurodamus to use collective IO for the simulation
described above.

## Testing
<!--
Please add a new test under `tests`. Consider the following cases:

1. If the change is in an independent component (e.g, a new container
type, a parser, etc) a bare unit test should be sufficient. See e.g.
`tests/test_coords.py`
2. If you are fixing or adding components supporting a scientific use
case, affecting node or synapse creation, etc..., which typically rely
on Neuron, tests should set up a simulation using that feature,
instantiate neurodamus, **assess the state**, run the simulation and
check the results are as expected.
    See an example at `tests/test_simulation.py#L66`
-->
We successfully ran the reduced MMB simulation, but since SSCX hasn't
been converted to SONATA, we can't run that simulation.

## Review
* [x] PR description is complete
* [x] Coding style (imports, function length, New functions, classes or
files) are good
* [ ] Unit/Scientific test added
* [ ] Updated Readme, in-code, developer documentation

---------

Co-authored-by: Luc Grosheintz <[email protected]>
WeinaJi pushed a commit to BlueBrain/neurodamus that referenced this pull request Oct 14, 2024
## Context
When using `WholeCell` load-balancing, the access pattern when reading
parameters during synapse creation is extremely poor and is the main
reason why we see long (10+ minutes) periods of severe performance
degradation of our parallel filesystem when running slightly larger
simulations on BB5.

Using Darshan and several PoCs we established that the time required to
read these parameters can be reduced by more than 8x and IOps can be
reduced by over 1000x when using collective MPI-IO. Moreover, the
"waiters" where reduced substantially as well. See BBPBGLIB-1070.

Following those finding we concluded that neurodamus would need to use
collective MPI-IO in the future.

We've implemented most of the required changes directly in libsonata
allowing others to benefit from the same optimizations should the need
arise. See,
BlueBrain/libsonata#309
BlueBrain/libsonata#307

and preparatory work:
BlueBrain/libsonata#315
BlueBrain/libsonata#314
BlueBrain/libsonata#298 

By instrumenting two simulations (SSCX and reduced MMB) we concluded
that neurodamus was almost collective. However, certain attributes where
read in different order on different MPI ranks. Maybe due to salting
hashes differently on different MPI ranks.

## Scope
This PR enables neurodamus to use collective IO for the simulation
described above.

## Testing
<!--
Please add a new test under `tests`. Consider the following cases:

1. If the change is in an independent component (e.g, a new container
type, a parser, etc) a bare unit test should be sufficient. See e.g.
`tests/test_coords.py`
2. If you are fixing or adding components supporting a scientific use
case, affecting node or synapse creation, etc..., which typically rely
on Neuron, tests should set up a simulation using that feature,
instantiate neurodamus, **assess the state**, run the simulation and
check the results are as expected.
    See an example at `tests/test_simulation.py#L66`
-->
We successfully ran the reduced MMB simulation, but since SSCX hasn't
been converted to SONATA, we can't run that simulation.

## Review
* [x] PR description is complete
* [x] Coding style (imports, function length, New functions, classes or
files) are good
* [ ] Unit/Scientific test added
* [ ] Updated Readme, in-code, developer documentation

---------

Co-authored-by: Luc Grosheintz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants