Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove useless IR when fusing instructions #234

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

qwe661234
Copy link
Collaborator

Originally, memory usage sharply increased during the frequent allocation and freeing of basic blocks and IR arrays. To reduce memory usage, we introduced a block memory pool and an IR array memory pool.

As shown in the analysis below, we experienced about a 10% performance loss due to the reduction in memory usage. However, we are currently utilizing KiB of memory instead of MiB.

  • Performance
Metric Original Proposed
Dhrystone 2359 DMIPS 2085.5 DMIPS
CoreMark 1967.382 iter/s 1823.333 iter/s
  • Memory usage
Metric Original Proposed
Dhrystone 99 MiB 169 KiB
CoreMark 2.4 MiB 176 KiB

@jserv jserv requested a review from RinHizakura September 27, 2023 11:57
@qwe661234
Copy link
Collaborator Author

Compared to PR #232, introducing a memory pool can effectively reduce memory usage, as valgrind only records the memory you actually use. For example, we allocate 2^32 bytes of memory using mmap for virtual data memory, but the total memory usage is less than 2^32.

Here, I compare the performance based on Benchmark / Performance regression check and memory usage based on host machine.

  • Performance
Metric PR #232 This PR
Dhrystone 1058.12 DMIPS 1084.62 DMIPS
CoreMark 922.773 iter/s 1010.928 iter/s
  • Memory usage
Metric Original Proposed
Dhrystone 382,269 B 169,197 B
CoreMark 835,597 B 176,605 B

@RinHizakura
Copy link
Collaborator

I believe that the benefit of memory reduction mostly comes from mmap. When we undefined the USE_MMAP, you can still see the terrible memory usage. The memory pool can do less since we only have to do mpool_free when block_map_clear happens.

If we think supporting a platform without mmap is not important, I would feel good to apply this, however.

@jserv
Copy link
Contributor

jserv commented Sep 27, 2023

I believe that the benefit of memory reduction mostly comes from mmap. When we undefined the USE_MMAP, you can still see the terrible memory usage. The memory pool can do less since we only have to do mpool_free when block_map_clear happens.
If we think supporting a platform without mmap is not important, I would feel good to apply this, however.

Good point.

In the context of utilizing the WebAssembly System Interface (WASI), it is apparent that there is no equivalent functionality to mmap. An issue has been raised on the WASI GitHub repository, highlighting the absence of mmap support within the WebAssembly environment. Consequently, if rv32emu were to be ported to WebAssembly, it should be noted that the mmap functionality would not be available.

@qwe661234, you should take into consideration the scenario where mmap may not be feasibly usable on certain platforms.

@qwe661234
Copy link
Collaborator Author

Even if we cannot use mmap, the total memory usage for blocks and IR arrays has been limited to sizeof(block) * block_capacity + sizeof(rv_insn_t) * block_capacity * 1024 by the memory pool. We should note that the total memory usage would exceed 2^32 bytes if we could not use mmap, due to virtual data memory. Therefore, I believe the memory usage of blocks is not a critical issue when your total memory usage has already exceeded 4 GiB.

I believe that using a linked list to implement IR is still a more flexible approach if we want to remove any IR nodes from this linked list during instruction fusion. However, it consumes more memory than qemu when running certain benchmarks.

src/riscv.h Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
@RinHizakura
Copy link
Collaborator

Even if we cannot use mmap, the total memory usage for blocks and IR arrays has been limited to sizeof(block) * block_capacity + sizeof(rv_insn_t) * block_capacity * 1024 by the memory pool. We should note that the total memory usage would exceed 2^32 bytes if we could not use mmap, due to virtual data memory. Therefore, I believe the memory usage of blocks is not a critical issue when your total memory usage has already exceeded 4 GiB.

I'm not sure if it is safe to say something like" Although we are not good, at least we aren't that bad at all compared to others". In other words, the 2^32 bytes requirement of data memory may not be something that isn't able to do better. So I am not sure if we have to be restricted for this.

I believe that using a linked list to implement IR is still a more flexible approach if we want to remove any IR nodes from this linked list during instruction fusion. However, it consumes more memory than qemu when running certain benchmarks.

If mmap is the key to reducing the consumption of memory, it is worth noting that we can also apply mmap for #232, which may save more memory. We should use mmap on #232 for fair comparison if qemu also make use of mmap .

@RinHizakura
Copy link
Collaborator

RinHizakura commented Sep 28, 2023

Another idea here. What if we try to combine both solutions' advantages? Specifically, we still use the memory pool but allocate the chunk of memory in a single sizeof(rv_insn_t) unit but not a 1 << 10 * sizeof(rv_insn_t) unit. This allows us to still get the benefit of the memory pool, and also the benefit of mmap (if enabled). If mmap is disabled for the platform unluckily, the small chunk of memory could still reduce the arena we should prepare for mpool, since we have less chance to do mpool_expand.

The possible issue for this solution could be the cost for mpool_alloc, since allocated with smaller chunks means frequent calls of this API. The mpool_free should not be a big problem because the allocation of rv_insn_t is special, which will free all of the allocated objects at once at block_map_clear, and this shall have room to be optimized.

@qwe661234
Copy link
Collaborator Author

qwe661234 commented Sep 28, 2023

Another idea here. What if we try to combine both solutions' advantages? Specifically, we still use the memory pool but allocate the chunk of memory in a single sizeof(rv_insn_t) unit but not a 1 << 10 * sizeof(rv_insn_t) unit. This allows us to still get the benefit of the memory pool, and also the benefit of mmap (if enabled). If mmap is disabled for the platform unluckily, the small chunk of memory could still reduce the arena we should prepare for mpool, since we have less chance to do mpool_expand.

The possible issue for this solution could be the cost for mpool_alloc, since allocated with smaller chunks means frequent calls of this API. The mpool_free should not be a big problem because the allocation of rv_insn_t is special, which will free all of the allocated objects at once at block_map_clear, and this shall have room to be optimized.

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

@RinHizakura
Copy link
Collaborator

RinHizakura commented Sep 28, 2023

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

Sure, 196cffbe shows the implementation of my new idea.

196cffbe with mmap:

$ ./build/rv32emu ./build/dhrystone.elf 
==11644==   total heap usage: 338 allocs, 338 frees, 15,204 bytes allocated

#234 with mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==12367==   total heap usage: 336 allocs, 336 frees, 15,172 bytes allocated

196cffbe w/o mmap:

valgrind ./build/rv32emu ./build/dhrystone.elf 
==13356==   total heap usage: 342 allocs, 342 frees, 252,772 bytes allocated

#234 w/o mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==14116==   total heap usage: 338 allocs, 338 frees, 58,768,196 bytes allocated

I haven't carefully experimented with the performance and other benchmarks yet. but it looks like some advantage could be brought for this new solution:

@qwe661234
Copy link
Collaborator Author

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

Sure, 196cffbe shows the implementation of my new idea.

196cffbe with mmap:

$ ./build/rv32emu ./build/dhrystone.elf 
==11644==   total heap usage: 338 allocs, 338 frees, 15,204 bytes allocated

#234 with mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==12367==   total heap usage: 336 allocs, 336 frees, 15,172 bytes allocated

196cffbe w/o mmap:

valgrind ./build/rv32emu ./build/dhrystone.elf 
==13356==   total heap usage: 342 allocs, 342 frees, 252,772 bytes allocated

#234 w/o mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==14116==   total heap usage: 338 allocs, 338 frees, 58,768,196 bytes allocated

I haven't carefully experimented with the performance and other benchmarks yet. but it looks like some advantage could be brought for this new solution:

@jserv, I think this idea is better than using array to store IRs. With this modification, we can easily remove useless IR directly, instead of skipping it.

@jserv
Copy link
Contributor

jserv commented Sep 29, 2023

@jserv, I think @RinHizakura's proposal is better than using array to store IRs. With this modification, we can easily remove useless IR directly, instead of skipping it.

Based on the feedback, I would like to merge pull request #232 first and then request improvements in IR manipulation by effectively removing unneeded instructions.
@qwe661234, create new pull requests for IR manipulation and take action on this pull request, either by refining it or marking it as obsolete.

@RinHizakura
Copy link
Collaborator

The new implementation is merged to #232 now! We could discuss over there if we have a consensus to use that solution.

jserv

This comment was marked as outdated.

src/emulate.c Outdated Show resolved Hide resolved
@qwe661234 qwe661234 requested review from jserv and RinHizakura October 3, 2023 03:42
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash the git commit and refine the commit message to address the benefit by use case.

Originally, we need to keep and handle useless IR carefully when fusing
instructions. Now, we can discard these useless IRs because of the
modification of IRs' data structure from array to singly-linked list.
@qwe661234 qwe661234 requested a review from jserv October 3, 2023 08:34
@jserv jserv changed the title Limit the memory usage of block and block IR Remove useless IR when fusing instructions Oct 3, 2023
@jserv
Copy link
Contributor

jserv commented Oct 3, 2023

I defer to @RinHizakura for confirmation.

Copy link
Collaborator

@RinHizakura RinHizakura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jserv Now it looks great to me!

@jserv jserv merged commit 8b59e9d into sysprog21:master Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants