Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass a start-of-usable-memory to the module. #334

Closed
ghost opened this issue Sep 5, 2015 · 17 comments
Closed

Pass a start-of-usable-memory to the module. #334

ghost opened this issue Sep 5, 2015 · 17 comments
Milestone

Comments

@ghost
Copy link

ghost commented Sep 5, 2015

For some runtimes there is the option of placing the linear memory at absolute zero in the address space. This can help on the ARM in particular as it frees up more addressing modes for use, but it also frees up a register for the linear memory base on ARM and x64 (tested on Odin). Some systems already have the bottom pages reserved for security purposes, so the low addresses could not be used with good performance (but perhaps emulated in a signal handler). If a start-of-usable-memory could be passed to a module then it could avoid allocating in this low region.

This might also be useful with an option to protect low pages (which was noted as a developer option elsewhere). Even when the linear memory is not placed at absolute zero, when this developer option is enabled the start-of-usable-memory parameter could advise the application to avoid the low protected pages, and the page size will vary between systems so the start might want to be a parameter and would be more general than a simple flag indicating that the low page is protected.

@lukewagner
Copy link
Member

Zero-based heaps is definitely an interesting optimization, but probably not something we want to provision for in the MVP. It's also possible that, with multi-process future feature (where the basic idea is letting wasm ask for a fresh process and avoiding all the sync call semantics that make this hard in general), that we wouldn't even need to reserve the low pages and so no explicit spec provision would be necessary. So perhaps you could file a PR to add a bullet to FutureFeatures.md#multiprocess-support pointing out that it enables this optimization more cleanly. Also, with separate processes, you'd avoid the "there can be only one" pigeonhole problem.

@jfbastien
Copy link
Member

I'm not sure I understand what the concrete proposal is. NaCl's sandbox uses this trick extensively, but I'd like to understand what you're proposing that wasm do concretely.

@ghost
Copy link
Author

ghost commented Sep 9, 2015

The proposal is that the runtime define a start-of-usable-memory value and pass this to the wasm code. If possible this would be a compile time constant just in case the code wnts a fixed low layout, but it might be fine as a runtime constant too. The wasm would ideally avoid using memory before this start index - this might just be some default code in emscripten. That's it - no memory protection for low pages defined here etc.

What this would do is give the option of a future runtime using these low pages, and a few uses have been mentioned: zero based linear memory on a system that protects the low pages; developer support for detecting zero page accesses in wasm code; speed bounds check code paths (I think as used in Odin x86 now which can take a slow path on low indexes).

It is just planning ahead so that deployed code has some flexibility that is anticipated to be needed.

@ghost
Copy link
Author

ghost commented Sep 9, 2015

@lukewagner Even running in a separate process does not always allow the use of low pages. For example, ARM Linux kernels have hard coded protected low page and the only way to explore this on the ARM is to use a hacked kernel. For example, x86/x64 Linux kernels have a system wide restriction which can be changed for testing but might not be practical to demand it changed just for wasm.

If memory were not linear and allocated from a larger process address space then I see this would not be an issue, but that seemed a long way off and it seems that wasm will want to support in-process runtimes well going forward too?

@AndrewScheidecker
Copy link

What I proposed in #306 would also provide this functionality. It's proposing that the WASM instance needs to go through the runtime to allocate pages from its address-space. In this case, you could make the runtime only allocate from pages above a certain address.

@ghost
Copy link
Author

ghost commented Sep 9, 2015

@AndrewScheidecker Yes, requesting use of the linear memory on a page level would also support this. I currently only see a need to restrict use of the start pages so this is all that is being requested.

@lukewagner
Copy link
Member

@JSStats I see, makes sense. In that case, I think that this feature should be opt-in and low addresses should fault. Otherwise, we'll have to simulate the low pages with signal handlers which is both unnecessary complexity and a hidden performance cliff for users. With opt-in + faulting, though, this gives applications the desired low-memory fault (and perhaps we can primarily pitch the feature as such; the zero-based optimization is just a perk on some platforms). While promising, I think this feature doesn't belong in the MVP; now I'm thinking this belongs as in the finger-grained control over memory bucket.

@jfbastien
Copy link
Member

IIUC your proposal is akin to feature detection in that a developer can query for lowest_addressable_memory, and the wasm runtime would provide a value (potentially 0)?

I see the rationale, but am wondering what happens if a developer addresses below this value?

Our orientation so far had been to let the user protect the lower pages as they wished, because a wasm module sees "addresses" (really: heap offset) that aren't actually the process' virtual address. Your proposal assumes that we allow zero-mapping (i.e. wasm heap offset == virtual address), and different kernels forbid us from controlling the lower pages entirely.

I'd like a proposal on this topic to also address security concerns: this gives knowledge to an attacker that they wouldn't otherwise have. It's workable (NaCl does it) but needs to be designed properly.

@ghost
Copy link
Author

ghost commented Sep 10, 2015

@jfbastien Yes, runtime feature detection would work fine for this. There is no proposal to do anything different if the wasm code accesses memory below this value - it's just to try and prepare deployed code to be able to support this restriction in future. This is not intended to replace mprotect support, rather to work around system or implementation issues accessing low pages, where even if the low pages are not user protected then access to these low pages might take slow paths.

Is the 'security concern' that an attacker would have a good clue that the linear memory is located at zero in the address space and could depend on this in conjunction with other security issues?

If people want to bundle this in with runtime enforced protection of low pages then this would be fine with me too - a attacker would then not be able to distinguish if the linear memory is at zero in the address space.

@ghost
Copy link
Author

ghost commented Sep 10, 2015

@jfbastien Actually, this might also want to be a compile-time constant so that the offset can be baked into compiled code.

@jfbastien
Copy link
Member

Is the 'security concern' that an attacker would have a good clue that the linear memory is located at zero in the address space and could depend on this in conjunction with other security issues?

Correct, it gives an attacker a lot of information that can be combined with ASLR leaks / non-PIC code / fixed-mapped code (e.g. kernel helpers). It's not insurmountable, but it's something that a design has to work through.

@ghost ghost mentioned this issue Sep 29, 2015
@ghost
Copy link
Author

ghost commented Oct 3, 2015

Some preliminary performance results. Using WAVM, LLVM 3.8, and the zlib benchmark. Using the pointer-masking variant of the benchmark as it is significantly faster and probably has more opportunity to exploit the buffer being at zero.

x86: 97.5 %
x64: 96.6 %
ARMv7: 84.6%

So this might offer a useful performance improvement for ARM. Devices using the ARM cpu are expected to be in need of the best performance and efficiency, and may well want to run a demanding wasm app 'all-in' and be able to allocate the buffer at zero.

Note the lower pages are not used. The emscripten GLOBAL_BASE option is used to move the data segment higher and this benchmark ran fine without access to the low pages and ran on an unmodified kernel.

@nmostafa
Copy link

nmostafa commented Oct 8, 2015

@JSStats , can you elaborate more on the experiment and the results ? Do we know where the improvement in ARM is coming from (use of faster addressing modes?) ? And why IA is not benefiting as much ?

@ghost
Copy link
Author

ghost commented Oct 9, 2015

@nmostafa Here's the ARM code for the hottest loop. You can see that the limited addressing modes are better used without having to add in the memory base on each access.

With memory at absolute zero:
0x75db3ca8: add r7, r1, r10
0x75db3cac: sxtb r5, r12
0x75db3cb0: add r4, r7, r0
0x75db3cb4: bfc r4, #28, #4
0x75db3cb8: ldrsb r4, [r4]
0x75db3cbc: cmp r4, r5
0x75db3cc0: bne 0x75db3e64
0x75db3cc4: add r5, r8, r1
0x75db3cc8: sxtb r4, r11
0x75db3ccc: add r5, r5, r0
0x75db3cd0: bfc r5, #28, #4
0x75db3cd4: ldrsb r5, [r5]
0x75db3cd8: cmp r5, r4
0x75db3cdc: mov r5, #0
0x75db3ce0: movweq r5, #1
0x75db3ce4: cmp r5, #0
0x75db3ce8: beq 0x75db3e64
0x75db3cec: bfc r7, #28, #4
0x75db3cf0: ldrb r5, [r9]
0x75db3cf4: ldrb r4, [r7]
0x75db3cf8: cmp r4, r5
0x75db3cfc: mov r5, #0
0x75db3d00: movweq r5, #1
0x75db3d04: cmp r5, #0
0x75db3d08: beq 0x75db3e64
0x75db3d0c: ldrb r5, [r9, #1]
0x75db3d10: ldrb r7, [r7, #1]
0x75db3d14: cmp r7, r5
0x75db3d18: mov r7, #0
0x75db3d1c: movweq r7, #1
0x75db3d20: cmp r7, #0
0x75db3d24: beq 0x75db3e64
0x75db3d28: ldr r7, [sp, #24]
0x75db3d2c: ldr r9, [sp, #20]
0x75db3d30: str r11, [sp, #44] ; 0x2c
0x75db3d34: add r11, r7, r1
0x75db3d38: mov r8, r9
0x75db3d3c: sub lr, r11, #8
0x75db3d40: bfc r8, #28, #4
0x75db3d44: bfc lr, #28, #4
0x75db3d48: mov r7, r8
0x75db3d4c: ldrb r4, [r7, #1]!
0x75db3d50: ldrb r5, [lr, #1]
0x75db3d54: cmp r4, r5
0x75db3d58: moveq r7, r8
0x75db3d5c: ldrbeq r4, [r7, #2]!
0x75db3d60: ldrbeq r5, [lr, #2]
0x75db3d64: cmpeq r4, r5
0x75db3d68: bne 0x75db3e08
0x75db3d6c: mov r7, r8
0x75db3d70: ldrb r4, [r7, #3]!
0x75db3d74: ldrb r5, [lr, #3]
0x75db3d78: cmp r4, r5
0x75db3d7c: moveq r7, r8
0x75db3d80: ldrbeq r4, [r7, #4]!
0x75db3d84: ldrbeq r5, [lr, #4]
0x75db3d88: cmpeq r4, r5
0x75db3d8c: bne 0x75db3e08
0x75db3d90: mov r7, r8
0x75db3d94: ldrb r4, [r7, #5]!
0x75db3d98: ldrb r5, [lr, #5]
0x75db3d9c: cmp r4, r5
0x75db3da0: moveq r7, r8
0x75db3da4: ldrbeq r4, [r7, #6]!
0x75db3da8: ldrbeq r5, [lr, #6]
0x75db3dac: cmpeq r4, r5
0x75db3db0: bne 0x75db3e08
0x75db3db4: ldrb r4, [r8, #7]!
0x75db3db8: ldrb r7, [lr, #7]
0x75db3dbc: cmp r4, r7
0x75db3dc0: mov r7, r8
0x75db3dc4: bne 0x75db3e08
0x75db3dc8: ldr r6, [sp, #48] ; 0x30
0x75db3dcc: add r7, r9, #8
0x75db3dd0: bfc r7, #28, #4
0x75db3dd4: cmp r7, r6
0x75db3dd8: bcs 0x75db3e08
0x75db3ddc: add r4, r11, #8
0x75db3de0: bfc r11, #28, #4
0x75db3de4: ldrb r5, [r7]
0x75db3de8: mov r9, r7
0x75db3dec: ldrb r6, [r11]
0x75db3df0: mov r11, r4
0x75db3df4: cmp r5, r6
0x75db3df8: mov r6, #0
0x75db3dfc: movweq r6, #1
0x75db3e00: cmp r6, #0
0x75db3e04: bne 0x75db3d38
0x75db3e08: ldr r6, [sp, #32]
0x75db3e0c: ldr lr, [sp, #40] ; 0x28
0x75db3e10: ldr r8, [sp, #28]
0x75db3e14: ldr r9, [sp, #16]
0x75db3e18: ldr r11, [sp, #44] ; 0x2c
0x75db3e1c: movw r5, #258 ; 0x102
0x75db3e20: add r4, r6, r7
0x75db3e24: ldr r6, [sp, #36] ; 0x24
0x75db3e28: add r7, r4, r5
0x75db3e2c: cmp r7, r0
0x75db3e30: ble 0x75db3e64
0x75db3e34: ldr r0, [sp, #8]
0x75db3e38: str r1, [r0, #112] ; 0x70
0x75db3e3c: ldr r0, [sp, #4]
0x75db3e40: cmp r7, r0
0x75db3e44: bge 0x75db3e88
0x75db3e48: ldr r0, [sp]
0x75db3e4c: add r0, r0, r4
0x75db3e50: movw r4, #257 ; 0x101
0x75db3e54: bfc r0, #28, #4
0x75db3e58: ldrsb r11, [r0, r4]
0x75db3e5c: ldrsb r12, [r0, r5]
0x75db3e60: mov r0, r7
0x75db3e64: and r1, r1, r2
0x75db3e68: add r1, lr, r1, lsl #1
0x75db3e6c: bic r1, r1, #-268435455 ; 0xf0000001
0x75db3e70: ldrh r1, [r1]
0x75db3e74: cmp r1, r6
0x75db3e78: bls 0x75db3e8c
0x75db3e7c: subs r3, r3, #1
0x75db3e80: bne 0x75db3ca8

Without buffer at zero:
0x75db23cc: ldr r3, [sp, #72] ; 0x48
0x75db23d0: ldr r7, [sp, #76] ; 0x4c
0x75db23d4: add r5, r1, r3
0x75db23d8: sxtb r7, r7
0x75db23dc: add r3, r5, r0
0x75db23e0: bfc r3, #28, #4
0x75db23e4: ldrsb r3, [r3, r9]
0x75db23e8: cmp r3, r7
0x75db23ec: bne 0x75db2600
0x75db23f0: add r3, r10, r1
0x75db23f4: ldr r7, [sp, #68] ; 0x44
0x75db23f8: add r3, r3, r0
0x75db23fc: bfc r3, #28, #4
0x75db2400: ldrsb r3, [r3, r9]
0x75db2404: sxtb r7, r7
0x75db2408: cmp r3, r7
0x75db240c: mov r3, #0
0x75db2410: movweq r3, #1
0x75db2414: cmp r3, #0
0x75db2418: beq 0x75db2600
0x75db241c: bfc r5, #28, #4
0x75db2420: ldrb r3, [lr, r9]
0x75db2424: ldrb r7, [r5, r9]
0x75db2428: cmp r7, r3
0x75db242c: mov r3, #0
0x75db2430: movweq r3, #1
0x75db2434: cmp r3, #0
0x75db2438: beq 0x75db2600
0x75db243c: ldr r3, [sp, #60] ; 0x3c
0x75db2440: ldrb r7, [r5, r8]
0x75db2444: ldrb r3, [r3, r9]
0x75db2448: cmp r7, r3
0x75db244c: mov r3, #0
0x75db2450: movweq r3, #1
0x75db2454: cmp r3, #0
0x75db2458: beq 0x75db2600
0x75db245c: ldr r3, [sp, #44] ; 0x2c
0x75db2460: ldr r11, [sp, #40] ; 0x28
0x75db2464: add r10, r3, r1
0x75db2468: mov r5, r11
0x75db246c: sub lr, r10, #8
0x75db2470: bfc r5, #28, #4
0x75db2474: bfc lr, #28, #4
0x75db2478: ldrb r3, [r5, r8]
0x75db247c: ldrb r7, [lr, r8]
0x75db2480: cmp r3, r7
0x75db2484: bne 0x75db2564
0x75db2488: ldrb r3, [lr, r12]
0x75db248c: ldrb r7, [r5, r12]
0x75db2490: cmp r7, r3
0x75db2494: bne 0x75db256c
0x75db2498: movw r3, #61443 ; 0xf003
0x75db249c: movt r3, #12287 ; 0x2fff
0x75db24a0: mov r7, r3
0x75db24a4: ldrb r3, [lr, r7]
0x75db24a8: ldrb r7, [r5, r7]
0x75db24ac: cmp r7, r3
0x75db24b0: bne 0x75db2574
0x75db24b4: movw r3, #61444 ; 0xf004
0x75db24b8: movt r3, #12287 ; 0x2fff
0x75db24bc: mov r7, r3
0x75db24c0: ldrb r3, [lr, r7]
0x75db24c4: ldrb r7, [r5, r7]
0x75db24c8: cmp r7, r3
0x75db24cc: bne 0x75db257c
0x75db24d0: movw r3, #61445 ; 0xf005
0x75db24d4: movt r3, #12287 ; 0x2fff
0x75db24d8: mov r7, r3
0x75db24dc: ldrb r3, [lr, r7]
0x75db24e0: ldrb r7, [r5, r7]
0x75db24e4: cmp r7, r3
0x75db24e8: bne 0x75db2584
0x75db24ec: movw r3, #61446 ; 0xf006
0x75db24f0: movt r3, #12287 ; 0x2fff
0x75db24f4: mov r7, r3
0x75db24f8: ldrb r3, [lr, r7]
0x75db24fc: ldrb r7, [r5, r7]
0x75db2500: cmp r7, r3
0x75db2504: bne 0x75db258c
0x75db2508: movw r3, #61447 ; 0xf007
0x75db250c: movt r3, #12287 ; 0x2fff
0x75db2510: mov r7, r3
0x75db2514: ldrb r3, [lr, r7]
0x75db2518: ldrb r7, [r5, r7]
0x75db251c: cmp r7, r3
0x75db2520: bne 0x75db2594
0x75db2524: ldr r3, [sp, #64] ; 0x40
0x75db2528: add r11, r11, #8
0x75db252c: bfc r11, #28, #4
0x75db2530: cmp r11, r3
0x75db2534: bcs 0x75db2598
0x75db2538: add r7, r10, #8
0x75db253c: bfc r10, #28, #4
0x75db2540: ldrb r3, [r11, r9]
0x75db2544: ldrb r5, [r10, r9]
0x75db2548: mov r10, r7
0x75db254c: cmp r3, r5
0x75db2550: mov r3, #0
0x75db2554: movweq r3, #1
0x75db2558: cmp r3, #0
0x75db255c: bne 0x75db2468
0x75db2560: b 0x75db2598
0x75db2564: add r11, r5, #1
0x75db2568: b 0x75db2598
0x75db256c: add r11, r5, #2
0x75db2570: b 0x75db2598
0x75db2574: add r11, r5, #3
0x75db2578: b 0x75db2598
0x75db257c: add r11, r5, #4
0x75db2580: b 0x75db2598
0x75db2584: add r11, r5, #5
0x75db2588: b 0x75db2598
0x75db258c: add r11, r5, #6
0x75db2590: b 0x75db2598
0x75db2594: add r11, r5, #7
0x75db2598: ldr r3, [sp, #56] ; 0x38
0x75db259c: ldr r10, [sp, #52] ; 0x34
0x75db25a0: ldr lr, [sp, #36] ; 0x24
0x75db25a4: movw r7, #258 ; 0x102
0x75db25a8: add r3, r3, r11
0x75db25ac: ldr r11, [sp, #48] ; 0x30
0x75db25b0: add r5, r3, r7
0x75db25b4: cmp r5, r0
0x75db25b8: ble 0x75db2600
0x75db25bc: ldr r0, [sp, #24]
0x75db25c0: str r1, [r0, r9]
0x75db25c4: ldr r0, [sp, #28]
0x75db25c8: cmp r5, r0
0x75db25cc: bge 0x75db2624
0x75db25d0: ldr r0, [sp, #20]
0x75db25d4: add r0, r0, r3
0x75db25d8: movw r3, #61697 ; 0xf101
0x75db25dc: bfc r0, #28, #4
0x75db25e0: movt r3, #12287 ; 0x2fff
0x75db25e4: ldrsb r3, [r0, r3]
0x75db25e8: str r3, [sp, #68] ; 0x44
0x75db25ec: movw r3, #61698 ; 0xf102
0x75db25f0: movt r3, #12287 ; 0x2fff
0x75db25f4: ldrsb r0, [r0, r3]
0x75db25f8: str r0, [sp, #76] ; 0x4c
0x75db25fc: mov r0, r5
0x75db2600: and r1, r1, r6
0x75db2604: add r1, r11, r1, lsl #1
0x75db2608: bic r1, r1, #-268435455 ; 0xf0000001
0x75db260c: ldrh r1, [r1, r9]
0x75db2610: cmp r1, r4
0x75db2614: bls 0x75db2628
0x75db2618: subs r2, r2, #1
0x75db261c: bne 0x75db23cc

LLVM does not always generate great code for asm.js style patterns, so the above might not be optimal, but these are both faster than Odin or V8 by a good margin.

Note there are no bounds checks in the above code, it used pointer mask, and the masking is hoisted somewhat, see the 'bfc rx, #28, #4' instructions.

IA has more consistent addressing modes, base+index*scale+offset, and the code is really only using index*scale+offset so adding the base is almost free. I thought it might help register pressure a little, but the initial results suggest it does not help much.

If you want an option to run wasm faster 'all-in' on ARM devices then vote for this feature :)

@nmostafa
Copy link

nmostafa commented Oct 9, 2015

I see ~20% code reduction, which I am assuming is the cause of the speedup (lower ICache and/or ITLB misses?), unless there is some ARMv7 uArch detail that favors simpler addressing forms.

Are the IA numbers for Core or Atom ? Atom has smaller ICache, and if we are reducing register pressure and saving some register spills/fills, you might see higher gains.

@ghost
Copy link
Author

ghost commented Oct 11, 2015

@nmostafa The ARMv7 has a fixed sized instruction and limited addressing modes.

Taking another look at the x86 code shows a potential improvement. If the wavm implicit masking is disabled then runtime improves to 92% with the memory at absolute zero. This might be realistic for wasm32 code running in a 64 bit runtime and using memory protection for access safety, and reserving a large amount of VM to allow a base+index*scale+offset to be known safe. Btw the performance is then better than for using explicit masking for safety. It's not clear if this memory protection scheme will scale to wasm64 code though.

The LLVM x64 code generation is currently too poor to draw any conclusions. Looks like it's having problems casting the i32 index to i64 values used in the addressing mode, and emits separate 32 bit lea operations to compute the index. I expect the wasm memory access opcode offset to help a little with this issue. btw this problem is a challenge for v8 at present too.

@ghost
Copy link
Author

ghost commented Feb 25, 2016

Suggest allowing the wasm app to declare it is not using the low page rather than being passed a start-or-usable-memory. This strategy seems a win in many ways, and is amenable to declaring in an externally defined section if there is no consensus among wasm.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants