| Summary: | useConcurrentJIT=true makes JS2-wasm slower on higher core counts | ||||||
|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Angelos Oikonomopoulos <angelos> | ||||
| Component: | JavaScriptCore | Assignee: | Nobody <webkit-unassigned> | ||||
| Status: | NEW --- | ||||||
| Severity: | Normal | CC: | keith_miller, mark.lam, webkit-bug-importer, ysuzuki | ||||
| Priority: | P2 | Keywords: | InRadar | ||||
| Version: | WebKit Nightly Build | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Attachments: |
|
||||||
Another reason I'm reluctant to submit the batchSize patch is that, fundamentally, while batching would help (lots of those wasm functions are only a few bytes long) I don't think we want to do batch scheduling of the compilation jobs. As core counts get higher, I guess some kind of work-stealing worklist would make sense? > Running the wasm parts of JS2 (wasm-cli.js) with useConcurrentJIT=true on an 80-core ARM64 box results in a performance slowdown. Specifically, with --useBBQJIT=false --useOMGJIT=false, the total score for wasm-cli.js hovers around 10 with useConcurrentJIT=true and between 13-14 for useConcurrentJIT=false.
1. What does it mean to run on an 80-core ARM64 box?
1.1 How many automatic threads are we actually spinning up in response to this?
1.2 How many of these threads are actually active during the slowdown?
1.3 What are these threads working on at that time?
2. What is nature of the workload?
2.1 Is the slowness due to more contention now?
2.2 Is the slowness due to memory bandwidth saturation?
Instead of jumping to some solution prematurely, I think you should profile the workload to see what's changed in the high core count profile vs the low core count profile. Since not everyone else has access to this 80-core ARM64 box, you'll have to do this profiling to determine what's change in the performance profile.
This is actually interesting, let me check a bit. Ah, so this only affects on large sized CPU cores (more than 15~) |
Created attachment 465195 [details] Plot of richards-wasm score vs number of worklist threads, --useBBQJIT=false --useOMGJIT=false Running the wasm parts of JS2 (wasm-cli.js) with useConcurrentJIT=true on an 80-core ARM64 box results in a performance slowdown. Specifically, with --useBBQJIT=false --useOMGJIT=false, the total score for wasm-cli.js hovers around 10 with useConcurrentJIT=true and between 13-14 for useConcurrentJIT=false. As far as I can tell, this is because of contention when claiming functions from the Wasm::LLintPlan in the Wasm::Worklist. I can easily bring it back to parity with the useConcurrentJIT=false case by changing the worklist code to spawn a single thread or by only notifying one of the threads for the compilation phase. The test most affected seems to be richards-wasm. Experimenting just with richards-wasm, I see performance start dropping off when the number of threads goes over 12 or so. This may not currently be a big problem on end user devices, but it does affect benchmarking and is probably going to become more of an issue in the future. I've experimented with simply batching the workload: have each worklist thread claim consecutive functions (up to functionBytes/batchSize, functionBytes being the sum of the function sizes in the module, batchSize being a tunable) and I can finally get a modest (~10%) speedup for the useConcurrentJIT=true case with a batchSize of 16KB (didn't put a lot of effort into tuning). This probably applies on other wasm tiers too, but I haven't verified that. I can submit a PR with the batchSize changes but the batch size would have to be tuned for each user of Wasm::Worklist (and probably for each arch too).