Bug 252981

Summary: useConcurrentJIT=true makes JS2-wasm slower on higher core counts
Product: WebKit Reporter: Angelos Oikonomopoulos <angelos>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: keith_miller, mark.lam, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Plot of richards-wasm score vs number of worklist threads, --useBBQJIT=false --useOMGJIT=false none

Description Angelos Oikonomopoulos 2023-02-27 02:58:49 PST
Created attachment 465195 [details]
Plot of richards-wasm score vs number of worklist threads, --useBBQJIT=false --useOMGJIT=false

Running the wasm parts of JS2 (wasm-cli.js) with useConcurrentJIT=true on an 80-core ARM64 box results in a performance slowdown. Specifically, with --useBBQJIT=false --useOMGJIT=false, the total score for wasm-cli.js hovers around 10 with useConcurrentJIT=true and between 13-14 for useConcurrentJIT=false.

As far as I can tell, this is because of contention when claiming functions from the Wasm::LLintPlan in the Wasm::Worklist. I can easily bring it back to parity with the useConcurrentJIT=false case by changing the worklist code to spawn a single thread or by only notifying one of the threads for the compilation phase.

The test most affected seems to be richards-wasm. Experimenting just with richards-wasm, I see performance start dropping off when the number of threads goes over 12 or so.

This may not currently be a big problem on end user devices, but it does affect benchmarking and is probably going to become more of an issue in the future.

I've experimented with simply batching the workload: have each worklist thread claim consecutive functions (up to functionBytes/batchSize, functionBytes being the sum of the function sizes in the module, batchSize being a tunable) and I can finally get a modest (~10%) speedup for the useConcurrentJIT=true case with a batchSize of 16KB (didn't put a lot of effort into tuning).

This probably applies on other wasm tiers too, but I haven't verified that. I can submit a PR with the batchSize changes but the batch size would have to be tuned for each user of Wasm::Worklist (and probably for each arch too).
Comment 1 Angelos Oikonomopoulos 2023-02-27 03:02:48 PST
Another reason I'm reluctant to submit the batchSize patch is that, fundamentally, while batching would help (lots of those wasm functions are only a few bytes long) I don't think we want to do batch scheduling of the compilation jobs. As core counts get higher, I guess some kind of work-stealing worklist would make sense?
Comment 2 Radar WebKit Bug Importer 2023-03-06 02:59:14 PST
<rdar://problem/106280756>
Comment 3 Mark Lam 2023-03-06 08:22:54 PST
> Running the wasm parts of JS2 (wasm-cli.js) with useConcurrentJIT=true on an 80-core ARM64 box results in a performance slowdown. Specifically, with --useBBQJIT=false --useOMGJIT=false, the total score for wasm-cli.js hovers around 10 with useConcurrentJIT=true and between 13-14 for useConcurrentJIT=false.

1. What does it mean to run on an 80-core ARM64 box?
   1.1 How many automatic threads are we actually spinning up in response to this?
   1.2 How many of these threads are actually active during the slowdown?
   1.3 What are these threads working on at that time?
2. What is nature of the workload?
   2.1 Is the slowness due to more contention now?
   2.2 Is the slowness due to memory bandwidth saturation?

Instead of jumping to some solution prematurely, I think you should profile the workload to see what's changed in the high core count profile vs the low core count profile.  Since not everyone else has access to this 80-core ARM64 box, you'll have to do this profiling to determine what's change in the performance profile.
Comment 4 Yusuke Suzuki 2023-03-06 23:52:22 PST
This is actually interesting, let me check a bit.
Comment 5 Yusuke Suzuki 2023-03-06 23:53:24 PST
Ah, so this only affects on large sized CPU cores (more than 15~)