Bug 261821

Summary: Resizable array buffer stress tests run out of memory on ppc64le
Product: WebKit Reporter: Michael Catanzaro <mcatanzaro>
Component: JavaScriptCoreAssignee: Michael Catanzaro <mcatanzaro>
Status: NEW ---    
Severity: Normal CC: dan, mark.lam, mcatanzaro, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: PC   
OS: Linux   

Description Michael Catanzaro 2023-09-20 08:00:29 PDT
Using a cloop build on ppc64le, the following JSC stress tests crash due to running out of memory:

stress/resizable-byteoffset.js.default
stress/resizable-byteoffset.js.bytecode-cache
stress/resizable-byteoffset.js.mini-mode
stress/resizable-byteoffset.js.lockdown
stress/resizable-bytelength.js.default
stress/resizable-bytelength.js.bytecode-cache
stress/resizable-bytelength.js.mini-mode
stress/resizable-bytelength.js.lockdown
stress/resizable-length.js.default
stress/resizable-length.js.bytecode-cache
stress/resizable-length.js.mini-mode
stress/resizable-length.js.lockdown

All of the failures look like this:

Running stress/resizable-byteoffset.js.default
stress/resizable-byteoffset.js.default: Crashing because current footprint: 631242752 exceeds limit: 629145600
stress/resizable-byteoffset.js.default: test_script_452: line 2: 1830913 Aborted                 (core dumped) ( "$@" /home/jenkins/workspace/WebKit-JSC/label/ppc64le/WebKitBuild/Release/bin/jsc --useFTLJIT\=false --useFunctionDotArguments\=true --validateExceptionChecks\=true --useDollarVM\=true --maxPerThreadStackUsage\=1572864 --useFTLJIT\=true --useResizableArrayBuffer\=1 resizable-byteoffset.js )
stress/resizable-byteoffset.js.default: ERROR: Unexpected exit code: 134
FAIL: stress/resizable-byteoffset.js.default

My first instinct was to mark the tests as memoryLimited, since we commonly do this whenever a test requires a large amount of memory to run:

//@ skip if $memoryLimited

But then I realized that the tests do not look like they should require very much memory at all. E.g. the resizable-length test only creates one ArrayBuffer at a time, then resizes it up to 1024 bytes. It does this 10,000 times. I'm OK with skipping the tests, but I wonder if it's appropriate; perhaps something is wrong with garbage collection on this architecture, for example? Would any sort of debug info be helpful to decide?
Comment 1 Radar WebKit Bug Importer 2023-09-27 08:01:11 PDT
<rdar://problem/116121832>
Comment 2 Michael Catanzaro 2023-10-03 06:27:22 PDT
(In reply to Michael Catanzaro from comment #0)
> But then I realized that the tests do not look like they should require very
> much memory at all. E.g. the resizable-length test only creates one
> ArrayBuffer at a time, then resizes it up to 1024 bytes. It does this 10,000
> times. I'm OK with skipping the tests, but I wonder if it's appropriate;
> perhaps something is wrong with garbage collection on this architecture, for
> example? Would any sort of debug info be helpful to decide?

I've also asked for guidance in Slack and didn't receive any response, so I'm going to mark these tests as memory limited.
Comment 3 Michael Catanzaro 2023-10-03 06:30:46 PDT
Pull request: https://github.com/WebKit/WebKit/pull/18548
Comment 4 Michael Catanzaro 2023-10-04 09:08:46 PDT
I got access to a ppc64le machine to run the test there, but it has a huge amount of RAM and the test passes.

I made two modifications to the test:

diff --git a/JSTests/stress/resizable-byteoffset.js b/JSTests/stress/resizable-byteoffset.js
index c25d9cbf22bb..cf26153ed8fb 100644
--- a/JSTests/stress/resizable-byteoffset.js
+++ b/JSTests/stress/resizable-byteoffset.js
@@ -15,7 +15,8 @@ function test() {
         else
             shouldBe(array.byteOffset, 64);
     }
+    gc();
 }
 
-for (var i = 0; i < 1e4; ++i)
+for (var i = 0; i < 1e5; ++i)
     test();

The change to 1e5 is to make the test run longer. With this change, I see RAM usage increase to just above 5 GiB before garbage collection kicks in .

If I additionally add the gc() to trigger garbage collection, then RAM usage stays nice and low the whole time.

So my conclusion is clear: the garbage collector works, but is waiting way too long before it runs, long enough that it runs out of RAM on the builder that runs the CI.

I wonder how JSC decides when to run garbage collection?

Notably, bugs specific to ppc64le almost always turn out to be related to page size, which for ppc64le is 64 KiB, much larger than 4 KiB that's generally used on x86_64 and aarch64 or 16 KiB used on Apple platforms. I wonder if that could be related.
Comment 5 Mark Lam 2023-10-04 13:49:30 PDT
How does it behave on x86_64 in contrast?
Comment 6 Michael Catanzaro 2023-10-04 14:00:02 PDT
(In reply to Mark Lam from comment #5)
> How does it behave on x86_64 in contrast?

Good question. Memory use grows to about 400 MB, then it drops (presumably because garbage collection runs). So it looks like the threshold to trigger garbage collection is way lower on x86_64 than on ppc64le.
Comment 7 Michael Catanzaro 2023-10-04 14:05:50 PDT
Hm, it looks like JSC considers total system RAM when deciding how frequently to run GC. In this example, the ppc64le test system has >200 GB of RAM while the x86_64 system has 64 GB. So maybe this could be a red herring? :/
Comment 8 Mark Lam 2023-10-04 14:10:30 PDT
(In reply to Michael Catanzaro from comment #7)
> Hm, it looks like JSC considers total system RAM when deciding how
> frequently to run GC. In this example, the ppc64le test system has >200 GB
> of RAM while the x86_64 system has 64 GB. So maybe this could be a red
> herring? :/

Find the code that determines available system RAM.  Find out how it determines this for actual devices with limited memory e.g. ARMv7 and MIPS, etc.  Do they override with a lower limit somehow, or does the OS give the correct value to guide the heuristics?

Does PPC64's OS give a reasonable value here, or is it overly optimistic about available memory?  Does PPC64's port need to override with a lower limit here?  Is there already a mechanism to override the limit?
Comment 9 Michael Catanzaro 2023-10-05 13:02:18 PDT
Thanks for the tips. I found Heap::collectIfNecessaryOrDefer and I'm guessing that is where the decision for whether to garbage collect or not based on critical memory threshold occurs:

        size_t bytesAllowedThisCycle = m_maxEdenSize;

#if USE(BMALLOC_MEMORY_FOOTPRINT_API)
        if (overCriticalMemoryThreshold())
            bytesAllowedThisCycle = std::min(m_maxEdenSizeWhenCritical, bytesAllowedThisCycle);
#endif

        if (m_bytesAllocatedThisCycle <= bytesAllowedThisCycle)
            return;

Notably the overCriticalMemoryThreshold() call depends on USE(BMALLOC_MEMORY_FOOTPRINT_API), and bmalloc doesn't work on ppc64le due to page size, bug #209360. So I think there is no code looking at available system RAM.

Next, I changed Heap::updateAllocationLimits to enable the verbose debug prints, expecting it to print something useful. After changing the loop from 1e4 to 1e5 to make the test run for longer, the function is called only once:

bytesAllocatedThisCycle = 33554629
totalBytesVisited = 97776, currentHeapSize = 97776
extraMemorySize() = 12678, currentHeapSize = 110454
Full: maxHeapSize = 33554432
Full: maxEdenSize = 33443978
Full: sizeAfterLastFullCollect = 110454
Full: bytesAbandonedSinceLastFullCollect = 0
sizeAfterLastCollect = 110454

33554629 bytes is almost exactly 32 MiB, but in fact jsc is using over 5 GiB of resident memory, so I'm suspicious about m_bytesAllocatedThisCycle, which is updated by Heap::didAllocate. That's called from a variety of different locations, so that's as far as I've taken this for now.

On x86_64, when a collection occurs, it looks like this:

bytesAllocatedThisCycle = 33554533
totalBytesVisited = 97824, currentHeapSize = 97824
extraMemorySize() = 12678, currentHeapSize = 110502
Full: maxHeapSize = 33554432
Full: maxEdenSize = 33443930
Full: sizeAfterLastFullCollect = 110502
Full: bytesAbandonedSinceLastFullCollect = 0
sizeAfterLastCollect = 110502

The numbers look almost identical, but on x86_64 the jsc process is using roughly 400 MB of RAM when the GC runs, rather than 5 GB on ppc64le. So it looks like the lack of BMALLOC_MEMORY_FOOTPRINT_API doesn't actually matter: JSC is reaching the bytesAllocatedThisCycle limit on both architectures, but this limit corresponds to only a small portion of the process memory. Yet all of that memory really is freed by the GC.

I wonder if there is a simple explanation for this that I am missing. It looks like m_bytesAllocatedThisCycle is wrong by about one order of magnitude on x86_64, and two orders of magnitude on ppc64le.
Comment 10 Michael Catanzaro 2023-10-05 13:05:38 PDT
m, immediately after I posted that long comment, I realized there are multiple Heap objects and that debug only prints for the particular Heap that happens to reach its size limit. So that explains the size discrepancy.
Comment 11 Michael Catanzaro 2023-10-05 13:12:23 PDT
(In reply to Michael Catanzaro from comment #10)
> m, immediately after I posted that long comment, I realized there are
> multiple Heap objects and that debug only prints for the particular Heap
> that happens to reach its size limit. So that explains the size discrepancy.

Ah well this seemed like a great theory, but I added a debug print to the Heap constructor and there is actually one Heap. Sorry for the noise.
Comment 12 Mark Lam 2023-10-05 14:22:39 PDT
Michael, you're debugging on a ppc64le which is configured differently (in terms of RAM at least) than the test machine that this issue manifested on.  Can you check what the available RAM is on the test machine where this is failing?

FWIW, I don't think your approach of increasing the loop iteration count to 1e5 is necessarily valid.  You may be looking one symptom that increases memory usage this way, but it may or may not be what the test machine is experiencing.

Do you have access to the test machine that is failing the test?  If so, can you just ssh in and run the test there and poke at it instead?  If not, is there some way to force the OS on the ppc64le to only use the same (lesser) amount of RAM as that on the test device?

Another tip: If you're running the jsc shell, you can pass it `--logGC=1` to see if GC is run or not.

Another tip: I see that ppc64le is using the cloop.  x86_64 is not.  Maybe try forcing the x86_64 build to use cloop also and see if there are any changes in memory use behavior?
Comment 13 Michael Catanzaro 2023-10-05 15:39:43 PDT
(In reply to Mark Lam from comment #12)
> Michael, you're debugging on a ppc64le which is configured differently (in
> terms of RAM at least) than the test machine that this issue manifested on. 
> Can you check what the available RAM is on the test machine where this is
> failing?

Yes. I will ask how much RAM is available on the machine where it's failing. (That said, based on what I've seen so far, I'm thinking available system RAM is not a consideration at all except when bmalloc is used.)
 
> Do you have access to the test machine that is failing the test?  If so, can
> you just ssh in and run the test there and poke at it instead?  If not, is
> there some way to force the OS on the ppc64le to only use the same (lesser)
> amount of RAM as that on the test device?

I can probably access the CI machine where the test is failing, yes. And yes, it's just a bunch of VMs, so we can also reduce memory level if needed.
 
> Another tip: If you're running the jsc shell, you can pass it `--logGC=1` to
> see if GC is run or not.

Neat. Didn't know this.

> Another tip: I see that ppc64le is using the cloop.  x86_64 is not.  Maybe
> try forcing the x86_64 build to use cloop also and see if there are any
> changes in memory use behavior?

Actually I built x86_64 also using cloop to ensure the only differences in test results should be due to processor architecture. The CI also disables bmalloc, but I didn't disable bmalloc in my local testing, which could be relevant. I should disable that to ensure the only difference is architecture.
Comment 14 Michael Catanzaro 2023-10-06 06:50:05 PDT
(In reply to Michael Catanzaro from comment #13)
> Yes. I will ask how much RAM is available on the machine where it's failing.

32 GB. Since this test never seems to use more than 5 GB, I guess other tests are running at the same time.
Comment 15 Michael Catanzaro 2023-10-06 11:55:11 PDT
(In reply to Michael Catanzaro from comment #13)
> Actually I built x86_64 also using cloop to ensure the only differences in
> test results should be due to processor architecture. The CI also disables
> bmalloc, but I didn't disable bmalloc in my local testing, which could be
> relevant. I should disable that to ensure the only difference is
> architecture.

Whoops, I really did have bmalloc disabled in the x86_64 build, so it was an apples-to-apples comparison.