How to investigate scroll Prover OOM
Currently, Scroll’s zkevm prover periodically occurs OOM, sometimes we can find all the provers trigger the OOM alert at the same time. So this is a serious issue.
This article share the approach about how to investigate this issue. Maybe can help you with similar issue.
First suspicion: rust side memory leak
Because of prover use cgo to invoke rust ffi. The first suspected point is there maybe exists a memory leak on rust side
, so I write a pure rust sample to test it which refers the prover logic.
https://github.com/scroll-tech/scroll/blob/feat/prover_oom/common/libzkp/impl/tests/chunk_test.rs
#[test]
fn chunk_test() {
println!("start chunk_test.");
unsafe {
let params = CString::new("/assets/test_params").expect("test_params conversion failed");
let assets = CString::new("/assets/test_assets").expect("test_assets conversion failed");
chunk::init_chunk_prover(params.as_ptr(), assets.as_ptr());
let chunk_trace = load_batch_traces().1;
let json_str = serde_json::to_string(&chunk_trace).expect("Serialization failed");
let c_string = CString::new(json_str).expect("CString conversion failed");
let c_str_ptr = c_string.as_ptr();
let ptr_cstr = CStr::from_ptr(c_str_ptr)
.to_str()
.expect("Failed to convert C string to Rust string");
println!("c_str_ptr len: {:?}", ptr_cstr.len());
let mut count = 1;
loop {
count += 1;
println!("count {:?}", count);
let _ = chunk::gen_chunk_proof(c_str_ptr);
// let ret_cstr = CStr::from_ptr(ret)
// .to_str()
// .expect("Failed to convert C string to Rust string");
// println!("ret: {:?}", ret_cstr)
}
}
}
But, the result: rust side run smoothly. the first suspicion can eliminate it.
Second suspicion: The cgo doesn’t free the rust memory usage
ditto, referring the prover’s logic, I write go prover oom test sample.
var (
paramsPath = flag.String("params", "/assets/test_params", "params dir")
assetsPath = flag.String("assets", "/assets/test_assets", "assets dir")
tracePath1 = flag.String("trace1", "/assets/traces/1_transfer.json", "chunk trace 1")
)
func initPyroscopse() {
go func() {
if runServerErr := http.ListenAndServe(":8089", nil); runServerErr != nil {
panic(runServerErr)
}
}()
}
func TestFFI(t *testing.T) {
initPyroscopse()
chunkProverConfig := &config.ProverCoreConfig{
ParamsPath: *paramsPath,
AssetsPath: *assetsPath,
ProofType: message.ProofTypeChunk,
}
chunkProverCore, _ := core.NewProverCore(chunkProverConfig)
chunkTrace1 := readChunkTrace(t, *tracePath1)
for {
chunkProverCore.ProveChunk("chunk_proof1", chunkTrace1)
time.Sleep(time.Millisecond * 10)
}
}
After run a long duration, the memory leak appeared.
But, unfortunately, the RSS is became higher and higher, but can’t find anything from the golang’s pprof analysis. This this expected, because cgo’s cpu&memory can’t capture by pprof.
I need switch to another tool. So i use eBPF’s memleak to analysis this.
[15:43:24] Top 10 stacks with outstanding allocations:
64 bytes in 2 allocations from stack
0x00007ff6f35ddc52 _ZN5alloc7raw_vec11finish_grow17he34d0657b5f696b3E.llvm.14486341794806259619+0x52 [libzkp.so]
72 bytes in 1 allocations from stack
0x00007ff6f328a511 _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hce44e0133d28b7f7+0xa1 [libzkp.so]
728 bytes in 13 allocations from stack
0x00007ff6f3289dad _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hc702eaeb1eeaa7da+0x9d [libzkp.so]
1520 bytes in 1 allocations from stack
0x00007ff6f3f7d934 crossbeam_deque::deque::Injector$LT$T$GT$::push::h47937e8e48ae466e+0xd4 [libzkp.so]
2032 bytes in 127 allocations from stack
0x00007ff6f2fe74ae _ZN15crossbeam_deque5deque15Worker$LT$T$GT$6resize17h709349a839ddfdffE.llvm.4718866933469281345+0x14e [libzkp.so]
0x0000000000000001 [unknown]
9696 bytes in 4 allocations from stack
0x00007ff6f35ddc32 _ZN5alloc7raw_vec11finish_grow17he34d0657b5f696b3E.llvm.14486341794806259619+0x32 [libzkp.so]
130048 bytes in 127 allocations from stack
0x00007ff6f2fe73ce _ZN15crossbeam_deque5deque15Worker$LT$T$GT$6resize17h709349a839ddfdffE.llvm.4718866933469281345+0x6e [libzkp.so]
524288 bytes in 1 allocations from stack
0x00007ff6f35ddc32 _ZN5alloc7raw_vec11finish_grow17he34d0657b5f696b3E.llvm.14486341794806259619+0x32 [libzkp.so]
0x064cf53f65dfb6f6 [unknown]
1166536 bytes in 563 allocations from stack
0x00007ff6f3f8235b crossbeam_epoch::sync::queue::Queue$LT$T$GT$::push::ha5643fb280f67d3f+0x1b [libzkp.so]
0x0000af820f20fa83 [unknown]
31037997056 bytes in 45 allocations from stack
0x00007ff6f2c5e4b8 sysmalloc_mmap.constprop.0+0x48 [libc.so.6]
Seems the memory leak appears on :
Leaks related to libzkp.so:
- Functions such as alloc::raw_vec::finish_grow and crossbeam_deque::deque::Injector<T>::push are called within Rust libraries. These functions may not properly release memory after allocation.
- Functions related to crossbeam_deque and crossbeam_epoch also appear multiple times, indicating possible memory management issues.
Leaks related to libc.so.6:
- The function sysmalloc_mmap.constprop.0, which is part of libc (the C standard library), may have leaked during memory mapping.
But from the First Suspicion
, known that the rust don’t exist a memory leak. So the eBPF’s analysis result make no sense.
I need read the prover_oom_test.go
very carefully, and try to find a bug from my eyes. But still unfortunately, the logic looks goods.
Having investigated up to this point, I feel very perplexed and wish to find a more suitable tool, but contrary to my hopes. I can only use a very primitive method to narrow down the scope of suspicion.
I assumes some suspicions.
- the input params exist memory leak
- the return data don’t free for the free method exists bug which don’t free the rust memory completely
Third Suspicion: the input params exist memory leak
I change the logic like this. CGO don’t pass any parameter to rust.
call the rust function to generate proof in go side:
cProof := C.gen_chunk_proof("")
defer C.free_c_chars(cProof)
In rust side, don’t accept the params from cgo, import the trace data itself.
pub unsafe extern "C" fn gen_chunk_proof(block_traces1: *const c_char) *const c_char {
let chunk_trace = load_batch_traces().1;
let json_str = serde_json::to_string(&chunk_trace).expect("Serialization failed");
let c_string = CString::new(json_str).expect("CString conversion failed");
let c_str_ptr = c_string.as_ptr();
let proof_result: Result<Vec<u8>, String> = panic_catch(|| {
let block_traces = c_char_to_vec(c_str_ptr);
let block_traces = serde_json::from_slice::<Vec<BlockTrace>>(&block_traces)
.map_err(|e| format!("failed to deserialize block traces: {e:?}"))?;
let proof = PROVER
.get_mut()
.expect("failed to get mutable reference to PROVER.")
.gen_chunk_proof(block_traces, None, None, OUTPUT_DIR.as_deref())
.map_err(|e| format!("failed to generate proof: {e:?}"))?;
serde_json::to_vec(&proof).map_err(|e| format!("failed to serialize the proof: {e:?}"))
})
.unwrap_or_else(|e| Err(format!("unwind error: {e:?}")));
....
}
Still exist memory leak.
Four Suspicion: The free_c_chars
exists bug, don’t free the rust memory.
So I change the code to this. Cgo don’t pass anything to rust, and rust also don’t return anything to go.
call the rust function to generate proof in go side, don’t pass param and don’t accept any result.
log.Info("Start to create chunk proof ...")
C.gen_chunk_proof("")
//defer C.free_c_chars(cProof)
log.Info("Finish creating chunk proof!")
In rust side, don’t accept the params from cgo, import the trace data itself. And don’t return data to cgo.
#[no_mangle]
pub unsafe extern "C" fn gen_chunk_proof(block_traces1: *const c_char) {
let chunk_trace = load_batch_traces().1;
let json_str = serde_json::to_string(&chunk_trace).expect("Serialization failed");
let c_string = CString::new(json_str).expect("CString conversion failed");
let c_str_ptr = c_string.as_ptr();
let proof_result: Result<Vec<u8>, String> = panic_catch(|| {
let block_traces = c_char_to_vec(c_str_ptr);
let block_traces = serde_json::from_slice::<Vec<BlockTrace>>(&block_traces)
.map_err(|e| format!("failed to deserialize block traces: {e:?}"))?;
let proof = PROVER
.get_mut()
.expect("failed to get mutable reference to PROVER.")
.gen_chunk_proof(block_traces, None, None, OUTPUT_DIR.as_deref())
.map_err(|e| format!("failed to generate proof: {e:?}"))?;
serde_json::to_vec(&proof).map_err(|e| format!("failed to serialize the proof: {e:?}"))
})
.unwrap_or_else(|e| Err(format!("unwind error: {e:?}")));
let _ = match proof_result {
Ok(proof_bytes) => ProofResult {
message: Some(proof_bytes),
error: None,
},
Err(err) => ProofResult {
message: None,
error: Some(err),
},
};
// serde_json::to_vec(&r).map_or(std::ptr::null_mut(), vec_to_c_char)
}
The interesting symptom occurs.
The memory still became higher and higher.
After investigate to this point, I known the real reason about it.
Strictly speaking,this is not a really memory leak, this is the go memory model mechanism.
I try to explain this in simple terms. Golang memory model use tcmalloc
. tcmalloc
have three cache layer for the purpose of Golang don’t allocate memory from OS very frequently. In this mechanism, golang memory model don’t free the memory to OS immediately which want to reuse the memory have allocated. This mechanism called MADV_FREE functionality. In MADV_FREE
mechanism golang’s memory will return to OS only when the OS memory is very busy. But the busyness of OS which golang need to free the memory is mysterious.
So gophers can always find that some go app allocate memory is higher, and the go app free the memory is very slow. If configs the VM monitor, this always triggers the memory alarm.
If you are interested with MADV_FREE functionality, pls search this keyword.
How to resolve this issue?
In fact, this issue is very difficult to resolve, because this is the go’s mechanism. go also support GODEBUG=madvdontneed=1
to switch the memory model mechanism, but suggest don’t change it.
in general, we can try three approaches.
1. sync.pool
In general, go suggest gopher use sync.pool
to reuse the allocated memory. But prover can’t do this for the memory which allocate by rust lib.
2. GOMEMLIMIT
GOMEMLIMIT=100GiB GOGC=100 ./prover
3. go ballest
func main() {
ballast := make([]byte, 100*1024*1024*1024) // 100G
// do something
runtime.KeepAlive(ballast)
}
But I try this three approaches, they all don’t works. The reason is from the perspective of Go, the memory allocated by Rust appears as a huge block of memory. Whether it’s speeding up GC (Garbage Collection) or limiting memory, it makes no sense to Go. It only sees that the prover requests a large block of memory every hour. Since Go releases memory relatively slowly, the memory occupied by the prover is seen to gradually increase.
conclusion
At present, . So the simplest way to resolve prover oom is rewrite the prover only use rust, don’t use cgo as a bridge.