且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

SparkR收集方法因Java堆空间上的OutOfMemory而崩溃

更新时间:2023-12-03 20:00:52

这似乎是Java内存中对象表示形式的简单组合,效率很低,但与一些明显的长寿命对象引用结合在一起,这导致某些集合无法及时进行垃圾收集,以便新的collect()调用就地覆盖旧的.

This does appear to be a simple combination of Java in-memory object representations being inefficient combined with some apparent long-lived object references which cause some collections to fail to be garbage-collected in time for the new collect() call to overwrite the old one in-place.

我尝试了一些选项,对于包含约400万行的256MB示例文件,我确实重现了您的行为,其中第一次使用collect可以,但是第二次使用SPARK_MEM=1g可以进行OOM.然后我改为设置SPARK_MEM=4g,然后我可以按ctrl + c并重新运行test <- collect(lines)任意多次.

I experimented with some options, and for my sample 256MB file that contains ~4M lines, I indeed reproduce your behavior where collect is fine the first time, but OOMs the second time, when using SPARK_MEM=1g. I then set SPARK_MEM=4g instead, and then I'm able to ctrl+c and re-run test <- collect(lines) as many times as I want.

有一件事,即使引用没有泄漏,请注意,第一次运行test <- collect(lines)之后,变量test保持着巨大的行数组,而第二次调用时,collect(lines) before 之前执行,最后被分配给test变量,因此,在任何简单的指令排序中,都无法对test的旧内容进行垃圾收集.这意味着第二次运行将使SparkRBackend进程同时保存整个集合的两个副本,从而导致您看到的OOM.

For one thing, even if references didn't leak, note that after the first time you ran test <- collect(lines), the variable test is holding that gigantic array of lines, and the second time you call it, the collect(lines) executes before finally being assigned to the test variable and thus in any straightforward instruction-ordering, there's no way to garbage-collect the old contents of test. This means the second run will make the SparkRBackend process hold two copies of the entire collection at the same time, leading to the OOM you saw.

为了进行诊断,我在主服务器上启动了SparkR,然后首次运行

To diagnose, on the master I started SparkR and first ran

dhuo@dhuo-sparkr-m:~$ jps | grep SparkRBackend
8709 SparkRBackend

我还检查了top,它正在使用大约22MB的内存.我用jmap获取了堆配置文件:

I also checked top and it was using around 22MB of memory. I fetched a heap profile with jmap:

jmap -heap:format=b 8709
mv heap.bin heap0.bin

然后我运行了test <- collect(lines)的第一轮,此时运行top时使用了约1.7g RES内存显示了该信息.我抓起另一个堆转储.最后,我还尝试了test <- {}摆脱了允许垃圾收集的引用.完成此操作后,打印出test并将其显示为空,我抓起另一个堆转储,发现RES仍显示1.7g.我使用jhat heap0.bin分析原始堆转储,并得到:

Then I ran the first round of test <- collect(lines) at which point running top showed it using ~1.7g of RES memory. I grabbed another heap dump. Finally, I also tried test <- {} to get rid of references to allow garbage-collection. After doing this, and printing out test and showing it to be empty, I grabbed another heap dump and noticed RES still showed 1.7g. I used jhat heap0.bin to analyze the original heap dump, and got:

Heap Histogram

All Classes (excluding platform)

Class   Instance Count  Total Size
class [B    25126   14174163
class [C    19183   1576884
class [<other>  11841   1067424
class [Lscala.concurrent.forkjoin.ForkJoinTask; 16  1048832
class [I    1524    769384
...

运行收集后,我有:

Heap Histogram

All Classes (excluding platform)

Class   Instance Count  Total Size
class [C    2784858 579458804
class [B    27768   70519801
class java.lang.String  2782732 44523712
class [Ljava.lang.Object;   2567    22380840
class [I    1538    8460152
class [Lscala.concurrent.forkjoin.ForkJoinTask; 27  1769904

即使我取消了test,它仍然保持不变.这向我们显示了char []的2784858个实例,总大小为579MB,还有String的2782732实例,大概是那些char []保持在其上方.我一直遵循参考图,并得到类似

Even after I nulled out test, it remained about the same. This shows us 2784858 instances of char[], for a total size of 579MB, and also 2782732 instances of String, presumably holding those char[]'s above it. I followed the reference graph all the way up, and got something like

char []->字符串-> String []-> ...->类scala.collection.mutable.DefaultEntry->类[Lscala.collection.mutable.HashEntry; ->类scala.collection.mutable.HashMap->类edu.berkeley.cs.amplab.sparkr.JVMObjectTracker $-> java.util.Vector@0x785b48cd8(36字节)-> sun.misc.Launcher$AppClassLoader@0x7855c31a8( 138个字节)

char[] -> String -> String[] -> ... -> class scala.collection.mutable.DefaultEntry -> class [Lscala.collection.mutable.HashEntry; -> class scala.collection.mutable.HashMap -> class edu.berkeley.cs.amplab.sparkr.JVMObjectTracker$ -> java.util.Vector@0x785b48cd8 (36 bytes) -> sun.misc.Launcher$AppClassLoader@0x7855c31a8 (138 bytes)

然后,AppClassLoader具有类似数千个入站引用的内容.因此,沿着该链的某个位置应该删除了引用,但未能删除引用,从而导致整个收集的数组都位于内存中,而我们尝试获取它的第二个副本.

And then AppClassLoader had something like thousands of inbound references. So somewhere along that chain something should've been removing their reference but failing to do so, causing the entire collected array to sit in memory while we try to fetch a second copy of it.

最后,要回答关于collect后挂起的问题,看来这与R进程的内存中不适合的数据有关;这是与此问题相关的主题: https://www.mail -archive.com/user@spark.apache.org/msg29155.html

Finally, to answer your question about hanging after the collect, it appears it has to do with the data not fitting in the R process's memory; here's a thread related to that issue: https://www.mail-archive.com/user@spark.apache.org/msg29155.html

我确认使用只有几行的较小文件然后运行collect确实不会挂起.

I confirmed that using a smaller file with only a handful of lines, and then running collect indeed does not hang.