更新时间:2023-11-18 22:26:04
TL;DR If you want to split DataFrame
use randomSplit
method:
ratings_sdf.randomSplit([0.6, 0.2, 0.2])
您的代码在多个级别上都是错误的,但是有两个基本问题使其无法修复:
Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:
Spark转换,并且您使用的函数应该是参照透明的且无副作用.您的代码多次评估split_sdf
,并且您使用有状态的RNG data_split
,因此每次结果都不相同.
Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf
multiple times and you use stateful RNG data_split
so each time results are different.
这会导致您描述一个行为,其中每个孩子看到父RDD的不同状态.
This results in a behavior you describe where each child sees different state of the parent RDD.
您没有正确初始化RNG,因此获得的随机值不是独立的.
You don't properly initialize RNG and in consequence random values you get are not independent.