更新时间:2023-11-18 23:31:04
A 数据帧
与谷歌搜索数据帧定义明确定义的:
A DataFrame
is defined well with a google search for "DataFrame definition":
一个数据帧是一个表,或二维阵列状结构,在
其中每列包含每行上的一个变量的测量,并
包含一个案例。
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
所以,数据帧
有额外的元数据由于其表格格式,这使得星火运行在完成查询特定的优化。
So, a DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
这是 RDD
,在另一方面,仅仅是一个的研究 esilient的 D istributed的 D ataset是比较数据的黑盒子不能作为可以反对它未作为约束执行的操作进行优化的。
An RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.
不过,您可以通过它的 RDD
方法从数据帧去一个 RDD
,并且可以从去一个 RDD
到数据帧
(如果RDD是表格形式)通过 toDF
法
However, you can go from a DataFrame to an RDD
via its rdd
method, and you can go from an RDD
to a DataFrame
(if the RDD is in a tabular format) via the toDF
method
一般推荐使用数据帧
,其中可能的,因为内置的查询优化。
In general it is recommended to use a DataFrame
where possible due to the built in query optimization.