且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark中数据框中和RDD的区别

更新时间:2023-11-18 23:31:04

A 数据帧与谷歌搜索数据帧定义明确定义的:

A DataFrame is defined well with a google search for "DataFrame definition":

一个数据帧是一个表,或二维阵列状结构,在
  其中每列包含每行上的一个变量的测量,并
  包含一个案例。

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

所以,数据帧有额外的元数据由于其表格格式,这使得星火运行在完成查询特定的优化。

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

这是 RDD ,在另一方面,仅仅是一个的研究 esilient的 D istributed的 D ataset是比较数据的黑盒子不能作为可以反对它未作为约束执行的操作进行优化的。

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.

不过,您可以通过它的 RDD 方法从数据帧去一个 RDD ,并且可以从去一个 RDD 数据帧(如果RDD是表格形式)通过 toDF

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

一般推荐使用数据帧,其中可能的,因为内置的查询优化。

In general it is recommended to use a DataFrame where possible due to the built in query optimization.