且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark数据框:收集()与选择()

更新时间:2023-11-18 23:27:04

行动与变革

  • 收集(操作)-在驱动程序中将数据集的所有元素作为数组返回.通常在过滤器或 其他返回足够小的数据子集的操作.
  • Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

spark-sql文档 >

select(* cols)(转换)-投影一组表达式并返回一个新的DataFrame.

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

参数:cols –列名(字符串)或表达式的列表 (柱子).如果列名称之一是"*",则该列将被展开 包括当前DataFrame中的所有列.**

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**

df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

在数据框上执行

执行select(column-name1,column-name2,etc)方法,将返回一个新的数据框,该数据框仅包含在select()函数中选择的列.

Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.

例如假设df有几列,包括名称"和值"以及其他一些列.

e.g. assuming df has several columns including "name" and "value" and some others.

df2 = df.select("name","value")

df2将仅容纳df

df2作为select的结果将在执行程序中而不在驱动程序中(如使用collect()的情况一样)

df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())

sql编程指南

df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

您可以在数据框上运行collect()(火花文档)

You can running collect() on a dataframe (spark docs)

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

火花文档

要在驱动程序上打印所有元素,可以使用collect()方法 首先将RDD带到驱动程序节点,从而: rdd.collect().foreach(println). 这可能会导致驱动程序用尽 但是,因为collect()会将整个RDD提取到一个内存中 单机;如果您只需要打印RDD的一些元素,则 比较安全的方法是使用take():rdd.take(100).foreach(println).

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).