Spark Streaming:将 Dstream 批次加入单个输出文件夹

更新时间：2022-01-14 19:53:03

我们可以使用 Spark SQL 的新 DataFrame 保存 API 来完成此操作，该 API 允许附加到现有输出.默认情况下，saveAsTextFile 将无法保存到包含现有数据的目录(请参阅 https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes).https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations 介绍了如何设置用于 Spark Streaming 的 Spark SQL 上下文.

We can do this using Spark SQL's new DataFrame saving API which allow appending to an existing output. By default, saveAsTextFile, won't be able to save to a directory with existing data (see https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ). https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations covers how to setup a Spark SQL context for use with Spark Streaming.

假设您使用 SQLContextSingleton 从指南中复制部分，生成的代码将类似于:

Assuming you copy the part from the guide with the SQLContextSingleton, The resulting code would look something like:

data.foreachRDD{rdd =>
  val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
  // Convert your data to a DataFrame, depends on the structure of your data
  val df = ....
  df.save("org.apache.spark.sql.json", SaveMode.Append, Map("path" -> path.toString))
}

(注意上面的例子使用 JSON 来保存结果，但你也可以使用不同的输出格式).

(Note the above example used JSON to save the result, but you can use different output formats too).

上一篇 : ：城市飞艇Android的推动越来越＆QUOT;这个程序是没有配置的iOS推送QUOT;下一篇 : Spark:从单个DStream中获取多个DStream

Spark Streaming:将 Dstream 批次加入单个输出文件夹

相关阅读

技术问答最新文章