如何在 Scala 中使用 spark cassandra 连接器 API

更新时间：2023-11-18 16:28:40

你需要专注于如何在 Spark 应用程序中处理数据，而不是如何读取或写入数据(这当然很重要，但只有当你遇到性能问题).

You need to concentrate on how you process your data in Spark application, not how the data are read or written (it matters, of course, but only when you hit performance problems).

如果您使用的是 Spark，那么在处理 RDD 或 DataFrame 中的数据时，您需要考虑 Spark 的术语.在这种情况下，您需要使用这样的结构(使用数据帧):

If you're using Spark, then you need to think in the Spark terms as you're processing data in RDDs or DataFrames. In this case you need to use constructs like these (with DataFrames):

val df = spark
  .read
  .cassandraFormat("words", "test")
  .load()
val newDf = df.sql(...) // some operation on source data
newDF.write
  .cassandraFormat("words_copy", "test")
  .save()

并避免直接使用session.prepare/session.execute、cluster.connect等——Spark connector会做prepare，以及其他底层优化.

And avoid the use of direct session.prepare/session.execute, cluster.connect, etc. - Spark connector will do prepare, and other optimizations under the hood.

上一篇 : ：Spark 数据集过滤器性能下一篇 : ld找不到现有的库

如何在 Scala 中使用 spark cassandra 连接器 API

相关阅读

推荐文章