且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

了解Spark结构化流式并行

更新时间:2023-11-18 18:18:40

对于每一批(从Kafka中拉出),提取的物品是否 在spark.sql.shuffle.partitions?

For every Batch (pull from the Kafka), will the pulled items be divided among the number of spark.sql.shuffle.partitions?

一旦达到groupByKey(即随机边界),它们将被分割.最初检索数据时,分区数将等于Kafka分区数

They will be divided once they reach groupByKey which is a shuffle boundary. When you retrieve the data at first, the number of partitions will be equal to the number of Kafka partitions

考虑提供的代码段代码,我们是否仍在 由于groupByKey后接a,因此产生了火花并行性 mapGroups/mapGroupsWithState函数

Considering the snippet code provided, do we still leverage in the Spark parallelism due to the groupByKey followed by a mapGroups/mapGroupsWithState functions

通常是的,但这还取决于您设置Kafka主题的方式.尽管您从代码中看不到,但Spark会在内部将数据分成不同的阶段,分成较小的任务,并将它们分配给集群中的可用执行程序.如果您的Kafka主题只有1个分区,则意味着在groupByKey之前,您的内部流将包含一个分区,该分区不会被并行化,而是在单个执行程序上执行.只要您的Kafka分区数大于1,您的处理就将是并行的.在重排边界之后,Spark将对数据进行重新分区,以包含spark.sql.shuffle.partitions指定的分区数量.

Generally yes, but it also depends on how you setup your Kafka topic. Although not visible to you from the code, Spark will internally split the data different stage into smaller tasks and distribute them among the available executors in the cluster. If your Kafka topic has only 1 partition, that means that prior to groupByKey, your internal stream will contain a single partition, which won't be parallalized but executed on a single executor. As long as your Kafka partition count is greater than 1, your processing will be parallel. After the shuffle boundary, Spark will re-partition the data to contain the amount of partitions specified by the spark.sql.shuffle.partitions.