为什么 Mongo Spark 连接器为查询返回不同且不正确的计数?

更新时间：2023-11-18 17:04:28

我解决了我的问题.计数不一致的原因是 MongoDefaultPartitioner 包装了使用随机采样的 MongoSamplePartitioner.老实说，这对我来说是一个很奇怪的默认设置.我个人更喜欢使用缓慢但一致的分区程序.分区器选项的详细信息可以在官方配置选项文档中找到.

I solved my issue. The reason of inconsistent counts was the MongoDefaultPartitioner which wraps MongoSamplePartitioner which uses random sampling. To be honest this is quite a weird default as for me. I personally would prefer to have a slow but a consistent partitioner instead. The details for partitioner options can be found in the official configuration options documentation.

代码:

val df = spark.read
  .format("com.mongodb.spark.sql.DefaultSource")
  .option("uri", "mongodb://127.0.0.1/enron_mail.messages")
  .option("partitioner", "spark.mongodb.input.partitionerOptions.MongoPaginateBySizePartitioner ")
  .load()

上一篇 : ：线程"main"中的异常java.util.NoSuchElementException:找不到行3下一篇 : 系统找不到anaconda命令提示符中指定的路径

为什么 Mongo Spark 连接器为查询返回不同且不正确的计数?

相关阅读

推荐文章