且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

有没有办法改变Spark中RDD的复制因子?

更新时间:2022-05-28 23:31:03

首先,请注意,Spark不会自动缓存所有RDD,这仅仅是因为应用程序可能会创建许多RDD,而并非所有它们都会被重用.您必须在它们上调用.persist().cache().

First, note Spark does not automatically cache all your RDDs, simply because applications may create many RDDs, and not all of them are to be reused. You have to call .persist() or .cache() on them.

您可以设置要用来保存RDD的存储级别 myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache().persist(StorageLevel.MEMORY_ONLY)的简写.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY).

在Java或Scala中,persist的默认存储级别的确是StorageLevel.MEMORY_ONLYStorageLevel.MEMORY_ONLY –但是,如果要创建DStream,通常会有所不同(请参阅DStream构造函数API文档).如果您使用的是Python,则为StorageLevel.MEMORY_ONLY_SER.

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER.

文档详细介绍了许多存储级别以及它们的含义,但从根本上讲,它们是将Spark指向扩展

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class. You can thus define your own with a replication factor of up to 40.

请注意,在各种预定义的存储级别中,有些保留RDD的单个副本.实际上,对于所有名称都没有后缀_2(NONE除外)的人都是这样:

Note that of the various predefined storage levels, some keep a single copy of the RDD. In fact, that's true of all of those which name isn't postfixed with _2 (except NONE):

  • DISK_ONLY
  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • OFF_HEAP

那是他们使用的每个介质一个副本,当然,如果您希望整体复制一个副本,则必须选择一个单一介质的存储级别.

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.