且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark 算子操作及总结_3

更新时间:2022-09-04 07:54:26

开发者学堂课程【大数据实时计算框架 Spark 快速入门Spark 算子操作及总结_3】学习笔记,与课程紧密联系,让用户快速学习知识。

课程地址https://developer.aliyun.com/learning/course/100/detail/1693


Spark 算子操作及总结_3


内容简介:

一、JoinOperator 相关代码

二、选择存储级别


一、JoinOperator 相关代码


20  //模拟集合

21  List> nameList = Arrays . asList(

22  new Tuple2(1, "xuruyun"),

23  new Tuple2(2, "liangyongqi"),

24  new Tuple2(3, "wangfei"),

25  new Tuple2(3, " annie"));

26

27  List scoreList = Arrays.asList(

28  new Tuple2(1, 150),

29  new Tuple2(2, 100),

30  new Tuple2(3, 80),

31  new Tuple2(3, 90));

32

33  JavaPairRDD nameRDD = sc  .parallelizePairs(namelist);

34  JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);

35


二、选择存储级别


Which Storage Level to Choose?

Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:

If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY),leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.

If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.

Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.

in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages:

it allows multiple executors to share the same pool of memory in Tachyon.

it significantly reduces garbage collection costs.

Cached data is not lost if individual executors crash.

译文:选择哪个存储级别?

Sparks 存储级别旨在提供内存使用量和 CPU 效率之间的差异权衡。我们建议通过以下过程选择一个:如果您的 RDDS 与默认存储级别(仅内存)相适应、请离开他们在那边。

这是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果没有,尝试使用 ME MORY_ ONLY_ SER 并选择个快速的序列化库 ,以使对象更节省空间, 但仍然可以快速访问。

不要溢出到磁盘,除非计算数据集的函数非常昂贵,或者它们过滤了大量数据。否则,重新计算分区可能与从磁盘读取分区一样快。

如果您想要快速故障恢复,请使用复制的存储级别(例如:如果使用 Spark 来服务来自网络应用程序的请求)所有存储级别都提供完整的故障通过重新计算丢失的数据来容忍。

但复制的数据允许您继续在 RDD 而不必等待重新计算丢失的分区。

在高内存或多个应用程序的环境中。  

OFF HEAP 模式有几个优点

它允许多个执行器共享同一个内存池中的超光速粒子。

它显著降低了垃圾收集成本。

如果个别执行程序崩溃,缓存的数据不会丢失。