Spark 算子操作及总结_3

更新时间：2022-09-04 07:54:26

开发者学堂课程【大数据实时计算框架 Spark 快速入门：Spark 算子操作及总结_3】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/100/detail/1693

Spark 算子操作及总结_3

内容简介：

一、JoinOperator 相关代码

二、选择存储级别

一、JoinOperator 相关代码

20 //模拟集合

21 List> nameList = Arrays . asList(

22 new Tuple2(1, "xuruyun"),

23 new Tuple2(2, "liangyongqi"),

24 new Tuple2(3, "wangfei"),

25 new Tuple2(3, " annie"));

26

27 List scoreList = Arrays.asList(

28 new Tuple2(1, 150),

29 new Tuple2(2, 100),

30 new Tuple2(3, 80),

31 new Tuple2(3, 90));

32

33 JavaPairRDD nameRDD = sc .parallelizePairs(namelist);

34 JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);

35

二、选择存储级别

Which Storage Level to Choose?

Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:

If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY)，leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.

If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.

Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data， but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.

in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages：

it allows multiple executors to share the same pool of memory in Tachyon.

it significantly reduces garbage collection costs.

Cached data is not lost if individual executors crash.

译文:选择哪个存储级别?

Sparks 存储级别旨在提供内存使用量和 CPU 效率之间的差异权衡。我们建议通过以下过程选择一个:如果您的 RDDS 与默认存储级别(仅内存)相适应、请离开他们在那边。

这是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果没有，尝试使用 ME MORY_ ONLY_ SER 并选择个快速的序列化库，以使对象更节省空间，但仍然可以快速访问。

不要溢出到磁盘，除非计算数据集的函数非常昂贵，或者它们过滤了大量数据。否则，重新计算分区可能与从磁盘读取分区一样快。

如果您想要快速故障恢复，请使用复制的存储级别(例如:如果使用 Spark 来服务来自网络应用程序的请求)所有存储级别都提供完整的故障通过重新计算丢失的数据来容忍。

但复制的数据允许您继续在 RDD 而不必等待重新计算丢失的分区。

在高内存或多个应用程序的环境中。

OFF HEAP 模式有几个优点

它允许多个执行器共享同一个内存池中的超光速粒子。

它显著降低了垃圾收集成本。

如果个别执行程序崩溃，缓存的数据不会丢失。

上一篇 : ：Spark 算子操作剖析 1下一篇 : runOnUiThread更新主线程

Spark 算子操作及总结_3