且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Apache Spark SQLContext 与 HiveContext 有什么区别?

更新时间:2023-11-18 14:44:40

Spark 2.0+

Spark 2.0 提供了原生窗口函数(SPARK-8641)并具有一些特性解析方面的额外改进和更好的 SQL 2003 合规性,因此它对 Hive 实现核心功能的依赖显着减少,并且由于 HiveContext(SparkSession 与 Hive 支持)似乎是稍微不那么重要.

Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext (SparkSession with Hive support) seems to be slightly less important.

火花

显然,如果您想使用 Hive,您必须使用 HiveContext.除此之外,目前(Spark 1.5)最大的区别是支持 窗口函数 和访问 Hive UDF 的能力.

Obviously if you want to work with Hive you have to use HiveContext. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.

总的来说,窗口函数是一个非常酷的特性,可以用来以简洁的方式解决相当复杂的问题,而无需在 RDD 和 DataFrame 之间来回切换.性能仍然远未达到***状态,尤其是没有 PARTITION BY 子句,但它实际上并不是 Spark 特有的.

Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY clause but it is really nothing Spark specific.

关于 Hive UDF 现在不是一个严重的问题,但在 Spark 1.5 之前,许多 SQL 函数已经使用 Hive UDF 表达并且需要 HiveContext 才能工作.

Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext to work.

HiveContext 还提供了更强大的 SQL 解析器.参见示例:py4j.protocol.Py4JJavaError when select Nested column in dataframe using select statement

HiveContext also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

最后需要 HiveContext 来启动 Thrift 服务器.

Finally HiveContext is required to start Thrift server.

HiveContext 最大的问题是它有大量的依赖.

The biggest problem with HiveContext is that it comes with large dependencies.