且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

PySpark 中的 Scala 案例类是什么?

更新时间:2023-01-13 23:27:50

正如 Alex Hall 所提到的,它是命名产品类型的真正等效物,是一个 namedtuple.

Row 不同,在 另一个答案 中建议,它具有许多有用的属性:

  • 具有明确定义的形状,可以可靠地用于结构模式匹配:

    >>>从集合导入namedtuple>>>>>>FooBar = namedtuple("FooBar", ["foo", "bar"])>>>foob​​ar = foobar(42, -42)>>>foo, bar = foobar>>>富42>>>酒吧-42

    相比之下 Rows 与关键字参数一起使用时不可靠:

    >>>从 pyspark.sql 导入行>>>>>>foob​​ar = Row(foo=42, bar=-42)>>>foo, bar = foobar>>>富-42>>>酒吧42

    虽然如果用位置参数定义:

    >>>FooBar = Row("foo", "bar")>>>foob​​ar = foobar(42, -42)>>>foo, bar = foobar>>>富42>>>酒吧-42

    订单被保留.

  • 定义正确的类型

    >>>从 functools 导入 singledispatch>>>>>>FooBar = namedtuple("FooBar", ["foo", "bar"])>>>类型(FooBar)>>>isinstance(FooBar(42, -42), FooBar)真的

    并且可以在需要类型处理的任何时候使用,尤其是单:

    >>>Circle = namedtuple("Circle", ["x", "y", "r"])>>>Rectangle = namedtuple("矩形", ["x1", "y1", "x2", "y2"])>>>>>>@singledispatch...定义区域(x):...引发 NotImplementedError......>>>@area.register(矩形)...定义_(x):...返回 abs(x.x1 - x.x2) * abs(x.y1 - x.y2)......>>>@area.register(圆圈)...定义_(x):...返回 math.pi * x.r ** 2......>>>>>>区域(矩形(0, 0, 4, 4))16>>>>>>区域(圆(0, 0, 4))50.26548245743669

    多重调度:

    >>>从多分派进口分派>>>从数字导入理性>>>>>>@dispatch(矩形,理性)...定义比例(x,y):...返回矩形(x.x1,x.y1,x.x2 * y,x.y2 * y)......>>>@dispatch(圆,理性)...定义比例(x,y):...返回圆(x.x,x.y,x.r * y)......>>>比例(矩形(0、0、4、4)、2)矩形(x1=0, y1=0, x2=8, y2=8)>>>规模(圆(0、0、11)、2)圆(x=0,y=0,r=22)

    并结合第一个属性,可以在广泛的模式匹配场景中使用.namedtuples 还支持标准继承和类型提示一>.

    Rows 不要:

    >>>FooBar = Row("foo", "bar")>>>类型(FooBar)<类'pyspark.sql.types.Row'>>>>isinstance(FooBar(42, -42), FooBar) # 预期失败回溯(最近一次调用最后一次):...TypeError: isinstance() arg 2 必须是一个类型或类型的元组>>>BarFoo = Row("bar", "foo")>>>isinstance(FooBar(42, -42), type(BarFoo))真的>>>isinstance(BarFoo(42, -42), type(FooBar))真的

  • 提供高度优化的表示.与 Row 对象不同,元组不使用 __dict__ 并且每个实例都带有字段名称.因此,初始化速度可以快几个数量级:

    >>>FooBar = namedtuple("FooBar", ["foo", "bar"])>>>%timeit FooBar(42, -42)每个循环 587 ns ± 5.28 ns(7 次运行的平均值 ± 标准偏差,每次 1000000 次循环)

    对比不同的Row构造函数:

    >>>%timeit 行(foo=42, bar=-42)每个循环 3.91 µs ± 7.67 ns(7 次运行的平均值 ± 标准偏差,每次 100000 次循环)>>>FooBar = Row("foo", "bar")>>>%timeit FooBar(42, -42)每个循环 2 µs ± 25.4 ns(7 次运行的平均值 ± 标准偏差,每次 100000 次循环)

    并且显着提高内存效率(处理大规模数据时非常重要的属性):

    >>>导入系统>>>FooBar = namedtuple("FooBar", ["foo", "bar"])>>>sys.getsizeof(FooBar(42, -42))64

    与等效的Row

    相比>>>sys.getsizeof(Row(foo=42, bar=-42))72

    最后,namedtuple 的属性访问速度提高了一个数量级:

    >>>FooBar = namedtuple("FooBar", ["foo", "bar"])>>>foob​​ar = foobar(42, -42)>>>%timeit foobar.foo每个循环 102 ns ± 1.33 ns(7 次运行的平均值 ± 标准偏差,每次 10000000 次循环)

    Row 对象的等效操作相比:

    >>>foob​​ar = Row(foo=42, bar=-42)>>>%timeit foobar.foo每个循环 2.58 µs ± 26.9 ns(7 次运行的平均值 ± 标准偏差,每次 100000 次循环)

  • 最后但并非最不重要的 namedtuples 在 Spark SQL 中得到正确支持

    >>>Record = namedtuple("Record", ["id", "name", "value"])>>>spark.createDataFrame([Record(1, "foo", 42)])DataFrame[id: bigint, name: string, value: bigint]

总结:

应该清楚的是,Row实际产品的非常糟糕的替代品类型,除非由 Spark API 强制执行,否则应避免使用.

还应该清楚的是,pyspark.sql.Row 并不是为了替代 case 类,而是直接等效于 org.apache.spark.sql.Row - 与实际产品相去甚远的类型,其行为类似于 Seq[Any](取决于子类,并添加了名称).Python 和 Scala 实现都被引入作为外部代码和内部 Spark SQL 表示之间的有用但笨拙的接口.

另见:

  • 如果不提及由 MacroPy,那就太可惜了"https://***.com/users/871202/li-haoyi">李浩毅及其端口 (MacroPy3) 作者:Alberto Berti:

    >>>导入 macropy.console0=[]======>启用 MacroPy

    它带有一组丰富的其他功能,包括但不限于高级模式匹配和简洁的 lambda 表达式语法.

  • Python 数据类(Python3.7+).

How would you go about employing and/or implementing a case class equivalent in PySpark?

As mentioned by Alex Hall a real equivalent of named product type, is a namedtuple.

Unlike Row, suggested in the other answer, it has a number of useful properties:

  • Has well defined shape and can be reliably used for structural pattern matching:

    >>> from collections import namedtuple
    >>>
    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
    >>> foobar = FooBar(42, -42)
    >>> foo, bar = foobar
    >>> foo
    42
    >>> bar
    -42
    

    In contrast Rows are not reliable when used with keyword arguments:

    >>> from pyspark.sql import Row
    >>>
    >>> foobar = Row(foo=42, bar=-42)
    >>> foo, bar = foobar
    >>> foo
    -42
    >>> bar
    42
    

    although if defined with positional arguments:

    >>> FooBar = Row("foo", "bar")
    >>> foobar = FooBar(42, -42)
    >>> foo, bar = foobar
    >>> foo
    42
    >>> bar
    -42
    

    the order is preserved.

  • Define proper types

    >>> from functools import singledispatch
    >>> 
    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
    >>> type(FooBar)
    <class 'type'>
    >>> isinstance(FooBar(42, -42), FooBar)
    True
    

    and can be used whenever type handling is required, especially with single:

    >>> Circle = namedtuple("Circle", ["x", "y", "r"])
    >>> Rectangle = namedtuple("Rectangle", ["x1", "y1", "x2", "y2"])
    >>>
    >>> @singledispatch
    ... def area(x):
    ...     raise NotImplementedError
    ... 
    ... 
    >>> @area.register(Rectangle)
    ... def _(x):
    ...     return abs(x.x1 - x.x2) * abs(x.y1 - x.y2)
    ... 
    ... 
    >>> @area.register(Circle)
    ... def _(x):
    ...     return math.pi * x.r ** 2
    ... 
    ... 
    >>>
    >>> area(Rectangle(0, 0, 4, 4))
    16
    >>> >>> area(Circle(0, 0, 4))
    50.26548245743669
    

    and multiple dispatch:

    >>> from multipledispatch import dispatch
    >>> from numbers import Rational
    >>>
    >>> @dispatch(Rectangle, Rational)
    ... def scale(x, y):
    ...     return Rectangle(x.x1, x.y1, x.x2 * y, x.y2 * y)
    ... 
    ... 
    >>> @dispatch(Circle, Rational)
    ... def scale(x, y):
    ...     return Circle(x.x, x.y, x.r * y)
    ...
    ...
    >>> scale(Rectangle(0, 0, 4, 4), 2)
    Rectangle(x1=0, y1=0, x2=8, y2=8)
    >>> scale(Circle(0, 0, 11), 2)
    Circle(x=0, y=0, r=22)
    

    and combined with the first property, there can be used in wide ranges of pattern matching scenarios. namedtuples also support standard inheritance and type hints.

    Rows don't:

    >>> FooBar = Row("foo", "bar")
    >>> type(FooBar)
    <class 'pyspark.sql.types.Row'>
    >>> isinstance(FooBar(42, -42), FooBar)  # Expected failure
    Traceback (most recent call last):
    ...
    TypeError: isinstance() arg 2 must be a type or tuple of types
    >>> BarFoo = Row("bar", "foo")
    >>> isinstance(FooBar(42, -42), type(BarFoo))
    True
    >>> isinstance(BarFoo(42, -42), type(FooBar))
    True
    

  • Provide highly optimized representation. Unlike Row objects, tuple don't use __dict__ and carry field names with each instance. As a result there are can be order of magnitude faster to initialize:

    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
    >>> %timeit FooBar(42, -42)
    587 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

    compared to different Row constructors:

    >>> %timeit Row(foo=42, bar=-42)
    3.91 µs ± 7.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    >>> FooBar = Row("foo", "bar")
    >>> %timeit FooBar(42, -42)
    2 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    and are significantly more memory efficient (very important property when working with large scale data):

    >>> import sys
    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
    >>> sys.getsizeof(FooBar(42, -42))
    64
    

    compared to equivalent Row

    >>> sys.getsizeof(Row(foo=42, bar=-42))
    72
    

    Finally attribute access is order of magnitude faster with namedtuple:

    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
    >>> foobar = FooBar(42, -42)
    >>> %timeit foobar.foo
    102 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    

    compared to equivalent operation on Row object:

    >>> foobar = Row(foo=42, bar=-42)
    >>> %timeit foobar.foo
    2.58 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

  • Last but not least namedtuples are properly supported in Spark SQL

    >>> Record = namedtuple("Record", ["id", "name", "value"])
    >>> spark.createDataFrame([Record(1, "foo", 42)])
    DataFrame[id: bigint, name: string, value: bigint]
    

Summary:

It should be clear that Row is a very poor substitute for an actual product type, and should be avoided unless enforced by Spark API.

It should be also clear that pyspark.sql.Row is not intended to be a replacement of a case class when you consider that, it is direct equivalent of org.apache.spark.sql.Row - type which is pretty far from an actual product, and behaves like Seq[Any] (depending on a subclass, with names added). Both Python and Scala implementations were introduced as an useful, albeit awkward interface between external code and internal Spark SQL representation.

See also:

  • It would be a shame not to mention awesome MacroPy developed by Li Haoyi and its port (MacroPy3) by Alberto Berti:

    >>> import macropy.console
    0=[]=====> MacroPy Enabled <=====[]=0
    >>> from macropy.case_classes import macros, case
    >>> @case
    ... class FooBar(foo, bar): pass
    ... 
    >>> foobar = FooBar(42, -42)
    >>> foo, bar = foobar
    >>> foo
    42
    >>> bar
    -42
    

    which comes with a rich set of other features including, but not limited to, advanced pattern matching and neat lambda expression syntax.

  • Python dataclasses (Python 3.7+).