apache spark - In JavaScala, how to change the serialization scope of a function closure when referring to a class member?

Apache Spark heavily rely on closure serialization for moving computations to different machines, unfortunately, the default JVM serialization is quite dirty and inclined to include irrelevant information. As a consequence, many Scala/Java frameworks may need special adaptation to work with it.

(using ScalaTest in the following example, both Scala 2 & 3 yield identical behaviour)

class ReduceClosureSerializingWithInline extends SparkUnitTest { // class is not serialisable

  val k = "1"

  it("will fail") {
    sc.parallelize(1 to 2).map {
      v =>
        k + v
    }.collect()
  }

  { // as a circumvention
    val k = "1"

    it("will succeed") {
      sc.parallelize(1 to 2).map {
        v =>
          k + v
      }.collect()
    }
  }
}


abstract class SparkUnitTest extends AnyFunSpec, BeforeAndAfterAll:
  def appName: String = getClass.getSimpleName

  given spark: SparkSession =

    val sparkConf = new SparkConf(true).setAll(
      Map(
      )
    )

    SparkSession
      .builder()
      .master("local")
      .config(sparkConf)
      .appName(suiteName)
      .getOrCreate()

  def sc: SparkContext = spark.sparkContext

  override def afterAll(): Unit =
    spark.stop()

  export .scalatest.matchers.should.Matchers.shouldEqual

the only difference between success & failure is that k is a member of SparkUnitTest in the first case (which carries the entire SparkUnitTest in the closure), but a free variable in the second case (which only carries itself in the closure).

This circumvention works in the above simple case, but in many other more complex cases it is either impossible, or making implementations much longer than necessary. Is there a method to make the first k behaving similar to the second k when being serialised as part of a closure?

(using ScalaTest in the following example, both Scala 2 & 3 yield identical behaviour)

class ReduceClosureSerializingWithInline extends SparkUnitTest { // class is not serialisable

  val k = "1"

  it("will fail") {
    sc.parallelize(1 to 2).map {
      v =>
        k + v
    }.collect()
  }

  { // as a circumvention
    val k = "1"

    it("will succeed") {
      sc.parallelize(1 to 2).map {
        v =>
          k + v
      }.collect()
    }
  }
}


abstract class SparkUnitTest extends AnyFunSpec, BeforeAndAfterAll:
  def appName: String = getClass.getSimpleName

  given spark: SparkSession =

    val sparkConf = new SparkConf(true).setAll(
      Map(
      )
    )

    SparkSession
      .builder()
      .master("local")
      .config(sparkConf)
      .appName(suiteName)
      .getOrCreate()

  def sc: SparkContext = spark.sparkContext

  override def afterAll(): Unit =
    spark.stop()

  export .scalatest.matchers.should.Matchers.shouldEqual

Share Improve this question edited Jan 20 at 1:24 asked Jan 19 at 19:30 tribbloid 3,86815 gold badges71 silver badges120 bronze badges

Put these in the companion object? – Gaël J Commented Jan 20 at 6:25
works in a some cases, but some k needs to be overridden – tribbloid Commented Jan 20 at 20:22
1 Use another (optionally local) variable when you need a different value. Tests don't have to honour the same code practices than production code. It's ok to duplicate for instance. Make your test easy to read, that's all that matter. – Gaël J Commented Jan 21 at 5:54
you are right, but test is just one of many cases where it interfere with serialisation. Java constructors generally won't cause such problem because you always have the option between local free variable and class member – tribbloid Commented Jan 21 at 8:01
If you've got another non test case, we can maybe propose alternatives through another question. But in the end, there's no real other opi: your code has to be serializable. And an easy way is often to use objects. – Gaël J Commented Jan 21 at 18:51

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

No. You should not use instance variables if you do not want to serialize the instance. You have to carry all your serializable data in data classes (e.g. a case class, but doesn't have to be) that can safely be serialized. Your logic can be carried either on classes that only have serializable fields, or singletons (object).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

apache spark - In JavaScala, how to change the serialization scope of a function closure when referring to a class member? - Sta

1 Answer 1

与本文相关的文章

评论列表(0)