![Page 1: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/1.jpg)
User Defined Aggregation In Apache Spark A Love Story
Erik ErlandsonPrincipal Software Engineer
![Page 2: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/2.jpg)
All Love Stories Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
![Page 3: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/3.jpg)
Establish The Plot
![Page 4: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/4.jpg)
Spark’s Scale-Out World
232535235
logical
![Page 5: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/5.jpg)
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
232535235
physical
logical
![Page 6: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/6.jpg)
Scale-Out Sum
2 3 5
s = 0
![Page 7: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/7.jpg)
Scale-Out Sum
2 3 5
s = s
+ 2 (
2)
![Page 8: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/8.jpg)
Scale-Out Sum
2 3 5
s = s
+ 3 (
5)
![Page 9: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/9.jpg)
Scale-Out Sum
2 3 5
s = s
+ 5 (
10)
![Page 10: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/10.jpg)
Scale-Out Sum
2 3 5 10
![Page 11: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/11.jpg)
Scale-Out Sum
2 3 5 10
5 3 5 13
![Page 12: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/12.jpg)
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
![Page 13: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/13.jpg)
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
![Page 14: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/14.jpg)
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
![Page 15: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/15.jpg)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
![Page 16: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/16.jpg)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
![Page 17: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/17.jpg)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
![Page 18: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/18.jpg)
Love Interest
![Page 19: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/19.jpg)
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile0
1
(x,q)
CDF
![Page 20: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/20.jpg)
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile0
1
(x,q)
CDF
![Page 21: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/21.jpg)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
![Page 22: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/22.jpg)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
![Page 23: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/23.jpg)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
![Page 24: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/24.jpg)
Romantic Chemistry
val query = records .writeStream //...
+---------+|wordcount|+---------+| 12|| 5|| 9|| 18|| 12|+---------+
val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90"))val query = r.writeStream //...
+----+----+| p50| p90|+----+----+|15.6|31.0||16.0|30.8||15.8|30.0||15.7|31.0||16.0|31.0|+----+----+
![Page 25: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/25.jpg)
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
![Page 26: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/26.jpg)
Conflict!
![Page 27: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/27.jpg)
UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...}
![Page 28: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/28.jpg)
UDAF Anatomyclass TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...}
![Page 29: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/29.jpg)
User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...}
![Page 30: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/30.jpg)
User Defined Type Anatomyclass TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(StructField("delta", DoubleType, false) ::StructField("maxDiscrete", IntegerType, false) ::StructField("nclusters", IntegerType, false) ::StructField("clustX", ArrayType(DoubleType, false), false) ::StructField("clustM", ArrayType(DoubleType, false), false) ::Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...}
Expensive
![Page 31: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/31.jpg)
What Could Go Wrong?class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... }
def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... }
// yada yada yada ...}
![Page 32: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/32.jpg)
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
![Page 33: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/33.jpg)
Wait What?val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserializeIn serializeIn deserializeIn serialize
… 997 more times !In deserializeIn serialize
![Page 34: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/34.jpg)
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
![Page 35: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/35.jpg)
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
![Page 36: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/36.jpg)
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update
}
![Page 37: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/37.jpg)
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize}
![Page 38: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/38.jpg)
SPARK-27296
![Page 39: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/39.jpg)
Resolution
![Page 40: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/40.jpg)
#25024
![Page 41: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/41.jpg)
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
![Page 42: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/42.jpg)
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
![Page 43: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/43.jpg)
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
![Page 44: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/44.jpg)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
![Page 45: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/45.jpg)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x Faster
![Page 46: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/46.jpg)
Epilogue
![Page 47: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/47.jpg)
Don’t Give Up
![Page 48: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/48.jpg)
Patience
![Page 49: User Defined Aggregation In Apache Spark A Love Stor y · 2020. 7. 15. · Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass](https://reader033.vdocuments.mx/reader033/viewer/2022060900/609d8abd053c4b1a4874f21e/html5/thumbnails/49.jpg)
Respect