lambda processing for near real time search indexing at walmartlabs: spark summit east talk by...
TRANSCRIPT
WalmartLabs Usecase
Why Lambda Processing
NRT Architecture
Overview
Implementation
Monitoring
Spark Application Tuning
Lessons Learnt
Product
Categorization
Shipping
Logistics
Offers
Price
Adjustments
Use Case: Product Search Indexing
Supplier/ Merchants/Sellers
Item Setup
Ecommerce Search
Use Case: Near Real Time Indexing
Improve Customer experience•
Update Product Information •
Index new Productso
Product Attribute changeo
Product Offer (Online availability) eventso
86 • million Product Change events/day
1 • product -> 5000 stores
Store A• vailability Change Events ~ 20 K
events/sec
Motivation For Spark
• Offline/Full Indexing – Integration with Spark
Batch Job
• To maintain the same code base/logic to ease
debugging
• Potentially Leverage same technology stack for
Batch and Streaming
Challenges
• Merge real time data with historic signals data updated at different frequency.
• Update the latest value of attribute from multiple pipeline updates
• Dynamic configuration update in Streaming component
• Manage Start/Stop Spark Streaming components
Product Attributes
Real time streaming
attributes (60+)
Availability
Offers (lowest price)
Product title
Product Reviews
Product description
…
Batch Computed
Attributes (20+)
Item score
Facets
Historic data computed by batch pipeline stored in Cassandra
Automatic management of latest version of data fields
Merge real time data with historic signals to compute complete
dataset
Lambda Architecture Processing Overview
Reprocessing ?
Event Ordering ?
Synchronization of Configuration Update ?
Start/Stop Streaming Component?
Orchestration with Full Index Update ?
Implementation
Streaming Component Interaction
Spark Streaming Receiver Approach
Multiple Kafka Streams processing
Store offsets in Zookeeper
Kafka Partitions by ID
Monitoring
Extended Spark Metrics Api
Register Custom Accumulators/Gauges for key metrics
Kafka Consumer Lag with Custom Scripts
Grafana Dashboard for Visualization
Tuning
• Scheduling delay = 0
• Partition RDDs effectively – In terms of multiple
of spark workers
• Coalesce over repartition
• spark.streaming.backpressure.enabled
• spark.shuffle.consolidateFiles
LessonsQuerying Cassandra
Worst : Filter on Spark sidesc.cassandraTable().filter(partitionkey in keys)
Bad : Filter on C* side in single operationsc.cassandraTable().where(keys in productIds)
Similar to “in” Query Clause
Query : Select * from my_keyspace.users where id in (1,2,3,4)
Best : Filter on C* side in distributed and Concurrent fashion
KafkaRDD.joinwithcassandraTable()
Little more about In Clause
Multiple Requests: “In” Clause Failure Scenario
Img src: https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Lessons
Spark Locality Wait
Avoid ANY
spark.locality.wait = 3s
Connection Keep Alive
Spark.cassandra.connection.keep_alive_ms
Cache RDDs !!
Thank You ! Questions ?
- Snehal Nagmotehttps://www.linkedin.com/in/snehal-nagmote-79651122
@ WalmartLabs