lambda processing for near real time search indexing at walmartlabs: spark summit east talk by...

Lambda Processing for

Near Time Search

Indexing

Snehal Nagmote

- @WalmartLabs

WalmartLabs Usecase

Why Lambda Processing

NRT Architecture

Overview

Implementation

Monitoring

Spark Application Tuning

Lessons Learnt

Product

Categorization

Shipping

Logistics

Offers

Price

Adjustments

Use Case: Product Search Indexing

Supplier/ Merchants/Sellers

Item Setup

Ecommerce Search

Use Case: Near Real Time Indexing

Improve Customer experience•

Update Product Information •

Index new Productso

Product Attribute changeo

Product Offer (Online availability) eventso

86 • million Product Change events/day

1 • product -> 5000 stores

Store A• vailability Change Events ~ 20 K

events/sec

Motivation For Spark

• Offline/Full Indexing – Integration with Spark

Batch Job

• To maintain the same code base/logic to ease

debugging

• Potentially Leverage same technology stack for

Batch and Streaming

Challenges

• Merge real time data with historic signals data updated at different frequency.

• Update the latest value of attribute from multiple pipeline updates

• Dynamic configuration update in Streaming component

• Manage Start/Stop Spark Streaming components

Product Attributes

Real time streaming

attributes (60+)

Availability

Offers (lowest price)

Product title

Product Reviews

Product description

…

Batch Computed

Attributes (20+)

Item score

Facets

Historic data computed by batch pipeline stored in Cassandra

Automatic management of latest version of data fields

Merge real time data with historic signals to compute complete

dataset

Lambda Architecture Processing Overview

Lambda Merge

Indexing Data Pipeline

Reprocessing ?

Event Ordering ?

Synchronization of Configuration Update ?

Start/Stop Streaming Component?

Orchestration with Full Index Update ?

Implementation

Streaming Component Interaction

Spark Streaming Receiver Approach

Multiple Kafka Streams processing

Store offsets in Zookeeper

Kafka Partitions by ID

Monitoring

Extended Spark Metrics Api

Register Custom Accumulators/Gauges for key metrics

Kafka Consumer Lag with Custom Scripts

Grafana Dashboard for Visualization

Tuning

• Scheduling delay = 0

• Partition RDDs effectively – In terms of multiple

of spark workers

• Coalesce over repartition

• spark.streaming.backpressure.enabled

• spark.shuffle.consolidateFiles

LessonsQuerying Cassandra

Worst : Filter on Spark sidesc.cassandraTable().filter(partitionkey in keys)

Bad : Filter on C* side in single operationsc.cassandraTable().where(keys in productIds)

Similar to “in” Query Clause

Query : Select * from my_keyspace.users where id in (1,2,3,4)

Best : Filter on C* side in distributed and Concurrent fashion

KafkaRDD.joinwithcassandraTable()

Little more about In Clause

Multiple Requests: “In” Clause Failure Scenario

Img src: https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Lessons

Spark Locality Wait

Avoid ANY

spark.locality.wait = 3s

Connection Keep Alive

Spark.cassandra.connection.keep_alive_ms

Cache RDDs !!

Thank You ! Questions ?

- Snehal Nagmotehttps://www.linkedin.com/in/snehal-nagmote-79651122

@ WalmartLabs

https://www.linkedin.com/in/snehal-nagmote-79651122

lambda processing for near real time search indexing at walmartlabs: spark summit east talk by...

Data & Analytics