app engine batch data processing
TRANSCRIPT
![Page 1: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/1.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 1/48
View live notes and ask questions aboutthis session on Google Wave
http://bit.ly/ds6t9F
![Page 2: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/2.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 2/48
![Page 3: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/3.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 3/48
Batch Data Processingwith Google App Engine
Mike Aizatsky20 May 2010
![Page 4: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/4.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 4/48
View live notes and ask questions aboutthis session on Google Wave
http://bit.ly/ds6t9F
![Page 5: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/5.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 5/48
Agenda
The challenge
Early batch processing in App Engine
Batch processing with Task Queues
Batch processing at Google
App Engine approach
![Page 6: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/6.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 6/48
Batch Data Processing
Processing thousands of entities is hard on App Engine
Some examples:
schema migration
data export
report generation
![Page 7: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/7.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 7/48
What Makes It Hard?
App Engine imposes certain restrictions
Restrictions guarantee automatic scalability
![Page 8: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/8.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 8/48
What Makes It Hard?
30s request limit
Transient errors in the system
Datastore latency and timeouts
Changing dataset
![Page 9: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/9.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 9/48
Early Batch Processing
![Page 10: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/10.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 10/48
Early Batch Processing
Define batch web handler:
class BatchHandler(...):def get(self):
self.processNextBatch()
Use a web page with auto-refresh or use curl:
while true; curl http://batch_url; doneTrivia: Was actually done by the App Engine team onthe day of release
![Page 11: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/11.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 11/48
Challenges
Need an external "driver" computer (may fail too)
Difficult error handling and recovery
Slow, inefficient
Complex state management
![Page 12: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/12.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 12/48
Possible Improvements
Communicate state with the driver
Sharding/parallel execution
Use remote api for complex scenarios
Typical Example: bulkloader from App Engine SDK
![Page 13: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/13.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 13/48
Batch Processing With Task Queues
![Page 14: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/14.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 14/48
Batch Processing with Task Queues
Task Queues released a year ago
Can perform work outside of a user request
Reliable, high-performance system
Still 30s limit
![Page 15: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/15.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 15/48
Task Chaining
Simple technique of overcoming 30s limit
Task enqueues its continuation when it's close to 30slimit
Task 1 Task 2 Task 3 Task 4
![Page 16: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/16.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 16/48
Task Chaining
Define batch handler:
class BatchHandler(...):def get(self):
next_starting_point =
self.performNextBatch(
self.request["starting_point"])
taskqueue.Task("/batch", params ={"starting_point":
next_starting_point})
.add("batch_queue")
To start a process, simply enqueue first task
![Page 17: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/17.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 17/48
Nice Task Queue Properties
Guarantees eventual task execution
No need for external drivers
Repeats task execution in case of unhandled failures
Can limit execution rate (both manually and
automatically)
![Page 18: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/18.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 18/48
Batch Processing At Google
![Page 19: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/19.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 19/48
Batch Processing At Google
MapReduce successfully used for years to do batch
processing at Google scale
Created to help developers work with unreliabledistributed systems
Widely adopted
![Page 20: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/20.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 20/48
MapReduce
Developer defines only 2 functions:
map(entity) -> [(key, value)]reduce(key, [value]) -> [value]
![Page 21: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/21.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 21/48
MapReduce Special Cases
Schema migration: empty reduce, update in map
Report generation: reduce generates new entities
![Page 22: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/22.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 22/48
App Engine & Google's MapReduce
We want you to use MapReduce too!
There are some unique challenges
![Page 23: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/23.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 23/48
App Engine & Google's MapReduce
Additional scaling dimension:
Lots and lots of applications
Many of them will run MapReduce at the same time
Isolation: application shouldn't influence performance of
the other
![Page 24: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/24.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 24/48
App Engine & Google's MapReduce
Rate limiting: you don't want to burn all day's resourcesin 15min and kill your online traffic
Very slow execution: free apps want to go really slow,staying under their resource limint
Protection: from malicious App Engine users
![Page 25: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/25.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 25/48
App Engine Approach
![Page 26: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/26.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 26/48
App Engine Approach
We already have a system to solve (most) of these
problems: Task Queue
Decided to build MapReduce on top of Task Queue
Some additional services will have to be developed
![Page 27: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/27.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 27/48
Mapper Library for App Engine
Early experimental release
Reliable, fast and efficient way to iterate over datastoreor blob files
Part 1 of the MapReduce story
You can start playing with it while we're working on thefull MapReduce
http://mapreduce.appspot.com/
![Page 28: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/28.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 28/48
Mapper Library Features
Completely user-space. Just pull into your project.
OSS (Apache 2.0). Hack, modify, play around. Patcheswelcome!
Python today & Java soon
API is very familiar to Hadoop/Dumbo users
![Page 29: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/29.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 29/48
Mapper Library Features
Automatic sharding for faster execution
Automatic rate limiting for slow execution
Status pages
Counters
Parameterized mappers
Batching datastore operations
Iterating over blob data
![Page 30: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/30.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 30/48
Demo
![Page 31: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/31.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 31/48
Adding Mapper Library To Your Project
Checkout library from svn into your project folder
Add 1 handler to app.yaml:
- url: mapreduce(/.*)?
script: <script_location>/main.py
admin: true
![Page 32: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/32.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 32/48
Defining Mapper
Define map function:
def process(entity):doSomethingWithEntity(entity)
Register in mapreduce.yaml:
mapreduce:
- name: Test Mappermapper:
input_reader: mapreduce.DatastoreInputReader
handler: model.Entity
Open http://<your_app>/mapreduce and start your mapper
Python
![Page 33: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/33.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 33/48
Defining Mapper
public class ExampleMapper extends
AppEngineMapper<Key, Entity> {@Override
public void map(
Key k, Entity e, Context ctx) {
processEntity(e);}
}
Java
![Page 34: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/34.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 34/48
Example: Datastore Operations
def user_ages(user):
migrate_user(user)yield op.db.Put(user)
![Page 35: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/35.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 35/48
Example: Counters
def user_ages(user):
yield op.counter.Increment("age-%d" % user.age)
![Page 36: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/36.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 36/48
Example: More Complex Reports
def orders_total(customer):
orders = Order.by_customer(customer)total = sum_orders(orders)
report_line = ReportLine(
report_id, customer, total)
yield op.db.Put(report_line)
![Page 37: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/37.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 37/48
Some Implementation Details
Uses task queue chaining
2 types of flows: controller flow & worker flow
Uses datastore for state storage and communication
![Page 38: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/38.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 38/48
Important point
We handle most of the task chaining complexity
You should handle only one: idempotence
![Page 39: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/39.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 39/48
Idempotence
f(f(x)) = f(x)
Means: your batch handler should be ready to processthe same entity twice
The most important property of batch operation
You should always think about idempotence
![Page 40: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/40.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 40/48
Practical Idempotence: Data Migration
Not Idempotent:
def migrate(entity):
yield op.counters.Increment("updated")
entity.property =
compute_property(entity)
yield op.db.Put(entity)
![Page 41: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/41.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 41/48
Practical Idempotence: Data Migration
Idempotent:
def migrate(entity):
if entity.property_updated:
return
yield op.counters.Increment("updated")
entity.property =compute_property(entity)
entity.property_updated = True
yield op.db.Put(entity)
![Page 42: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/42.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 42/48
Practical Idempotence: Reports
Not Idempotent:
def report(customer):
# .....
report_line = ReportLine(
report_id, customer, total)
yield op.db.Put(report_line)
![Page 43: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/43.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 43/48
Practical Idempotence: Reports
Idempotent:
def report(customer):
# .....
key_name = "%s-%s" %
(report_id, customer.id)
if ReportLine.get_by_key_name(key_name):return
report_line = ReportLine(
key_name=key_name,
report_id, customer, total)
yield op.db.Put(report_line)
![Page 44: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/44.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 44/48
![Page 45: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/45.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 45/48
Practical Idempotence: Counters
No easy way to achieve
Arguably not needed
![Page 46: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/46.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 46/48
Idempotence And Changing Data
Most reports over live data are approximate
Approximate reports are OK for most cases
Margin of error should be quite small due to the waymapper library is implemented
![Page 47: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/47.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 47/48
Summary
Mapper library available today
Reliable, fast and efficient way to iterate over datastoreor blob files
Java & Python
Fully OSS
http://mapreduce.appspot.com/
![Page 48: App Engine Batch Data Processing](https://reader030.vdocuments.mx/reader030/viewer/2022021117/577d26e91a28ab4e1ea2871c/html5/thumbnails/48.jpg)
8/6/2019 App Engine Batch Data Processing
http://slidepdf.com/reader/full/app-engine-batch-data-processing 48/48