Download - Hadoop at aadhaar
Hadoop at Aadhaar
(Data Store, OLTP & OLAP)
github.com/regunathb RegunathB
Bangalore Hadoop Meetup
Enrolment Data
600 to 800 million UIDs in 4 years
1 million a day with transaction, durability guarantees
350+ trillion matches every day
~5MB per residentMaps to about 10-15 PB of raw data (2048-bit PKI encrypted)
About 30 TB I/O every day
Replication and backup across DCs of about 5+ TB of incremental data every day
Lifecycle updates and new enrolments will continue for ever
Enrolment data moves from very hot to cold, needing multi-layered storage architecture
Additional process dataSeveral million events on an average moving through async channels (some persistent and some transient)
Needing insert and update guarantees across data stores
Authentication Data
100+ million authentications per day (10 hrs)
Possible high variance on peak and average
Sub second response
Guaranteed audits
Multi-DC architectureAll changes needs to be propagated from enrolment data stores to all authentication sites
Authentication request is about 4 K100 million authentications a day
1 billion audit records in 10 days (30+ billion a year)
4 TB encrypted audit logs in 10 days
Audit write must be guaranteed
Aadhaar Data Stores
Mongo cluster(all enrolment records/documents demographics + photo)Shard 1Shard 4Shard 5Shard 2Shard 3
Low latency indexed read (Documents per sec), High latency random search (seconds per read)
MySQL (all UID generated records - demographics only, track & trace, enrolment status )
Low latency indexed read (milli-seconds per read), High latency random search (seconds per read)
UID master (sharded)Enrolment DB
Solr cluster(all enrolment records/documents selected demographics only)
Low latency indexed read (Documents per sec), Low latency random search (Documents per sec)Shard 0Shard 2Shard 6Shard 9Shard aShard dShard f HDFS(all raw packets)Data Node 1Data Node 10Data Node ..
High read throughput (MB per sec), High latency read (seconds per read)Data Node 20 HBase(all enrolment biometric templates)Region Ser. 1Region Ser. 10Region Ser. ..
High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region Ser. 20 NFS(all archived raw packets)
Moderate read throughput, High latency read (seconds per read)LUN 1LUN 2LUN 3LUN 4
Systems Architecture
Work distribution using SEDA & Messaging
Ability to scale within JVM and across
Recovery through check-pointing
Sync Http based Auth gateway
Protocol Buffers & XML payloads
Sharded clusters
Near Real-time data delivery to warehouse
Nightly data-sets used to build dashboards, data marts and reports
Real-time monitoring using Events
Enrolment Biometric Middleware
Distribute, Reconcile biometric data extraction and de-dup requests across multiple vendors (ABISs)
Biometric data de-referencing/read service(Http) over sharded HDFS and NFSServes bulk of the HDFS read requests (25TB per day)
Locate data from multiple HDFS clustersSharded by read/write patterns : New, Archive, Purge
Calculates and maintains Volume allocation, SLA breach thresholds of ABISsThresholds stored in ZK and pushed to middleware nodes
Event Streams & Sinks
Event framework supporting different interaction/data durability patterns
P2P, Pub-Sub
Intra-JVM and Queue destinations - Durable / Non-Durable
Fire & Forget, Ack. after processing
Event SinksEphemeral data consumed by counters, metrics (dashboard)
Rolling file appenders that push data to HDFSPrimary mechanism for delivering raw fact data from transactional systems to the warehouse staging area
Data Analysis
Statistical analysis from millions of eventsView into quality of enrolments e.g. Enrolment Agencies, Operators
Feature introduction e.g. Based on avg. time taken for biometric capture, demographic data input
Enrolment volumes e.g. By Registrar, Agency, Operator etcUseful in fraud detection
Goal to share anonymized data sets for use by industry and academia information transparency
Various reports Self-serve, Canned, Operational and/or Aggregates
UID BI PlatformData Analysis architecture
Data Access FrameworkUIDAI SystemsEvents(Rabbit MQ)Server DB(MySQL)Hadoop HDFSData Warehouse (HDFS/Hive)Event CSVFact DataDimension DataDatasets
On-Demand DatasetsDatamarts(MySQL)Raw DataDimension Data(MySQL)
PigPentaho Kettle
Hive
Pentaho KettleCanned ReportsDashboardSelf-service Analytics
Pentaho BI
FusionCharts
E-mail/Portal/Others
Hadoop stack summary
CDH2 (Enrolment, Analysis), CDH3(Authentication)
Data Store
HDFS : Enrolment, Events, Audit Logs, Warehouse
HBase : Biometric templates used in Authentication
Coordination/ConfigZK : Biometric middleware thresholds
AnalysisPig : ETL for loading analysis data from staging to atomic warehouse
Hive : Dataset generation framework
Learnings
Watch out fortoo many small files. HDFS is better suited for fewer but large files
Data loss from HDFS in spite of having 3 replica copies maybe fixed in releases post CDH2?
Give careful consideration to HBase table design row key primarily to avoid region-server hot-spotting
Hive data (HDFS files) does not handle duplicate records can be an issue if data injestion is replayed for data setsHive over Hbase is a viable alternative
References
Aadhaar Portal : https://portal.uidai.gov.in/uidwebportal/dashboard.do
Data Portal : https://data.uidai.gov.in/uiddatacatalog/dataCatalogHome.do
Analytics whitepaper : http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30012012.pdf
Click to edit the title text formatClick to edit Master title style
Click to edit the title text formatClick to edit Master title style
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick to edit Master title style
(c) UIDAI, 2011