successes, challenges, and pitfalls migrating a saas business to hadoop
TRANSCRIPT
Successes, Challenges and Pitfalls Migrating a SAAS Business to HadoopShaun Klopfenstein, CTO Eric Kienle, Chief Architect
The Vision
Requirements
Page 4Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Business Requirements• Near real-time activity processing• 1 billion activities per customer per day• Improve cost efficiency of operations while scaling up• Global enterprise grade security and governance
Page 5Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Architecture Requirements• Maximize utilization of hardware• Multitenancy support with fairness• Encryption, Authorization & Authentication• Applications must scale horizontally
Technology Bake Off
Page 7Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Bake Off• Technology Selection
• Storm/Spark Streaming• HBase/Cassandra
• Built POC with each permutation + Kafka• Load tested with one day of web traffic
Page 8Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
The Winner Is… Our First Challenge • We hoped to find a clear winner… we didn’t exactly• Truth is all the POCs worked at the scale we tested• It’s possible if we had scaled up the test, we would
have found more differences
Page 9Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
How We Chose• Community• Features• Team Skillset• History• The winners: HBase/Kafka/Spark streaming
Architecture & Design
Page 11Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
• Enhanced Lambda Architecture• Inbound activities written to Ingestion Processor
• Hbase and then Kafka• High volume (e.g. web) activities
• First written to Kafka, then enriched• Spark Streaming applications consume events from Kafka
• Solr Indexing• Email Reports• Campaign Processing
• HBase is used for simple historical queries, and is system of record
High Level Architecture
Build It
Implementation
Page 14Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Expertise• We had a few people with Hadoop and
Spark experience
• We decided to grow knowledge in house
• Focus on training - HortonWorks boot camp for operations
• In house courses and tech talks for engineering/QE
Page 15Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Expertise - Successes• Critical to kick start the project
• Built excitement
• Created foundation for the design process
Page 16Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Expertise – Context Challenge
Challenge • Training packed a lot of information into a short period• Teams that didn’t leverage the training right away lost context
Recommendation• Create environments for hands on experience early• Hands on experience across all teams right after training
Page 17Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Expertise – Experience Challenge
Challenge • Hadoop technology is like playing a piano… knowing how to read
music doesn’t mean you can play• Many ways to design, configure, manage - Only a few right ways
and the reasons can be subtle
Recommendation• Find your experts!• Partner and hire
Page 18Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Our First Cluster• Initial sizing and capacity planning of first
Hadoop Clusters
• Perform load tests to get initial capacity plan
• Decided that disk I/O and storage would be the leading indicator
• Went with industry best practice on hardware and network configuration
Page 19Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Our First Cluster- Success• Leading indicator ended up being compute
• But cluster sizing ended up being close enough to start
• Clusters can always be expanded…So don’t get too hung up
Page 20Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Building Our First Cluster – Zookeeper & VM
Challenge • We started with Zookeeper virtualized• Didn’t perform properly (we think because of disk IO)• Caused random outages
Recommendation• We ended up migrating zookeeper to physical boxes• Don’t use VMs for zookeeper!
Page 21Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Security• All data at rest must be encrypted
• Applications sharing Hadoop must be isolated from each other
• Applications must have hard quotas for both compute and disk resources
Page 22Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Security - Success• Enabled Kerberos security for Hadoop cluster• Kerberos allowed us to leveraged HDFS
native encryption• Used encrypted disks for Kafka servers• Created separate secure Yarn queues to
isolate applications• Each application uses a separate Kerberos principal
Page 23Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Security – Kerberos ChallengeChallenge
• Kerberos can’t be added to a Hadoop cluster without prolonged downtime and patches
• Needed weeks of developer time to accommodate security changes• Added several months to the overall rollout schedule
Recommendation• Allow extra time for Kerberos• Educate your team beforehand, find an expert to guide you• Be prepared for different levels of Kerberos support across the
Hadoop ecosystem
Page 24Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Security – Kafka and Spark Challenge
Challenge• Kafka doesn’t support data encryption (and won’t)• HDP version we had didn’t fully support Kerberos Kafka and Spark
clients properly
Recommendation• Move Kafka and Spark out of Ambari • Only encrypt Kafka data if you absolutely must, as it adds complexity
Test It
Page 26Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Validation• Changing the engines on a plane while in flight is hard• Required all components implemented “Passive mode”
• The new code ran in the background and continuously compared results with the legacy system
• Automated functional tests kicked off from Jenkins• Performance testing at AWS
Page 27Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Validation - Success• Passive mode is one of the best moves we made!• Allowed for testing of components with real world
data and load• Found countless performance and logic issues with
minimal operational impact
Page 28Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Validation – Passive Mode “Minimal Impact”
Challenge• By design passive mode wrote to both Legacy and Hadoop systems• We impacted performance during an outage of our cluster
Recommendation• Use asynchronous writes or tight timeouts in passive mode• Monitoring for the Hadoop cluster should be in place before
passive testing
Deploying It
Page 30Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Migration and Management• We are here!
• Migrate over 6,000 subscriptions with no service interruption or data loss
• Track and monitor migration and provide management tools for the new platform
• Achieve the end goal of removing the safety net
Page 31Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Migration and Management - Successes• Created a new management console called Sirius
• Close architectural coordination of all teams during migration
• If problems arose, we had a quick, automated, fallback path to the legacy system
• Daily cross-functional standup meetings to track the rollout
Challenge• Oozie workflows can be challenging to build and debug• Capacity planning and resource management in the shared Hadoop
cluster is very complex
Recommendation• Only use Oozie workflows for automating complex or long running
processes, or use a different orchestration platform• Constantly reevaluate your capacity plan based on current deployment
Migration and Management Challenges
Running It
Page 35Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Monitoring• Needed to monitor hundreds of new Hadoop and other
infrastructure servers
• Our custom Spark Streaming applications required all new metrics and monitors
• Capacity planning requires trend analysis of both the infrastructure and our applications
• Don’t overwhelm our already busy Cloud Platform Team
Page 36Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Monitoring - Successes• Built a custom monitoring infrastructure using
OpenTSDB and Grafana
• Added business SLA metrics to our Sirius console to provide real-time alerts
• Added comprehensive Hadoop monitors into our pre-existing production monitoring system
Page 37Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Monitoring - ChallengesChallenges
• Adding hundreds of servers and a dozen new applications makes for a huge monitoring task
• Nagios is a very general purpose system and isn’t designed to monitor Hadoop out of the box
Recommendations• Make sure that you have monitors and trend analysis in
place and tested before migration• Be prepared to constantly refine and improve the your
monitors and alerts
Page 38Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Patching and Upgrading• We have a zero-downtime requirement for applications
• Patching and upgrading of either the infrastructure or our own applications is problematic
• Keeping up with the community requires frequent patching
• Eventually hundreds of Spark Streaming jobs will need to be constantly processing data with no interruption
Page 39Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Patching and Upgrading - Successes• Use Sirius console to manage Spark Streaming jobs
• Marketo’s Kafka consumer allows streaming jobs to pick up where they left off after a restart
• Integrated existing Jenkins infrastructure with the Sirius console to provide painless automated patching/upgrades
Page 40Marketo Proprietary and Confidential | © Marketo, Inc. 05/03/2023
Infrastructure Patching and Upgrading - Challenges
Challenges• Patches/upgrades managed with Ambari – not perfect!• We almost never get through an upgrade without one or more Hadoop
components having downtime (so far)
Recommendations• Test all infrastructure patches and upgrades in a loaded non-production
environment• Check out the start and stop scripts from the component specific open
source communities, rather than rely on Ambari
We’re Hiring! Http://Marketo.Jobs
Q & A