data engineering guide · 2020. 11. 25. · data engineering guide 9 how to reduce cost while...

22
DATA ENGINEERING GUIDE How to reduce cost while improving data pipeline performance and reliability eBook

Upload: others

Post on 14-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

DATA ENGINEERING GUIDE How to reduce cost while improving data pipeline performance and reliability

eBook

Page 2: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

2D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Starbucks’ challenge With petabytes of data to be ingested for downstream machine learning and analytics, Starbucks’ architecture struggled to handle the scale. They also dealt with a variety of structured and unstructured data that was fast-changing and fragmented across various systems, making it difficult to gain a complete view of their customers and business.

With a huge variety of data sources and types, data reliability and governance was of utmost importance but difficult to achieve. They needed a way to build out their historical data and live aggregations together to ensure they were delivering real-time, accurate insights to their stores and partners.

Building an architecture to support petabyte-scale data and machine learning with Databricks

With a unified data analytics platform at the core of their architecture, Starbucks’ entire data strategy has been transformed. Data can now flow seamlessly through their pipelines and models, allowing for new ideas and solutions to flourish. The processing power of Databricks and Delta Lake built on top of their cloud data lake has increased performance 50-100x, giving data science and analytics teams the data they need faster.

Databricks provides a trusted, persistent storage layer that securely delivers quality data that enables downstream data analytics. This allows them to explore many analytics use cases across the board such as tour operations, quality of service analysis, demand forecasting and inventory management, personalized shopping experiences, and much more. “With Databricks, we can now take a strategic view into data analytics,” expressed Vishwanath Subramanian, the director of data engineering and analytics at Starbucks. “So much so, that our teams can now focus on business problems up the value chain rather than simply moving data from point A to point B.”

Read the full story of Starbucks’ data transformation on page 11.

1,000+ P R O D U C T I O N DATA

P I P E L I N E S

50-100x FAS T E R DATA

P R O C E S S I N G

15 minsTO D E P LOY

M L M O D E L S

Page 3: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

3D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Contents Introduction 4

A unified approach optimized for performance and cost-efficiency 6

Simplifying production pipelines 6

Ensuring data reliability and consistency 7

Scalable data for actionable insights 8

Open sourced for ultimate flexibility and multi-cloud 9

Getting a custom value assessment to understand your gains for 10 moving production pipelines to Databricks

How leading enterprises have built production pipelines with Databricks 11

Case Studies

Starbucks 11

Columbia 14

Comcast 16

Mars Petcare 18

Healthdirect Australia 20

Conclusion 22

Page 4: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

4D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

In every industry today, data is the lifeblood that can make or break a business. Around the clock, your business is generating and storing millions of data points from your customers, your supply chain, and even from your internal teams. But it’s not just about collecting data — it’s about using it to drive operational efficiencies and increase revenues. Crucial to this is automated data pipelines to ingest, process and deliver data from source to query, whether it is for business intelligence, reporting or advanced data science use cases.

But performance and reliability can wane as new data sources are added, small files create bottlenecks, and rigid schemas can’t adjust to even minor changes. As a result, job duration — and compute costs — increase as your pipeline performance degrades. Fixing mistakes in the data becomes harder, and your data lake becomes populated by duplications, partial failures, or late and out of order data. Over time your data lake can become a data swamp.

The limits to scaling and maintaining data pipeline performance are apparent even when using best-in-breed data engines. The data engine is only one component of the total workflow and doesn’t address the compute costs or lack of data reliability from overly complex architectures requiring additional validation, reprocessing and manual update and merge steps. This increased complexity, particularly when dealing with batch and streaming data sources, adds compute costs while introducing additional latency and points of failure.

Introduction

We’ve seen major improvements in the speed we can make data available for analysis. We have a number of jobs that used to take 6 hours and now take only 6 seconds.

A L E S S I O BAS S O Chief Architect, HSBC

“”

Drop us a line at

[email protected] to

speak to our specialist

data engineers and get

a custom assessment

based on data jobs in your

environment. Find out how

Databricks can help cut

costs while improving your

data pipeline performance.

Page 5: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

5D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Complex, redundant systems and operational challenges to process batch and streaming data

Unreliable data processing jobs that require manual cleanup and reprocessing after failed jobs

Long data processing times and increased infrastructure costs from inefficient data pipelines

Static infrastructure resources incurring expensive overhead costs and limited workload scalability

Unscalable processes, with tight dependencies, complex workflows, and system downtime

X

X

X

X

X

Operational costs can be impacted by a number of factors from static infrastructure to poor performance.

Across industries, this is a major challenge as enterprises are constantly dealing with the deluge of various data sources streaming in real time, all while trying to eliminate costs without sacrificing long-term performance and capabilities. What most organizations don’t fully grasp when analyzing database costs is to look at the full costs to run a job, not just the data engine list price. When UK-based meal kit retailer Guosto moved their ETL pipelines to Databricks, they cut data latency over 99% (from 2 hours to 15 seconds) without increasing costs by moving to smaller instances, for example. Furthermore, there is a false notion that performance and cost are linked; that is, in order to increase performance you need to spend more. The key is to understand the underlying total workflow costs in addition to the data engine costs, compute costs, and the effort required by your data teams to access reliable data.

Data Lake

Update and merge

Streaming Analytics

Streaming AnalyticsEvents

Events

Events

Validation

Partitioned

ReprocessingReporting

-arch

-arch

The traditional way of building data pipelines is complex and unreliable

Page 6: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

6D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Databricks can help unify data and analytics, ensuring your data pipelines are optimized for the speed, performance and reliability that you require — shortening time to analysis.

Normally, higher performance comes with higher costs. By using Delta Lake as a single platform to build automated data pipelines, data engineering teams can reduce costs by reducing steps that require more compute power.

By handling batch and streaming data sources along with schema enforcement and evolution within a single workflow, Databricks makes it easier to add new data sources, simplifies the provisioning of clusters while decreasing operational costs, lowers failure rates while speeding up troubleshooting through schema enforcement and evolution, and makes the end-to-end process easy with automation and self-service tools.

As a result, Databricks can lower the total cost of running your data jobs while meeting or exceeding even the most stringent service level agreements across a variety of data use cases.

Simplifying data pipelines

In addition to fewer hops, features, like ACID transactions and data quality guarantees, mean there’s no need to create parallel processes for validation, reprocessing, updates and merges. What this means to you is less compute required, fewer points of failure and easier support and troubleshooting.

A unified approach optimized for performance and cost-efficiency

Databricks ensures a clean flow of data from ingest to use by the business and data teams.

Unified and simplified architecture across batch and streaming to serve all use cases

Robust data pipelines that ensure data reliability, and ACID transactions and data quality guarantees

Reduced compute times and costs with a scalable cloud runtime powered by highly optimized Apache SparkTM clusters

Elastic cloud resources intelligently auto-scale up with workloads and scale down for cost savings

Modern data engineering best practices for improved productivity, system stability and data reliability

Ingesting data into data lake

Processing data to organize

within data lake

Enriching and optimizing data for

downstream use

Feed to BI reports and data

science teams

Databricks unifies data and analytics to improve production pipeline performance and efficiency.

B R O N Z E

Raw Ingestion Filtered, Cleaned, Augmented

Business-level Aggregates

Streaming Analytics

Reporting

S I LV E R G O L D

Page 7: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

7D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Ensuring data reliability and consistency

Running on top of your existing data lake, Delta Lake is an open source transaction layer bringing reliability, performance and lifecycle management to data ingestion, processing and management.

With Databricks and Delta Lake, you can reliably ingest, process and update structured and unstructured data in real time with transactional guarantees and high-performance queries — for both batch and streaming. No need for separate workflows. Instead, you get a single, reliable source of data for BI and data science teams.

Going from

50% to

ZERO

In the first few years of Databricks’ cloud service (2014–2016), around half the support escalations we received were a result of data corruption, consistency or performance issues due to cloud storage strategies (e.g., undoing the effect of a crashed update job, or improving the performance of a query that reads tens of thousands of objects).

Delta Lake reduced the fraction of support issues due to those issues from

half to nearly none. It also improved workload performance for most customers,

with speedups as high as 100x in extreme cases where its data layout optimizations

and fast access to statistics are used to query very high-dimensional data sets

(e.g., network security and bioinformatics use cases).

Source: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Page 8: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

8D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Data is one of the most

critical assets we have

to improve demand

forecasting. At RB, we

process over 2TB of data

every day across 250+

data pipelines that are

running 24/7.

H A R I S H K U M A R Director of Data Engineering and Architecture, RB

Redash makes it easy to explore, query, visualize and share data.

Scalable data for actionable insights

With data centralized for easy access, data analysts can easily integrate directly to their most complete and recent data at massive scale in the data lake, and use your preferred BI visualization and reporting tools for more timely business insights. For a completely seamless experience, you can leverage Redash to easily visualize and share your data via intuitive dashboards and queries.

Page 9: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

9D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Open sourced for ultimate flexibility and multi-cloud

While most data warehouses lock you in with proprietary formats, Databricks is built from the ground up for performance and speed using open standards for big data processing and management, making it ideally suited for multi-cloud enterprises.

The key tenets for Delta Lake’s design are for openness and extensibility. Delta Lake stores all the data and metadata in cloud object stores, with an open protocol design that leverages existing open formats such as JSON and Apache Parquet.

This openness not only removes the risk of vendor lock-in, but it also ensures your data remains your data — no proprietary vendor formats. Openness also impacts collaboration among the community, which is critical to innovation. By tapping into the greater ecosystem of developers and engineers, new opportunities arise that enable a myriad of different use cases and capabilities powered by data science, machine learning and SQL with the end goal of creating more value for the customer.

To ensure the project’s long-term growth and community development, we’ve worked with the Linux Foundation to further this spirit of openness.

While our roots are in open source, our managed platform was built from the ground up for speed, performance and reliability. In benchmark comparisons, Databricks Runtime is 5x faster than vanilla Spark. Hotels.com, for example, accelerated ETL at scale with Databricks, increasing the volume of data processed 20x without impacting performance. And because it’s all integrated into the unified platform, you don’t have to spend expensive resources on DevOps work.

Page 10: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 0D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Getting a custom value assessment to understand your gains for moving data transformation pipelines to Databricks

Whether you’re going from on-prem Hadoop clusters to the cloud, or if you’re already in the cloud but want to achieve more operational savings, we have several programs to demonstrate the tangible value Databricks brings, from infrastructure cost savings to increased productivity unlocking new use cases.

Along with fully funded PoCs and pricing incentives to migrate workloads, we start with a custom value assessment to help you understand the full costs to run your queries, along with projected savings and gains based on specific pipeline usage.

Reach out to our team today at [email protected] to see how we can reduce your costs, meet or exceed current SLAs, and have your data engineering teams work on higher-level problems instead of maintaining and fixing what are supposed to be automated pipelines.

Recommendations and SavingsAutomated Assessment + Expertise

Action Table:Issues and Recommendations

Estimated Savings

Cluster Utilization (Un-utilized core hours) $$$

Small File Write opportunity $

Task size opportunity (3M tasks executed above 250 MB threshold) $$

Unnecessary Expensive operations (count_distinct, repartition) $

Delta Z-order opportunities (detected common query patterns) $

Total Savings $$$$

Other Spark

Other Spark

X

E X A M P L E

Business Value Analysis: input and output

E X A M P L E

Optimization Assessment Output

Page 11: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 1D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

This eBook highlights how leading enterprises, including Starbucks, Columbia, Comcast, Healthdirect Australia and Mars Petcare, have

leveraged the Databricks Unified Data Analytics Platform to build automated, performant and scalable data pipelines — allowing them to

make lightning-fast decisions without worrying about their compute costs or pipeline performance degradation.

Brewing data and AI at scaleStarbucks serves up omnichannel experiences across 30,000+ stores with Databricks

Building an architecture to support petabyte-scale data and machine learning

Data is crucial at Starbucks. Across 30,000+ stores, they generate billions of transactional data points that can be used to fuel data-driven innovations and operational improvements. Their data strategy and guiding principles are built on three pillars: 1. a single version of the truth, 2. data and analytics enablement, and 3. trusted data. However, extracting value from their data was the first and foremost challenge.

With petabytes of data to be ingested for downstream machine learning and analytics, their architecture struggled to handle the scale. They also dealt with a variety of structured and unstructured data that was fast-changing and fragmented across various systems, making it difficult to gain a complete view of the customer and business.

WAT C H V I D E OCase Study:

How leading enterprises have built data pipelines with Databricks

With Databricks, we can now take a strategic view into data analytics. Our teams can spend time focusing on business problems up the value chain, rather than simply moving data from point A to point B.

V I SHWAN ATH SU BR AMANIAN

Director of Data Engineering and Analytics at Starbucks

Page 12: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 2D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

With a huge variety of data sources and types, data reliability and governance was of utmost importance, but difficult to achieve. They needed a way to build out their historical data and live aggregations together to ensure they were delivering real-time, accurate insights to their stores and partners.

They also struggled to provision clusters to support their data needs. Data engineering was often overwhelmed with spinning up and maintaining clusters. “Our engineering services were not optimal,” explained Vishwanath Subramanian, director of data engineering and analytics at Starbucks. “We struggled to scale compute in a timely manner, often taking over 30 minutes to scale clusters.” Once the data did make it downstream to the data science and analytics teams, the lack of a unified user experience acted as an impediment to innovation, blocking exploration, experimentations and reproducibility. To truly create meaningful connections with

their customers, they needed to remove these barriers to innovation.

A single source of truth to brew up new ML use cases

To address these challenges, Starbucks’ developed BrewKit, a zero-friction analytics framework, built on top of Azure Databricks. Their goals were to ensure the democratization of data by creating a single source of truth while creating an environment that fosters cross-team collaboration to unlock the possibilities of machine learning at scale.

“We wanted to make sure the smallest of teams at Starbucks had the ability to do data science and data engineering at scale,” said Subramanian. “The only way to enable that was to empower them with a massively scalable unified analytics platform.”

With Azure Databricks and Delta Lake, their data engineers are able to build pipelines that support batch and real-time workloads on the same platform. This has enabled their data science teams to blend various data sets to train new models that improve the customer experience. Most importantly, data processing performance has improved dramatically, allowing them to deploy environments and deliver insights in minutes.

From a data science perspective, the interactive notebooks have enabled users to onboard quickly and collaborate more efficiently and more easily manage various use cases. Once models are developed, MLflow allows them to easily experiment and test models in a rapid fashion. “From a data team collaboration productivity standpoint, this has been huge. The tooling has been collaborative. We also now foster a culture of experimentation and self-service, and maintain shared responsibility across environments,” said Subramanian.

Page 13: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 3D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Azure Databricks a core ingredient in Starbucks’ data-driven journey

With a Unified Data Analytics Platform at the core of its data strategy, Starbucks’ entire data strategy has been transformed. Data can now flow seamlessly through their pipelines and models, allowing new ideas and solutions to flourish. The processing power of Databricks and Delta Lake paired with Azure services has increased performance 50-100x, giving data science and analytics teams the data they need faster.

Delta Lake provides a trusted, persistent storage layer that securely delivers quality data that enables downstream data analytics. This allows them to explore many analytics use cases across the board such as tour operations, quality of service

analysis, demand forecasting and inventory management, personalized shopping experiences and much more.

“With Databricks, we can now take a strategic view into data analytics,” expressed Subramanian. “So much so, that our teams can now focus on business problems up the value chain rather than simply moving data from point A to point B.”

As Starbucks continues to focus on providing world-class customer experiences, Subramanian is excited about the impact Databricks will continue to have in achieving their mission. “At Starbucks, we are elevating customer connections through the convergence of data and AI,” concluded Subramanian. “As we extend our channels for delivery and adjust to the new norms in today’s new era, data will play an extremely crucial role in this effort.”

Page 14: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 4D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Moving to the cloud ushers in a new era of data-driven retailingColumbia Sportswear migrates from Hadoop to Databricks to fuel data-driven decision-making across the business

Legacy analytics systems that were costly and slow

As the retail industry continues to digitize across all channels, Columbia has been at the forefront of leveraging data across their business lines to impact sales, purchasing, supply chain, and product optimization. For example, they wanted to understand how to leverage insights related to geography, brand affinity, gross margins, and costs to improve operations and make smarter decisions. Or how to leverage customer engagement data from product reviews and comments to inform marketing campaigns and improve customer support.

With troves of data at their disposal, the processing efficiency of both batch and real-time data for downstream analytics and reporting was not meeting internal service-level agreements. Hampered by specialty ETL tooling and legacy data

warehouses that were siloed and complex to scale, the enterprise information management (EIM) team struggled to efficiently build data pipelines that unlocked access to curated data for various data teams and business stakeholders. Furthermore, their infrastructure was rigid and costly to manage and scale, which was problematic as the number of people needing access to data was on the rise.

“Our legacy systems could take weeks to ETL data for analytics and reporting,” explained Lara Minor, a senior enterprise data manager at Columbia Sportswear. “As a result, we were unable to support a variety of use cases, impacting analyst and line-of-business satisfaction.”

With various teams from the executives to data analysts and scientists all vying for company-wide data, they realized that they needed to re-platform their analytics system to the cloud to enable more agility and cost efficiency at scale. They also needed to streamline data preparation and ETL, while making it easier and safer for their stakeholders to access the data they need to make smarter decisions.

Getting data to those who need it as quickly as possible

The EIM team at Columbia decided to move to Microsoft Azure, which opened the door to use Azure Databricks and Delta Lake to upgrade their data processing and analytics capabilities. “We were looking for something that was scalable, elastic, and at a lower cost,” said Minor. “Azure and Databricks met those requirements.”

With Databricks, they are now able to build high-performance ETL pipelines that support batch and real-time workloads. The pipelines feed into Delta Lake, which

Case Study:

70% R E D U C T I O N I N DATA

P I P E L I N E C R E AT I O N T I M E

48x FAS T E R E T L

WO R K L O A D S

Page 15: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 5D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

provides secure access to curated data. “Delta Lake provides ACID capabilities that simplify data pipeline operations to increase pipeline reliability and data consistency,” explained Minor. “At the same time, features like caching and auto-indexing enable efficient and performant access to the data.”

Once the data is ingested, it can be directed to various endpoints across the company depending on the end-user and use case. For example, business analysts could connect directly with PowerBI for sales reporting that requires near real-time information on-demand. They could make data accessible via Databricks interactive notebooks for data scientists to explore and train models. Or they could send data to their data warehousing tool for use cases with low latency and high concurrency requirements. Whichever data team needed access to the data, they were confident that the data was reliable and consistent.

Faster data pipelines, shorter time-to-insight

Shortening data processing times is key to rapidly delivering data insights to the business. Databricks has helped Columbia’s EIM team accelerate ETL and data preparation, achieving a 70% reduction in ETL pipeline creation time while reducing the amount of time to process ETL workloads from 4 hours to only 5 minutes, a 48x improvement.

With a scalable and performant platform that better supports batch and real-time workloads at their disposal, various data users are now empowered to make smarter decisions that impact business operations without having to be over-reliant on the EIM team.

“One of the benefits of this platform is how fast people can come up to speed on it. All that data is coming in, and more business units are using it across the enterprise in a self-service manner that was not possible before,” stated Minor. “I can’t say enough about the positive impact that Databricks has had on Columbia.”

With curated data at their fingertips, use cases — from forecasting consumer demands to analyzing product reviews to increase customer satisfaction — are being driven by data. As Minor concurs, the sky’s the limit in terms of how the team at Columbia can leverage data to make smarter business decisions and drive the business into the future.

More business units are using the platform in a self-service manner that was not possible before. I can’t say enough about the positive impact that Databricks has had on Columbia.

L AR A MI N OR

Senior Enterprise Data Manager at Columbia Sportswear

Page 16: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 6D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

The future of entertainment with AIComcast transforms the viewing experience with ML-powered voice recognition

Infrastructure unable to support data and ML needs

As a global technology and media company connecting millions of customers to personalized experiences, Comcast struggled with massive data, fragile data pipelines, and poor data science collaboration. Instantly answering a customer’s voice request for a particular program while turning billions of individual interactions into

actionable insights strained Comcast’s IT infrastructure and their data analytics and data science teams. To make matters more complicated, Comcast needed to deploy models to a disjointed and disparate range of environments: cloud, on-prem, and even directly to devices in some instances.

MASSIVE DATA: Billions of events generated by our entertainment system and 20+ million voice remotes resulting in petabytes of data that need to be sessionized for analysis.

FR AGILE PIPELINES: Complicated data pipelines that frequently failed and were hard to recover. Small files were difficult to manage, slowing data ingestion for downstream machine learning.

POOR COLL ABOR ATION: Globally dispersed data scientists working in different scripting languages struggled to share and reuse code.

MANAGEMENT OF ML MODELS: Developing, training and deploying hundreds of models was highly manual, slow and hard to replicate, making it difficult to scale.

FRICTION BETWEEN DE V AND DEPLOYMENT: Dev teams wanted to use latest tools and models while ops wanted to deploy on proven infrastructure.

Case Study:

10x R E D U C T I O N I N O V E R A L L

C O M P U T E C O S T S TO

P R O C E S S DATA

90% R E D U C T I O N I N R E Q U I R E D

D E VO P S R E S O U R C E S TO

M A N A G E I N F R AS T R U C T U R E

R E D U C E D

D E P L O Y M E N T

T I M E S F R O M

W E E K S TO

M I N U T E S

Page 17: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 7D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Automated infrastructure, faster data pipelines with Delta Lake

Comcast realized they needed to modernize their entire approach to analytics from data ingest to the deployment of machine learning models that deliver new features that delight their customers. Today, the Databricks Unified Data Analytics Platform on AWS enables Comcast to build rich data sets and optimize machine learning at scale, streamline workflows across teams, foster collaboration, reduce infrastructure complexity, and deliver superior customer experiences.

SIMPL I FI E D I NF R AST RUCT U R E MANAG EMENT: Reduced operational costs through automated cluster management and cost management features such as autoscaling and spot instances.

PERFO RM AN T DATA P IP ELINES WIT H DELTA L AKE: Delta Lake is used for the ingest, data enrichment, and initial processing of the raw telemetry from video and voice applications and devices.

REL I A B LY M AN AGE SMALL FILES: Delta Lake enabled them to optimize files for rapid and reliable ingestion at scale.

COLL A BO R ATI VE WORKSPACES: Interactive notebooks improve cross-team collaboration and data science creativity, allowing Comcast to greatly accelerate model prototyping for faster iteration.

SIMPL I FI E D ML LIFECYCLE: Managed MLflow simplifies the machine learning lifecycle and model serving via the Kubeflow environment, allowing them to track and manage hundreds of models with ease.

REL I A B L E E TL AT SCALE: Delta Lake provides efficient analytics pipelines at scale that can reliably join historic and streaming data for richer insights.

Delivering personalized experiences with ML

In the intensely competitive entertainment industry, there is no time to press the Pause button. Armed with a unified approach to analytics, Comcast can now fast- forward into the future of AI-powered entertainment — keeping viewers engaged and delighted with competition-beating customer experiences.

EMMY-WINNING VIE WER EXPERIENCE: Databricks helps enable Comcast to create a highly innovative and award-winning viewer experience with intelligent voice commands that boost engagement.

REDUCED COMPUTE COSTS BY 10X: Delta Lake has enabled Comcast to optimize data ingestion, replacing 640 machines with 64, while improving performance. Teams can spend more time on analytics and less time on infrastructure management.

LESS DE VOPS: Reduced the number of devops full-time employees required for onboarding 200 users from 5 to 0.5.

HIGHER DATA SCIENCE PRODUCTIVITY: Fostered collaboration globally between data scientists by enabling different programming languages through a single interactive workspace. Also, Delta Lake has enabled the data team to use data at any point within the data pipeline, allowing them to act more quickly in building and training new models.

FASTER MODEL DEPLOYMENT: Reduced deployment times from weeks to minutes as operations teams deployed models on disparate platforms.

Page 18: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 8D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Healthier, happier pets with data and AIMars Petcare uses Databricks to accelerate the diagnosis of health issues

Disjointed teams and data slow analytical progress

Mars Petcare’s mission is to provide pet owners with the information they need to get a holistic view into the health of their pets. With multiple brands focused on different areas of the pet industry — from medical devices to nutrition — they have a wealth of diverse data, including veterinarian notes and diagnoses, dietary information, genomics data and more. Seizing on this opportunity, they set forth to combine all of their data to identify new ways to help improve pet health.

However, with multiple brands capturing, storing and analyzing their own data, the engineering team faced an uphill battle leveraging the data for analytics, as each business had its own source systems, data sets, processes and models. This massive diversity in data quality and format created a lot of complexity from a data engineering perspective.

Building ETL pipelines was complex and time-consuming due to the siloed nature of their business. These disjointed business units also impacted the productivity of their analytics and data scientists. They were unable to collaborate efficiently across teams and often struggled to get the data they needed to analyze and build models.

Unifying data across teams accelerates healthcare innovation

In order to fully realize the value of their various sources of data as a whole, their data teams needed a unified approach to data analytics. With Azure Databricks, they are able to easily access their various data sources with JDBC connections, provision infrastructure to any scale without burdening their engineering team, and collaborate on the data across various data teams and business units with ease.

Databricks was the clear choice due to its seamless integration with Azure services, infinite scale and collaborative notebooks. To improve their data management and pipelines, they rely on Delta Lake for meta-configuration and providing their analyst and data science teams with direct access to ACID-compliant data for analytics and modeling. Versioning allows the data engineers to debug issues with transaction

Case Study: WAT C H V I D E O

Page 19: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

1 9D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

logs and update history, and time travel enables them to restore an older version. With Delta Lake and Databricks working seamlessly together, they can visualize the metadata layer, look at transaction logs directly, and view file types and sizes — all in a single platform.

With their ETL pipelines in place, the analytics and data science teams can start working with the data more easily. Having access to versioned data allows their analysts to recreate projects easily, validate models on updated data sets, and even

analyze previous versions for missing insights. The collaborative notebooks have democratized the data for various teams across business units — giving everyone the ability to access and leverage the data for their needs.

Faster time-to-insight results in better healthcare solutions for pets

The benefits of Databricks and Delta Lake have enabled the Mars Petcare team to accelerate pet healthcare innovation. From an operational standpoint, infrastructure management is more efficient — eliminating the complexity of spinning up clusters and lowering overall costs. Additionally, features like autoscaling have helped reduce compute usage, which has lowered overall cloud costs.

With data insights at their fingertips, they are revolutionizing the diagnosis of healthcare issues within pets, including the prediction of terminal diseases in older cats, identifying the types of behaviors that might indicate health issues, and using DNA analysis to identify genetic health conditions within dogs.

With a single source of truth and a unified system that promotes collaboration, their data teams across all of their brands can now use data to discover new ways to improve pet health and well-being.

Page 20: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

2 0D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Putting the patient’s health first with data and AIHealthdirect Australia provides personalized and secure online patient care with Databricks

Data quality and governance, silos, and the inability to scale

Due to regulatory pressures, Healthdirect Australia set forth to improve overall data quality and ensure a level of governance on top of that. But they ran into challenges when it came to data storage and access. Multiple data silos also served as a blocker to efficiently prepare data for downstream analytics. These disjointed data sources impacted the consistency of data reads as data was oftentimes out-of-sync between the various systems in their stack. The low-quality data also led to higher error rates and processing inefficiencies. This fragmented architecture created significant operational overhead and limited their ability to have a comprehensive view of the patient.

Further, they needed to ingest over 1 billion data points due to a changing landscape of customer demand, such as bookings, appointments, pricing, eHealth transaction activity, etc. — estimated at over 1TB of data.

“We had a lot of data challenges. We just couldn’t process efficiently enough. We were starting to get batch overruns. We were starting to see that a 24-hour window isn’t the most optimum time in which we want to be able to deliver healthcare data and services,” explained Peter James, chief architect, Healthdirect Australia.

Ultimately, Healthdirect realized they needed to modernize their end-to-end process and tech stack to properly support the business.

Modernizing analytics with Databricks and Delta Lake

Databricks provides Healthdirect Australia with a Unified Data Analytics Platform that simplifies data engineering and accelerates data science innovation. The notebook environment enables them to make content changes in a controlled fashion rather than having to run bespoke jobs each time.

“Databricks has provided a big uplift for our teams and our data operations,” said James. “The analysts were working directly with the data operations teams. They are able to achieve the same pieces of work together, within the same timeframes that used to take twice as long. They’re working together and we’re seeing just a massive reduction in the speed at which we can deliver service.”

With Delta Lake, they’ve created logical data zones: Landing, Raw, Staging and Gold. Within these zones, they store their data “as is,” in their structured or unstructured state, in Delta Lake Tables. From there they use a metadata-driven schema and hold the data within a nested structure within that table. What this allows them to do is handle data consistently from every source and simplifies the mapping of data to the various applications pulling the data.

Case Study:

6x I M P R O V E M E N T I N DATA

P R O C E S S I N G

20 million R E C O R D S I N G E S T E D

I N 2 0 M I N U T E S

Page 21: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

2 1D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Meanwhile, through Structured Streaming, they were able to convert all of their ETL batch jobs into streaming ETL jobs that could serve multiple applications consistently. Overall, the advent of Spark Structured Streaming, Delta Lake and Databricks’ Unified Data Analytics Platform provides significant architecture improvements that have boosted performance, reduced operational overheads and increased process efficiencies.

Faster data pipelines result in better patient-driven healthcare

As a result of the performance gains delivered by Databricks and the improved data reliability through Delta Lake, Healthdirect Australia realized improved accuracy of their fuzzy name match algorithm from less than 80% with manual verification to 95% and no manual intervention.

The processing improvements with Delta Lake and Structured Streaming allowed them to process more than 30,000 automated updates per month. Prior to Databricks, they

had to use unreliable batch jobs that were highly manual to process the same number of updates over a span of 6 months — a 6x improvement in data processing.

They were also able to increase their data load rate to 1 million records per minute, loading their entire 20 million record data set in 20 minutes. Before the adoption of Databricks, it took more than 24 hours to process 1 million transactions, blocking analysts from making swift decisions to drive results.

Finally, data security, which was critical due to compliance requirements, was greatly improved. Databricks provides standard security accreditations like HIPAA. Healthdirect was able to use Databricks to meet Australia’s security requirements. This yielded significant cost reductions and gave them continuous data assurance by monitoring access privileges like changes in roles, metadata level security changes, data leakage, etc.

“Databricks delivered the time to market as well as the analytics and operational uplift that we needed in order to be able to meet the new demands of the healthcare sector,” said James.

Looking ahead, the future looks bright for Healthdirect Australia. With the help of Databricks, they have proven the value of data and analytics and how it can impact their business vision. With transparent access to data that boasts well-documented lineage and quality, participation across various business and analyst groups has increased — empowering teams to more easily and quickly extract value from their data with the goal of improving healthcare for everyone.

Databricks delivered the time to market as well as the analytics and operational uplift that we needed in order to be able to meet the new demands of the healthcare sector.

PE TE R JAMES

Chief Architect, Healthdirect Australia

Page 22: DATA ENGINEERING GUIDE · 2020. 11. 25. · DATA ENGINEERING GUIDE 9 How to reduce cost while improving data pipeline performance and reliability Open sourced for ultimate flexibility

2 2D ATA E N G I N E E R I N G G U I D E How to reduce cost while improving data pipeline performance and reliability

Conclusion

Databricks helps enterprises simplify their architecture for production pipelines in order to increase reliability, reduce latency, reduce the amount of support and troubleshooting required by data engineers, which reduced their net infrastructure costs in the short term. Bigger picture, moving to the Delta architecture based on open standards has increased the agility and flexibility of their data lakes. Instead of supporting various data silos with portions of their data locked away in proprietary databases, they have made their data lakes into a consistent, reliable source for their data analysts and data scientists.

And as enterprises move to minimize vendor lock-in by moving to cloud-agnostic and multi-cloud architectures, Databricks’ architecture, based on open standards, helps customers stay agile and flexible by simplifying access to new data sources and unlocking new use cases powered by data analytics.

Get a custom value assessment and PoC to see what infrastructure savings and business gains you can achieve with Databricks.

C O N TA C T U S TO DAY

© Databricks 2020. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.

Privacy Policy | Terms of Use

Databricks has greatly improved collaboration within our cross-functional data team, empowering us to collectively work toward new data-driven innovations to improve workplace safety.

BRYANT EADON

CIO, StrongArm Technologies

“”