idc update on how big data is redefining high performance ... · idc update on how big data is...
TRANSCRIPT
IDC Update on How Big Data Is Redefining High Performance
Computing Earl Joseph – [email protected]
Steve Conway – [email protected] Chirag Dekate – [email protected]
Bob Sorensen – [email protected]
IDC Has Over 1,000 Analysts In 62 Countries
Agenda
A Short HPC Market Update
Big Data Challenges and Short Comings
The High End of Big Data
• Examples of Very Large Big Data
Examples of How Big Data is Redefining
High Performance Computing
HPC
Market Update
5
What Is HPC?
IDC uses these terms to cover all technical
servers used by scientists, engineers, financial
analysts and others:
• HPC
• HPTC
• Technical Servers
• Highly computational servers
HPC covers all servers that are used for
computational or data intensive tasks • From a $5,000 deskside server up to over $550 million
dollar supercomputer
Top Trends in HPC
2013 declined overall – by $800 million
• For a total of $10.3 billion
• Mainly due to a few very large systems sales in 2012
that weren’t repeated in 2013
• We expect growth in 2015 to 2018
Software issues continue to grow
The worldwide Petascale Race is at full speed
GPUs and accelerators are hot new technologies
Big data combined with HPC is creating new solutions
in new areas
IDC HPC Competitive Segments: 2013
Departmental ($250K - $100K)
$3.4B
Divisional ($250K - $500K)
$1.4B
Supercomputers (Over $500K)
$4.0B
Workgroup (under $100K)
$1.6B
HPC
Servers
$10.3B
HPC WW Market Trends:
A 17 Year Perspective
C
HPC
Market Forecasts
HPC Forecasts
• Forecasting a 7.4% yearly growth from 2013 to 2018
• 2018 should reach $14.7 billion
HPC Forecasts: By Industry/Applications
The Broader HPC Market
Big Data
Challenges And
Shortcomings
Defining Big Data
HPDA Market Drivers More input data (ingestion)
• More powerful scientific instruments/sensor networks
• More transactions/higher scrutiny (fraud, terrorism)
More output data for integration/analysis • More powerful computers • More realism • More iterations in available time
Real time, near-real time requirements • Catch fraud before it hits credit cards • Catch terrorists before they strike • Diagnose patients before they leave the office • Provide insurance quotes before callers leave the phone
The need to pose more intelligent questions • Smarter mathematical models and algorithms
Top Drivers For Implementing Big Data
Organizational Challenges With Big Data:
Government Compared To All Others
Big Data Meets
HPC And
Advanced Simulation
High Performance Data Analysis
Needs HPC resources • High complexity (algorithms)
• High time-criticality
• High variability
• (On premise or in cloud)
Data of all kinds • The 4 V’s: volume, variety, velocity, value
• Structured, unstructured
• Partitionable, non-partitionable
• Regular, irregular patterns
Simulation & analytics
• Search, pattern discovery
• Iterative methods
• Established HPC users + new
commercial users
HPC Adoption Timeline (Examples)
1960 1970 1980 1990 2000 2012
Very Large
Big Data
Examples
NASA
22
Square Kilometre Arrary – Radio
Astronomy for Astrophysics
CERN • LHC: the world’s leading accelerator -- Multiple Nobel Prizes
for particle physics work
• Innovation driven by the need to distribute massive data sets
and the accompanying applications
• Altas, one of CERN’s two detectors, generates 1PB of data
per second when running! (Not all of this is distributed).
• Private cloud distribution to scientists in 20 EU member states
plus observer states (single largest user is the U.S.)
• Today: only .0000005% of the data is used
NOAA
25
26
27
HPC Will Be Used More for Managing Mega-IT Infrastructures
Examples of
Big Data
Redefining HPC
Finding suspicious patterns that we don’t
even know exist in related data sets
Use Case: PayPal Fraud Detection
The Problem
What Kind of Volume?
PayPal’s Data Volumes And HPDA Requirements
Where Paypal Used HPC
The Results $710 million saved in fraud that they wouldn’t have been able
to detect before (in the first year)
There Are New Technologies That Will Likely Cause A Mass Explosion In Data – Requiring HPDA Solutions
GEICO: Real-Time Insurance Quotes
Problem: Need accurate automated phone quotes
in 100ms. They couldn’t do these calculations
nearly fast enough on the fly.
Solution: Each weekend, use a new HPC cluster to
pre-calculate quotes for every American adult and
household (60 hour run time)
Something To Think About -- GEICO: Changing
The Way One Approaches Solving a Problem
• Instead of processing each event one-at-a-time, process it for everyone on a regular basis It can be dramatically cheaper, faster and offers
additional ways to be more accurate
But most of all it can create new and more powerful capabilities
• Examples: For home loan applications – calculate for every adult
in the US and every home in the US
For health insurance fraud – track every procedure done on every US person by every doctor – and find patterns
Something To Think About -- GEICO: Changing
The Way One Approaches Solving a Problem
• Future Examples (continued): If you add-in large scale data collection via sensors like
GPS, drones and RFID tags:
• New car insurance rules – The insurance company doesn’t have to pay if you break the law -- like speeding and having an accident
• You could track every car at all times – then charge $2 to see where the in-laws are in traffic if they are late for a wedding
• Google maps could show in real-time where every letter and package is located
• But crooks could also use it in many ways – e.g. watching ATM machines, looking for when guards are on break, …
U.S. Postal Service
U.S. Postal Service
U.S. Postal Service
MCDB = memory-centric database
CMS: Government Health Care Fraud
5 separate databases for the big USG health care programs
under Centers for Medicare and Medicaid Services (CMS)
Estimated fraud: $150B-$450B <$5B caught today)
ORNL, SDSC have evaluation contracts to unify the
databases and perform fraud detection on various
architectures
Schrödinger: Cloud-based Lead
Discovery for Drug Design
NOVARTIS/SCHROEDINGER:
Pharmaceutical company Novartis increased resolution
of drug discovery algorithm 10x and wanted to use it to
test 21 million small molecules as drug candidates
Novartis used the Schroedinger drug discovery app in
AWS public cloud, with the help of Cycle Computing
Initial run used 51,000 AWS cores and took $14,000 and
<4 hours
… and its getting cheaper Later run used 156,000
AWS cores with comparable costs and time
Schrödinger: Cloud-based Lead
Discovery for Drug Design
Global Financial Services: Company X
One of the most respected firms in the global financial services
industry updates detailed information daily on several million
companies around the world.
Clients use the firm's credit ratings and other company information
in making lending decisions and for other planning, marketing, and
business decision making.
The firm uses statistical models to develop a company's scores and
ratings, and for years, the ratings have been prepared and analyzed
locally in near real time by the firm's personnel around the world.
• This practice is a major competitive advantage but resulted in the
creation of hundreds of distinct databases and more than a dozen
scoring environments.
• Several years ago, the company established a goal of centralizing these
resources and chose SAS as the centralization mechanism, including
SAS Grid Manager as part of the software stack.
Global Financial Services: Company X
The centralized IT infrastructure created using SAS preserves the
advantages of the company's locally created ratings and reports.
The new infrastructure provides an effective environment for
analytics development and accommodates multiple testing, debt,
and production environments in a single stack.
It is flexible enough to allow dynamic prioritization among these
environments, according to a company executive. With help from
SAS Grid Manager, the company can maximize the use of its
computing resources. The software automatically assigns jobs to
server nodes with available capacity, instead of having users wait in
queue for time on fully utilized nodes.
The company executive estimates that it might cost 30% more to
purchase servers with enough capacity to handle these peak
workloads on their own.
Global Financial Services: Company X
Several million clients use
the firm’s credit ratings to
help make lending
decisions
Goal: increase efficiency for
updating ratings
Result: HPC multi-cluster
grid boosted efficiency 30%
-- no need to buy additional
clusters yet.
Real Estate
Worldwide vacation exchange &
rental leader
Goal: Update property valuations
several times per day (not possible
with enterprise servers)
Results:
• HPC technology enabled all updates in
8-9 hours
• Avoided move to heuristics
• Allowed company to focus on rental
side
Outcomes-Based Medical Diagnosis and
Treatment Planning
Enter the patient’s history and symptomology.
While the patient is still in the office, sift through millions of archived patient records for relevant outcomes.
Provider considers the efficacies of various treatments for “similar” patients (but is not bound by the findings).
Ergo, this functions as a powerful decision-support tool.
Benefits: better outcomes + rein in costly outlier practices.
Digital Television Services
A global leader with 30 million subscribers
Goal: maximize revenue & customer satisfaction
during high-growth period
Result: HPC has added €7.5 million in annual
revenue while increasing satisfaction
IDC HPDA Server Forecast Fast growth from a small starting point: $933 M
HPDA ecosystem >$2.6B in 2018
IDC HPDA Storage Forecast Storage is the fastest-growing HPC market (9% CAGR)
And HPDA storage will grow even faster (26.5% CAGR)
In
Summary
Summary: HPDA Market Opportunity
HPDA: simulation + newer high-
performance analytics
• IDC predicts fast growth from a small
starting point
HPC and high-end commercial analytics
are converging
• Algorithmic complexity is the common
denominator
Economically important use cases are
emerging
No single HPC solution is best for all
problems
HPDA User Talks: HPC User Forums, UK, Germany, France,
China and U.S. …
• HPC in Evolutionary Biology, Andrew Meade, University of Reading
• HPC in Pharmaceutical Research: From Virtual Screening to All-Atom Simulations of Biomolecules, Jan Kriegl, Boehringer-Ingelheim
• European Exascale Software Initiative, Jean-Yves Berthou, Electricite de France
• Real-time Rendering in the Automotive Industry, Cornelia Denk, RTT-Munich
• Data Analysis and Visualization for the DoD HPCMP, Paul Adams, ERDC
• Why HPCs Hate Biologists, and What We're Doing About It, Titus Brown, Michigan State University
• Scalable Data Mining and Archiving in the Era of the Square Kilometre Array, the Square Kilometre Array Telescope Project, Chris Mattmann, NASA/JPL
• Big Data and Analytics in HPC: Leveraging HPC and Enterprise Architectures for Large Scale Inline Transactional Analytics in Fraud Detection at PayPal, Arno Kolster, PayPal, an eBay Company
• Big Data and Analytics Vendor Panel: How Vendors See Big Data Impacting the Markets and Their Products/Services, Panel Moderator: Chirag Dekate, IDC
• Data Analysis and Visualization of Very Large Data, David Pugmire, ORNL
• The Impact of HPC and Data-Centric Computing in Cancer Research, Jack Collins, National Cancer Institute
• Urban Analytics: Big Cities and Big Data, Paul Muzio, City University of New York
• Stampede: Intel MIC And Data-Intensive Computing, Jay Boisseau, Texas Advanced Computing Center
• Big Data Approaches at Convey, John Leidel
• Cray Technical Perspective On Data-Intensive Computing, Amar Shan
• Data-intensive Computing Research At PNNL, John Feo, Pacific Northwest National Laboratory
• Trends in High Performance Analytics, David Pope, SAS
• Processing Large Volumes of Experimental Data, Shane Canon, LBNL
• SGI Technical Perspective On Data-Intensive Computing, Eng Lim Goh, SGI
• Big Data and PLFS: A Checkpoint File System For Parallel Applications, John Bent, EMC
• HPC Data-intensive Computing Technologies, Scott Campbell, Platform/IBM
• The CEA-GENCI-Intel-UVSQ Exascale Computing Research Centre, Marie-Christine Sawley, Intel
Big Data Software Shortcomings --
Today
Big Data Software:
Big Data Software Technology Stack