an intersystems guide to the data galaxy · an intersystems guide to the data ... through its...

38
An InterSystems Guide to the Data Galaxy Benjamin De Boe Product Manager

Upload: vankhuong

Post on 05-Apr-2018

250 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

An InterSystems Guide to the Data GalaxyBenjamin De Boe

Product Manager

Page 2: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

Analytics

Page 3: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

3 | © InterSystems Corporation. All rights reserved. |

Page 4: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

4 | © InterSystems Corporation. All rights reserved. |

Page 5: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

5 | © InterSystems Corporation. All rights reserved. |

Page 6: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

6 | © InterSystems Corporation. All rights reserved. |

Page 7: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

7 | © InterSystems Corporation. All rights reserved. |

Page 8: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

Analytical Workflows & Tools

An InterSystems Guide to the Data Galaxy

Page 9: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

9 | © InterSystems Corporation. All rights reserved. |

Capturing & Storing Data

Capturing & Storingraw data curated data

Page 10: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

10 | © InterSystems Corporation. All rights reserved. |

Capturing & Storing Data

Goals:

• Persist data in a form that maximizes options for analytics

Tools:

• Storage: Hadoop, HDFS, NoSQL DBMS

• Database: RDBMS, Multi-model DBMS

• Solutions: Splunk, Flume

Challenges:

• Volume, Velocity and Variety of data

• Provide flexible & efficient access to data

Page 11: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

11 | © InterSystems Corporation. All rights reserved. |

Data Processing

Capturing & Storingraw data curated data

Processing

Page 12: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

12 | © InterSystems Corporation. All rights reserved. |

Data Processing

Goals:

• Transform raw data into analytics-ready formats

Tools:

• Frameworks: Hadoop MapReduce, Spark, Flink, …

• ETL tools: ODI, Attunity, Talend, Informatica, SnapLogic, …

• Databases: SQL, Stored procedures, …

Challenges:

• Efficiency & scalability as volumes grow

• Trade-off between flexibility & usability

Page 13: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

13 | © InterSystems Corporation. All rights reserved. |

Exploration & Analytics

Exploring

Business Analytics

Capturing & Storingraw data curated data

Processing

Page 14: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

14 | © InterSystems Corporation. All rights reserved. |

Exploration & Analytics

Goals:

• Exploratory analysis of raw data’s general characteristics & distribution

• Deeper analysis through predefined dimensions, measures & KPIs

• Identify opportunities for data refinement & further analysis

Tools:

• Exploration: DBMS, Hive, SparkSQL, Periscope, …

• Analytics: Tableau, QlikView, Birst, Datameer, Oracle BI, Cognos, BusinessObjects, …

Challenges:

• Trade-off between flexibility & ease of use through predefined data models

• Support for semi-structured and structured data

• Efficiency & scalability as volumes grow

Page 15: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

15 | © InterSystems Corporation. All rights reserved. |

Machine Learning

Exploring

Business Analytics

Machine Learning

Capturing & Storingraw data curated data

Processing

Page 16: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

16 | © InterSystems Corporation. All rights reserved. |

Machine Learning

Goals:

• Identify correlations between input features and outcomes

• Define derived features and models that maximize predictive value

Tools:

• Frameworks: R, Spark ML, System ML, SciKit, …

• Products: SAS, SPSS, BigML, KNIME, …

Challenges:

• Abstract complexity of underlying algorithms

• Optimize complex computations on large datasets

Page 17: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

17 | © InterSystems Corporation. All rights reserved. |

Closing the Loop

Exploring

Business Analytics

Predicting

Capturing & Storingraw data curated data

Processing

Machine Learning

Page 18: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

18 | © InterSystems Corporation. All rights reserved. |

Closing the Loop

Goals:

• Apply predictive models to new data

• Leverage analytics output in transactional systems

Tools:

• Standard: PMML

• Tools: Zementis, SAS

Challenges:

• Efficient model scoring in transactional systems

• Avoid complexity and heterogeneity in production environment

Page 19: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

An Open Analytics Platform

An InterSystems Guide to the Data Galaxy

Page 20: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

20 | © InterSystems Corporation. All rights reserved. |

Recurring Themes

Diverse tasks require diverse toolsets: Open Platform

• There is no single tool that covers the entire spectrum

• A proper combination of tools can help address the usability vs flexibility trade-off

Importance of a flexible storage layer: Flexible Platform

• Tool diversity should not be an excuse for siloing

• Move computations closer to the data to increase efficiency

Horizontal scalability is a must: Scalable Platform

• Almost every task lists volume-related challenges

Page 21: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries
Page 22: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

InterSystems IRIS Analytics Strategy

An Open Analytics Platform allows partners and end-users to build solutions and services

that leverage best-of-breed analytics technologies, both embedded in the platform and

complementary to it.

InterSystems IRIS: Database

Text

AnalyticsPMML UIMABI

Apache

Spark…

Integration

UIMA

third-party tools embedded technologies industry standards

MDX

JDBCIntegration

ConnectorConnector

Page 23: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

23 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: Database

The InterSystems IRIS Database sits at the core of our Open Analytics

Platform. It scales both up and out to meet the most demanding

workloads

• Parallel processing for vertical scalability

• ECP application servers and sharding for horizontal scalability

For SQL access, the InterSystems JDBC driver offers many optimizations to maximize

throughput, all fully transparent to analytical tools connecting to the database:

• Leverage distributed data layout for ingestion

• Exploit low-level storage format for select queries

• Shared memory access

Page 24: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

24 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: Analytics

InterSystems IRIS: Analytics comes with embedded technologies for exploring and

analysing both structured and unstructured data. Being embedded in the platform, they

can be easily leveraged in applications and solutions built on InterSystems IRIS.

InterSystems IRIS: Business Intelligence (fka DeepSee)

• Define and query multidimensional OLAP cubes

• Supports both ad-hoc analysis and embeddable dashboards

InterSystems IRIS: Text Analytics (fka iKnow)

• Identify concepts and their context from text through a unique

bottom-up approach

• Enable knowledge professionals to explore and consume large volumes of texts

Page 25: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

25 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: Apache Spark Connector

Apache Spark addresses the need for efficient, horizontally scalable data processing

through its Resilient Distributed Dataset abstraction

• Much higher developer productivity than Hadoop’s MapReduce framework

• Lazy execution & optimized code generation enables very high throughput

• Comes with broad set of libraries, including Machine Learning, Graph, etc

• Active and broad user community

Apache Spark’s Data Source API allows database vendors to implement optimizations that

leverage the underlying data store

• Spark’s Catalyst compiler will use these to push computations to the data

• Further convenience extensions are also available

Page 26: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

26 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: Apache Spark Connector

InterSystems IRIS Apache Spark Connector capabilities:

• Convenient Read/Write API

― Expose DB table as Dataset

― Expose arbitrary SQL query as Dataset

• Pushdown strategies

― filter(), select(), and more advanced where possible

• Leverage distributed DB data layout & parallelism

― Biggest gains vs vanilla JDBC connector

Page 27: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

27 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: Apache Spark Connector

Spark Spark Spark

Shard

Master

JD

BC

Shard Shard

Master

Spark Spark Spark

Shard Shard Shard

Inte

rSyst

em

s IR

IS

Spark

Connecto

r

Inte

rSyst

em

s IR

IS

Spark

Connecto

r

Inte

rSyst

em

s IR

IS

Spark

Connecto

r

Page 28: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

28 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: PMML Support

Advanced

AnalyticsTransactional

System

Model Consumer Model Producer

Page 29: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

30 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: PMML Support

Advanced

AnalyticsTransactional

System

Model Consumer Model Producer

PMML

Page 30: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

32 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: PMML Support

PMML standard enables best-of-breed approach for Predictive Analytics

• Leverage existing skillset in 3rd party tools for model building in lab environment

• Leverage InterSystems IRIS Data Platform for model execution in production environment

• Fits our Open Analytics Platform roadmap

Standard support in InterSystems IRIS: PMML Consumer

• Automatically generate optimized COS code based on PMML

• Leverage from other InterSystems IRIS embedded technologies, ...

Page 31: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

34 | © InterSystems Corporation. All rights reserved. |

InterSystems IRIS: UIMA Integration

Combining different NLP tools is challenging at best

custom custom customNLP tech 1 NLP tech 2

Page 32: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

35 | © InterSystems Corporation. All rights reserved. |

UIMA standard offers interoperability & scalability for NLP application developers

NLP tech 1 NLP tech 2

InterSystems IRIS: UIMA Integration

NLP tech 1 NLP tech 2UIMA UIMA UIMANLP tech 1 NLP tech 2

Page 33: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

InterSystems IRIS: UIMA Integration

InterSystems IRIS UIMA Integration marries data platform and UIMA data processing

infrastructure for increased developer productivity and a quicker path to analytics

NLP tech 1 NLP tech 2NLP tech 1 NLP tech 2UIMANLP tech 1 ISC NLP

~

InterSystems IRIS Data Platform

AnnotationsSource Text

Page 34: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

Conclusion

An InterSystems Guide to the Data Galaxy

Page 35: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

38 | © InterSystems Corporation. All rights reserved. |

Conclusions

The InterSystems IRIS Data Platform is an Open Analytics Platform, helping customers

leverage the tools they know on a platform that scales:

• Through embedded technologies for exploration & analytics

• Through connectors for best-of-breed third-party technologies, allowing those tools to

leverage our platform in a way that optimizes overall system performance

• Through integration with relevant industry standards, enabling customers to leverage models

and components created in dedicated tools as part of new solutions

Page 36: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

39 | © InterSystems Corporation. All rights reserved. |

Related Sessions

InterSystems IRIS: Database

• What’s Lurking in your Data Lake? (Tue, 3PM)

• We Want More, Solving Scalability (Tue, 4PM; Wed, 11AM)

• Speeding up JDBC (Wed, 10:30AM, flash talk)

InterSystems IRIS: Spark Connector

• Apache Spark Experience lab (runs daily)

InterSystems IRIS: Text Analytics

• iKnow: Treating Patients with iKnow and REST (Mon, 3:30PM)

• iFind: Catching Bad Guys with iFind and REST (Tue, 4PM)

• UIMA: Annotations on the Sofa (Wed, 9:45AM, flash talk)

Page 37: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

Please complete the survey in the event app.

Session recording and slides will be available at: http://Learning.InterSystems.com Search for “Global Summit 2017”

Visit the Tech Exchange to learn more!

Page 38: An InterSystems Guide to the Data Galaxy · An InterSystems Guide to the Data ... through its Resilient Distributed Dataset abstraction ... InterSystems IRIS UIMA Integration marries

Thank you.