sympathy for data

Post on 05-Jul-2015

242 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation of the tool Sympathy for Data at FSCONS 2014 in Gothenburg, Sweden.

TRANSCRIPT

Sympathy for DataCreating FOSS in an enterprise environment

Stefan Larsson Combine AB

!E-mail: stefan.larsson@combine.se Twitter: @lastsys

FSCONS 2014 / 2014-11-01

Outline

• Background and problem description

• Technology overview

• Demonstration

• Future and conclusion

Background and Problem Description

Spreading local innovation is difficult in a large organization

Management

Unit 1 Unit 2

Dept 2.1

Section 2.1.1

Group 2.1.1.1 Group 2.1.1.2

Section 2.1.2

Group 2.1.2.1 Group 2.1.2.2

Dept 2.2

Section 2.2.1

Group 2.2.1.1 Group 2.2.1.2

Section 2.2.2

Group 2.2.2.1 Group 2.2.2.2

Dept 2.3

Unit 3

Employee Employee

In 2009 we started coding during evenings and weekends

Ensure ownership! or

Make an agreement with your employer first!

We decided to ask our employer for funding through paid time

Selling Arguments

Company Lawyers

Maintenance Ensure Function

OwnershipCode Contribution

Warranty and Responsibility

”Big Data” is a recent marketing gimmick, engineers have lived with it for decades

Issue Details

Volume Storage, memory and distribution.

Velocity Rapid results from data and data generation rate.

Variety Many different data sources and data structures.

Veracity Truth or accuracy of data.

Business Intelligence

Data Science

Business Intelligence evolving into Data Science

Busi

ness

Va

lue

Time

Low

Past Future

Hig

h

Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013

Forward thinking

Retrospective

It is easy to get stuck in ”why”

Busi

ness

Va

lue

Analytics Sophistication

Low

Reporting Action

Hig

h

Analysis

What should I do next? !What result should I expect? !What if trends continue? !Why did this happen?!!How did we do? !How many, how often, where?

Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013

”Data Science” can be much more complex than BI

Unstructured Data Sources

Unstructured Data Sources

Unstructured Data Sources

ELT Analyis / Modelling

Report / Prediction Action

Well Formed Data Source ETL Analyze Report

Business Intelligence

Data Science!!!

Engineers are usually not software developers, but can have great scripting skills

Data 1

Data 2

Data 3

Data import script

File

Clean and group data script

File File

Analyze data script

Visualize / report result script

File

80-90% of the workConclusions / Actions

LoadExtract Transform

Those engineers who are uncomfortable with writing scripts tend to use Microsoft Excel for everything

Data 1

Data 2

Data 3

Excel

Copy/Paste

Mouse

Manual labor

Keyboard

Result

No reader

No reader

With independent work the individual data formats are often incompatible

Data 1

Data 2

Data 3

Data import Clean and group data Analyze data Visualize / report

result

Data import Clean and group data Analyze data Visualize / report

result

Clean and group data Analyze data Visualize / report

result

Engineer 1

Engineer 2

Engineer 3

Data import

80-90% of the work

Well defined data formats at inputs and outputs of operations simplifies reuse of scripts

Data 1

Data 2

Data 3

Analyze data

Data import Clean and group data Analyze data Visualize / report

result

Analyze data

Engineer 1

Engineer 2

Engineer 3

80-90% of the work

The Pareto Principle states that 20% of the work solves 80% of the problem, we are

attacking the ELT-problem

Basic Requirement Advantage Challenge

Isolated execution environment. Guarantee functionality. Design environment(s).

Data type system for inputs and outputs. Well defined data. Design type system.

Library of reusable operations.

Saving time and improving quality of operations. Granularity of operations.

Graphical editor to build data flow graphs

No coding knowledge required for user.

Visualization and user interaction concepts.

The Result Became ”Sympathy for Data”

Technology Overview

The platform is based on Python

• Python 2.7 with NumPy and SciPy as a foundation.!

• Easy for Matlab users to convert.

• Plenty of computational and plotting libraries to choose from.

• HDF5 for storage of intermediate data.!

• Easy to read subsets of data.

• User Interface: PySide (Qt)!

• Started in C++ but switched to Python for faster development rate.

• No feedback loops in flows, just list recursion.!

• Type system since tables are not enough.

We work with text and tables in combination with containers

Data Containers

Text

Table

List

Record (Named Tuple)

Dictionary (String Keys)in the future: image, sound, etc.

Example of typestype1: (desc: text, data: [table], prop: { (f1: text, f2: table) })

type2: (desc: text, content: [type1])

Record with fields ’desc’, ’data’ and ’prop’.

type1 is referred to in type 2.

We are using separate worker processes for each block

Scheduler

Worker 1 Worker 2 Worker 3 Worker 4

Demonstration

Future and Conclusion

To sum up, Sympathy for Data was born since nothing fulfilled our needs• Existing solutions found on the market only works with

well-formed tables.

• Evaluated software requires data to be preprocessed.

• Faster and cheaper to adapt our own platform for our needs.

• Many engineers are not ”multi-instrumentalists”.

• And of course; personal interest and commitment.

Sympathy for Data is currently powering several customer applications

• Automation of manual ELT-workflows with heterogeneous data sources.

• Failure/warranty prediction.

• Replacing existing outdated Matlab-scripts.

And recycling code between applications is working well…

We still need to work on some important areas

• Mature development environment for blocks.

• Improve support for interactive work.

• Clean up library with ”Any”-type.

• Introduce type for functions.

• Higher-order functions — develop for singular case, scale to plural.

• Improve performance.

• Polish, polish, polish… The software is still quite rough.

top related