sympathy for data

Download Sympathy for data

Post on 05-Jul-2015

192 views

Category:

Data & Analytics

0 download

Embed Size (px)

DESCRIPTION

Presentation of the tool Sympathy for Data at FSCONS 2014 in Gothenburg, Sweden.

TRANSCRIPT

  • Sympathy for DataCreating FOSS in an enterprise environment

    Stefan Larsson Combine AB

    !E-mail: stefan.larsson@combine.se Twitter: @lastsys

    FSCONS 2014 / 2014-11-01

  • Outline

    Background and problem description

    Technology overview

    Demonstration

    Future and conclusion

  • Background and Problem Description

  • Spreading local innovation is difficult in a large organization

    Management

    Unit 1 Unit 2

    Dept 2.1

    Section 2.1.1

    Group 2.1.1.1 Group 2.1.1.2

    Section 2.1.2

    Group 2.1.2.1 Group 2.1.2.2

    Dept 2.2

    Section 2.2.1

    Group 2.2.1.1 Group 2.2.1.2

    Section 2.2.2

    Group 2.2.2.1 Group 2.2.2.2

    Dept 2.3

    Unit 3

    Employee Employee

  • In 2009 we started coding during evenings and weekends

    Ensure ownership! or

    Make an agreement with your employer first!

  • We decided to ask our employer for funding through paid time

    Selling Arguments

    Company Lawyers

    Maintenance Ensure Function

    OwnershipCode ContributionWarranty and Responsibility

  • Big Data is a recent marketing gimmick, engineers have lived with it for decades

    Issue Details

    Volume Storage, memory and distribution.

    Velocity Rapid results from data and data generation rate.

    Variety Many different data sources and data structures.

    Veracity Truth or accuracy of data.

  • Business Intelligence

    Data Science

    Business Intelligence evolving into Data Science

    Busin

    ess

    Value

    Time

    Low

    Past Future

    High

    Redrawn from Big Data - Understanding how data powers big business by Bill Schmarzo, Wiley, 2013

    Forward thinking

    Retrospective

  • It is easy to get stuck in why

    Busin

    ess

    Value

    Analytics Sophistication

    Low

    Reporting Action

    High

    Analysis

    What should I do next? !What result should I expect? !What if trends continue? !Why did this happen?!!How did we do? !How many, how often, where?

    Redrawn from Big Data - Understanding how data powers big business by Bill Schmarzo, Wiley, 2013

  • Data Science can be much more complex than BI

    Unstructured Data Sources

    Unstructured Data Sources

    Unstructured Data Sources

    ELT Analyis / ModellingReport /

    Prediction Action

    Well Formed Data Source ETL Analyze Report

    Business Intelligence

    Data Science!!!

  • Engineers are usually not software developers, but can have great scripting skills

    Data 1

    Data 2

    Data 3

    Data import script

    File

    Clean and group data script

    File File

    Analyze data script

    Visualize / report result script

    File

    80-90% of the workConclusions / Actions

    LoadExtract Transform

  • Those engineers who are uncomfortable with writing scripts tend to use Microsoft Excel for everything

    Data 1

    Data 2

    Data 3

    Excel

    Copy/Paste

    Mouse

    Manual labor

    Keyboard

    Result

    No reader

    No reader

  • With independent work the individual data formats are often incompatible

    Data 1

    Data 2

    Data 3

    Data import Clean and group data Analyze dataVisualize / report

    result

    Data import Clean and group data Analyze dataVisualize / report

    result

    Clean and group data Analyze data

    Visualize / report result

    Engineer 1

    Engineer 2

    Engineer 3

    Data import

    80-90% of the work

  • Well defined data formats at inputs and outputs of operations simplifies reuse of scripts

    Data 1

    Data 2

    Data 3

    Analyze data

    Data import Clean and group data Analyze dataVisualize / report

    result

    Analyze data

    Engineer 1

    Engineer 2

    Engineer 3

    80-90% of the work

  • The Pareto Principle states that 20% of the work solves 80% of the problem, we are

    attacking the ELT-problem

    Basic Requirement Advantage Challenge

    Isolated execution environment. Guarantee functionality. Design environment(s).

    Data type system for inputs and outputs. Well defined data. Design type system.

    Library of reusable operations.

    Saving time and improving quality of operations. Granularity of operations.

    Graphical editor to build data flow graphs

    No coding knowledge required for user.

    Visualization and user interaction concepts.

  • The Result Became Sympathy for Data

  • Technology Overview

  • The platform is based on Python

    Python 2.7 with NumPy and SciPy as a foundation.!

    Easy for Matlab users to convert.

    Plenty of computational and plotting libraries to choose from.

    HDF5 for storage of intermediate data.!

    Easy to read subsets of data.

    User Interface: PySide (Qt)!

    Started in C++ but switched to Python for faster development rate.

    No feedback loops in flows, just list recursion.!

    Type system since tables are not enough.

  • We work with text and tables in combination with containers

    Data Containers

    Text

    Table

    List

    Record (Named Tuple)

    Dictionary (String Keys)in the future: image, sound, etc.

  • Example of typestype1: (desc: text, data: [table], prop: { (f1: text, f2: table) })

    type2: (desc: text, content: [type1])

    Record with fields desc, data and prop.

    type1 is referred to in type 2.

  • We are using separate worker processes for each block

    Scheduler

    Worker 1 Worker 2 Worker 3 Worker 4

  • Demonstration

  • Future and Conclusion

  • To sum up, Sympathy for Data was born since nothing fulfilled our needs Existing solutions found on the market only works with

    well-formed tables.

    Evaluated software requires data to be preprocessed.

    Faster and cheaper to adapt our own platform for our needs.

    Many engineers are not multi-instrumentalists.

    And of course; personal interest and commitment.

  • Sympathy for Data is currently powering several customer applications

    Automation of manual ELT-workflows with heterogeneous data sources.

    Failure/warranty prediction.

    Replacing existing outdated Matlab-scripts.

  • And recycling code between applications is working well

  • We still need to work on some important areas

    Mature development environment for blocks.

    Improve support for interactive work.

    Clean up library with Any-type.

    Introduce type for functions.

    Higher-order functions develop for singular case, scale to plural.

    Improve performance.

    Polish, polish, polish The software is still quite rough.