sympathy for data
Post on 05-Jul-2015
242 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sympathy for DataCreating FOSS in an enterprise environment
Stefan Larsson Combine AB
!E-mail: stefan.larsson@combine.se Twitter: @lastsys
FSCONS 2014 / 2014-11-01
Outline
• Background and problem description
• Technology overview
• Demonstration
• Future and conclusion
Background and Problem Description
Spreading local innovation is difficult in a large organization
Management
Unit 1 Unit 2
Dept 2.1
Section 2.1.1
Group 2.1.1.1 Group 2.1.1.2
Section 2.1.2
Group 2.1.2.1 Group 2.1.2.2
Dept 2.2
Section 2.2.1
Group 2.2.1.1 Group 2.2.1.2
Section 2.2.2
Group 2.2.2.1 Group 2.2.2.2
Dept 2.3
Unit 3
Employee Employee
In 2009 we started coding during evenings and weekends
Ensure ownership! or
Make an agreement with your employer first!
We decided to ask our employer for funding through paid time
Selling Arguments
Company Lawyers
Maintenance Ensure Function
OwnershipCode Contribution
Warranty and Responsibility
”Big Data” is a recent marketing gimmick, engineers have lived with it for decades
Issue Details
Volume Storage, memory and distribution.
Velocity Rapid results from data and data generation rate.
Variety Many different data sources and data structures.
Veracity Truth or accuracy of data.
Business Intelligence
Data Science
Business Intelligence evolving into Data Science
Busi
ness
Va
lue
Time
Low
Past Future
Hig
h
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
Forward thinking
Retrospective
It is easy to get stuck in ”why”
Busi
ness
Va
lue
Analytics Sophistication
Low
Reporting Action
Hig
h
Analysis
What should I do next? !What result should I expect? !What if trends continue? !Why did this happen?!!How did we do? !How many, how often, where?
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
”Data Science” can be much more complex than BI
Unstructured Data Sources
Unstructured Data Sources
Unstructured Data Sources
ELT Analyis / Modelling
Report / Prediction Action
Well Formed Data Source ETL Analyze Report
Business Intelligence
Data Science!!!
Engineers are usually not software developers, but can have great scripting skills
Data 1
Data 2
Data 3
Data import script
File
Clean and group data script
File File
Analyze data script
Visualize / report result script
File
80-90% of the workConclusions / Actions
LoadExtract Transform
Those engineers who are uncomfortable with writing scripts tend to use Microsoft Excel for everything
Data 1
Data 2
Data 3
Excel
Copy/Paste
Mouse
Manual labor
Keyboard
Result
No reader
No reader
With independent work the individual data formats are often incompatible
Data 1
Data 2
Data 3
Data import Clean and group data Analyze data Visualize / report
result
Data import Clean and group data Analyze data Visualize / report
result
Clean and group data Analyze data Visualize / report
result
Engineer 1
Engineer 2
Engineer 3
Data import
80-90% of the work
Well defined data formats at inputs and outputs of operations simplifies reuse of scripts
Data 1
Data 2
Data 3
Analyze data
Data import Clean and group data Analyze data Visualize / report
result
Analyze data
Engineer 1
Engineer 2
Engineer 3
80-90% of the work
The Pareto Principle states that 20% of the work solves 80% of the problem, we are
attacking the ELT-problem
Basic Requirement Advantage Challenge
Isolated execution environment. Guarantee functionality. Design environment(s).
Data type system for inputs and outputs. Well defined data. Design type system.
Library of reusable operations.
Saving time and improving quality of operations. Granularity of operations.
Graphical editor to build data flow graphs
No coding knowledge required for user.
Visualization and user interaction concepts.
The Result Became ”Sympathy for Data”
Technology Overview
The platform is based on Python
• Python 2.7 with NumPy and SciPy as a foundation.!
• Easy for Matlab users to convert.
• Plenty of computational and plotting libraries to choose from.
• HDF5 for storage of intermediate data.!
• Easy to read subsets of data.
• User Interface: PySide (Qt)!
• Started in C++ but switched to Python for faster development rate.
• No feedback loops in flows, just list recursion.!
• Type system since tables are not enough.
We work with text and tables in combination with containers
Data Containers
Text
Table
List
Record (Named Tuple)
Dictionary (String Keys)in the future: image, sound, etc.
Example of typestype1: (desc: text, data: [table], prop: { (f1: text, f2: table) })
type2: (desc: text, content: [type1])
Record with fields ’desc’, ’data’ and ’prop’.
type1 is referred to in type 2.
We are using separate worker processes for each block
Scheduler
Worker 1 Worker 2 Worker 3 Worker 4
Demonstration
Future and Conclusion
To sum up, Sympathy for Data was born since nothing fulfilled our needs• Existing solutions found on the market only works with
well-formed tables.
• Evaluated software requires data to be preprocessed.
• Faster and cheaper to adapt our own platform for our needs.
• Many engineers are not ”multi-instrumentalists”.
• And of course; personal interest and commitment.
Sympathy for Data is currently powering several customer applications
• Automation of manual ELT-workflows with heterogeneous data sources.
• Failure/warranty prediction.
• Replacing existing outdated Matlab-scripts.
And recycling code between applications is working well…
We still need to work on some important areas
• Mature development environment for blocks.
• Improve support for interactive work.
• Clean up library with ”Any”-type.
• Introduce type for functions.
• Higher-order functions — develop for singular case, scale to plural.
• Improve performance.
• Polish, polish, polish… The software is still quite rough.
top related