put down that checkbook! - big data without the big bucks

37

Upload: charlie-greenbacker

Post on 26-Jan-2015

108 views

Category:

Data & Analytics


0 download

DESCRIPTION

Hiring data scientists sure is expensive. One way to afford top talent is to stop throwing your money away on costly "big data" software that over-promises and under-delivers. This talk will offer an opinionated definition of data science, argue why free & open source software is usually the right choice for data scientists, and describe some of the leading free & open source software tools for data science available today.

TRANSCRIPT

Page 1: Put Down That Checkbook! - Big Data without the Big Bucks
Page 2: Put Down That Checkbook! - Big Data without the Big Bucks

Put Down That Checkbook! Big Data without the Big Bucks

Charlie Greenbacker Director of Data Science

Altamira Technologies Corporation

Page 3: Put Down That Checkbook! - Big Data without the Big Bucks

Agenda

•  What is a Data Scientist? •  Why use Open Source Software (OSS)? •  Survey of OSS Tools for Data Science

Page 4: Put Down That Checkbook! - Big Data without the Big Bucks

About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable

photo: Columbia Pictures

Page 5: Put Down That Checkbook! - Big Data without the Big Bucks

Best reason for not finishing PhD

Page 6: Put Down That Checkbook! - Big Data without the Big Bucks

@ExploreAltamira

Page 7: Put Down That Checkbook! - Big Data without the Big Bucks

WHAT IS A DATA SCIENTIST?

Page 8: Put Down That Checkbook! - Big Data without the Big Bucks
Page 9: Put Down That Checkbook! - Big Data without the Big Bucks
Page 10: Put Down That Checkbook! - Big Data without the Big Bucks
Page 11: Put Down That Checkbook! - Big Data without the Big Bucks

credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Page 12: Put Down That Checkbook! - Big Data without the Big Bucks

“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”

Paul Cooper, ITProPortal.com http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/

Page 13: Put Down That Checkbook! - Big Data without the Big Bucks

Computer Programming

Mathematics & Analytic Methodology

Distributed Computing & Big Data

Data Science

Stat

istic

al A

naly

sis

Dat

a M

inin

g

Mac

hine

Lea

rnin

g

Nat

ural

Lan

guag

e Pr

oces

sing

Soci

al N

etw

ork

Ana

lysis

Dat

a V

isual

izat

ion

Domain Knowledge & Communication Skills

etc.

Altamira Technologies Corporation 2014

Page 14: Put Down That Checkbook! - Big Data without the Big Bucks

WHY USE OSS?

Page 15: Put Down That Checkbook! - Big Data without the Big Bucks

What is Open Source Software (OSS)?

The Open Source Definition:

1.  Free Redistribution 2.  Source Code 3.  Derived Works

more: opensource.org

Page 16: Put Down That Checkbook! - Big Data without the Big Bucks

WHY USE OSS?

Page 17: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Karen (https://flic.kr/p/5njby2)

THERE ARE NO SILVER BULLETS."

Page 18: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Paul Inkles (https://flic.kr/p/e2QMS5)

IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."

Page 19: Put Down That Checkbook! - Big Data without the Big Bucks

photo: Valugi (http://bit.ly/1jrvVBC)

BUDGETS DON’T SCALE."

Page 20: Put Down That Checkbook! - Big Data without the Big Bucks

SURVEY OF OSS TOOLS FOR DATA SCIENCE

Page 21: Put Down That Checkbook! - Big Data without the Big Bucks

Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features:

–  Language & environment for statistical computing & viz –  Linear and nonlinear modeling, classical statistical tests, time-series

analysis, graphical techniques, and more… –  5000+ packages available in CRAN repository

Page 22: Put Down That Checkbook! - Big Data without the Big Bucks

Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features:

–  Data analysis workflow in Python –  DataFrame object for fast manipulation & indexing –  Tools for reading & writing data between formats –  Label-based slicing, indexing, and subsetting of data

Page 23: Put Down That Checkbook! - Big Data without the Big Bucks

Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features:

–  MPP query engine implemented on Hadoop –  Low latency, high concurrency SQL & BI queries –  Same interfaces as Apache Hive, but ~24x faster –  Written in C++; does not use MapReduce

Page 24: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features:

–  Distributed/scalable ML library for Hadoop –  Classification, Clustering, Collaborative filtering –  Logistic regression, naïve Bayes, random forest, neural networks, HMM,

k-means, SVD, PCA, ALS, LDA, etc.

Page 25: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features:

–  ML library for Python built on NumPy, SciPy, matplotlib –  Support for classification, clustering, dimensionality reduction,

regression, model selection, preprocessing –  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...

Page 26: Put Down That Checkbook! - Big Data without the Big Bucks

Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features:

–  Java-based “Machine Learning for Language Toolkit” –  Document classification, clustering, topic modeling, information

extraction & sequence tagging, etc. –  Efficient implementation of LDA for topic modeling

Page 27: Put Down That Checkbook! - Big Data without the Big Bucks

Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features:

–  Natural Language Toolkit for Python –  Built-in support for dozens of corpora & trained models –  Libraries for classification, tokenization, stemming, tagging, parsing, and

semantic reasoning

Page 28: Put Down That Checkbook! - Big Data without the Big Bucks

Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features:

–  Suite of high-quality, Java-based NLP tools –  Includes POS tagger, named entity recognizer, parser, coreference

resolution, sentiment analysis, SUTime, etc. –  Includes models for English, Chinese, Arabic, German

Page 29: Put Down That Checkbook! - Big Data without the Big Bucks

NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features:

–  Extracts location names from text, resolves to gazetteer –  Employs context-based geospatial entity resolution –  ~75% accuracy, processes 1M documents per hour –  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org

Page 30: Put Down That Checkbook! - Big Data without the Big Bucks

Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features:

–  Python structures for graphs, digraphs, & multigraphs –  Support for creating, manipulating, & analyzing the structure, dynamics,

& functions of complex networks –  Provides standard graph algorithms & analysis metrics

Page 31: Put Down That Checkbook! - Big Data without the Big Bucks

Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features:

–  Network analysis and visualization package for Java –  Dynamic network analysis with temporal filtering –  Metrics include: community detection, betweenness, closeness,

clustering coefficient, PageRank, etc.

Page 32: Put Down That Checkbook! - Big Data without the Big Bucks

Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features:

–  JavaScript library based on HTML, SVG, and CSS –  Binds data to DOM & enables transformations –  ~200 examples, including: force-directed graphs, choropleths,

treemaps, dendrograms, animations, etc.

Page 33: Put Down That Checkbook! - Big Data without the Big Bucks

Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features:

–  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. –  Integrates structured data, text, images, video –  Cell-level security & access controls –  Live, shared collaborative workspaces

Page 34: Put Down That Checkbook! - Big Data without the Big Bucks
Page 35: Put Down That Checkbook! - Big Data without the Big Bucks

Final Thought…

Save your $$$ for: People

–  salaries, training, etc.

Resources –  hardware, AWS, etc.

Proprietary software –  if no viable OSS

alternative exists

photo: Brett Weinstein (http://bit.ly/1dHXvqJ)

FINAL THOUGHT

Springer’s

Page 36: Put Down That Checkbook! - Big Data without the Big Bucks

open source software for data scientists

oss4ds.com

Page 37: Put Down That Checkbook! - Big Data without the Big Bucks

Charlie Greenbacker @greenbacker | oss4ds.com