Download - When big data meet python @ COSCUP 2012
2012
When Big Data Meet Python
Jimmy Lai (賴弘哲)
2012/08/19
1
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
2012
自我介紹
• 賴弘哲 (Jimmy Lai)
• Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python
• LindedIn profile: http://goo.gl/XTEM5
• 現任職於引京聚點知識結構搜索公司,
從事大資料語意分析
2
2012
Outline
1. Big Data
a. Concept
b. Technical issues
2. Big Data + Python
a. Related open source tools
b. Example
3
2012
Benefits of Big Data
1. Creating transparency(透明度) 2. Enabling experimentation to discover needs,
expose variability, and improve performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化) actions
4. Replacing/supporting human decision making with automated algorithms(自動決策)
5. Innovating new business models, products and services(創新的服務、產業)
4
(May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
e.g. http://www.data.gov/
深度資料分析人才的短缺
2012
Initiative from the White House
• (Mar 2012) Big Data Research and Development Initiative, the White House.
• National Science Foundation encourages education on Big Data.
• Government invest on developing state-of-the-art technologies, harness those technologies, and expand the workforce for Big Data.
5
2012
Big Data Issues
6
Collecting
User Generated Content Machine Generated Data
Storage
Computing
Analysis
Visualization
2012
Big Data Techniques
7
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Crawler
– Collect raw data
– E.g. Heritrix, Nutch
• Scraping
– Parse information from raw data
– E.g. Yahoo! Pipes, Scrapy
2012
Big Data Techniques
8
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Big Table – Distributed key-value
storage – E.g.Hbase, Cassandra
• NoSQL – Not use SQL for
manipulation – Not use relational
database model – E.g. MongoDB, Redis,
CouchDB
2012
Big Data Techniques
9
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Batch
– MapReduce
– E.g. Hadoop
• Real-time
– Stream processing
– E.g. S4, Storm
2012
Big Data Techniques
10
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Data mining – Weka
• Machine learning – scikit-learn
• Natural language processing – NLTK, Stanford NLP
• Statistics – R
2012
Big Data Techniques
11
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
• Abstract
• Interactive
• E.g. Processing, Gephi, D3.js
2012
Why Python?
• Good code readability for fast development.
• Scripting language: the less code, the more productivity.
• Fast growing among open source communities.
– Commits statistics from ohloh.net
12
2012
When Big Data meet Python
13
Collecting
User Generated Content
Machine Generated Data
Scrapy: scraping framework
PyMongo: Python client for Mongodb
Hadoop streaming: Linux pipe interface Disco: lightweight MapReduce in Python
Storage
Computing
Analysis
Visualization
Pandas: data analysis/manipulation Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning
Matplotlib: plotting NetworkX: graph visualization
Infr
astr
uct
ure
2012
When Big Data meet Python
web scraping framework
• Simple and Extensible
• Components: • Scheduler
• Downloader
• Spider(Scraper)
• Item pipeline
14
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
http://scrapy.org/
2012
When Big Data meet Python
NoSQL database
• PyMongo: client for python
• Document(JSON)-oriented
• No schema
• Scalable • Auto-sharding
• Replica-set
• File storage
• MapReduce aggregation
15
Collecting
User Generated Content
Machine Generated Data
Computing
Analysis
Visualization
http://www.mongodb.org/
Storage
2012
When Big Data meet Python
• Distributed computing: – MapReduce
– Disco distributed file system
• Write code in Python – Easy/fast to profiling
– Easy/fast to debugging
16
Collecting
User Generated Content
Machine Generated Data
Analysis
Visualization
Storage
Computing
http://discoproject.org/
2012
When Big Data meet Python
• Data analysis library
• Datastructure for fast data manipulation – Slicing
– Indexing
– subsetting
• Handling missing data
• Aggregation
• Time series
17
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://pandas.pydata.org/
Analysis
2012
When Big Data meet Python
Statsmodels
• Statistical analysis
• Statistical models
• Fit data with model
• Statistical tests
• Data exploration
• Time series analysis
18
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://statsmodels.sourceforge.net/
Analysis
2012
When Big Data meet Python
scikit-learn
• Machine learning algorithms
• Supervised learning
• Unsupervised learning
• Dataset
• Preprocessing
• feature extraction
• Model
• Selection
• Pipeline
19
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://scikit-learn.org/
Analysis
2012
When Big Data meet Python
NLTK: Natural Language Toolkit
• Natural language processing
• Annotated corpora and resources
20
Collecting
User Generated Content
Machine Generated Data
Visualization
Storage
Computing
http://scikit-learn.org/
Analysis
Sentence Segmentation
Tokenization POS tagging
Named Entity Recognition
Relation Recognition
Information Extraction Work Flow
2012
When Big Data meet Python
NL
• Plotting
– Histograms
– Power spectra
– Bar charts
– Error charts
– Scatter plots
• Full control to detail of plotting
21
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
http://matplotlib.sourceforge.net/
Analysis
Visualization
2012
When Big Data meet Python
NetworkX • Graph algorithms and
visisualization
• Draw graph with layout: – Circular
– Random
– Spectural
– Spring
– Shell
– Graphviz
22
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
http://networkx.lanl.gov/
Analysis
Visualization
2012
聚寶評 www.ezpao.com
美食搜尋引擎
23
搜尋各大部落格食記
2012
聚寶評 www.ezpao.com
語意分析搜尋引擎
24
2012
網友分享菜分析
正評/負評分析
評論主題分析
25
2012
Thank you for your attention. Q & A
We are hiring! • 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師
Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 [email protected]
26
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.