python을 이용한 데이터 분석 - openwith.net¹…데이터분석201511v2part5.pdf ·...

Python을 이용한 데이터 분석

2015.11

윤형기 ([email protected])

빅데이터분석교육(2015-11)

mailto:[email protected]

PYTHON 활용 추천시스템


추천시스템


개요


Everything is a recommendation

• Information Overload – “People read around 10 MB worth of material a day, hear 400 MB

a day, and see 1 MB of information every second” - The Economist, November 2006

– In 2015, consumption will raise to 74 GB a day - UCSD Study 2014


The value of recommendations

• Netflix: 2/3 of the movies watched are recommended

• Google News: recommendations generate 38% more clickthrough

• Amazon: 35% sales from recommendations

• Choicestream: 28% of the people would buy more music if they found what they liked.


The “Recommender problem”

• Let C be set of all users and let S be set of all possible recommendable items

• Let u be a utility function measuring the usefulness of item s to user c, i.e., u : C X S→R, where R is a totally ordered set.

• For each user cєC, we want to choose items sєS that maximize u.

– Utility is usually represented by rating but can be any function


접근법

• 사용자 기반 – User-based: Find similar users to me and recommend what they

liked

– Item-based: Find similar items to those that I have previously liked

• Content-based: – Recommend based on item features

• 기타 – Hybrid: Combine any of the above

– Personalized Learning to Rank: As a ranking problem

– Demographic: Recommend based on user features

– Social recommendations (trust-based)

– Deep Learning

– Context-aware Recommendation


추천시스템 일반론

• The core of the Recommendation Engine can be assimilated to a general data mining problem:


CF의 Basic Steps:

• 1. Identify set of ratings for the target/active user

• 2. Identify set of users most similar to the target/active user according to a similarity function (neighborhood formation)

• 3. Identify the products these similar users liked

• 4. Generate a prediction - rating that would be given by the target user to the product - for each one of these products

• 5. Based on this predicted rating recommend a set of top N products


Personalised vs Non-Personalised CF

• CF recommendations are personalized since the “prediction” is based on the ratings expressed by similar users – Those neighbors are different for each target user

• A non-personalized collaborative-based recommendation can be generated by averaging the recommendations of ALL the users

• How would the two approaches compare?


User-User Collaborative Filtering


UB Collaborative Filtering

• A collection of user ui , i=1, …n and a collection of products pj , j=1, …, m

• An n × m matrix of ratings vij , with vij = ? if user i did not rate product j

• Prediction for user i and product j is computed

– Similarity can be computed by Pearson correlation


Item-Item Collaborative Filtering


Item Similarity Computation

• Correlation-based Similarity - using the Pearson-r correlation (used only in cases where the users rated both item I & item j).

– Ru,i = rating of user u on item i.

– Ri = average rating of the i-th item.


PYTHON 활용 데이터분석


ipython

• 개념 – 대화형 컴퓨팅 작업용 command shell (여러 프로그램 언어 지원)

• 특징 – introspection – 추가의 shell 기능- (terminal and Qt-based)

• tab completion & history, rich media

• Ipython Notebook – 브라우저 기반의 코딩, 수식, inline plots, rich media. – 대화형 data visualization 및 GUI toolkits 적용 – Flexible, embeddable interpreters to load into one's own projects. – parallel computing용 performance tools

• Profileing & 최적화 – %time, %timeit in Ipython – %prun ; to profile a statement with cProfile – %run –p ; to profile whole programs – Line_profiler module, for line-by-line timing


numpy

• 개념 – 수치 데이터 처리 기능을 확장

• 주요 기능 – large, multi-dimensional arrays and matrices 및

high-level 수학 함수 지원

• 배경 – Numeric에서 출발


matplotlib

• 개념 – plotting library for the Python and its NumPy.

• 주요 기능 – Plot을 애플리케이션에 내장하기 위한 object-oriented API

• general-purpose GUI toolkits 이용( wxPython, Qt, or GTK+)

• pylab – state machine 기반 (예: OpenGL),

– MATLAB과 유사

– SciPy 은 matplotlib을 이용


scipy

• 개념 – 과학, 분석 용 오픈소스 기반 Python library

• Numpy와 scipy – NumPy array object위에서 구축/개발됨

– NumPy stack의 일부분 • Matplotlib, pandas 및 SymPy을 포함

• 주요 내용 – 최적화, linear algebra, integration, interpolation, special

functions, FFT, signal 및 image processing, ODE solvers

– 기타의 science and engineering 작업 도구

• 라이센스 – BSD license


pandas

• 개념 – Python 을 이용한 데이터 분석을 위한 software library

– Data munging/preparation/cleaning /integraation

– Rich data manipulation tool (Numpy 이용)

– Fast, intuitive data structures

– Python과 DSL (예: R)의 중간영역 (?)

– R의 data.frame과 유사

– Easy-to-use, highly consistent API


세부 내용

• 주요 기능 – DataFrame object – Integrated indexing을 이용한 데이터 분석

– 여러 포맷 지원 (CSV, text files, Excel, SQL databases, HDF5)

– data alignment 및 결측 데이터를 위한 통합 기능

– 데이터셋의 reshaping 및 pivoting

– 대규모 dataset 용 label-based slicing, indexing, subsetting

– 데이터 Aggregating/ transforming data (group by 엔진) split-apply-combine operations on data sets;

– Hierarchical axis indexing

– Time series 기능


Pandas의 데이터 모델

• Series: – 1D label – numpy array – Subclass of numpy.ndarray – Data: any dtype – Index labels need not be ordered – Duplicates are possible (but result in reduced functionality)

• DataFrame – 2D table with rows and column labels – potentially heterogeneous columns – ndarray-like, but not ndarray – column 별로 서로 다른 dtype을 가질 수 있음 – Row and column index – Size mutable: insert and delete columns


실습

• 기초

• 응용


• 정형데이터의 분석 – 데이터베이스 (SQL, NoSQL) – 기계학습, Mining – 수리, 통계분석

• 비정형데이터의 분석 – re, pawk 등 – www.nltk.org – …

• 빅데이터 – 다양한 시도


http://www.nltk.org/

기타 주요 이슈


Python과 R

• R의 장점: – 수리, 통계에 특화 (DSL)

– 5000여개의 특화된 패키지

• Python의 장점: – 강력한 범용언어

• 충실한 OOL (Object-Oriented Language),

• Dynamic Typing

• …

– 50,000여개 패키지와 방대한 Framework

• 통합의 어려움 – 설계사상의 차이점

– Python관점: more pythonic appoarch의 구현문제

– Namespace 등


• 다양한 시도 – (1) Rserver

• = R을 위한 TCP/IP 서버 • 다양한 client가 R을 access하도록 함 (예: c/c++/c#/Ruby, ...) • pyRserve를 통해 Python client가 R을 직접 호출 가능

– R code는 Python으로 callback

– (2) rPython • R에서 python을 호출 • python.call( "len", 1:3 ) • a <- 1:4 • b <- 5:8 • python.exec( "def concat(a,b): return a+b" ) • python.call( "concat", a, b)

– (3) rpy2 • Python에서 r을 호출 • rpy에서 출발


Python과 빅데이터

• 배경 – Big Data & Hadoop

– Jython 프로그램 이용

– Hadoop streaming 이용

– source: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

• Python MapReduce 프로그래밍 개요 – 환경: Hadoop on Linux

– 데이터: WordCount 예에 Gutenberg 데이터(https://www.gutenberg.org/) 적용


http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/















https://www.gutenberg.org/

• 방법론 – Hadoop Streaming

• 모든 Hadoop job을 표준입출력 (stdin, stdout)으로 여긴다.

• (http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Streaming)

• Hadoop에 포함된utility – 어떤 프로그램 언어로 작성되었던 상관없이 Hadoop의 Map/Reduce job으로 이

용할 수 있다. 즉, Python의 sys.stdin 으로 입력데이터를 읽고 sys.stdout으로 결과물을 출력한다.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc


http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Streaming



• Linux shell – = command interpreter

– Standard I/O

• 일단 명령이 수행되면 process가 만들어지며 이 process opens 3 flows:

• stdin,

– standard input reads the input data.

• stdout, – standard output writes the output data.

• stderr, – standard error writes the error messages.

– Redirection과 Pipe


Python MapReduce Code

• Map 단계 (/home/hduser/mapper.py) #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() words = line.split() for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the Reduce step # print '%s\t%s' % (word, 1)


• Reducer 단계 (/home/hduser/reducer.py) #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count)


• Testing – mapper와 reducer를 별도로 test하여 확인된 후 MapReduce job으

로 실행

• Python 코드를 Hadoop에서 수행 – (1) 데이터를 HDFS로 복사

$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

– (2) MR job 수행 $ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \

-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \

-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \

-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output


• 수행과정 $ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output additionalConfSpec_:null null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/app/hadoop/tmp/hadoop-unjar54543/] [] /tmp/streamjob54544.jar tmpDir=null [...] INFO mapred.FileInputFormat: Total input paths to process : 7 [...] INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] [...] INFO streaming.StreamJob: Running job: job_200803031615_0021 [...] [...] INFO streaming.StreamJob: map 0% reduce 0% … [...] INFO streaming.StreamJob: map 100% reduce 100% [...] INFO streaming.StreamJob: Job complete: job_200803031615_0021 [...] INFO streaming.StreamJob: Output: /user/hduser/gutenberg-output


• 실행결과


모델평가


• MSE(mean squared error) – One common measure of accuracy

• regression의 경우

• 과제 – method is generally designed to make MSE small on the training

data we are looking at – How well the method works on new data – “Test Data”. – 단, no guarantee for the smallest training MSE

• Training vs. Test MSE’s – In general the more flexible a method is the lower its training

MSE will be – 즉, it will “fit” or explain the training data very well. – However, the test MSE may in fact be higher for a more flexible

method than for a simple approach like linear regression.


n

i

ii yyn

MSE1

2)ˆ(1

Levels of Flexibility: Example 1

LEFT Black: Truth Orange: Linear Estimate Blue: smoothing spline Green: smoothing spline (more flexible)

RIGHT RED: Test MES Grey: Training MSE Dashed: Minimum possible test MSE (irreducible error)


Bias/ Variance Tradeoff

• 2 competing forces that govern the choice of learning method

• Bias

– = error introduced by modeling a real life problem (that is usually extremely complicated) by a much simpler problem.

– 예: linear regression assumes that there is a linear relationship between Y and X, which is unlikely in real life.

– The more flexible/complex a method is the less bias it will generally have.

• Variance

– = How much your estimate for f would change by if you had a different training data set.

– Generally, the more flexible a method is the more variance it has.


The Trade-off

It can be shown that for any given, X=x0, the expected test MSE for a new Y at x0 will be equal to

즉, method가 복잡해질 수록 bias는 감소, variance는 증가.

그러나 expected test MSE may go up or down!

ExpectedTestMSE= E Y - f (x0 )( )2

= Bias2 +Var+ s 2

Irreducible Error


– ROC curves (also known as a sensitivity/specificity plot) • examines the tradeoff between the detection of true positives, while

avoiding the false positives.

• are equivalent to sensitivity and (1 – specificity), respectively

• (Tip) The best practice is to use AUC in combination with qualitative

examination of the ROC curve.


Estimating future performance

• (배경) – To estimate a model's performance on unseen data. (caret

package)

• The holdout method

• (Recommend) In addition to training and test datasets, a 3rd

validation dataset for iterating and refining the model chosen, leaving the test dataset to be used only once as a final step to report an estimated error rate for future predictions.

• A typical split = training, test, validation = 50 %: 25 %: 25 % respectively.


• Cross-validation (or k-fold CV)) – randomly divides the data into k completely separate random

partitions called folds. (most common is to use 10-fold CV).

– (Tip) An extreme case of CV is leave-one-out method.

• Bootstrap sampling – uses random samples of data to estimate properties of a

larger set. The results from the various random datasets are then averaged to obtain a final estimate of future performance.

– (vs. k-fold CV) Where CV divides the data into separate partitions, in which each example can appear only once, bootstrapping allows examples to be selected multiple times through a process of sampling with replacement.


Resampling methods

• Tools that involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain more information about the fitted model – Model Assessment: estimate test error rates

– Model Selection: select the appropriate level of model flexibility

• Computationally expensive! But today, powerful computers – Two resampling methods:

– Cross Validation

– Bootstrapping


Leave-One-Out Cross Validation (LOOCV)

• similar to the Validation Set Approach, but it tries to address the latter’s disadvantages

• For each suggested model, do:

– Split the data set of size n into

• Training data set (blue) size: n -1

• Validation data set (beige) size: 1

– Fit model using the training data

– Validate model using the validation data, and compute the corresponding MSE

– Repeat this process n times

– MSE for the model is computed as:

5.1 Cross-Validation 181

F IGUR E 5.3. A schematic display of LOOCV . A set of n data points is repeat-edly split into a training set (shown in blue) containing all but one observation,and a validation set that contains only that observation (shown in beige). The testerror is then estimated by averaging the n resulting MSE’s. The first training setcontains al l but observation 1, the second training set contains all but observation2, and so forth.

observations, and a prediction y1 is made for the excluded observation,using itsvaluex1. Since(x1,y1) wasnot used in thefittingprocess,MSE1 =(y1 − y1)

2 provides an approximately unbiased estimate for the test error.But even though MSE1 is unbiased for the test error, it is a poor estimatebecause it is highly variable, since it is based upon a single observation(x1,y1).We can repeat the procedure by selecting (x2,y2) for the validationdata, training the statistical learning procedure on the n −1 observations{(x1,y1), (x3,y3), . . . , (xn ,yn )}, and computingMSE2 = (y2−y2)

2. Repeat-ing this approach n times produces n squared errors, MSE1, . . . , MSEn .The LOOCV estimate for the test MSE is the averageof these n test errorestimates:

CV (n) =1

n

n

i= 1

MSE i . (5.1)

A schematic of the LOOCV approach is illustrated in Figure 5.3.LOOCV has a couple of major advantages over the validation set ap-proach. First, it has far less bias. In LOOCV, we repeatedly fit the statis-tical learning method using training sets that contain n − 1 observations,almost as many as are in the entire data set. This is in contrast to thevalidation set approach, in which the training set is typically around halfthe sizeof theoriginal data set. Consequently, theLOOCV approach tendsnot to overestimate the test error rate as much as the validation set ap-proach does. Second, in contrast to thevalidation approach which will yielddifferent results when applied repeatedly due to randomness in the train-


k-fold Cross Validation

• LOOCV is computationally intensive, so we can run k-fold Cross Validation instead

• With k-fold Cross Validation, we divide the data set into K different parts (e.g. K = 5, or K = 10, etc.)

• We then remove the first part, fit the model on the remaining K-1 parts, and see how good the predictions are on the left out part (i.e. compute the MSE on the first part)

• We then repeat this K different times taking out a different part each time

• By averaging the K different MSE’s we get an estimated validation (test) error rate for new observations

5.1 Cross-Validation 183

F IGUR E 5.5. A schematic display of 5-fold CV . A set of n observations israndomly split into five non-overlapping groups. Each of these fifths acts as avalidation set (shown in beige), and the remainder as a training set (shown inblue). The test error is estimated by averaging the five resulting MSE estimates.

chapters. Themagic formula (5.2) does not hold in general, in which casethe model has to be refit n times.

5.1.3 k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomlyk-fold CV

dividing the set of observations into k groups, or folds, of approximatelyequal size. The first fold is treated as a validation set, and the methodis fit on the remaining k − 1 folds. The mean squared error, MSE1, isthen computed on the observations in the held-out fold. This procedure isrepeated k times; each time, a different group of observations is treatedas a validation set. This process results in k estimates of the test error,MSE1,MSE2, . . .,MSEk . Thek-fold CV estimate iscomputed by averagingthese values,

CV (k) =1

k

k

i = 1

MSE i . (5.3)

Figure 5.5 illustrates the k-fold CV approach.It isnot hard toseethat LOOCV isa special caseof k-fold CV in which kis set to equal n. In practice, one typically performsk-fold CV using k = 5or k = 10. What is the advantage of using k = 5 or k = 10 rather thank = n? The most obvious advantage is computational. LOOCV requiresfitting the statistical learningmethod n times. This has thepotential to becomputationally expensive (except for linear models fit by least squares,in which case formula (5.2) can be used). But cross-validation is a verygeneral approach that can be applied to almost any statistical learningmethod. Some statistical learningmethods havecomputationally intensivefitting procedures, and so performing LOOCV may pose computational


K-fold Cross Validation


MODEL 성능 평가

성능평가

• 성능 측정 – Working with classification prediction data in R

– confusion matrix를 이용한 성능측정

– Beyond accuracy – other measures of performance • The kappa statistic

• Sensitivity and specificity

• Precision and recall

• The F-measure

– Visualizing performance tradeoffs • ROC curves

• Future performance의 추정 – The holdout method

– Cross-validation

– Bootstrap sampling

Classification 예측 데이터 작업

• 3 major types of data that are used to evaluate a classifier – Actual class values

• > actual_outcom <- testdata$outcome

– Predicted class values • > predicted_outcome <- predict(model, testdata)

– Estimated probability of the prediction • = internal prediction probability

• 통상, predict() allows to specify the type of prediction. – (ex) class types: prob, posterior, raw

– (예) sms_results.csv – model is very confident 또는 somewhat less extreme probability

– In spite of such mistakes, is the model still useful?

Confusion matrices 세부사항

• Confusion matrix – = a table that categorizes predictions to whether they match t

he actual value in data

Confusion matrices를 통한 성능 측정

• accuracy = 𝑇𝑃+𝑇𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

• error rate = 𝐹𝑃+𝐹𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁= 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

• 예:

– table(sms_results$actual_type, sms_results$predict_type) – xtabs(~ actual_type + predict_type, sms_results) – library(gmodels) – CrossTable(sms_results$actual_type, sms_results$predict_type) – (154 + 1202) / (154 + 1202 + 5 + 29) # accuracy – (5 + 29) / (154 + 1202 + 5 + 29) # error rate – 1 - 0.9755396 # error rate = 1 - accuracy – library(caret) – confusionMatrix(sms_results$predict_type, sms_results$actual_type,

positive = "spam")

Beyond accuracy – 기타의 성능측정

• kappa statistic

– Accounts for the possibility of a correct prediction by chance alone.

– < 1 <0 ; indicates perfect agreement between the model's predictions and the true values – a rare occurrence.

– (common interpretation: )

• Poor agreement = Less than 0.20 Fair agreement = 0.20 to 0.40

• Moderate agreement = 0.40 to 0.60 Good agreement = 0.60 to 0.80

• Very good agreement = 0.80 to 1.00

– k = Pr 𝑎 −Pr (𝑒)

1 −Pr (𝑒)

• 단, kappa statistic의 계산방식은 여러 가지

– # example using SMS classifier • pr_a <- 0.865 + 0.111; pr_a

• pr_e <- 0.868 * 0.886 + 0.132 * 0.114; pr_e

• k <- (pr_a - pr_e) / (1 - pr_e); k

• library(vcd)

• Kappa(table(sms_results$actual_type, sms_results$predict_type))

• library(irr)

• kappa2(sms_results[1:2])

• Sensitivity and specificity – overly conservative vs. overly aggressive (tradeoff)의 균형

– 예: e-mail filter ; …

– 모델의 sensitivity (= true positive rate) • = proportion of positive examples that were correctly classified.

• = # of true positives / total # of positives in the data - those correctly classified (the true positives), as well as those incorrectly classified (the false negatives).

• = 𝑇𝑃

𝑇𝑃+𝐹𝑁

– 모델의 specificity (= true negative rate) • = proportion of negative examples that were correctly classified.

• # of true negatives / total # of negatives

– Specificity = 𝑇𝑁

𝑇𝑁+𝐹𝐹

• 예: SMS classifier – sens <- 154 / (154 + 29) – spec <- 1202 / (1202 + 5) – library(caret) – > sensitivity(sms_results$predict_type, sms_results$actual_type, positive =

"spam") – [1] 0.8415301 – > specificity(sms_results$predict_type, sms_results$actual_type, negative =

"ham") – [1] 0.9958575

• sensitivity of 0.842 == 84 % of spam messages were correctly classified.

• specificity of 0.996 == 99.6 % of non-spam messages were correctly classified, or alternatively, 0.4 percent of valid messages were rejected as spam.

• The idea of rejecting 0.4 percent of valid SMS messages may be unacceptable, or it may be a reasonable tradeoff given the reduction in spam.


• Precision과 recall – precision (= positive predictive value) 정확도

• = proportion of positive examples that are truly positive; 즉, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases very likely to be positive. It will be very trustworthy.

• if the model was very imprecise – 예: Google returning unrelated results – eventually users would switch to a competitor. – SMS spam filter, high precision means the model is able to carefully target only the spam w

hile ignoring the ham.

• Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃

– recall (재현율) • ; measures of how complete the results are. = # of true positives / total # of positiv

es. • the same as sensitivity, only the interpretation differs. • A model with high recall captures a large portion of the positive examples, meaning

that it has wide breadth. – 예: 검색에서의 high recall returns a large number of documents pertinent to the search que

ry. – Similarly, the SMS spam filter has high recall if the majority of spam messages are correctly

identified.

• Recall = 𝑇𝑃

𝑇𝑃+𝐹𝑁

– # Precision and recall

– prec <- 154 / (154 + 5)

– prec

– rec <- 154 / (154 + 29)

– rec

– library(caret)

– posPredValue(sms_results$predict_type, sms_results$actual_type, positive = "spam")

– sensitivity(sms_results$predict_type, sms_results$actual_type, positive = "spam")

• F-measure – ; precision + recall ==> F-measure (= F1 score or the F-score). – ; combines precision and recall using the harmonic mean.

• F-measure= = 2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙

recall + precision =

2 𝑥 𝑇𝑃

2 x TP + FP + FN

– F-measure계산 ; 앞서의 precision & recall values :

• > f <- (2 * prec * rec) / (prec + rec) • This is the same as using the counts from the confusion matrix: • > f2 <- (2 * 154) / (2 * 154 + 5 + 29)

• 단, assuming equal weight to precision and recall (not always valid)

대신 다른 weights 적용. tricky at best and arbitrary at worst. • 대신, use F-score in combination with methods that consider a model

's strengths and weaknesses more globally, such as those described in the next section.

성능 tradeoffs의 시각화

– compares learners side-by-side in a single chart. – ROCR package의 prediction()

• ROC curves – ; examine the tradeoff between the detection of true positives, wh

ile avoiding the false positives. – Curves are defined on a plot with:

• Y축의 proportion of true positives + X축의 proportion of false positives

• equivalent to sensitivity and (1 – specificity), respectively, • == sensitivity/specificity plot: • The points comprising ROC curves indicate the true positive rate at v

arying false positive thresholds. To create the curves, a classifier's predictions are sorted by the model's estimated probability of the positive class, with the largest values first.

• Beginning at the origin, each prediction's impact on the true positive rate and false positive rate will result in a curve tracing vertically (for a correct prediction), or horizontally (for an incorrect prediction).

• 3 hypothetical classifiers are contrasted in the plot. – (i) 대각선 ;

• represents a classifier with no predictive value detects true positives and false positives at exactly the same rate. = the baseline by which other classifiers may be judged; ROC curves falling close to this line indicate models that are not very useful.

– (ii) the perfect classifier • has a curve passing through the point at 100 % TP rate and 0 % FP rate.

– (iii) Most real-world classifiers • they fall somewhere in the zone between perfect and useless.

• AUC (area under the ROC curve) – Curve가 perfect classifier에 가까울 수록 좋다. can be measured using AUC. – AUC ranges = 0.5 (a classifier with no predictive value) ~ 1.0 (a perfect classifi

er). • 0.9 . 1.0 = A (outstanding) 0.8 . 0.9 = B (excellent/good) • 0.7 . 0.8 = C (acceptable/fair) 0.6 . 0.7 = D (poor) • 0.5 . 0.6 = F (no discrimination)

– 단, ROC curve 모양이 다르면서도 AUC가 동일할 수 있다. can be misleading. Use AUC in combination with qualitative examination of the ROC curve.

– library(ROCR) – > pred <- prediction(predictions = sms_results$prob_spam,

labels = sms_results$actual_type)

– # ROC curves – > perf <- performance(pred, measure = "tpr", x.measure = "fpr") – > plot(perf, main = "ROC curve for SMS spam filter", col = "blue", lwd = 2)

– # add a reference line to the graph – > abline(a = 0, b = 1, lwd = 2, lty = 2)

– # calculate AUC – > perf.auc <- performance(pred, measure = "auc") – > str(perf.auc) – > as.numeric([email protected])

Wrap-up


빅데이터의 추세


확장

• YARN & Hadoop 2.x – 실시간 (stream) 처리 – 그래프 처리 – …

그림 출처: http://ko.hortonworks.com/get-started/yarn/


http://ko.hortonworks.com/get-started/yarn/





통합

• 시도 – Enterprise Data hub

• 완성(?) – 람다 (Lambda) 아키텍처


Spark

• 아키텍처: – 2009년에 UC Berkeley의 AMPLab 에서 개발 – 인메모리 방식 – Cached intermediate data sets, – Multi-step DAG 실행엔진, – …

– “Over time, fewer projects will use MapReduce, and more will use Spark” • Doug Cutting, creator of Hadoop


분석

• 기능특화


실사용자 중심

• 빅데이터 어플라이언스 (Appliance) – 각 사별 서버 + 빅데이터 솔루션 + NoSQL + 관리도구 + …

• 클라우드 컴퓨팅과 빅데이터 – Cloud 서비스

• Amazon EMR, Google, …, 국내 클라우드 서비스

– Big data as a service (BDaaS) • = 외부 서비스 업체가 분석 도구 또는 분석결과를 제공

• 일종의 managed services (SaaS와 흡사)

– Cloud 스토리지와 연계되는 경향

– Google의 BigQuery

• Dremel, REST API

– …


확산


맺음말


• 전략 – 빅데이터 마인드

• 분석적 사고

• 빅데이터와 스몰데이터

– “자산으로서의 데이터” • Metadata관리의 중요성 – 데이터 품질

• 조직의 데이터 공유/협업을 막는 문화와 관행

– 기술의 내재화와 분석적 사고 • 오픈소스 S/W 전략

• 분석적 기업문화

• 창의성 (Creativity) – 복잡계 이론과 System Dynamics

– 데이터 잔해 (Data Exhaust)

• “Big Data is All Data”


python을 이용한 데이터 분석 - openwith.net¹…데이터분석201511v2part5.pdf ·...

Documents