deep learning text nlp and spark collaboration . 한글 딥러닝 text nlp & spark

Spark Day 2017

Deep Learning Text NLPwith Spark Collaboration

스사모(Spark Korea User Group)SSG.COM BigData Team

김훈동

Who am I ?• 김훈동• 스사모(Korea Spark User Group)• 신세계 그룹 온라인 포털 SSG.COM 빅데이타파트 리더• Hadoop, Spark, Machine Learning, Azure ML 분야

Microsoft MVP(Most Valuable Professional)• Major in BigData RealTime Analytics & NoSQL• http://hoondongkim.blogspot.kr

I will say …• Buzzword( AI, ML, DL … )• 챗봇 & Text NLP …• Deep Learning NLP

• 다양한 알고리즘.• 성능• 구현 Code 및 결과 Review• Pain Point

• Spark Collaboration for Deep Learning• 결론

Buzzword AI에 대하여…

Buzzword Machine Learning에 대하여…

• 블랙박스 접근 vs 룰기반

Buzzword Deep Learning 에 대하여…

Buzzword 챗봇에 대하여

Closed Domain Chatbot vs Open Domain Chatbot

Retrieval Based Model vs Generative Model

Easy Hard

한정된 도메인.특정 분야.

Pre-defined Response때로는 Rule Base

때로는 Text Classification

도메인 및 의도 한정 없음.열린 분야.

질문에 따라 답변 생성학습하지 못한 질문에 대하여도

적절한 답변 생성 및 창조

Machine Learning vs Deep Learning

• 보다 긴 역사를 가지고 있다.• 많은 구현체, 검증된 방법론들이 있다.• 보다 다양한 요건에 대한 방법론이 존

재한다.• 작은 데이터로 빠른 성능에 도달하기

도 한다.

[장점]

[단점]

• 특정 분야에서는 Deep Learning 보다정확도가 많이 떨어진다.

• Feature selection , 검증 등에 있어, Deep Learning 의 end to end 방식에비하여 손이 많이 가고, feature 가 매우 많을 때에는 현실적이지 않을 수 있다.

• 특정 분야에 있어서는 Dramatic한 성능 향상을 꽤 할 수 있다.

• Feature Selection 등에 큰 공을 들이지않더라도, 모델이 end to end 로 학습하고 선별하는 경우가 많다.

• 상당히 많은 Label 데이터가 필요 할때가 많다.

• 작은 양의 데이터로는 Under fit 되는경우가 ML의 경우보다 더 많다.

• GPU 및 BigData Scale Computing 환경 등, 고사양의 Compute 자원을 필요로 한다.

개발자 입장에서 Deep Learning 이란?

• 제프 딘 역할

• 우리가 더 잘하는 건….• 대용량 데이터 처리• 데이터 전처리 , 후처리• 알고리즘 구현(코딩 레벨 구현)• Deep Learning 은 Low Level 코딩 구현이 더 많음.

• Deep Learning 특히 CNN, RNN, RNN변형 , RL 등은 수학이 그리 복잡하지 않음. • 이전의 Machine Learning

• Markov-Chain Monte Carlo• Gibbs Sampler• Variation Inference• Deep Belief Network

• 현재의 Deep Learning• CNN• RNN• LSTM …

Mathematical formula

Engineering Art

Machine Learning vs Deep LearningSentimental Analysis for 한글1. Naïve Bayes

2. Word2Vec + CNN

한글문장 긍부정 Sentimental 분류

Naïve Bayes 83.2%

Word2Vec + CNN 85.4%

- 긍부정 정확도 Score 출처. (by 송치성(바벨피쉬))

작은 크기 데이터.긍 부정 등 쉬운 분류 문제에서는 Machine Learning 도 매우 정확하고, 훨씬 Training 이 빠름!

Machine Learning vs Deep LearningSentimental Analysis for 한글1. Multinomial Naïve Bayes

2. Count Vector + SVM

3. TF-IDF + SVM

4. Word2Vec + CNN

한글문장 138지 분류(1:1고객응대 Data)

Multinomial Naïve Bayes 32.31%

Count Vector + SVM 17.28%

TF-IDF + SVM 51.21%

Word2Vec + CNN 59.00%

- 사용데이타 SSG.COM 1:1 고객응대 CS Data, 138지 분류 문제 (Top 1 맞추기) - Training Data : 1,649,415건

데이터 충분히 많은 경우.좀 어려운 다지 분류 문제로 가보자.

Deep Learning Text Classification Deep Dive for 한글

1. Word2Vec + CNN (Batch Normalize + Augmentation)

2. Word2Vec + LSTM

3. Word2Vec + CNN + LSTM

4. Word2Vec + Bidirectional GRU

5. Word2Vec + Bidirectional GRU + Attention Network

6. FastText

7. Glove + LSTM

- 사용데이타 SSG.COM 1:1 고객응대 CS Data – 고객 라벨링

Deep Learning Text Classification Deep Dive for 한글

1. Word2Vec + CNN (Batch Normalize + Augmentation)

2. Word2Vec + LSTM

3. Word2Vec + CNN + LSTM

4. Word2Vec + Bidirectional GRU

5. Word2Vec + Bidirectional GRU + Attention Network

6. FastText

7. Glove + LSTM (Tensorflow GPU vs BigDL on Spark Cluster)

- 사용데이타 : SSG.COM 1:1 고객응대 CS Data 총 31지 분류 (top 1 맞추기) – 고객 라벨링- Training Data : 1,649,415건

72.30%

73.94%

72.97%

74.36%

73.15%

72.50%

...ing

Code On!

Training 속도에 대하여

• Best Score is Word2Vec + Bidirectional GRU

• Tesla M40 GPU

• Training Data 165만 건

• Learning Rate 0.0005

• 5 Epoch 에 50000초 = 833분 = 약 14시간

Deep Learning Hyper Parameter Optimization

Deep Learning Text NLP – Pain Point

1. 검증된 알고리즘 코딩시간< 해당 알고리즘에 Data 를 태우기 까지의 전처리 (of course 재활용 가능)< Hyper Parameter Optimization

2. Data 전처리, Pandas, Numpy ? One Single Core for 수백 기가 이상 데이터??3. GPU 의 small memory, GPU 머신 수 부족.4. GPU 를 이용한 병렬 Deep Learning 코딩 은 흡사 초기 Map/Reduce 코딩 유사5. 수백 기가 이상 데이터에 대한 Deep Learning?6. 반복, 반복, 반복, parameter 바꾸고 또 반복….

Python Machine Learning on Spark

Spark Deep Learning Deep Dive

• Keras + Tensorflow + Spark : elephas

Jupyter on Spark + Hadoop Cluster

BigDL Deep Learning Job on Hadoop Yarn Manager (by Spark Job)

BigDL Deep Learning Text Classification

BigDL Deep Learning Job on Hadoop Yarn Manager (by Spark Job)

BigDL Deep Learning Text Classification

Spark & Deep Learning

• BigDL 예

• BigDL Magic Button

https://software.intel.com/en-us/articles/deploying-bigdl-on-azure-data-science-vm

Tensorflow on Spark

• Deep Water 의 예

기타.

• Deep Learning 툴 소개• http://ankivil.com/choosing-a-deep-learning-software/

• 딥러닝 프레임워크 속도 비교• https://tensorflow.blog/2017/02/13/chainer-mxnet-cntk-tf-

benchmarking/

• Keras• 버전 2 준비 되고 있음.

• CNTK 도 Keras 백엔드가 될 듯.

• Keras 가 Tensorflow 코드 베이스로 흡수??

기타

Conclusion• Deep Learning 이 은총알은 아니지만…

• 특정 분야에 성능에 있어 퀀텀 점프 된 성능을 보이는 것은 사실..

• 다양한 도구와 툴이 생겨나면서 진입장벽도 낮아지고 있음.

• 개발자가 유리한 부분이 많음. (최소한의 이론 습득이 갖추어 져 있다면.)

• 현재의 상황을 보면, 좀더 Deep Learning 이 보급되고 보편화 될것으로 보임.

• Pain Point 를 통해 예측컨데, 다시 Spark 가 힘을 받을 수 있는 환경이 갖추어 지고 있음.

deep learning text nlp and spark collaboration . 한글 딥러닝 text nlp & spark

Data & Analytics

nlp for biomedical applications · 2003. 11. 12. · nlp...

text mining and thai nlp

text intensive diligence - clojure nlp (january 18, 2017)

an excursion into text analytics with apache spark

protein function inference enhanced by text...

nlp, information retrieval and text mining · 2...

text representation nlp module: feature engineering - data-x

nlp and text mining: an introduction · 2012. 6. 25. ·...

nlp for health informatics: text-mining patient records

foss sea 2014_Инструменты и алгоритмы...

relationship extraction from unstructured text-based on...

propaganda detection in text data based on nlp and machine

text processing & data structures for nlp a tutorial (cse...

comparative analysis of confidence speeches through nlp and...

spark tutorial for text analysis - cleveland state...

data scientist nlp / text mining · 2019. 3. 8. · ils...

search-based unsupervised text generation · •of how i...

(deep) neural networks在 nlp 和 text mining 总结

large-scale text processing pipeline with spark ml and...

journey to auto model training at scale nlp text