hadoop:’ challenge - proideadata.proidea.org.pl/.../hadoop_challange.pdf · hadoop:’ challenge...
TRANSCRIPT
![Page 1: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/1.jpg)
Hadoop: challenge accepted!
Arkadiusz Osiński [email protected]
Robert Mroczkowski [email protected]
![Page 2: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/2.jpg)
ToC -‐‑ Hadoop basics -‐‑ Gather data -‐‑ Process your data -‐‑ Learn from your data -‐‑ Visualize your data
![Page 3: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/3.jpg)
BigData -‐‑ Petabytes of (un)structured data
![Page 4: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/4.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed
![Page 5: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/5.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed -‐‑ a lot of data is not gathered
![Page 6: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/6.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed -‐‑ a lot of data is not gathered -‐‑ how to gain knowledge?
![Page 7: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/7.jpg)
Power Big Data
Data Lake
Scalability
Petabytes
Mapreduce Commodity
![Page 8: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/8.jpg)
HDFS -‐‑ Storage layer
![Page 9: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/9.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system
![Page 10: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/10.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware
![Page 11: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/11.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability
![Page 12: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/12.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD
![Page 13: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/13.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD -‐‑ Access control
![Page 14: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/14.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD -‐‑ Access control -‐‑ No SPOF
![Page 15: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/15.jpg)
YARN -‐‑ Distributed computing layer
![Page 16: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/16.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data
![Page 17: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/17.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce…
![Page 18: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/18.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce… -‐‑ and others applications
![Page 19: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/19.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce… -‐‑ and others applications -‐‑ Resource management
![Page 20: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/20.jpg)
Let’s squize our data to get a juice!!
![Page 21: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/21.jpg)
Gather data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql
![Page 22: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/22.jpg)
Process your data -‐‑ Hadoop Streaming!
![Page 23: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/23.jpg)
Process your data -‐‑ Hadoop Streaming! -‐‑ No need to write code in Java
![Page 24: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/24.jpg)
Process your data -‐‑ Hadoop Streaming! -‐‑ No need to write code in Java -‐‑ You can use Python, Perl or Awk
![Page 25: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/25.jpg)
Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))
![Page 26: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/26.jpg)
Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey != line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))
![Page 27: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/27.jpg)
Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files ./map.py,./reduce.py \
-mapper ./map.py \
-reducer ./reduce.py \
-input /tweets/2014/04/*/*/* \
-input /tweets/2014/05/*/*/* \
-output /tweet_keyword
![Page 28: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/28.jpg)
Process your data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)
![Page 29: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/29.jpg)
Process your data
![Page 30: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/30.jpg)
Recommendations
Which product will be desired by client?
We’ve got historical users interaction with items.
![Page 31: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/31.jpg)
![Page 32: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/32.jpg)
Simple Example Let’s just do mahout -‐‑ it’s easy!
> apt-get install mahout
> cat simple_example.csv
1,101
1,102
1,103
2,101
> hdfs dfs -put simple_example.csv
> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \
-Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \
-Dmapred.output.dir=/mahout/output/wikilinks/simple_example \
-Dmapred.job.queue.name=atmosphere_prod
![Page 33: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/33.jpg)
Simple Example Tadadam!
> hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]
![Page 34: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/34.jpg)
Wiki Case
We’ve got links between wikipedia articles, and want to propose new links between articles.
„Wikipedia (i/ˌwɪkɨˈpiːdiəә/ or i/ˌwɪkiˈpiːdiəә/ WIK-‐‑i-‐‑PEE-‐‑dee-‐‑əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non-‐‑profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access”
![Page 35: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/35.jpg)
Wiki Case
![Page 36: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/36.jpg)
Wiki Case
hlp://users.on.net/%7Ehenry/pagerank/links-‐‑simple-‐‑sorted.zip
#!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }
![Page 37: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/37.jpg)
Wiki Case
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmapreduce.job.max.split.locations=24 \
-Dmapreduce.job.queuename=hadoop_prod \
-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \
-Dmapred.text.key.comparator.options=-n \
-Dmapred.output.compress=false \
-files ./mahout/mapper.awk \
-mapper ./mapper.awk \
-input /mahout/input/wikilinks/links-simple-sorted.txt \
-output /mahout/output/wikilinks/fixedinput
![Page 38: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/38.jpg)
Wiki Case Mahout lib count’s similarity Matrix and gave recommendations for 824 articles.
What’s important, we didn’t gather any knowledge a priori and just ran algorithm’s out of box.
![Page 39: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/39.jpg)
Wiki Case Acadèmia_Valenciana_de_la_Llengua
FIFA Valencia
October_1 Calendar
Prehistoric_Iberia Link appears recently
Ceuta Spain City at the north coast of Africa
Roussillon Part of France by the border with Spain
Sweden J
Turís municipality in the Valencian Community
Vulgar_Latin Language article Western_Italo-‐‑Western_languages Language article
Àngel_Guimerà Spanish wriler
![Page 40: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/40.jpg)
Wiki Case
![Page 41: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/41.jpg)
Tweets
Let’s find group of: • tags • users
![Page 42: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/42.jpg)
Tweets
• Our data is not random • We’ve picked specific keywords • We’ll do analysis in two
orthogonal directions
![Page 43: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/43.jpg)
Tweets {
"filter_level":"medium",
"contributors":null,
"text":"PROMOCIÓN MES DE MAYO. con ...",
"geo":null,
"retweeted":false,
"lang":"es",
"entities":{
"urls":[
{ "expanded_url":"http://www.agmuriel.com",
"indices":[ 69, 91 ],
"display_url":"agmuriel.com/#!-/c1gz",
"url":"http://t.co/APpPjRRTXn" } ]
}
(…)
![Page 44: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/44.jpg)
Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"\t"+text except KeyError: continue
#!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastKey and lastKey != key: print lastKey+"\t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)
![Page 45: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/45.jpg)
Tweets
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmapreduce.job.queuename=atmosphere_time \
-Dmapred.output.compress=false \
-Dmapreduce.job.max.split.locations=24 \
-D-Dmapred.reduce.tasks=20 \
-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \
-mapper ./twitter_map.py \
-reducer ./twitter_reduce.py \
-input /project/atmosphere/tweets/2014/04/*/* \
-output /project/atmosphere/tweets/output \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
Get SequenceFile with proper mapping
![Page 46: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/46.jpg)
Tweets
mahout seq2sparse \
-i /project/atmosphere/tweets/output \
-o /project/atmosphere/tweets/vectorized -ow \
-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2
Calculate vector representation for text
{10:0.6292275202550768,14:0.7772211575566166} {10:0.6292275202550768,14:0.7772211575566166} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859} {17:1.0} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}
![Page 47: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/47.jpg)
Tweets I’ts time to begin clusterization
Let’s find 100 clusters
mahout kmeans \
-i /tweets_5/vectorized/tfidf-vectors \
-c /tweets_5/kmeans/initial-clusters \
-o /tweets_5/kmeans/output-clusters \
-cd 1.0 -k 100 -x 10 -cl –ow \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
![Page 48: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/48.jpg)
Tweets Glance at results
BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE
PATCHING
![Page 49: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/49.jpg)
Tweets
It was easy because tags are very dependent (coocurence).
![Page 50: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/50.jpg)
Tweets Bigger challenge – user clustering
LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT
FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC
![Page 51: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/51.jpg)
Tweets Bigger challenge – user clustering
• Results show that dataset is strongly curved by mobile and games
• Dataset wasn’t random – we subscribed specific keywords
• OS result is great!
![Page 52: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/52.jpg)
Tweets HADOOP WORLD
run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar
rt @mapr: google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata
![Page 53: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/53.jpg)
Tweets HADOOP WORLD
Cloudera wants to do big data in Real Time.
Hortonworks wants to replace cloudera by research.
![Page 54: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/54.jpg)
Visualize data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang
![Page 55: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/55.jpg)
Visualize data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;
![Page 56: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/56.jpg)
Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;
![Page 57: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/57.jpg)
Visualize data
![Page 58: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/58.jpg)
Visualize data Tag: eurovisiontve
![Page 59: Hadoop:’ challenge - Proideadata.proidea.org.pl/.../hadoop_challange.pdf · Hadoop:’ challenge accepted!0 ArkadiuszOsiński arkadiusz.osinski@allegrogroup.com0 RobertMroczkowski](https://reader034.vdocuments.mx/reader034/viewer/2022051914/60062bc24027b358a5637693/html5/thumbnails/59.jpg)
Thank you!
Questions?