big data & hadoop & how we use it at alchetron

Post on 14-Apr-2017

337 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BIG DATA HADOOPHDFSMAP-REDUCEALCHETRONFEEDBACKSQ/A

BIG DATA & HADOOP

+

To understand BIG DATA we will have to understand data first !!!

THIS DRAWING WAS CREATED 40,000 YEARS AGO THIS WAS THE FIRST TIME WHEN HUMANS STARTED RECORDING DATA

AS TIME PASSED WE STARTED CREATING MORE DATA AS YOU CAN SEE IN THIS PIC WHICH IS 3000-

10,000 YEARS OLD

STONE TABLETS

This man invented printing machine in

1439 that means more data is

collected than before

Johannes Gutenberg

100 crore books printed till 18th century & my dear friends you are still not born…..

THIS GUY INVENTS INTERNET IN 1991

SIR Tim Berners-Lee Invents Internet in 1991 now with internet the amount of data generatedby mankind explodes !!

30 years of mobile Technology

30 years of mobile Technology

Next 20 years Computing will move on to Microscopic levelComputers wont be in our pockets but inside our body & mindThis is where Technology & Biology will merge which will multiply and enhance our capabilities a thousand times

30 years of mobile Technology

Technological change will be so rapid & exponential

With invention of internet + small & less expensive storage devices !! Data creation explodes

Data generation statisticsDith invention of internet + small & less expensive storage devices !! Data creation explodes2.7 Zetabytes of data exist in the digital universe today

Facebook stores, accesses, and analyzes 50+ Petabytes of user generated data.

Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data

More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.

YouTube users upload 48 hours of new video every minute of the day.

In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day

With invention of internet data creation explodesSO WHAT IS BIG DATA ??

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere : sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few.

This data is big data.

With invention of internet data creation explodes

With invention of internet data creation explodes

With invention of internet data creation explodes

With invention of internet data creation explodes

Who will manage BIG DATA

HADOOP

Open Source Apache ProjectWritten in Java

Runs on Linux, Mac OS/X, Windows, and Solaris

Commodity hardware

Contents

• History of Hadoop• The current applications of Hadoop• Hadoop HDFS + MAP-REDUCE• Other hadoop projects

Fun Fact of Hadoop

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.                    

---- Doug Cutting, Hadoop project creator

History of Hadoop

Apache Nutch

Doug Cutting

“Map-reduce” 2004

“It is an important technique!”Reads paper

Extended

Joins Yahoo! at 2006

The great journey begins…

History of Hadoop• Yahoo! became the primary contributor in

2006

History of Hadoop• Yahoo! deployed large scale science clusters in

2007. • Tons of Yahoo! Research papers emerge:

– WWW– CIKM– SIGIR

• Yahoo! began running major production jobs in Q1 2008.

Hadoop consists of 2 parts.They are HDFS & MapReduce.

HDFS

Namenodes & Datanodes are nothing but machines which helps the client to store data.Metadata is stored in namenode & actual data is stored in datanodes

A TaskTracker is a daemon and works on datanode and is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a Jobtracker.

A JobTracker is a daemon and works on namenodeand also farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

Map-Reduce Architecture

Map-reduce is basically a data processing engine

To understand it deeply you should know java coding with experience

Lets try to learn the architecture of map-reduce

An example

BORED ALMOST THERE

BORED ALMOST THERE

JUST ONE MORE CODE

Another Example code

Now a days (as per latest job market)…• Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /

data analytics a plus• Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel

data processing system, big data analytics ... multiple technologies, including Hadoop

Other Hadoop Projects Ecosystem• Hadoop Core

– Distributed File System– MapReduce Framework

• Pig (initiated by Yahoo!)– Parallel Programming Language and Runtime

• Hbase (initiated by Powerset)– Table storage for semi-structured data

• Zookeeper (initiated by Yahoo!)– Coordinating distributed systems

• Hive (initiated by Facebook)– SQL-like query language and metastore

TYPICAL HADOOP CLUSTER HANDLING & PROCESSING PETA BYTES OF DATA

1000 TB = 1 PETA BYTE APPROX..

Now a days… Who use Hadoop?

• Amazon/A9• Alchetron• Fox interactive media• Google • IBM• Facebook• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!• More at http://wiki.apache.org/hadoop/PoweredBy

Lets see how we Implemented this at

When you visit Alchetron.comyou are interacting with data processedwith Hadoop

When you visit Alchetron.comyou are interacting with data processedwith Hadoop!!

Search Index

Search Index

When you visit Alchetron.comyou are interacting with data processedwith Hadoop !!

Organizing data

Content Filtering

References• For more information:

– http://hadoop.apache.org/– http://developer.yahoo.com/hadoop/– http://alchetron.com/What-is-Big-data-1530-W– http://alchetron.com/Big-Data-Hadoop-260-W

top related