big data & hadoop & how we use it at alchetron
TRANSCRIPT
BriefBIG DATA & HADOOP
Alchetron.comFree Social Encyclopedia
BIG DATA HADOOPHDFSMAP-REDUCEALCHETRONFEEDBACKSQ/A
BIG DATA & HADOOP
+
To understand BIG DATA we will have to understand data first !!!
THIS DRAWING WAS CREATED 40,000 YEARS AGO THIS WAS THE FIRST TIME WHEN HUMANS STARTED RECORDING DATA
AS TIME PASSED WE STARTED CREATING MORE DATA AS YOU CAN SEE IN THIS PIC WHICH IS 3000-
10,000 YEARS OLD
STONE TABLETS
This man invented printing machine in
1439 that means more data is
collected than before
Johannes Gutenberg
100 crore books printed till 18th century & my dear friends you are still not born…..
THIS GUY INVENTS INTERNET IN 1991
SIR Tim Berners-Lee Invents Internet in 1991 now with internet the amount of data generatedby mankind explodes !!
30 years of mobile Technology
30 years of mobile Technology
Next 20 years Computing will move on to Microscopic levelComputers wont be in our pockets but inside our body & mindThis is where Technology & Biology will merge which will multiply and enhance our capabilities a thousand times
30 years of mobile Technology
Technological change will be so rapid & exponential
With invention of internet + small & less expensive storage devices !! Data creation explodes
Data generation statisticsDith invention of internet + small & less expensive storage devices !! Data creation explodes2.7 Zetabytes of data exist in the digital universe today
Facebook stores, accesses, and analyzes 50+ Petabytes of user generated data.
Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
YouTube users upload 48 hours of new video every minute of the day.
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day
With invention of internet data creation explodesSO WHAT IS BIG DATA ??
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere : sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few.
This data is big data.
With invention of internet data creation explodes
With invention of internet data creation explodes
With invention of internet data creation explodes
With invention of internet data creation explodes
Who will manage BIG DATA
HADOOP
Open Source Apache ProjectWritten in Java
Runs on Linux, Mac OS/X, Windows, and Solaris
Commodity hardware
Contents
• History of Hadoop• The current applications of Hadoop• Hadoop HDFS + MAP-REDUCE• Other hadoop projects
Fun Fact of Hadoop
"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.
---- Doug Cutting, Hadoop project creator
History of Hadoop
Apache Nutch
Doug Cutting
“Map-reduce” 2004
“It is an important technique!”Reads paper
Extended
Joins Yahoo! at 2006
The great journey begins…
History of Hadoop• Yahoo! became the primary contributor in
2006
History of Hadoop• Yahoo! deployed large scale science clusters in
2007. • Tons of Yahoo! Research papers emerge:
– WWW– CIKM– SIGIR
• Yahoo! began running major production jobs in Q1 2008.
Hadoop consists of 2 parts.They are HDFS & MapReduce.
HDFS
Namenodes & Datanodes are nothing but machines which helps the client to store data.Metadata is stored in namenode & actual data is stored in datanodes
A TaskTracker is a daemon and works on datanode and is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a Jobtracker.
A JobTracker is a daemon and works on namenodeand also farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
Map-Reduce Architecture
Map-reduce is basically a data processing engine
To understand it deeply you should know java coding with experience
Lets try to learn the architecture of map-reduce
An example
BORED ALMOST THERE
BORED ALMOST THERE
JUST ONE MORE CODE
Another Example code
Now a days (as per latest job market)…• Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /
data analytics a plus• Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel
data processing system, big data analytics ... multiple technologies, including Hadoop
Other Hadoop Projects Ecosystem• Hadoop Core
– Distributed File System– MapReduce Framework
• Pig (initiated by Yahoo!)– Parallel Programming Language and Runtime
• Hbase (initiated by Powerset)– Table storage for semi-structured data
• Zookeeper (initiated by Yahoo!)– Coordinating distributed systems
• Hive (initiated by Facebook)– SQL-like query language and metastore
TYPICAL HADOOP CLUSTER HANDLING & PROCESSING PETA BYTES OF DATA
1000 TB = 1 PETA BYTE APPROX..
Now a days… Who use Hadoop?
• Amazon/A9• Alchetron• Fox interactive media• Google • IBM• Facebook• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!• More at http://wiki.apache.org/hadoop/PoweredBy
Lets see how we Implemented this at
When you visit Alchetron.comyou are interacting with data processedwith Hadoop
When you visit Alchetron.comyou are interacting with data processedwith Hadoop!!
Search Index
Search Index
When you visit Alchetron.comyou are interacting with data processedwith Hadoop !!
Organizing data
Content Filtering
References• For more information:
– http://hadoop.apache.org/– http://developer.yahoo.com/hadoop/– http://alchetron.com/What-is-Big-data-1530-W– http://alchetron.com/Big-Data-Hadoop-260-W