traffic measurements
DESCRIPTION
Traffic Measurements. Modified from Carey Williamson. Network Traffic Measurement. A focus of networking research for 20+ years Collect data or packet traces showing packet activity on the network for different applications Study, analyze, characterize Internet traffic Goals: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/1.jpg)
Traffic Measurements
Modified from Carey Williamson
![Page 2: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/2.jpg)
Network Traffic Measurement
• A focus of networking research for 20+ years• Collect data or packet traces showing packet activity on the
network for different applications• Study, analyze, characterize Internet traffic
• Goals:– Understand the basic methodologies used– Understand the key measurement results to date
2
![Page 3: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/3.jpg)
Why Network Traffic Measurement?
• Understand the traffic on existing networks• Develop models of traffic for future networks• Useful for simulations, capacity planning studies
3
![Page 4: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/4.jpg)
Measurement Environments
• Local Area Networks (LAN’s)– e.g., Ethernet LANs
• Wide Area Networks (WAN’s)– e.g., the Internet
• Wireless LANs• …
4
![Page 5: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/5.jpg)
Requirements
• Network measurement requires hardware or software measurement facilities that attach directly to network
• Allows you to observe all packet traffic on the network, or to filter it to collect only the traffic of interest
• Assumes broadcast-based network technology, superuser permission
5
![Page 6: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/6.jpg)
Measurement Tools (1 of 3)
• Can be classified into hardware and software measurement tools
• Hardware: specialized equipment– Examples: HP 4972 LAN Analyzer, DataGeneral Network Sniffer,
others...
• Software: special software tools– Examples: tcpdump, xtr, SNMP, others...
6
![Page 7: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/7.jpg)
Measurement Tools (2 of 3)
• Measurement tools can also be classified as active or passive• Active: the monitoring tool generates traffic of its own during
data collection (e.g., ping, pchar)• Passive: the monitoring tool is passive, observing and
recording traffic info, while generating none of its own (e.g., tcpdump)
7
![Page 8: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/8.jpg)
Measurement Tools (3 of 3)
• Measurement tools can also be classified as real-time or non-real-time
• Real-time: collects traffic data as it happens, and may even be able to display traffic info as it happens, for real-time traffic management
• Non-real-time: collected traffic data may only be a subset (sample) of the total traffic, and is analyzed off-line (later), for detailed analysis
8
![Page 9: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/9.jpg)
Potential Uses of Tools (1 of 4)
• Protocol debugging– Network debugging and troubleshooting– Changing network configuration– Designing, testing new protocols– Designing, testing new applications– Detecting network weirdness: broadcast storms, routing loops, etc.
9
![Page 10: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/10.jpg)
Potential Uses of Tools (2 of 4)
• Performance evaluation of protocols and applications– How protocol/application is being used– How well it works– How to design it better
10
![Page 11: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/11.jpg)
Potential Uses of Tools (3 of 4)
• Workload characterization– What traffic is generated– Packet size distribution– Packet arrival process– Burstiness– Important in the design of networks, applications, interconnection
devices, congestion control algorithms, etc.
11
![Page 12: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/12.jpg)
Potential Uses of Tools (4 of 4)
• Workload modeling– Construct synthetic workload models that concisely capture the
salient characteristics of actual network traffic– Use as representative, reproducible, flexible, controllable workload
models for simulations, capacity planning studies, etc.
12
![Page 13: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/13.jpg)
13
![Page 14: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/14.jpg)
Traffic Measurement Time Scales
• Performance analysis– representative models
• throughput, packet loss, packet delay
– Microseconds to minutes
• Network engineering– network configuration– capacity planning– demand forecasting– traffic engineering– Minutes to years
• Different measurement methods14
![Page 15: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/15.jpg)
Properties
• Most basic view of traffic is as a collection of packets passing through routers and links
• Packets and Bytes– One can capture/observe packets at some location– Packet arrivals
• interarrivals• count traffic at timescale T
– Captures workload generated by traffic on a per-packet basis
– Packet Size• time series of Byte count
– Captures the amount of consumed bandwidth
• packet size distribution– router design etc.
15
![Page 16: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/16.jpg)
Higher-level Structure
• Transport protocols and applications
• ON/OFF process– bursty workload
– Packet-level– Packet Train
• interarrival threshold
– Session• single execution of an application• Human generated
16
![Page 17: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/17.jpg)
Flows
• Set of packets passing an observation point during a time interval with all packets having a set of common properties– Header field contents, packet characteristics, etc.
• IP flows– source/destination addresses– IP or transport header fields– prefix
• Network-defined flow– network’s workload– ingress and egress– Traffic matrix and Path matrix
17
![Page 18: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/18.jpg)
Semantically Distinct Traffic Types
• Control Traffic– Control plane
• Routing protocols– BGP, OSPF, IS-IS
• Measurement and management– SNMP
• General control packets– ICMP
– Data plane
• Malicious Traffic
18
![Page 19: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/19.jpg)
19
![Page 20: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/20.jpg)
Challenges
• Practical issues– Observability
• Core simplicity– Flows– Packets
• Distributed Internetworking• IP Hourglass
– Data volume
– Data sharing
20
![Page 21: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/21.jpg)
Challenges
• Statistical difficulties– Long tails and High variability
• Instability of metrics• Modeling difficulty• Confounding intuition
– Stationarity and stability• Stationarity: joint probability distribution does not change when shifted in
time• Stability: consistency of properties over time
– Autocorrelation and memory in system behavior
– High dimensionality
21
![Page 22: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/22.jpg)
Tools
• Packet Capture– General purpose systems
• libpcap• tcpdump• ethereal• scriptroute• …
– Special purpose system
– Control plane traffic• GNU Zebra• Routeviews
22
![Page 23: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/23.jpg)
Data Management
• Full packet capture and storage is challenging• Limitations of commodity PC
• Data stream management
• Big Data platforms– Hadoop, etc.
23
![Page 24: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/24.jpg)
Data Reduction
• Lossy compression
• Counters– SNMP Management Information Base
• Flow capture– Packet trains– Packet flows
24
![Page 25: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/25.jpg)
Data Reduction
• Sampling– Basic packet sampling
• Random: with fixed probability• Deterministic: periodic samples• Stratified: multi step sampling
– Trajectory sampling• Chose a randomly sampled packet at all locations
25
![Page 26: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/26.jpg)
Data Reduction
• Summarization– Bloom filters– Sketches: Dimension reducing random projections– Probabilistic counting– Landmark/sliding window models
26
![Page 27: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/27.jpg)
Review: Bloom Filters
• Given a set S = {x1,x2,x3,…xn} on a universe U, want to answer queries of the form:
• Bloom filter provides an answer in– “Constant” time (time to hash).– Small amount of space.– But with some probability of being wrong.
• Alternative to hashing with interesting tradeoffs.
SyIs
![Page 28: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/28.jpg)
Review: Bloom FiltersStart with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have a false positive; all k values are 1, but y is not in S.
n items m = cn bits k hash functions
![Page 29: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/29.jpg)
Review: Bloom Filters
• Tradeoffs
• Three parameters.– Size m/n : bits per item.– Time k : number of hash functions.– Error f : false positive probability.
29
![Page 30: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/30.jpg)
Review: Bloom Filters
• False Positive Probability• Pr(specific bit of filter is 0) is
• If r is fraction of 0 bits in the filter then false positive probability is
• Approximations valid as r is concentrated around E[r]. – Martingale argument suffices.
• Find optimal at k = (ln 2)m/n by calculus.– So optimal fpp is about (0.6185)m/n
kckkkk pp )e1()1()'1()1( /
n items m = cn bits k hash functions
pmp mknkn /e)/11('
![Page 31: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/31.jpg)
Data Reduction
• Dimensionality reduction– Clustering
– Principal Component Analysis
• Probabilistic models– Distribution models– Dependence structure
• Inference– Traffic Matrix estimation
31
![Page 32: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/32.jpg)
Curse of Dimensionality.
• A major problem is the curse of dimensionality.• If the data x lies in high dimensional space, then an
enormous amount of data is required to learn distributions or decision rules.
• Example: 50 dimensions. Each dimension has 20 levels. This gives a total of cells. But the no. of data samples will be far less. There will not be enough data samples to learn.
![Page 33: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/33.jpg)
Dimension Reduction
• One way to avoid the curse of dimensionality is by projecting the data onto a lower-dimensional space.
• Techniques for dimension reduction:– Principal Component Analysis (PCA)– Fisher’s Linear Discriminant – Multi-dimensional Scaling. – Independent Component Analysis.– …
![Page 34: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/34.jpg)
Principal Component Analysis
• PCA is the most commonly used dimension reduction technique.– Also called the Karhunen-Loeve transform
• PCA – data samples
• Compute the mean
• Computer the covariance:
![Page 35: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/35.jpg)
• Compute the eigenvalues and eigenvectors of the matrix
• Solve
• Order them by magnitude:
• PCA reduces the dimension by keeping direction such that
Principal Component Analysis
![Page 36: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/36.jpg)
Principal Component Analysis
• For many datasets, most of the eigenvalues \lambda are negligible and can be discarded.
The eigenvalue measures the variationIn the direction e
Example:
![Page 37: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/37.jpg)
Why Principal Component Analysis?
• Motive– Find bases which has high variance in data– Encode data with small number of bases with low MSE
![Page 38: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/38.jpg)
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Vari
an
ce (
%)
Dimensionality Reduction
Can ignore the components of less significance.
You do lose some information, but if the eigenvalues are small, you don’t lose much– n dimensions in original data – calculate n eigenvectors and eigenvalues– choose only the first p eigenvectors, based on their eigenvalues– final data set has only p dimensions
![Page 39: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/39.jpg)
Dimensionality Reduction
Variance
Dimensionality
![Page 40: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/40.jpg)
• PCA may not find the best directions for discriminating between two classes.
• Example: suppose the two classes have 2D Gaussian densities as ellipsoids.
• 1st eigenvector is best for representing the probabilities.
• 2nd eigenvector is best for discrimination.
PCA and Discrimination
![Page 41: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/41.jpg)
Linear methods..
• Principal Component Analysis (PCA)
One DimensionalManifold
![Page 42: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/42.jpg)
Nonlinear Manifolds..
A
Unroll the manifold
PCA and MDS see the Euclideandistance
What is important is the geodesic distance
![Page 43: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/43.jpg)
To preserve structure preserve the geodesic distance and not the euclidean distance.
![Page 44: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/44.jpg)
Two methods
• Tenenbaum et.al’s Isomap Algorithm– Global approach.– On a low dimensional embedding
• Nearby points should be nearby.• Farway points should be faraway.
• Roweis and Saul’s Locally Linear Embedding Algorithm– Local approach
• Nearby points nearby
![Page 45: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/45.jpg)
Isomap
• Estimate the geodesic distance between faraway points.• For neighboring points Euclidean distance is a good approximation
to the geodesic distance.• For farway points estimate the distance by a series of short hops
between neighboring points.– Find shortest paths in a graph with edges connecting neighboring data points
Once we have all pairwise geodesic distances use classical metric MDS
![Page 46: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/46.jpg)
Isomap - Algorithm
• Determine the neighbors.– All points in a fixed radius.– K nearest neighbors
• Construct a neighborhood graph.– Each point is connected to the other if it is a K nearest neighbor.– Edge Length equals the Euclidean distance
• Compute the shortest paths between two nodes– Floyd’s Algorithm– Djkastra’s ALgorithm
• Construct a lower dimensional embedding.– Classical MDS
![Page 47: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/47.jpg)
Isomap
![Page 48: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/48.jpg)
Observations
48
![Page 49: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/49.jpg)
Overview of Traffic Analysis
49
![Page 50: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/50.jpg)
Traffic Samples from Internet2
50
![Page 51: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/51.jpg)
Packet Trains and Autocorrelation
51
![Page 52: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/52.jpg)
Observation #1
• The traffic model that you use is extremely important in the performance evaluation of routing, flow control, and congestion control strategies– Have to consider application-dependent, protocol-dependent, and
network-dependent characteristics– The more realistic, the better
52
![Page 53: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/53.jpg)
Observation #2
• Characterizing aggregate network traffic is hard– Lots of (diverse) applications– Just a snapshot: traffic mix, protocols, applications, network
configuration, technology, and users change with time
53
![Page 54: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/54.jpg)
Observation #3
• Packet arrival process is not Poisson– Packets travel in trains– Packets travel in tandems– Packets get clumped together (ack compression)– Interarrival times are not exponential– Interarrival times are not independent
54
![Page 55: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/55.jpg)
Observation #4
• Packet traffic is bursty– Average utilization may be very low– Peak utilization can be very high– Depends on what interval you use!!– Traffic may be self-similar
• bursts exist across a wide range of time scales
– Defining burstiness (precisely) is difficult
55
![Page 56: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/56.jpg)
Observation #5
• Traffic is non-uniformly distributed amongst the hosts on the network– Example: 10% of the hosts account for 90% of the traffic (or 20-80)– Why?
• Clients versus servers, geographic reasons, popular ftp sites, web sites, etc.
56
![Page 57: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/57.jpg)
Observation #6
• Network traffic exhibits ‘‘locality’’ effects– Pattern is far from random– Temporal locality– Spatial locality– Persistence and concentration– True at host level, at gateway level, at application level
57
![Page 58: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/58.jpg)
Observation #7
• Well over 90% of the byte and packet traffic on most networks is TCP/IP– By far the most prevalent– Often as high as 95-99%– Most studies focus only on TCP/IP for this reason
58
![Page 59: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/59.jpg)
Observation #8
• Most conversations are short– Example: 90% of bulk data transfers send less than 10 kilobytes of
data– Example: 50% of interactive connections last less than 90 seconds– Distributions may be ‘‘heavy tailed’’
• i.e., extreme values may skew the mean and/or the distribution
59
![Page 60: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/60.jpg)
Observation #9
• Traffic is bidirectional– Data usually flows both ways– Not just acks in the reverse direction– Usually asymmetric bandwidth though– Pretty much what you would expect from the TCP/IP traffic for most
applications
60
![Page 61: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/61.jpg)
Observation #10
• Packet size distribution is bimodal– Lots of small packets for interactive traffic and acknowledgements – Lots of large packets for bulk data file transfer type applications– Very few in between sizes
61
![Page 62: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/62.jpg)
Network Security Monitoring and Analysis based on Big Data Technologies
Bingdong Li
62
![Page 63: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/63.jpg)
Objectives
• A network security monitor and analysis system based on Big Data technologies to
– Measures the network
– Real time continuous monitoring and interactive visualization
– Intelligent network object classification and identification based on role behavior as context
![Page 64: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/64.jpg)
Big Data Machine Learning
Network Security
Objectives
![Page 65: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/65.jpg)
65
![Page 66: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/66.jpg)
System Design
• Data Collection
66
![Page 67: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/67.jpg)
System Design
• Online Real Time Process
67
![Page 68: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/68.jpg)
System Design
• NoSQL Storage
68
![Page 69: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/69.jpg)
System Design
• User Interfaces
69
![Page 70: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/70.jpg)
70
![Page 71: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/71.jpg)
Monitoring and Visualization
• Real Time response within a time constraint
• Interactive involve user interaction
• Continuously “continue to be effective overtime in light
of the inevitable changes that occur” (NIST)
71
![Page 72: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/72.jpg)
Network Status
72
![Page 73: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/73.jpg)
Top N
73
![Page 74: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/74.jpg)
HadoopDhruba Borthakur
74
![Page 75: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/75.jpg)
Hadoop, Why?
• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail every day
– Failure is expected, rather than exceptional.– The number of nodes in a cluster is not constant.
• Need common infrastructure– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but– Workloads are IO bound and not CPU bound
![Page 76: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/76.jpg)
Commodity Hardware
Typically in 2 level architecture– Nodes are commodity PCs– 30-40 nodes/rack– Uplink from rack is 3-4 gigabit– Rack-internal is 1 gigabit
![Page 77: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/77.jpg)
Goals of HDFS
• Very Large Distributed File System– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware– Files are replicated to handle hardware failure– Detect failures and recovers from them
• Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
![Page 78: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/78.jpg)
SecondaryNameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filename
2. BlckId, DataNodes
o
3.Read data
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log
![Page 79: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/79.jpg)
Distributed File System
• Single Namespace for entire cluster• Data Coherency
– Write-once-read-many access model– Client can only append to existing files
• Files are broken up into blocks– Typically 128 MB block size– Each block replicated on multiple DataNodes
• Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode
![Page 80: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/80.jpg)
![Page 81: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/81.jpg)
NameNode Metadata
• Meta-data in Memory– The entire metadata is in main memory– No demand paging of meta-data
• Types of Metadata– List of files– List of Blocks for each file– List of DataNodes for each block– File attributes, e.g creation time, replication factor
• A Transaction Log– Records file creations, file deletions. etc
![Page 82: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/82.jpg)
DataNode
• A Block Server– Stores data in the local file system (e.g. ext3)– Stores meta-data of a block (e.g. CRC)– Serves data and meta-data to Clients
• Block Report– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data– Forwards data to other specified DataNodes
![Page 83: Traffic Measurements](https://reader036.vdocuments.mx/reader036/viewer/2022062422/568137ea550346895d9f9e70/html5/thumbnails/83.jpg)
Data Flow
Web Servers Scribe Servers
Network Storage
Hadoop ClusterOracle RAC MySQL