1 cisa continually improving stream analysis nancy mcmillan doug mooney dave burgoon march 14, 2003
TRANSCRIPT
1
CISAContinually Improving Stream Analysis
Nancy McMillanDoug MooneyDave Burgoon
March 14, 2003
04/21/23 2
Agenda
Background and Overview Architecture Algorithms Results
04/21/23 3
MURALS:Multiple Use Real-time Analytics for Large Scale Data
Major information technology initiative• Objective: Develop intellectual property addressing the challenges created by:
– Data generation/collection at previously unimaginable rates– Growing expectation that real time decision-making is feasible and necessary for
competitive advantage– Dramatic increase in the data to information ratio– Compelling need for balance between result precision and timeliness
Sponsored development of two technologies• InfoRes: Addresses IT issues associated with real-time querying of very large
relational databases• CISA: Addresses IT issues associated with real-time analysis of high volume
(varying arrival speed) stream data
04/21/23 4
Background:Our problem space
Many data sources supplying stream data
Stream data can be summarized by a set of features/summary statistics over some time window
Each data source needs continually classified or characterized
Classification/characterization of a single data source may depend on data from other data sources
Examples:• Computers connecting to a firewall• Sensor networks
04/21/23 5
Internet Security Example Who is trying to inappropriately access a company’s network?
There are 19 firewalls recording connections in a log file• Date/Time • Source and Destination IP addresses• Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule
Inbound and outbound connections and warnings over a six day period in July 2002 were logged• but connections from site to site VPNs are not• only externally initiated connections are being analyzed• more data (6 days in September) were provided later
04/21/23 6
The Problem: The faster data arrives, the more processing power required for real-time analysis.
Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time• Systems designed for gushing data
waste resources when data trickles.• Systems designed for slower data
flow fail when data arrives too fast.
More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers• Analytics designed for gushing data
don’t provide the best answer possible when data trickles.
• Analytics designed for slower data flow don’t provide timely answers when data arrives too fast
To what data arrival rate should system be designed?
04/21/23 7
The CISA Answer: A precision-speed trade-off
When the data arrives more slowly than the system design rate, the best possible answer is provided• All data is considered.• Best analysis techniques are used.
As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly.
System achieves precision-speed trade-off through:• Architecture
– Answer not based on all current data– Requires feedback from algorithm so most important data is considered
• Algorithms– Partial/approximate solutions provided
04/21/23 8
Architecture and Algorithm OverviewHow CISA achieves precision-speed trade-off
Architecture• Assign analysis tasks to
asynchronously operating objects
– storage, characterization, decision-making, and visualization
• Prioritize analysis tasks associated with each new piece of data
– Data likely to impact analysis is analyzed sooner
Algorithm• Use incremental algorithms
where possible– Update previous answer with new
data rather than re-analyze all data
• Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm
– Partial/approximate solutions provided
04/21/23 9
Agenda
Background and Overview Architecture Algorithms Results
04/21/23 10
CISA Architectural ComponentsDiagram
DatabaseSource 1
Source 2
Source 1 Source 2
PRIORITIZE
PRIORITIZEAlgorithm Visualization/
Monitor
. . .
Raw DataSummary Statistic/FeatureAlgorithmDirect Connection
Source Data Objects
Data Management Object
Algorithm Objects
04/21/23 11
Internet Security Example ArchitectureDiagram
DecisionMaker
DatabaseSource 1
Source 2
Firewall 1
Listener-Publisher
DatabaseRequester
PRIORITIZE
Topic
Database
Topic
Source 1Feature
Topic
Source 2Feature
PRIORITIZE
Topic
DecisionUpdate
Listener-Publisher
Topic
DecisionMade
Listener
Visualization/State Reporter
Publisher
Publisher
Firewall 2
Topic
Source 1Data
Topic
Source 2Data
Listener-Publisher
Source 1Feature/State
Listener-Publisher
Source 2Feature/State
... ......Source Data Objects
...
Data Management Object
Algorithm Object
Log Data MessageFeature calculation MessageState Update MessageDirect Connection
DecisionMaker
DatabaseSource 1
Source 2
Firewall 1
Listener-Publisher
DatabaseRequester
PRIORITIZE
Topic
Database
Topic
Source 1Feature
Topic
Source 2Feature
PRIORITIZE
Topic
DecisionUpdate
Listener-Publisher
Topic
DecisionMade
Listener
Visualization/State Reporter
Publisher
Publisher
Firewall 2
Topic
Source 1Data
Topic
Source 2Data
Listener-Publisher
Source 1Feature/State
Listener-Publisher
Source 2Feature/State
... ......Source Data Objects
...
Data Management Object
Algorithm Object
Log Data MessageFeature calculation MessageState Update MessageDirect Connection
Access database
JMS object communication
SAS Analytics
Java
04/21/23 12
Advantages / IssuesRelated to rapid prototyping decisions
Advantages• Asynchronous• Prioritized Lists• Open Source / Off-the-shelf• Platform Independent
Issues• Slow – system resources,
”thrashing”, db, (network speeds)• JMS Implementations vary slightly
Advantages• Easy communication with Java• Easily and quickly developed
– data storage and– feature calculation
Issues• Slow• Not available on many platforms
JMS Access
04/21/23 13
Agenda
Background and Overview Architecture Algorithms Results
04/21/23 14
Candidate CISA AlgorithmsA very broad group of statistical methods…
Feature characteristics• Relies on more than one
feature• Some of the individual
features take time to compute or measure
• Meaningful nested "sub-algorithms" can be built on increasing sets of features
Data source characteristics• The algorithm can efficiently,
update its current solution when feature values for only a small group of source objects change
• There is a natural method for prioritizing objects
04/21/23 15
Construction MethodologiesGeneral
Feature Priority• Order features (statically)• Create series of nested models that use an increasing number of features• Develop a function to assign priorities based on feature order and current object
classification
Data Source Priority• Order data sources (dynamically)• Assign priorities based on uncertainty of classification or cost of misclassification• Incremental algorithms are usually essential
Combinations of Both
04/21/23 16
Construction MethodologiesExamples
Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation.• Example: Decision tree using X1,X2,… , Xn
• Prioritize order of Xi computation based on tree structure• Use pruned trees to classify:
{X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn}
Data Source Priority: • Example: Cluster analysis—All features needed• Objects with incomplete feature sets get higher priority• Objects with more uncertain classifications get higher priority
04/21/23 17
Feature Priority ConstructionDecision tree example
|X1<0.00134771
X2<0.16844
X4<0.148293
X6<0.722813
X3<0.248832
X5<34.5G
G B
G
B
B G
04/21/23 18
Agenda
Background and Overview Architecture Algorithms Results
04/21/23 19
Internet Security Example Who is trying to inappropriately access the company’s network?
There are 19 firewalls recording connections in a log file• Date/Time • Source and Destination IP addresses• Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule
Inbound and outbound connections and warnings over a six day period in July 2002 were logged• but connections from site to site VPNs are not• only externally initiated connections are being analyzed• more data (6 days in September) were provided later
04/21/23 20
External Network Connectors Summary statistics/features
Quickly calculated features• % Drop• % Accept• Hits/Sec• # Hits
More time consuming features• # Different Services• Different Services/Hit• # Different IPs• Different IPs/Hit
04/21/23 21
N=36Port Scans
High ServicesLarge Drop %
N=3Slow Port and IP Scans
High ServicesHigh Number of IPsHigh Number of Hits
Low Hits/SecLarge Drop %
N=10Fast IP Address Scans
Low ServicesHigh Number of Hits
High IP/HitHigh Number of Hits/Sec
Large Drop %Mostly Foreign
Represent 40% of External Connections
N=4636Suspicious
Large Drop %Medium IP/Hit
Low everything else
N=8055Suspicious-Too Early to Tell
Large Drop %High IP/HitFew Hits
N=7828Normal
High Accept %
Dates: 7/21/02 -7/27/02
04/21/23 22
External Network ConnectorsClassifications
70%-80% of IPs stay in same group from day to day.
Class Sources Connections PercentagePort Scans 36 218,658 14.40%Mostly Foreign IP Sweeps 10 602,438 39.68%Port and IP Sweeps 3 9,165 0.60%Normal 7,828 205,990 13.57%Suspicious 4,636 455,687 30.02%Few Connections 8,055 26,163 1.72%
04/21/23 23
External Network ConnectorsRule-based, feature priority classification algorithm
Level Features Added0
1 NormalToo Early
to Tell Drop %
2 NormalIP Scan
OnlyPort Scan
Only
Both IP and Port
Scan UnknownToo Early
to Tell Ratio Measures
3 NormalIP Scan
OnlyPort Scan
Only
Both IP and Port
Scan UnknownToo Early
to Tell Distinct Services
4 NormalIP Scan
OnlyPort Scan
Only
Both IP and Port
Scan UnknownToo Early
to TellDistinct IP Addresses
Suspicious
Too Early to TellClassification
Priority
04/21/23 24
Correctly classified same level algorithmCorrectly classified different level algorithmConsistently classifiedInconsistently classified
Connections per second
0
%
100
Precision-Speed Trade-offExpected results
04/21/23 25
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%60.0%
70.0%
80.0%
90.0%
100.0%
Connections per Second
%
Correctly classifiedsame level algorithm
Correctly classifieddifferent level algorithm
Consistently classified
Inconsistently classified
Precision-Speed Trade-offObserved results
04/21/23 26
External Network ConnectorsDynamic, data source priority algorithm
Traditional cluster analysis (e.g., K-means) is time consuming on large datasets
Incremental clustering algorithm required for reasonable performance
Our approach: • After first cluster analysis, use centroid locations to seed the next
analysis • Used the SAS procedure FASTCLUS for proof-of-concept purposes
04/21/23 27
Outlier Outlier: n=1 (0.32% of connections) Extremely high services China
Dates: 8/11/02 - 8/17/02
04/21/23 28
Cluster 1
Cluster 4
Cluster 0
Cluster 3 Cluster 5
Cluster 2
Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit
Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit
Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec
Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed
Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits
Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop %
Dates: 8/11/02 - 8/17/02
04/21/23 29
External Network Connector Classifications Dashboard report
Drop %Service/HitIPS/HitMax Hit/SecIPs ScannedServices Scanned% of Sources% Connections
04/21/23 30
External Network Connector ClassificationsOutlier report
Src: 211.96.31.129Country: CHINA
Org ID: SCH-CHENGDU-HUITEC
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6
40 Minutes
cluster 0 1 2 3 4 5 6 7
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
Services Scanned
Drop %Service/HitIPS/HitMax Hit/SecIPs Scanned