a data-centric approach to insider attack detection in database systems
DESCRIPTION
A Data-Centric Approach to Insider Attack Detection in Database Systems. Sunu Mathew Joint Work with: Michalis Petropoulos, Hung Q. Ngo, Shambhu Upadhyaya Computer Science and Engineering, Univ. at Buffalo 13 th International Symposium on Recent Advances in Intrusion Detection (RAID 2010) - PowerPoint PPT PresentationTRANSCRIPT
A Data-Centric Approach to Insider Attack Detection in Database Systems
Sunu MathewJoint Work with: Michalis Petropoulos, Hung Q. Ngo,
Shambhu Upadhyaya
Computer Science and Engineering, Univ. at Buffalo
13th International Symposium on Recent Advances in Intrusion Detection (RAID 2010)Ottawa, Canada (September 2010)
Insider Threats against Database Systems
Insider Threats – Hard problem with vast scope Operate within existing rules and policies High potential for damage
Impersonation Information Leak Data Harvesting
Most critical threat insider access to sensitive information
Hence, focus on effective solution to insider threats against Database Systems
2
Contributions
Modeling user queries from a data-centric viewpoint (as opposed to prior syntax centric)
Extracting feature vectors based on the results of query execution
Sampling strategies to address performance issues
Taxonomy for analyzing query anomalies from a data-centric viewpoint
3
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
4
Syntax Centric Modeling
Models queries using a characteristic or frequency vector based on parsing syntax elements (e.g., [Kamra et al., VLDB J. 2008] )SELECT p.product_name, p.product_id
FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80; SQL command – SELECT, Select Clause Relations – PRODUCT, Select
Clause Attributes – product_name, product_id, Where Clause Relations – PRODUCT, Where Clause Attributes – cost, weight
SQL command – SELECT, Select Clause Relations – PRODUCT, Select Clause Attributes – product_name, product_id, Where Clause Attributes - cost
5
Limitations of Syntax Centric Approach
Two similar looking queries can produce completely different resultsSELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost > 100 and p.weight = 80;
False negative – similar syntax, different results Two different looking queries can produce
similar resultsSELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80
AND p.product_name is not null;
False positive – different syntax, same results Syntax Analysis is hard!!([Calvanese et al. PODS 98])
6
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
7
Alternate viewpoint – Data Centric
When are two queries similar/different? Model based on what data is accessed (i.e.,
results of query execution) Visualize a database as a single relation
(Universal Relation, [Ullman et al. , TODS 84] ) Compute a statistical summary (S-Vector) of a
query’s result tuples Fixed size feature vector (depends on DB
schema, independent of DB size and number of result tuples)
S-Vectors for past queries used to profile user8
Statistics Vector For Sample Database
9
Database Schema S-Vector Features
Relation AttributeProduct.type.ncount
Product.type.ndistinct
Product.cost.Min
Product.cost.Max
Product.cost.Mean
Product.cost.StdDev
Product.cost.Median
Product
Product.type (varchar)
Product.cost (numeric)
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
10
Analyzing query anomalies -- 1
Based on schema, statistics of execution result Type 1: Different Result Schema/Different
Result Tuples (Different Rows and Columns)Query 1: SELECT p.cost FROM PRODUCT p
WHERE p.type = ’abc’;
Query 2: SELECT p.type FROM PRODUCT pWHERE p.cost < 1000;
Both syntax-centric/data-centric should be effective Typically studied in the context of Role Based Access
Control (RBAC) and Masquerade detection
11
Analyzing query anomalies -- 2
Type 2: Similar Result Schema/Different Result Tuples (Similar Columns, Different Rows)
Base Query: SELECT * FROM PRODUCT p WHERE p.cost == 1000;
Type 2(a): Distinct SyntaxSELECT * FROM PRODUCT p WHERE p.cost < 1000 AND
p.type == `abc’; Type 2(b): Similar Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND p.cost > 1000;
Syntax-centric schemes “blind” to 2(b)
12
Analyzing query anomalies --3
Type 3: Similar Result Schema/Similar Result Tuples (Similar rows and columns )Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost
< 1000; Type 3(a): Different Syntax/Similar SemanticsSELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND
p.type IN (SELECT q.type FROM PRODUCT q); Syntax-centric generates false positives
Type 3(b): Different Syntax/Different SemanticsSELECT p.type FROM PRODUCT p WHERE true;
Attacker gains additional knowledge (detected by syntax-centric, undetected by data-centric)
3(b) is rare (attacker has to control result statistics)
13
Data-Centric Classification of Query Anomalies
14
Anomaly Cases Types Detected by Syntax Centric?
Detected by Data-Centric?
Type 1 (Different Schema/Different Results
Yes Yes
Type 2 (Similar Schema/Different Results)
a) Distinct Syntax Yes Yes
b) Similar Syntax No Yes
Type 3 (Similar Schema/Similar Results)
a) Different Syntax/ Similar Semantics
False positive Yes (allowed)
b) Different Syntax/ Different Semantics
Yes No (Rare)
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
15
Experimental Evaluation
Goals Type 1 and 2 (a) anomalies (Detected by both syntax-
centric, data-centric): Show that data-centric performs at least as well as
syntax-centric Type 2(b) (Undetected by syntax-centric, Detected by
data-centric): Show good performance for the data-centric
approach Type 3 is outside scope (addressed in other
research) Compare with syntax centric ([Kamra et al. 2008])
16
Test Setup
17
Query Result Sampling -- Strategies
A potential drawback of data-centric – Requires query execution
Database server load may not be significant
However, some examination of query results is needed for online detection Compromise – use only k tuples of the execution
results to approximate the S-Vector Top (Initial) – k Random – k
Suitable for the pipelined query execution model for commercial RDBMS
18
Evaluating Type 1, 2(a) (Different Query Result Schema/Different Syntax)
Role based anomaly (masquerade) detection
Algorithms: Naïve Bayes, Decision Tree, SVM and Euclidean clustering
Syntax Centric formats – C (crude) quiplet , M (medium) quiplet, F (fine) quiplet ( [Kamra et al. VLDB J. 2008] )
Data-Centric formats (S-Vector, S-V Initial and S-V Random with 20, 10 and 5 result tuples)
19
Results – Type 1, 2(a) (Different Query Result Schema/Different Syntax)
20
Roles Algorithm
C. quip%
M. quip
F.quip
S-V (all)
S-V I-20
S-V R-20
S-V I-10
S-V R-10
S-V I-5
S-V R-5
Chair Vs Faculty
N-Bayes 81.67 85.33 75 85 85 82.67 78.33 77 81.67 90
Dec-Tree 88 87.67 87.67 96.33 88.3 88.3 89 88.67 88.67 88.67
SVM 83.3 81 87.67 82.33 74.67 77 71.33 75.67 68 74.33
Clustering 73.3 72 65.6 92 92.6 92.33 94 94 92.6 93.33
Similar performance as expected, with high detection rates (close to 100%) Clustering performs better with S Vector
Good results with SV Initial and SV Random
Evaluating Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax)
Volume anomalies, Sensitive tuples Supervised two-class training is not feasible Manually created `attack set’ based on dataset
Base query: SELECT * FROM vApplicants WHERE reviewStatusID = `a’ AND reviewStatusID = `b’;
Variation: SELECT * FROM vApplicants WHERE reviewStatusID = `b’ OR reviewStatusID = `b’;
One-class Anomaly Detection Cluster-based Outlier Detection Attrib-Deviation
21
Evaluating Type 2(b) (Different Query Result Stats/ Similar Syntax) -- Schemes
Cluster based outlier detection Queries can be considered as points in some
space, construct role (e.g., Faculty) clusters Approximate a cluster by a single point (e.g.,
centroid of cluster), use Euclidean distance & threshold for anomaly detection
Attrib-Deviation Data anomalies Anomalies in S-V dimensions Perform anomaly detection on a per-dimension
basis (e.g., 3*StdDev threshold), number of anomaly dimensions used to rank test queries
22
Results – Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax)
23
Algorithm Syntax-Centric Data-Centric
Cqlet
Mqlet
Fqlet
S-V all
S-VI20
S-VR20
S-VI-10
S-VR10
S-V I-5
S-VR5
Cluster-Outlier
Detection 23.5 26.4 17.6 83.8 12 67.7 6.4 45.1 6.4 35.4
False Positive
14.4 11.8 15.8 10.5 3.9 6.5 3.9 5.2 2.6 6.6
Attrib-Deviation
Detection 0 17.6 2.9 87 87 87 87 87 12.9 64.5
False Positive
0 4.8 4.8 22.6 26 15 23.8 15.8 23.8 20.4
Cluster based -- Data-centric better, Initial-k poor Attrib-Deviation – Data-centric better, Initial-k effective!
Initial-k Sampling -- Performance
But why does Initial-k perform well with Attrib-Deviation? Single anomalous attribute many anomaly
dimensions Per dimension basis -- selective ordering
affects anomaly attributes differently Different statistical measurements may be
affected differently (e.g., Max, Median) Performance drops only when result sample
size is too low (e.g., k = 5) Positive practical implications of Initial-k
performance
24
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future Work
25
Conclusions and Future Work
Queries other than Select-Project-Join Union Queries
SELECT g.name, g.gpa FROM GRADS g UNION SELECT u.name, u.gpa FROM UGRADS u;
Computed Attributes (Avg, etc.) GROUP-BY attributes ORDER-BY may lead to false positives
Dynamic databases, correlated queries Feasibility of randomized result set Combining Syntax and Data Centric schemes
26
Questions?
Thank You!
27