a data-centric approach to insider attack detection in database systems

A Data-Centric Approach to Insider Attack Detection in Database Systems

Sunu MathewJoint Work with: Michalis Petropoulos, Hung Q. Ngo,

Shambhu Upadhyaya

Computer Science and Engineering, Univ. at Buffalo

13th International Symposium on Recent Advances in Intrusion Detection (RAID 2010)Ottawa, Canada (September 2010)

Insider Threats against Database Systems

Insider Threats – Hard problem with vast scope Operate within existing rules and policies High potential for damage

Impersonation Information Leak Data Harvesting

Most critical threat insider access to sensitive information

Hence, focus on effective solution to insider threats against Database Systems

2

Contributions

Modeling user queries from a data-centric viewpoint (as opposed to prior syntax centric)

Extracting feature vectors based on the results of query execution

Sampling strategies to address performance issues

Taxonomy for analyzing query anomalies from a data-centric viewpoint

3

Outline

Limitations of Syntax Centric Approaches

Data-Centric User Profiling

Taxonomy of Query Anomalies (Data-Centric)

Experimental Evaluation

Conclusions and Future work

4

Syntax Centric Modeling

Models queries using a characteristic or frequency vector based on parsing syntax elements (e.g., [Kamra et al., VLDB J. 2008] )SELECT p.product_name, p.product_id

FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80; SQL command – SELECT, Select Clause Relations – PRODUCT, Select

Clause Attributes – product_name, product_id, Where Clause Relations – PRODUCT, Where Clause Attributes – cost, weight

SQL command – SELECT, Select Clause Relations – PRODUCT, Select Clause Attributes – product_name, product_id, Where Clause Attributes - cost

5

Limitations of Syntax Centric Approach

Two similar looking queries can produce completely different resultsSELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost > 100 and p.weight = 80;

False negative – similar syntax, different results Two different looking queries can produce

similar resultsSELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80

AND p.product_name is not null;

False positive – different syntax, same results Syntax Analysis is hard!!([Calvanese et al. PODS 98])

6

Outline



Taxonomy of Query Anomalies (Data-Centric)



7

Alternate viewpoint – Data Centric

When are two queries similar/different? Model based on what data is accessed (i.e.,

results of query execution) Visualize a database as a single relation

(Universal Relation, [Ullman et al. , TODS 84] ) Compute a statistical summary (S-Vector) of a

query’s result tuples Fixed size feature vector (depends on DB

schema, independent of DB size and number of result tuples)

S-Vectors for past queries used to profile user8

Statistics Vector For Sample Database

9

Database Schema S-Vector Features

Relation AttributeProduct.type.ncount

Product.type.ndistinct

Product.cost.Min

Product.cost.Max

Product.cost.Mean

Product.cost.StdDev

Product.cost.Median

Product

Product.type (varchar)

Product.cost (numeric)

Outline



Data-Centric Analysis of Query Anomalies



10

Analyzing query anomalies -- 1

Based on schema, statistics of execution result Type 1: Different Result Schema/Different

Result Tuples (Different Rows and Columns)Query 1: SELECT p.cost FROM PRODUCT p

WHERE p.type = ’abc’;

Query 2: SELECT p.type FROM PRODUCT pWHERE p.cost < 1000;

Both syntax-centric/data-centric should be effective Typically studied in the context of Role Based Access

Control (RBAC) and Masquerade detection

11

Analyzing query anomalies -- 2

Type 2: Similar Result Schema/Different Result Tuples (Similar Columns, Different Rows)

Base Query: SELECT * FROM PRODUCT p WHERE p.cost == 1000;

Type 2(a): Distinct SyntaxSELECT * FROM PRODUCT p WHERE p.cost < 1000 AND

p.type == `abc’; Type 2(b): Similar Syntax

SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND p.cost > 1000;

Syntax-centric schemes “blind” to 2(b)

12

Analyzing query anomalies --3

Type 3: Similar Result Schema/Similar Result Tuples (Similar rows and columns )Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost

< 1000; Type 3(a): Different Syntax/Similar SemanticsSELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND

p.type IN (SELECT q.type FROM PRODUCT q); Syntax-centric generates false positives

Type 3(b): Different Syntax/Different SemanticsSELECT p.type FROM PRODUCT p WHERE true;

Attacker gains additional knowledge (detected by syntax-centric, undetected by data-centric)

3(b) is rare (attacker has to control result statistics)

13

Data-Centric Classification of Query Anomalies

14

Anomaly Cases Types Detected by Syntax Centric?

Detected by Data-Centric?

Type 1 (Different Schema/Different Results

Yes Yes

Type 2 (Similar Schema/Different Results)

a) Distinct Syntax Yes Yes

b) Similar Syntax No Yes

Type 3 (Similar Schema/Similar Results)

a) Different Syntax/ Similar Semantics

False positive Yes (allowed)

b) Different Syntax/ Different Semantics

Yes No (Rare)

Outline






15


Goals Type 1 and 2 (a) anomalies (Detected by both syntax-

centric, data-centric): Show that data-centric performs at least as well as

syntax-centric Type 2(b) (Undetected by syntax-centric, Detected by

data-centric): Show good performance for the data-centric

approach Type 3 is outside scope (addressed in other

research) Compare with syntax centric ([Kamra et al. 2008])

16

Test Setup

17

Query Result Sampling -- Strategies

A potential drawback of data-centric – Requires query execution

Database server load may not be significant

However, some examination of query results is needed for online detection Compromise – use only k tuples of the execution

results to approximate the S-Vector Top (Initial) – k Random – k

Suitable for the pipelined query execution model for commercial RDBMS

18

Evaluating Type 1, 2(a) (Different Query Result Schema/Different Syntax)

Role based anomaly (masquerade) detection

Algorithms: Naïve Bayes, Decision Tree, SVM and Euclidean clustering

Syntax Centric formats – C (crude) quiplet , M (medium) quiplet, F (fine) quiplet ( [Kamra et al. VLDB J. 2008] )

Data-Centric formats (S-Vector, S-V Initial and S-V Random with 20, 10 and 5 result tuples)

19

Results – Type 1, 2(a) (Different Query Result Schema/Different Syntax)

20

Roles Algorithm

C. quip%

M. quip

F.quip

S-V (all)

S-V I-20

S-V R-20

S-V I-10

S-V R-10

S-V I-5

S-V R-5

Chair Vs Faculty

N-Bayes 81.67 85.33 75 85 85 82.67 78.33 77 81.67 90

Dec-Tree 88 87.67 87.67 96.33 88.3 88.3 89 88.67 88.67 88.67

SVM 83.3 81 87.67 82.33 74.67 77 71.33 75.67 68 74.33

Clustering 73.3 72 65.6 92 92.6 92.33 94 94 92.6 93.33

Similar performance as expected, with high detection rates (close to 100%) Clustering performs better with S Vector

Good results with SV Initial and SV Random

Evaluating Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax)

Volume anomalies, Sensitive tuples Supervised two-class training is not feasible Manually created `attack set’ based on dataset

Base query: SELECT * FROM vApplicants WHERE reviewStatusID = `a’ AND reviewStatusID = `b’;

Variation: SELECT * FROM vApplicants WHERE reviewStatusID = `b’ OR reviewStatusID = `b’;

One-class Anomaly Detection Cluster-based Outlier Detection Attrib-Deviation

21

Evaluating Type 2(b) (Different Query Result Stats/ Similar Syntax) -- Schemes

Cluster based outlier detection Queries can be considered as points in some

space, construct role (e.g., Faculty) clusters Approximate a cluster by a single point (e.g.,

centroid of cluster), use Euclidean distance & threshold for anomaly detection

Attrib-Deviation Data anomalies Anomalies in S-V dimensions Perform anomaly detection on a per-dimension

basis (e.g., 3*StdDev threshold), number of anomaly dimensions used to rank test queries

22

Results – Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax)

23

Algorithm Syntax-Centric Data-Centric

Cqlet

Mqlet

Fqlet

S-V all

S-VI20

S-VR20

S-VI-10

S-VR10

S-V I-5

S-VR5

Cluster-Outlier

Detection 23.5 26.4 17.6 83.8 12 67.7 6.4 45.1 6.4 35.4

False Positive

14.4 11.8 15.8 10.5 3.9 6.5 3.9 5.2 2.6 6.6

Attrib-Deviation

Detection 0 17.6 2.9 87 87 87 87 87 12.9 64.5

False Positive

0 4.8 4.8 22.6 26 15 23.8 15.8 23.8 20.4

Cluster based -- Data-centric better, Initial-k poor Attrib-Deviation – Data-centric better, Initial-k effective!

Initial-k Sampling -- Performance

But why does Initial-k perform well with Attrib-Deviation? Single anomalous attribute many anomaly

dimensions Per dimension basis -- selective ordering

affects anomaly attributes differently Different statistical measurements may be

affected differently (e.g., Max, Median) Performance drops only when result sample

size is too low (e.g., k = 5) Positive practical implications of Initial-k

performance

24

Outline





Conclusions and Future Work

25

Conclusions and Future Work

Queries other than Select-Project-Join Union Queries

SELECT g.name, g.gpa FROM GRADS g UNION SELECT u.name, u.gpa FROM UGRADS u;

Computed Attributes (Avg, etc.) GROUP-BY attributes ORDER-BY may lead to false positives

Dynamic databases, correlated queries Feasibility of randomized result set Combining Syntax and Data Centric schemes

26

Questions?

Thank You!

27

a data-centric approach to insider attack detection in database systems

Documents