Integrating Cybersecurity Log Data Analysis in HadoopBryan Stearns, Susan Urban, Sindhuri Juturu
Texas Tech University 2014 NSF Research Experience for Undergraduates Site Program
AbstractIn cybersecurity, the growth of Advanced Persistent Threats (APTs) and other sophisticated attacks makes it increasingly important to analyze network and system activities from all event sources. If logs are recorded through different software packages, the resulting big data can be in a form known as dirty data that is difficult to merge and cross-analyze. Additionally, storing logs in a big data architecture like Apache Hadoop can make data joins time-consuming. Merging data as it is stored rather than at access can greatly simplify unified analysis, yet doing so requires knowledge of what kinds of merges will be required. Services wishing to holistically analyze dirty data are required to merge information externally each time data is pulled. This research is developing a file system called HVID (Hadoop Value-oriented Integration of Data) that will represent dirty log data as a single table while maintaining its raw form using a novel variant of column-oriented storage. This system utilizes the open-source big data system Hadoop HBase to enable fast access to unified views without the need for predetermined joins. This design will allow more natural and efficient holistic analysis of stored cybersecurity data both with external mining applications and with local MapReduce tools.
Introductiono The amount of unstructured or semi-structured data recorded in cybersecurity endeavors is growing every day [1].
o Increasingly sophisticated cybersecurity threats require large-scale holistic pattern analysis for proper detection [1][2].
o Heterogeneous “dirty” data generated from multiple network sources needs customized unification to be useful for such purposes [3].
o MapReduce is desirable to analyze unstructured data, but existing unification methods structure data externally from unstructured storage platforms, requiring additional I/O to feed merged data back to storage [3].
o A better method is needed to unify and manipulate disparate information from across services in a network!
MethodsEquipment
o 64-bit single-node virtual Linux machine
o IBM BigInsights V3.0.0.0
o Station with 8-thread 1.87GHz processor and 8GB RAM
Process
o Design – The design proceeded with two fronts: abstract and physical. Abstract design focused on the data to be merged, while physical focused on the selection, implementation, and optimization of features available in Hadoop.
o Implement – The designed data architecture was created along with basic access features. Java was used for system creation and interfacing.
o Test – Basic speed tests were performed on working features. Generic system time properties at the start and completion of operations were used for tests.
ObjectivesDesign and prototype a file system that:
o Runs in Hadoop
o Supports structured, semi-structured, and unstructured data
o Provides quick access to merged tables
o Does not restrict what columns are used for merging
o Supports MapReduce operations upon merged data
Implications
o Faster value-based retrieval of data
o Eliminate need to individually merge tables containing shared features
o Reduce I/O needed for holistic unstructured MapReduce analysis
Conclusion:
The HVID design:
o Unifies data into a common format via a unique value-based structure
o Allows heterogeneous datasets to be merged by any field
o Supports external or internal data mining and manipulation
o Supports internal MapReduce on merged data
o Should allow value-based merge and join queries to be run in comparable time to plain select queries,
o Requires less space than comparable row-oriented solutions*
o Can utilize backup copies for improved data interconnectivity
o Requires further research and development
o Provides a means to unite dirty cybersecurity data in storage without the need to explicitly outline how information should me merged until it is needed.
References:[1] A. A. Cárdenas, P. K. Manadhata, and S. P. Rajan, "Big Data Analytics for Security," IEEE Security & Privacy, vol. 11, pp. 74-76, 2013.
[2] A. K. Sood and R. J. Enbody, "Targeted Cyberattacks: A Superset of Advanced Persistent Threats," IEEE Security & Privacy, vol. 11, p. 7, 2013.
[3] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham, W. Robertson, A. Juels, et al., "Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks," presented at the Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, 2013.
Design:o An inverted variant of column-oriented storage was
developed in which values are the primary key and row IDs are the dependent value.
o This physical association of rows with shared values allows dynamic views on relations without lengthy scans and comparisons.
o HBase was chosen as the HVID platform for its flexibility, support for read-intensive applications, and its ability to integrate with Hadoop MapReduce.
o Value-oriented data reside in row key byte arrays for quick scanning
o Rows are sorted for fast collection of values from contiguous ranges.
o Large unstructured data is stored in a separate row-oriented table.
o This row-oriented form can be used to store backups of value-oriented data, while enhancing row ID resolution of value-based queries.
Future Work:
o Complete working implementation for data lookup
o Compare functionality with and without backup row-oriented records
o Modify BulkLoad to support multi-table output from a single MapReduce
o Analyze row key structure for region balancing
o Load implementation onto cluster
o Benchmark various operations in various cluster configurations
o Create Hive interface for increased functionality and ease-of-use
o Create automatic upload system and interface
o Create web-interface for database access
o Explore support for varying data-types within a field via qualifiers
o Explore inverted clustering techniques based on inverted data
DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. Any opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense.
Results:o Custom database generation and access tools were created for HBase using Java. Full functionality is not yet complete, but load and basic
data retrieval have been implemented for the row-oriented segment of HVID.
o Whether row-oriented versions of data should be kept alongside value-oriented tables remains unknown until a full prototype is built and merge speeds are tested under various configurations.
o While results are inconclusive, selection of rows through value-oriented storage shows promise.
Figure 1: Value-Based storage
Preliminary Testing:o Collective storage consumption by value-based HVID tables
was found to be 86% of that required for a classic HBase table.
o Further compression can be realized when employing numeric data
o Value-based access is not yet fully implemented for testing.
o Results show little difference between HVID and classic HBase for basic row-oriented retrieval (when using rows as a backup form).
o The system must be tested on a full Hadoop cluster before any speed tests can contain significant meaning.
Figure 2: Preliminary Space and Time Comparisons
Classic HBase HVID0
20406080
100120140
Space takenfor database storage*
Row Based (MB) Value Based (MB)
* Tests used 25MB .tsv text file as source data
116
Classic HBase HVID0
2000400060008000
100001200014000
Time to select360k rows by value**
Time (ms)
11500
5500
** Only Row IDs were selected.
Pulling content remains to be implemented.
108128
* When using non-duplicated value-oriented usage
A B C
a1
a2
a3
b1
b2
b3
c1
c2
c3
T1
row1
row2
row3
B C D
b1
b2
b4
c2
c4
c5
d1
d2
d3
row1
row2
row3
T2
Bb1:T1:row1
b1:T2:row1
b2:T1:row2
b2:T2:row2
b3:T1:row3
b4:T2:row3
Aa1:T1:row1
a2:T1:row2
a3:T1:row3
Cc1:T1:row1
c2:T1:row2
c2:T2:row1
c3:T1:row3
c4:T2:row2
c5:T2:row3
Dd1:T2:row1
d2:T2:row2
d3:T2:row3
T2row1
row2
row3
{(“B”, “b1”), (“C”, “c2”), (“D”, “d1”)}{(“B”, “b2”), (“C”, “c4”), (“D”, “d2”)}{(“B”, “b4”), (“C”, “c5”), (“D”, “d3”)}
rowsT1:row1
T1:row2
T1:row3
{(A, a1), (B, b1), (C, c1)}
{(A, a2), (B, b2), (C, c2)}
{(A, a3), (B, b3), (C, c3)}
T2:row1
T2:row2
T2:row3
{(B, b1), (C, c2), (D, d1)}
{(B, b2), (C, c4), (D, d2)}
{(B, b4), (C, c5), (D, d3)}
Source Data
Classic HBaseRow-Oriented storage
HVID Storage
vs
T1row1
row2
row3
{(“A”, “a1”), (“B”, “b1”), (“C”, “c1”)}{(“A”, “a2”), (“B”, “b2”), (“C”, “c2”)}{(“A”, “a3”), (“B”, “b3”), (“C”, “c3”)}