digital evidence: representation and...

Digital Evidence: Representation and Assurance

by

Bradley Schatz

Bachelor of Science (Computer Science), UQ, Australia 1995

Thesis submitted in accordance with the regulations for the Degree of Doctor of Philosophy

Information Security Institute Faculty of Information Technology

Queensland University of Technology

October, 2007

Keywords

Digital evidence, computer based electronic evidence, digital forensics,

computer forensics, forensic computing, evidence provenance, evidence representation,

knowledge representation.

iii

Abstract

The field of digital forensics is concerned with finding and presenting evidence

sourced from digital devices, such as computers and mobile phones. The complexity of

such digital evidence is constantly increasing, as is the volume of data which might

contain evidence. Current approaches to interpreting and assuring digital evidence rely

implicitly on the use of tools and representations made by experts in addressing the

concerns of juries and courts. Current forensics tools are best characterised as not easily

verifiable, lacking in ease of interoperability, and burdensome on human process.

The tool-centric focus of current digital forensics practise impedes access to

and transparency of the information represented within digital evidence as much as it

assists, by nature of the tight binding between a particular tool and the information that

it conveys. We hypothesise that a general and formal representational approach will

benefit digital forensics by enabling higher degrees of machine interpretation,

facilitating improvements in tool interoperability and validation. Additionally, such an

approach will increase human readability.

This dissertation summarises research which examines at a fundamental level

the nature of digital evidence and digital investigation, in order that improved

techniques which address investigation efficiency and assurance of evidence might be

identified. The work follows three themes related to this: representation, analysis

techniques, and information assurance.

The first set of results describes the application of a general purpose

representational formalism towards representing diverse information implicit in event

based evidence, as well as domain knowledge, and investigator hypotheses. This

representational approach is used as the foundation of a novel analysis technique which

uses a knowledge based approach to correlate related events into higher level events,

which correspond to situations of forensic interest.

v

vi

The second set of results explores how digital forensic acquisition tools scale

and interoperate, while assuring evidence quality. An improved architecture is

proposed for storing digital evidence, analysis results and investigation documentation

in a manner that supports arbitrary composition into a larger corpus of evidence.

The final set of results focus on assuring the reliability of evidence. In

particular, these results focus on assuring that timestamps, which are pervasive in

digital evidence, can be reliably interpreted to a real world time. Empirical results are

presented which demonstrate how simple assumptions cannot be made about computer

clock behaviour. A novel analysis technique for inferring the temporal behaviour of a

computer clock is proposed and evaluated.

Table of Contents Keywords __________________________________________________________ iii Abstract_____________________________________________________________ v Table of Contents ____________________________________________________ vii List of Tables _______________________________________________________ xi List of Figures______________________________________________________ xiii List of Abbreviations _________________________________________________ xv Declaration ________________________________________________________ xix Previously Published Material ________________________________________ xxi Acknowledgements__________________________________________________ xxv Chapter 1. Introduction ____________________________________________ 1

1.1 Digital forensics and digital evidence______________________________ 1 1.2 Contributions ________________________________________________ 6 1.3 Dissertation roadmap __________________________________________ 7

Chapter 2. Background: Digital forensics_____________________________ 11 2.1 A brief history of digital forensics _______________________________ 11 2.2 Digital evidence & digital forensics defined _______________________ 13

2.2.1 The nature of digital evidence ______________________________ 16 2.2.2 Perspectives on the digital investigation process ________________ 18

2.3 Digital forensics tools _________________________________________ 23 2.3.1 Acquisition tools ________________________________________ 23 2.3.2 Examination & analysis tools_______________________________ 24 2.3.3 Integrated digital investigation environments __________________ 27 2.3.4 Models of tools__________________________________________ 27 2.3.5 Current approaches to tool integration________________________ 28

2.4 Key challenges ______________________________________________ 30 2.4.1 Volume & Complexity____________________________________ 32

vii

viii

2.4.2 Effective forensics tools and techniques ______________________ 33 2.4.3 Meeting the standard for scientific evidence ___________________ 34

2.5 Conclusions_________________________________________________ 36 Chapter 3. Related work___________________________________________ 37

3.1 Event correlation for forensics __________________________________ 38 3.1.1 Approaches to modeling events _____________________________ 39 3.1.2 Event patterns and event pattern languages ____________________ 40 3.1.3 Observations____________________________________________ 42

3.2 Current approaches to evidence representation and format ____________ 42 3.2.1 Digital evidence container formats___________________________ 42 3.2.2 Representation of digital investigation documentation ___________ 46 3.2.3 Observations____________________________________________ 48

3.3 Reliable interpretation of time __________________________________ 48 3.3.1 An introduction to computer timekeeping _____________________ 48 3.3.2 Reliable time synchronization ______________________________ 49 3.3.3 Factors affecting timekeeping accuracy _______________________ 49 3.3.4 Usage of timestamps in forensics____________________________ 50 3.3.5 Observations____________________________________________ 51

3.4 Conclusion _________________________________________________ 51 Chapter 4. Digital evidence representation: addressing the complexity and

volume problems of digital forensics ____________________________________ 53 4.1 Introduction_________________________________________________ 54 4.2 Background on knowledge representation _________________________ 56

4.2.1 Historical foundations ____________________________________ 56 4.2.2 Defining knowledge representation __________________________ 58 4.2.3 Hybrid approaches _______________________________________ 61

4.3 Semantic markup languages ____________________________________ 62 4.3.1 A basic Introduction to the RDF data model___________________ 64 4.3.2 RDF serialisation ________________________________________ 67 4.3.3 Adding semantics to published RDF data _____________________ 69

4.4 KR in digital forensics and IT security ____________________________ 72 4.5 A formal KR approach to investigation documentation and digital evidence

___________________________________________________________ 74 4.6 Conclusion _________________________________________________ 76

Chapter 5. Event representation in forensic event correlation ____________ 79 5.1 Introduction: Event correlation in digital forensics __________________ 80 5.2 Ontologies, KR and a new approach______________________________ 81

5.2.1 Knowledge representation framework ________________________ 82 5.2.2 Application architecture___________________________________ 82

5.3 Implementation ______________________________________________ 83 5.3.1 The design of the event representation________________________ 83 5.3.2 Log parsers_____________________________________________ 85 5.3.3 A heuristic correlation language – FR3 _______________________ 86

5.4 Case study 1: Intrusion forensics ________________________________ 89 5.4.1 Investigation using FORE _________________________________ 90 5.4.2 Experimental results______________________________________ 95

5.5 Case study 2: Extending the approach to new domains _______________ 96 5.5.1 Integration of standard ontologies ___________________________ 97 5.5.2 Integrating new domains __________________________________ 98 5.5.3 Experimental results_____________________________________ 100

5.6 Conclusion ________________________________________________ 102 Chapter 6. Sealed digital evidence bags _____________________________ 107

6.1 Introduction________________________________________________ 108 6.2 Definitions ________________________________________________ 109 6.3 An extensible information architecture for digital evidence bags ______ 110

6.3.1 Storage container architecture _____________________________ 111 6.3.2 Information architecture__________________________________ 113 6.3.3 Integrity ______________________________________________ 117 6.3.4 Evidence assurance _____________________________________ 118 6.3.5 Clarifications __________________________________________ 118

6.4 Usage scenario: imaging and annotation _________________________ 118 6.5 Experimental results _________________________________________ 120 6.6 Conclusion and future work ___________________________________ 121

Chapter 7. Temporal provenance & uncertainty ______________________ 125 7.1 Introduction________________________________________________ 126 7.2 Characterising the behaviour of drifting clocks ____________________ 127

7.2.1 Experimental setup______________________________________ 127 7.2.2 Analysis and discussion of results __________________________ 128

ix

x

7.3 Identifying computer timescales by correlation with corroborating sources _

__________________________________________________________ 133 7.3.1 Experimental setup______________________________________ 134 7.3.2 Challenges in correlating browser and squid logs ______________ 135 7.3.3 Analysis methodology ___________________________________ 136 7.3.4 Clickstream correlation algorithm __________________________ 137 7.3.5 Results _______________________________________________ 139 7.3.6 Non-cached records correlation algorithm ____________________ 140 7.3.7 Results _______________________________________________ 141

7.4 Discussion _________________________________________________ 142 7.4.1 Relation to existing work _________________________________ 143

7.5 Conclusions________________________________________________ 144 Chapter 8. Conclusions and future work ____________________________ 147

8.1 Summary of contributions and achievements ______________________ 148 8.2 Discussion of main themes and conclusions_______________________ 149

8.2.1 Addressing complexity and volume of digital evidence _________ 149 8.2.2 Assurance of fundamental temporal information _______________ 150

8.3 Implications of Work ________________________________________ 150 8.4 Opportunities for further work _________________________________ 151

8.4.1 Document oriented evidence ______________________________ 151 8.4.2 Ontologies in digital forensics _____________________________ 152 8.4.3 Temporal assumptions underlying event correlation ____________ 153 8.4.4 Characterising temporal behavior of computers________________ 154 8.4.5 Event pattern languages __________________________________ 155

Chapter 9. Bibliography __________________________________________ 157

List of Tables

Table 1: Challenges in digital forensics - DFRWS 2006 keynote _______________ 31 Table 2: RDF/XML Serialisation of two triples_____________________________ 68 Table 3: RDF/XML serialisation using XML Namespace abbreviation __________ 68 Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition

_____________________________________________________________ 68 Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a

Person named Sarah Jessica Parker starred in the Movie 'Footloose'." ______ 69 Table 6: N3 serialisation of statement from Table 5 _________________________ 69 Table 7: A simple Movie related ontology_________________________________ 72 Table 8: Web Session / Causality Correlation Rule__________________________ 89 Table 9: OSExploit Heuristic Rule_______________________________________ 91 Table 10: SAP Related Events __________________________________________ 99 Table 11: Identity Masquerade Rule ____________________________________ 100 Table 12: Door Entry- Login rule ______________________________________ 100 Table 13: The file content of a browser log SDEB _________________________ 113 Table 14: XML/RDF content of Investigation Documentation File named

jbloggs.cache.index.dat.rdf ______________________________________ 115 Table 15: Digital Evidence Bag instance data stored in the Tag File ___________ 117 Table 16: Evidence Content message digest property _______________________ 117 Table 17: Investigation Documentation Container Metadata stored in the Tag File. 118 Table 18: Annotated information from composing SDEB____________________ 121

xi

List of Figures

Figure 1: Corresponding phases of linear process models of digital forensic investigation___________________________________________________ 21

Figure 2: Event based digital investigation framework _______________________ 22 Figure 3: Digital crime scene specific investigation phases____________________ 23 Figure 4: Carrier's digital forensics tool abstraction layer model _______________ 28 Figure 5: Turner's digital evidence bag ___________________________________ 44 Figure 6: Trivial set of physical, digital and document evidence________________ 55 Figure 7: Current Semantic Web standards ________________________________ 64 Figure 8: Basic RDF node-arc-node triple _________________________________ 65 Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named

'Footloose'"____________________________________________________ 65 Figure 10: Unambiguous meaning is given to concepts and instances through naming

with URI’s ____________________________________________________ 66 Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a

Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.” _____ 67 Figure 12: The FORE Architecture ______________________________________ 83 Figure 13: Instance and Class/Subclass relationships between events____________ 85 Figure 14: Causal ancestry graph of exploit________________________________ 92 Figure 15: Related events remain unconnected because of surrogate proliferation__ 93 Figure 16: Correlated event graphs after proliferate surrogates merged __________ 94 Figure 17: Causal ancestry graph of identity masquerading scenario ___________ 101 Figure 18: Referencing nested and external digital evidence bags _____________ 111 Figure 19: Proposed sealed digital evidence bag structure ___________________ 112 Figure 20: RDF Graph relating original data object and image ________________ 116 Figure 21: RDF graph resulting from addition of new documentation to embedded DEB

____________________________________________________________ 120 Figure 22: Experimental setup for logging temporal behaviour of windows PC's in

small business network _________________________________________ 128

xiii

xiv

Figure 23: Clock skew of Domain Controller "Rome" offset from civil time. ____ 129 Figure 24: Clock skew of workstation "Florence" offset from civil time. ________ 130 Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed). __ 131 Figure 26: Clock skew of workstation “Trieste” offset from civil time. _________ 132 Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed). __ 132 Figure 28: Experimental setup for correlation _____________________________ 134 Figure 29: Matching is complicated by only the most recent record present in the

history. ______________________________________________________ 136 Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host

“Milan” do not correlate because of presence of false positives.__________ 137 Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when

false positives are removed. ______________________________________ 138 Figure 32: Pompeii" cache correlation. __________________________________ 139 Figure 33: History Correlation vs. Timescale. _____________________________ 141 Figure 34: Incomplete information______________________________________ 143

List of Abbreviations

AAFS: American Association of Forensic Sciences

ACPO: (UK) Association of Chief Police Officers

AFF: Advanced Forensics Format

API: Application Programming Interface

APIC: Advanced Programmable Interrupt Controller

BIOS: Basic Input Output System

CART: Computer Analysis and Response Team

CDESF: Common Digital Evidence Storage Format

CEP: Complex Event Processing

CERN: The European Particle Physics Laboratory (Consiel Europeen pour la

Recherche Nucleaire)

CERT: Computer Emergency Response Team

CFSAP: Computer Forensics Secure Analyse Present

CFTT: Computer Forensics Tool Testing

DAML: DARPA Agent Markup Language

DARPA: Defence Advanced Research Projects Agency

DCO: Drive Configuration Overlay

DC: (MS Windows) Domain Controller

DCS: Digital Crime Scene

DE: Digital Evidence

DEB: Digital Evidence Bag

DEID: Digital Evidence Identifier

DF: Digital forensics

DFRWS: Digital Forensics Research Workshop

DL: Description Logic

xv

xvi

DMCA: Digital Millennium Copyright Act

DTD: Document Type Definition

DLG: Directed Labelled Graph

DO: Data Object

ERP: Enterprise Resource Planning

FBI: (US) Federal Bureau of Investigation

FOL: First Order (Predicate) Logic

FSM: Finite State Machine

FTK: Forensics Toolkit

HDD: Hard Disk Drive

HTML: Hypertext Markup Language

HPA: Post Protected Area

IDS: Intrusion Detection System

IE: Internet Explorer

IOCE: International Organisation on Computer Evidence

KIF: Knowledge Interchange Format

KR: Knowledge Representation

LSID: Life Sciences Identifier

MAC: Media Access Control

MD5: Message Digest 5

MRU: Most Recently Used

N3: Notation 3

NIJ: (US) National Institute of Justice

NIST: (US) National Institute of Standards and Technology

NL: Natural Language

NSRL: (US) National Software Reference Library

NTP: Network Time Protocol

OIL: Ontology Integration Language

OWL: Web Ontology Language

P2P: Peer To Peer

PDA: Personal Digital Assistant

RAID: Redundant Array of Inexpensive Disks

RDF: Resource Description Framework

RDFS: RDF Schema

RTC: Real Time Clock

SDEB: Sealed Digital Evidence Bag

SGML: Standard Generalized Markup Lanugage

SIM: Subscriber Identity Module

SNTP: Simple Network Time Protocol

SUO: Standard Upper Ontology

SUMO: Suggested Upper Merged Ontology

SWGDE: Scientific Working Group on Digital Evidence

SPARQL: Simple Protocol And RDF Query Language

TSK: The Sleuth Kit

UMM: Unified Modelling Methodology

URI: Uniform Resource Identifier

URL: Uniform Resource Locator

URN: Uniform Resource Name

UTC: Coordinated Universal Time

WWW: World Wide Web

W3C: World Wide Web Consortium

XML: Extensible Markup Language

XML-NS: XML Namespace

XSD: XML Schema Definition

xvii

xix

Declaration

The work contained in this dissertation has not been previously submitted to

meet requirements for an award at this or any other higher education institution. To the

best of my knowledge and belief, this dissertation contains no material previously

published or written by any other person except where due reference is made.

Signed: ……………………………… Date:……………………………

Previously Published Material

The following papers have been published or presented, and contain material

based on the content of this dissertation.

Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and Management Systems (APIEMS 2004), Brisbane, Australia.

Schatz, B., Mohay, G. and Clark, A., (2004) ‘Generalising Event Forensics Across Multiple Domains’ Proceedings of the 2004 Australian Computer Network and Information Forensics Conference (ACNIFC 2004), Perth, Australia.

(revised version published as)

Schatz, B., Mohay, G. and Clark, A., (2005) ‘Generalising Event Correlation Across Multiple Domains’, Journal of Information Warfare, vol 4, iss 1, pp. 69-79.

Schatz, B., Clark, A., (2006) ‘An information architecture for digital evidence integration’ Proceedings of the 2006 Australian Security Response Team Annual Conference (AUSCERT 2006), Gold Coast, Australia.

Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of computer event log evidence’ Digital Investigation, 3 (Supplement 1), pp. 89-107.

(also published as)

Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of computer event log evidence’ Proceedings of the 2006 Digital Forensics Workshop (DFRWS 2006), West Lafayette, USA.

xxi

In loving memory of my father, Gregory Schatz.

xxiii

Acknowledgements

This dissertation, like most, is the product of one author, yet has been shaped

by a cast of supporters, colleagues, friends and family. I would like to express my

sincere appreciation in the following paragraphs.

I would like to thank Adjunct Professor George Mohay, Dr. Andrew Clark and

Associate Professor Peter Best for their supervision. George and Andrew’s guidance

and inspiration have been instrumental in directing the course of this research. Thank

you both for giving me the opportunity to spend this time researching and for freely

sharing of your insight, time, energy and experience.

George deserves special mention for the methodical and focused attention

which he applied to my writing; it has been a true pleasure writing papers together.

Andrew’s patience and willingness to offer an alternative perspective has often helped

clarify otherwise murky waters.

The Information Security Institute (ISI) provided the resources and

environment for me to perform this research, contributions which I highly appreciate.

Without the help of Ed Dawson, Colin Boyd, and Mark Looi I would not have found

the opportunity to research at the ISI. Many thanks to the ISI staff who have helped

along the way. Additional thanks go to SAP Research, who supported some research

related to Chapter 5.

I would like express my appreciation to Peter Best for providing important

direction related to the event correlation work presented in Chapter 5. Peter Kingsley

(Qld. Police Forensics Unit) provided valuable assistance in provenance related issues

which contributed towards the results in Chapter 6.

xxv

xxvi

Many thanks go to my colleagues who have helped along the way by reading

drafts of papers and discussing ideas. I would especially like to thank Jason Smith

(ISI), Mark Branagan (ISI) and Julienne Vayssiere (SAP Research).

Finally, to my family, thank you all for your encouragement and understanding

during the period of this research. In particular I would like to give heartfelt thanks to

my wife Kelly, who has patiently supported me throughout this period, enduring far too

much seriousness and absence on my part.

Chapter 1. Introduction

All our lauded technological progress – our very civilization – is like the axe in the hand of the pathological criminal.

(Albert Einstein)

The aim of the work described in this thesis is to examine, at a fundamental

level, the nature of digital (computer based) evidence, so that improved digital

investigation techniques addressing investigation efficiency and evidence assurance

might be identified. In our opinion, the current tool-centric focus of digital forensics

impedes access to the information represented within digital evidence as much as it

assists, by nature of the tight binding between the tool and the information which it

conveys. We hypothesise a shift in focus towards the data, by employing a common

representational approach, would benefit the field in areas such as tool interoperability,

assurance and validation.

1.1 Digital forensics and digital evidence

In times past, computer evidence meant “the regular print out from a computer”

[120]. Computer evidence today means data from storage media such as hard drives

and floppy disks, captures of data transmitted over communications links, emails, and

log files generated by operating systems. What was formerly called computer evidence

is now also called digital evidence, including new classes of evidence drawn from a

plethora of digital devices which do not fit the conventional concept of a computer.

PDAs, mobile phones, engine management systems in cars, and even washing

machines are all examples of this.

A body of widely accepted techniques for seizing computers and digital storage

media and copying (or imaging) the contents of media has been developed. Further

techniques address analysis of digital content for information relevant to a particular

investigation, and presentation of the evidence. Finally, these techniques and resultant

information must be both independently verifiable and understood in contexts such as

1

2 CHAPTER 1 - Introduction

courts of law. Protocols have been established which prevent contamination of

evidence, and establish continuity of evidence. These techniques and protocols, which

ultimately aim to find evidence and have it accepted as such by the courts, form a part

of the practice of digital forensics or computer forensics.

Since at least 19671 courts of law have been faced with challenges relating to

admitting computer related evidence. In these instances the courts have generally

followed rules of evidence, based on their legal tradition, for guidance whether to admit

evidence into proceedings. The “regular print outs from a computer” required the

testimony of the computer’s maintainer asserting the correct operation of the computer,

and were often admitted using the “business records” exemption to the hearsay rule of

evidence.

In the year 2007, the courts regularly accept as evidence a wide variety of

digital document based evidence which has similar non-digital equivalents. For

example, emails are similar to postal mail, word processor documents similar to

typewritten documents, and accounting software records similar to bookkeeping

ledgers. Digital evidence artefacts based on new ideas with less direct relation to real

world practise or artefacts has also become widely accepted. For example, processes

which recover deleted files or partial data from slack space2 are widely understood and

accepted.

The principles by which digital evidence is evaluated, accepted into legal

proceedings, and ascribed weight vary widely from jurisdiction to jurisdiction.

Countries with a common law background, which includes the United Kingdom,

Australia, and the United States, share, however, a number of common principles. In

1 According to Parker [100] , the first successfully prosecuted case (in a federal

US jurisdiction) involving the criminal use of a computer concluded on January 10,

1967. The defendant, a computer programmer, worked on a reporting system for

overdrawn checking accounts for the National City Bank of Minneapolis. The

defendant, whose personal checking account was with the same bank, and subject to the

same processing system, patched the program to hide a growing personal debt. The

situation was discovered when a computer failure caused processing to revert back to

manual methods.

2 Slack space is an emergent artefact of filesystems related to their block oriented allocation strategies; it refers to an area in the last block or cluster used to store the final part of a file. Where the final chunk of the file only partly uses this block or cluster, the remainder of the storage area remains unused. It is this unused area that is referred to as slack space.

CHAPTER 1 - Introduction 3

1998, Sommer described the following basic principles for evaluating the acceptability

of new types of evidence not previously considered by courts:

• authentic – the evidence should be “specifically linked to the circumstances

and persons alleged – and produced by someone who can answer questions

about such links”.

• accurate – the evidence should be “free from any reasonable doubt about the

quality of procedures used to collect the material, analyse the material if that is

appropriate and necessary and finally to introduce it into court - and produced

by someone who can explain what has been done. In the case of exhibits which

themselves contain statements - a letter of other document, for example –

‘accuracy’ must also encompass accuracy of content; and that normally

requires the document’s originator to make a Witness Statement and be

available for cross – examination”

• complete – “tells within its own terms a complete story of (a) particular set of

circumstances or events” [121]

In addition to these considerations, technically oriented evidence (forensic

evidence) must exhibit the following properties:

• chain of evidence – “there should be a clear chain of custody or continuity of

evidence”

• transparent – “a forensic method needs to be transparent, that is, freely

testable by a third party expert”

• explainable – “in the case of material derived from sources with which most

people are not familiar quite extensive explanations may be needed”

• accurate - when evidence is presented that contains statements which were

originally created by a computer, “accuracy must encompass the accuracy of

the process which produced the statement as well as accuracy of content”

[121]

Digital evidence is by its nature fundamentally different from existing types of

physical evidence. By itself it contains no informational value. This contrasts with

regular forms of evidence such as documents or testimony, both of which are readily

understood by the literate and conversant. The content of (or information latent within)

the digital evidence is dependent upon the process by which it is interpreted. While this

process of interpreting the data is based upon principles of computer science, and could


potentially be performed manually by a skilled expert with sufficient time and

motivation, imperatives such as efficiency and reliability have driven the adoption of

tools to mechanise these tasks. In the early days of printouts this discrimination was not

clear, but as the police began to seize storage media in general practise, the role of tools

in processing digital evidence came to the fore.

That digital forensics has been made the subject of a recent special issue of the

Communications of the Association of Computing Machinery (CACM) [5], is

indicative of the transition of the field towards the mainstream. The field is still,

however, far from mature, with widely acknowledged challenges remaining. The courts

are beginning to impose stricter standards as to what is admitted as scientific fact. The

United States legal system is increasingly less tolerant of “junk science”: data, research

and conclusions which are presented as scientific in nature, but absent of the rigour and

methodology underlying the scientific method. The challenge for digital forensics is to

firm its foundations in scientific discipline, so that it might meet arguments of the field

being a junk science.

In the United States, judges now act as gatekeeper for novel scientific

evidence. Under the 1993 Daubert v Merrel Dow case, expert evidence must satisfy the

following strict criteria, which are commonly referred to as the “Daubert factors” [103]:

• whether the technique “can be (and has been) tested”,

• whether the technique has been “subjected to peer review and publication”,

• “the known or potential rate of error… and the existence of and maintenance

of standards controlling the technique’s operation”, and

• “general acceptance.”

Sommer is doubtful whether some of the computer forensic evidence accepted

by courts, and in turn the processes being used to interpret such evidence, can meet

these tests, citing issues of disclosure, testing and repeatability as having been

neglected or not applied uniformly [121, 122]. These concerns are paralleled by recent

calls for the practice of digital forensics to establish itself more like a forensic science

[98].

Despite some preliminary work advancing the theory behind validation [12],

calls to validate the computer forensics toolset have largely resulted in lacklustre

results. For example, the Computer Forensics Tools Testing (CFTT) group of the US

National Institute of Standards and Technology (NIST) have so far only tested a

handful of hardware write blocking and imaging devices, the function of which is

already widely accepted.


The remainder of the digital forensics toolset generally employed remains

absent of rigorous validation. Commercial tools such as Guidance Software’s EnCase3

(arguably the de-facto standard in commercial forensics software) and Access Data’s

FTK4, remain both absent of validation and conspicuous for their absence of identified

error rates. Surprisingly they have not yet faced a Daubert challenge. The tools are

monolithic in nature, their internal workings a closely guarded secret by nature of their

closed source and commercial origins. Furthermore, in-depth tests by the community

are limited by licensing contracts and legislation such as the Digital Millennium

Copyright Authority (DMCA) [78]. Open source development models have been

proposed as a means of increasing reliability of forensics tools [60], however the open

model of development does not necessarily assure peer review or a reduced error rate.

The rapid rate of change of technology and the constant environment of

change, has resulted in a situation where, in general, forensics tools are best

characterised as point solutions for addressing specific tasks.

In the longest standing area of the practice, media acquisition and analysis, a

number of commercial and open source tools exist which acquire, interpret and explore

digital evidence sourced from storage media such as hard disks and flash drives.

Leading commercial tools such as EnCase and FTK integrate acquisition, analysis,

documentation and reporting functions at various layers of abstraction. These tools do

not, however, take a general approach to processing digital evidence. Integration of

new analysis techniques and new sources of evidence, and, validation of analysis and

interpretation results, is hampered by the monolithic nature of the tools.

Outside the more established areas of the practice, an ecosystem of small task

specific tools exists, for example in the area of mobile device forensics. While

invaluable for specific tasks at hand, they lack an integrated approach. Employing data

from one tool in another tool may require reformatting, leaving the conversion process

a manual one and at risk of human error. Similarly, investigation related documentation

must be methodically and manually maintained outside of the context of the tool.

The constantly changing nature of computing and information technology

creates additional challenges beyond those of raising the practice to the level of rigour

of the forensic sciences. Firstly, the emergence of new or improved devices and

software serve to constantly introduce new sources of complexity which must be

addressed to acquire, interpret and analyse evidence, which often require the

development of new techniques. This is referred to as the complexity problem.

Secondly, the volume of data which may have relevance to an investigation is 3 http://www.guidancesoftware.com/ 4 http://www.accessdata.com/


increasing markedly. The number of individual units of potential evidence is increasing

because of network effects, increasing the occurrence of evidence distributed over

multiple devices. Furthermore, the volume of data in each unit under consideration is

increasing markedly, with multiple terabyte acquisitions becoming common. This is

referred to as the volume problem.

With the prevalence of cases involving digital evidence approaching a

watershed, the challenge for digital forensics is to both increase in reliability and

rigour, while at the same time increasing the efficiency of investigation. Without these

efficiencies, mounting pressure on the courts to accept and employ digital evidence

based on “junk science” could lead to failures in the administration of justice, or limit

the employment of digital evidence to only the few who can afford it.

1.2 Contributions

New paradigms for interacting with, managing, processing and presenting

digital evidence are needed for achieving these efficiencies and reliability of findings.

Current approaches to digital investigation overly rely on human intervention as the

glue which binds together the operation of disparate tools over opaque data. All the

while investigation documentation must be sufficiently maintained and generated to

provide assurance of the authenticity and provenance of evidence and reproducibility of

findings.

The aim of the work described in this dissertation is to investigate at a

fundamental level the nature of information examined, inferred and reported in digital

investigations, to identify techniques which facilitate documenting of digital

investigations, analysis of digital evidence, and reporting of findings, while at the same

time assuring reliability and authenticity of digital evidence. Our research addresses the

complexity problem by supporting the expression of arbitrary information related to the

investigation, and the volume problem by enabling scalable approaches to digital

evidence.

This dissertation summarises a significant body of research performed

following three intertwined themes: representation, analysis techniques, and

information assurance. The original contributions contained within this dissertation are:

• Proposition that a formal knowledge representation approach to digital

evidence will yield benefits that will solve current digital forensics problems of

complexity and volume.

• Demonstration of the usefulness of a particular representational formalism,

RDF/OWL, in representing arbitrary and diverse information both implicit in


event log based evidence, investigation related documentation, and wider

domain knowledge. This is demonstrated in the context of building improved

forensic correlation tools, and in building interoperable forensics tools and

digital evidence storage formats.

• Demonstration of a novel analysis technique which supports automated

identification of high level forensically interesting situations by means of

heuristic event correlation rules which operate over event oriented information

described in the RDF/OWL formalism. Furthermore,

- A novel means of addressing the problem of surrogate proliferation5,

improving automated correlation by interactive (human guided)

declaration of hypothetical equivalence relationships between

surrogates is demonstrated.

• Demonstration of a novel approach to the problem of digital evidence storage

containers, proposing an architecture for containers of digital evidence and

arbitrary investigation related information. Our proposal enables composition

of evidence units and arbitrary related information into a larger corpus of

evidence, while assuring the integrity of evidence. Furthermore;

- A unique naming scheme for identifying digital evidence which

enables separate and subsequent addition of arbitrary information

without violating the integrity of original evidence or the evidence

container is defined.

• An analysis of the temporal behaviour of PC clocks as generally implemented

in the Microsoft Windows 2000 and XP operating systems and empirical

results demonstrating the unreliability of timestamps sourced from computers

running these operating systems.

- A novel approach for characterising the temporal behaviour of a host,

based on correlating commonly available local timestamps and

timestamps from a reference source.

1.3 Dissertation roadmap

This dissertation contains this chapter, and another seven, which are described

in overview below:

Chapter 1: Introduction

5 Defined in Chapter 5.


This chapter provides a brief introduction to the field of digital forensics, and

the subject of that field; digital evidence. A brief summary of the challenges in digital

forensics are presented, followed by a summary of the contributions of this dissertation,

and finally, the dissertation roadmap.

Chapter 2: Background: Digital forensics

This chapter is a comprehensive review of digital forensics and digital evidence

from practice and research perspectives. The chapter begins by describing the historical

context and evolution of the field. The field is then characterised by presenting multiple

perspectives of it, including noted definitions, the nature of digital evidence and its

relation to digital forensics tools. Finally, current approaches to representing and

documenting digital evidence are described, and limitations in evidence representation

are identified.

Chapter 3: Related work

This chapter reviews background material and related work relevant to the

work described in this dissertation in Chapters 4 to 7. Section 3.1 describes the

literature related to event correlation, both specifically related to computer forensics

and in a more general context. Chapter 5 builds upon this background. Section 3.2

describes current approaches to maintaining investigation documentation and evidence

storage, which is a subject of Chapter 6. Both sections 3.1 and 3.2 make observations

related to representation upon which Chapter 4 follows from. Section 3.3 describes

work related to computer timekeeping, forming background information for Chapter 7.

Chapter 4: Digital evidence representation: addressing the complexity &

volume problems of digital forensics

Following from the observed limitations in evidence representation made in

Chapter 3, this chapter reviews literature in the fields of Knowledge Representation

(KR) and markup languages, with the goal of representing digital evidence. These are

currently the two primary approaches to representing and communicating knowledge

outside of natural language. The historical context of KR is described, followed by a

description of the major approaches to KR. The historical context of markup languages

is then relayed, leading to a description of the current state of KR, its influence on

markup languages, and the current research agenda towards building a “Semantic Web”

of knowledge. A brief introduction to the RDF/OWL representational formalism is then

presented. Finally, the chapter concludes by proposing that this formalism would be of

benefit towards solving the complexity and volume problems of computer forensics.

Chapter 5: Heuristic event correlation for forensics

This chapter addresses the themes of evidence representation and analysis

techniques. Following from the proposal of the RDF/OWL formalism as a


representation layer for documenting arbitrary information in a machine and human

readable manner, this chapter demonstrates the formalism is a useful generic

representation upon which digital forensics applications might be built. This is

performed in the context of correlation of computer and other event logs for the

purpose of forensics. An automated technique of identifying situations of interest in

computer forensic investigations is presented.

A novel means of resolving the problem of surrogate proliferation in event

based knowledge is proposed, and it is shown that this technique assists increasing the

quality of automatically inferred results, and reducing the volume of entities which

must be manually considered. Finally, it is demonstrated the approach is extensible and

generalisable to support integration of and reasoning with evidence from multiple

heterogeneous domains. This is demonstrated by applying the approach to a forensic

scenario involving Enterprise Resource Planning (ERP) software and door logs as well

as commonly available computer event logs.

Chapter 6: Sealed digital evidence bags

This chapter addresses themes of representation and assurance in considering

how forensics tools scale and interoperate in an automated manner, while assuring

evidence quality. A novel architecture is proposed for storing and representing digital

evidence, analysis results, and investigation documentation in a manner that supports

arbitrary composition of evidence units, and related information into a larger corpus of

evidence. Finally a proof of concept is demonstrated by describing a prototype

implementation of this architecture.

Chapter 7: Temporal provenance and uncertainty

This chapter addresses the theme of assurance of evidence, examining one of

the primary challenges in relating real world events to events found in computer event

sources: determining what the real world meaning of a particular timestamp is. While

timestamps are ubiquitous in computer records, the clocks which generate them are

often unreliable, fluctuating with changes in temperature and other factors. This chapter

presents empirical results identifying where the real world behaviour of Microsoft

Windows based computer clocks diverge from the ideal. A novel analysis technique for

assuring the interpretation of timestamps generated by a particular computer clock

based on commonly available event logs is proposed and evaluated.

Chapter 8: Conclusion & Future Work

The concluding chapter identifies areas for future work.

Chapter 2. Background: Digital forensics

“Normal science does and must continually strive to bring theory and fact into closer agreement, and that activity can easily be seen as testing or as a search for confirmation or falsification.”

(Thomas Kuhn)

This chapter describes in detail the field of digital forensics. Section 2.1 begins

by describing the historical context and evolution of the field. Following this, Section

2.2 relates key definitions of digital forensics and digital evidence. Section 2.2.1

describes in detail the nature of digital evidence, and the following section, Section

2.2.2, describes the digital investigation process by surveying a number of process

models which have been proposed. The subject of Section 2.3 is digital forensics tools.

Finally, key research challenges in the field of digital forensics are outlined.

2.1 A brief history of digital forensics

The field of digital forensics is in a transitional state. Its origins are in solving

pragmatic acquisition and chain of evidence problems related to investigations,

performed by and large, by law enforcement personnel with little formal background in

computing. The field has progressed to a point where, at national levels, best practice

standards and certification are being considered. Internationally, however, there is no

single accepted statement of standards or best practices, nor is there a generally

accepted governing body for the field [21]. The practice is not yet at a stage that

qualifies it to be called a forensic science.

The transitional nature of the field has impacts on any attempt to characterise

or critique it. The following sections examine the fundamental character of digital

forensics by reflecting on milestones in the history of the practice and definitions and

perspectives of the field made by actors who have shaped the field’s history.

The 1980’s era saw the beginnings of a need for dealing with computer based

evidence, which mainly involved mini systems or mainframe computers. In 1984 in the

UK, New Scotland Yard formed its Computer Crime Unit, and in the USA, the FBI

11

12 CHAPTER 2 – Background: Digital forensics

established a Magnetic Media Program, their first computer forensics initiative [9,

106]. The Magnetic Media Program later became the Computer Analysis and Response

Team (CART).

The late 80’s and early 90’s saw the proliferation of the PC platform, and in the

early 90’s, the widespread recognition that new techniques were required for preserving

digital evidence. The first specific forensic imaging tool, IMDUMP, was in the USA,

superceded in 1991 by a tool called Safeback [89]. In the UK in the same year, another

disk imaging application called the Data Image Back-Up System (DIBS) was produced

[9].

Computer forensics practitioners begin to organise and evaluate their

techniques and practices; in 1993 the First International Law Enforcement Conference

on Computer Evidence was hosted by the FBI. Subsequent conferences led to the 1995

formation of the International Organization on Computer Evidence (IOCE), and the

1997 meeting which resolved to develop best practice standards [21]. Around this time

audio and video technologies were moving from analogue to digital, which led

practitioners to consider whether the same principles of computer forensics applied to

all types of digital evidence [142].

Efforts to define the principles of computer forensics resulted in 1999 in the

adoption by the IOCE of proposals authored by member organisations, the Scientific

Working Group on Digital Evidence (SWGDE), from the USA, and the Association of

Chief Police Officers (ACPO), from the UK [21]. The ACPO proposal has evolved into

what is known as the “Good Practice Guide for Computer based Electronic Evidence”

[6]. In 2002, based on the IOCEs 2000 submission, the G8 issued the “G8 Proposed

principles for the procedures relating to digital evidence”. In Australia, the move

towards formal standardisation of the management and treatment of digital evidence

has begun with the 2003 definition of “Guidelines for the management of IT evidence”

[1].

The academic history of computer forensics goes back to the late 80’s and early

90’s with work by Collier and Spaul [33], Sommer [120] and Spafford [123]. By the

late 90’s very little had been published in the open literature on computer forensics

[88], however the new millennium has seen an upturn in both DF targeted publications

and conferences, including the first two specifically targeted journals. The first digital

forensics targeted conference, The Digital Forensics Research Workshop, was

established in 2001, followed by the International Journal of Digital Evidence in 2002

and the International Journal of Digital Investigation in 2004.

CHAPTER 2 – Background: Digital forensics 13

That digital forensics has been made the subject of a recent special issue of the

Communications of the Association of Computing Machinery (CACM) [5], is

indicative of the transition of the field towards the mainstream.

2.2 Digital evidence & digital forensics defined

There was a time when the primary area of technical innovation was analogue

electronics. In these pre-digital days, the courts adapted to new forms evidence,

including documents transmitted and received by telex, and visual and audio recordings

on magnetic tape. Such evidence was termed electronic evidence. Today, the courts are

coming to terms with a higher order of electronic evidence: digital evidence.

What exactly constitutes digital evidence is a moving target caused by the

continual emergence of new digital technologies, and additionally, because of the broad

definition of the word evidence. There are, however, a number of generally accepted

definitions which have been given by leading organisations and authors in the field

which serve to delineate the territory. These are presented below:

• information of probative value6 that is stored or transmitted in binary form (SWGDE)

[131]

• information stored or transmitted in binary form that may be relied upon in court

(IOCE) [57]

• any data stored or transmitted using a computer that support or refute a theory of how

an offence occurred or that address critical elements of the offence such as intent or

alibi (Casey) [27]

The above definitions are sourced from entities who primarily reside in the

United States. The Association of Police Chief Officers of England, Wales and North

Ireland defines Computer Based Electronic Evidence as:

• information and data of an investigative value that is stored on or transmitted

by a computer (ACPO) [6]

The usage of the terms “digital evidence” and “computer based electronic

evidence” are synonymous. While the final definition parallels the earlier three, subtle

differences remain. By defining computer evidence in relation to the investigative

process, rather than in relation to the legal one, the ACPO definition addresses digital

data from the time it becomes a part of an investigation. The other definitions however

limit their subject to data which has been examined and found relevant towards

6 Probative value: “The extent to which evidence could rationally affect the assessment of the probability of the existence of a fact in issue” [68]


establishing some theory. The IOCE definitions address this shortfall between the two

styles of definitions by defining the term Data Objects:

• Objects or information of potential probative value that are associated with physical

items. Data objects may occur in different formats without altering the original

information.

For this dissertation, the author chooses to adopt the IOCE definition of the

term digital evidence: “information of probative value7 that is stored or transmitted in

binary form”.

The term computer forensics was in informal use in academic publications

from at least 1992 [121], however the term remained informally defined for many

years. A commonly cited definition of the field in Australian literature, is

McKemmish’s 1999 definition of forensic computing:

“The process of identifying, preserving, analysing and presenting digital evidence in a manner that is legally acceptable” [76]

The American Academy of Forensic Sciences defines forensics as follows:

The word forensic comes from the Latin word forensis: public; to the forum or public discussion; argumentative, rhetorical, belonging to debate or discussion. From there it is a small step to the modern definition of forensic as belonging to, used in or suitable to courts of judicature, or to public discussion or debate. Forensic science is science used in public, in a court or in the justice system. Any science, used for the purposes of the law, is a forensic science. [2]

This broad definition of forensics, and McKemmish’s earlier definition inform

the definition of computer forensics given by the Scientific Working Group on Digital

Evidence (SWGDE), whose definition is:

the scientific examination, analysis, and/or evaluation of digital evidence in legal matters [131].

Researchers attending the first Digital Forensics Research Workshop, 2001,

defined Digital Forensic Science as:

The use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned operations. [98]

This broad definition reflects a change in forums in which the techniques of

computer forensics are increasingly being applied. While traditionally, computer

forensics was exclusively targeted in the legal forum, computer forensics is

increasingly practised in non-legal contexts such as corporate investigations,

intelligence and military. 7 Probative value: “The extent to which evidence could rationally affect the assessment of the probability of the existence of a fact in issue” [68]


The terms digital forensics, forensic computing and computer forensics are

today arguably used interchangeably. Historically, computer forensics and forensic

computing8 related to the interpretation of computer related evidence in courts of law.

Technology however does not stand still, nor does language, and the meaning of the

term has remained consistently under negotiation. Two factors have been at play

underlying this process: the changing state of uptake of digital technologies, and with it

moves within organisations to consider governing and regulating the use of information

technology.

The late 80’s and early 90’s period was characterised by a mainstream of stand

alone PC’s: the internet was in its infancy, and only in academic circles. In this

environment, the main subject of computer forensics was indeed the basic components

of computing: persistent storage, such as floppy disks and hard disks, and the software

itself. As computers became inter-networked and proliferated into small devices such

as PDAs and mobile phones, the field has broadened its scope beyond the computer to

include network forensics and small scale device forensics. Digital forensics more

accurately describes this new state of affairs.

New imperatives are further shaping the field; there exists a rise in demand for

computer forensics outside of the traditional legal context. Today, Digital forensics is

practiced by law enforcement, military, intelligence, and also within the corporate

sector. Each of these sectors bring with them divergent agendas, primarily related to the

rigour required of any conclusions made by an investigation.

In the law enforcement context, the primary objective is the prosecution of an

alleged perpetrator of crimes. This necessarily dictates application of strict judicial

standards to the practice of computer forensics, because of the impacts on the freedoms

and liberties of the accused. In the military context, a secondary objective may be

prosecution; however this objective is subordinate to continuity of operations. In this

context, the practitioner of computer forensics is prepared to sacrifice accuracy for

immediate answers. This being the case, and under a time imperative, the conclusions

made by the practice of computer forensics in the military context cannot necessarily be

expected to be rigorous. The term digital investigation arguably reflects these subtle

changes in focus.

Despite the changes in agenda signified by the usage of the term digital

investigation over computer forensics, it is the subject of this field which defines it,

unifying any common definitions. Digital evidence remains at the centre of the field.

The next section examines the nature of digital evidence. 8 For the reader interested in the evolution of definitions of the field, see “To Revisit: What is Forensic Computing?”[53]


2.2.1 The nature of digital evidence

Digital evidence is an interpretation of data, either at rest (when found on a

hard drive) or in motion (for example network communications) or a combination of

the two. Where it is derived from non-volatile computer storage devices, such as hard

disks and flash drives, digital evidence could constitute portions of or the entirety of the

data contained in the drive. Network traffic dumps are recordings of data of a more

dynamic and mobile nature, recording data communications over a period of time.

Digital evidence is, in the primary instance, derived from physical properties,

whether it is a particular arrangement of magnetic fields on the platter of a hard disk,

the charge state of a transistor in memory, or the oscillation of electromagnetic waves

in a wireless network. Despite this physical basis, the nature of digital evidence is

ephemeral and independent of its storage or transmission medium – its true value is in

its interpretation as information. It is this nature that dictates the primary properties of

digital data and consequently digital evidence: its latency, fidelity, and volatility. These

properties impact the practice of digital forensics in fundamental ways.

By latency, we refer to the latent nature of digitally encoded data. Binary data

conveys no information in and of itself. For example, the binary number 01100001

could be interpreted as the decimal number 97, or the character “a” depending on its

context. For it to be interpreted as information, it must be first processed.

The fidelity property means that the data found on a digital device at a crime

scene may be freely copied and treated as if it was the original, provided that the

copying process can be demonstrated to be correct. This property is exploited in digital

forensics by inductively treating the copy as if it is an original, enabling evidence to be

easily shared between people and tools.

This kind of action upon digital data carries substantial risk of accidental

deletion or modification: we refer to this risk by the term volatility. Exploitation of the

fidelity property and the threat of volatility lead to the question arising as to whether

the copy is an authentic replica of the original. Evidence presented in legal proceedings

must be authentic: conversely, one must be sure that evidence is not fabricated or a

distortion.

Authenticating a piece of digital evidence is performed by testimony which

reliably identifies (or individuates) a particular piece of digital evidence, and then

establishes a chain of custody, the location of the evidence at all times since its

copying. The latent and voluminous nature of digital evidence makes discriminating

differences between two pieces of data in practice infeasible without tools. As such,


identification of digital evidence is achieved primarily through using hashing to

generate a unique, and more easily distinguishable, individuator.

The term “digital evidence” is a general term, only a little more specific than

the term “data”. Accordingly, it is used to refer to a plethora of things when talking

about digital forensics. In this section, we consider a number of categories which may

be used to discriminate between what is being referred to when speaking of digital

evidence.

Digital Crime Scene: After Carrier’s definition, a digital crime scene is the

data contained in a digital device, such as a hard drive or mp3 player, found at a

physical crime scene [26]. This is equivalent to the IOCE defined Data Object [130].

Well accepted protocols exist for preserving and collecting the digital crime scenes

associated with computer storage media, however other classes of digital crime scene,

such as volatile memory, and mobile phone memory are still under active scrutiny.

The use of the term “digital crime scene” acknowledges that the mere presence

of data at a physical crime scene (by way of being stored in a digital device) does not

make it evidence. Numerous sources draw the analogy that best practice in digital

forensic evidence collection perform the equivalent of seizing the entire physical crime

scene, including for example doors, rooms and even buildings.

Investigation Documentation: In outlining his “Big Computer Forensic

Challenges”, Spafford observes that practitioners and researchers in the field of digital

forensics do not use standard terminology[98]. Where they do, we observe it in areas

where the terminology is taken directly from the canon of computer science and relates

directly to elements of software, hardware and data abstraction layers.

While the field has achieved a strong consensus on terminology as it relates to

discussing artefacts found within the digital crime scene, common terminology for

describing investigation related artefacts is still elusive. In instances that investigation

documentation is discussed, it is referred to sometimes as “case documentation”, and

sometimes as “metadata” [30, 45]. Turner discusses elements of information related to

individuating features such as the name of the person capturing information, and

introduces a feature called a “tag continuity block” to help track chain of custody

related information [135].

Despite the lack of standards for what kind of documentation is required for

admission of evidence, a number of general classes of investigation documentation may

be identified:


• Continuity of Evidence: (also chain of evidence) involving tracking the

possession of the digital crime scene, and ultimately any digital evidence

identified.

• Provenance: identifying the genesis of the digital crime scene, for example

Case Number, Examiner, Evidence Number, Unique Description, Acquisition

Time

• Individuating characteristics: identifying the digital crime scene, or derived

digital data objects, or digital evidence

• Integrity: records which may enable identifying if the digital crime scene has

been modified

• Contemporaneous notes: notes made in the course of examination of the

digital crime scene which address reproducibility (ie Encase bookmarks,

Autopsy event sequencer, Notes, Audit)

• Error records: bad sectors, read failures, SMART errors

A subset of these classes of information, specifically continuity of evidence,

provenance, integrity, and individuating characteristics are the central pieces of

information which are used in assuring evidence is authentic. For this reason, we refer

to this set of information as evidence assurance documentation.

Protocols for individuating a piece of evidence and maintaining a chain of

custody have been successfully adapted from practice with physical evidence to the

handling of digital evidence. The protocols, however, remain implemented largely

through manual processes. Where tools support the maintenance or generation of these

types of documentation, their scope is limited to within the tool’s confines, preventing

third party tools from integrating in the documentation maintenance task.

Digital Evidence: While the presentation of a document printed from a

computer along with verbal evidence asserting its provenance may have been

acceptable in the past, today the authenticity of digital evidence is increasingly subject

to scrutiny, as is the accuracy of any conclusions drawn. Under this scrutiny we expect

the findings of digital investigations presented as evidence to a court are dependent on

the accompanying investigative documentation that asserts its pedigree.

2.2.2 Perspectives on the digital investigation process

The drive for establishing a recognised set of standards for performing digital

forensics has resulted in reflection on what tasks are performed in a digital


investigation, and to what end. Beyond the goals of standards setting, descriptions of

forensic processes are also useful for training and directing research.

Early descriptions of digital forensics processes, such as Mandia’s Intrusion

Response oriented methodology, and Farmer and Venema’s early guidelines have been

criticised as being too specific, focusing on the specifics of technology rather than on

generalised process [109]. Since these early attempts, a number of other processes and

frameworks have been proposed, which are described in the following sections.

Linear process models

A number of authors have proposed models which describe digital

investigation as a process consisting of number of phases, which are intended to be

performed one after the other (in a sequential fashion). Casey describes the phases of a

digital investigation as:

• Preliminary Considerations: Authority to conduct investigation

• Planning: Preparation and Methodology

• Recognition: Identification of potential sources of digital evidence

• Preservation, collection and documentation: Crime scene documentation

establishing provenance and chain of custody

• Classification, comparison, and individualization: Examination and search

for digital evidence

• Reconstruction: Deleted or damaged digital evidence recovery, slack space

search. Reconstruct relational and functional aspects of the crime [27]

The US National Institute of Justice (NIJ), in their “Electronic Crime Scene

Investigation: A Guide for First Responders” describe a four phase process, consisting

of the following four phases:

• Collection: “search for, recognition of, collection of, and documentation of

electronic evidence.”

• Examination: “make evidence visible and explain its origin and significance…

search for information… data reduction”

• Analysis: “looks at the product of the examination for its significance and

probative value to the case. Examination is a technical review that is the

province of the forensic practitioner, while analysis is performed by the

investigative team.”


• Reporting: “outlines the examination process and the pertinent data

recovered” [92]

A further preparatory phase was also implied, by posing the question whether

the first responder’s unit had the requisite capability to perform the other phases.

The results of a review of the terminology used for describing the phases of

linear process models are presented in Figure 1. Phases which may be considered as

either equivalent in nature, or simply more specific are arranged in columns.

A few points may be observed. Firstly, despite similarities in the activities or

goals identified for particular phases, terminology remains varied, and the subtleties

implied are not clearly defined. The differences in terminology may be explained by

the granularity of the models. For example, the 2001 NIJ model prescribes a Collection

phase, but in their 2004 model [93] describe Assessment and Acquisition phases

without reference to Collection. It would appear, however, that Assessment and

Acquisition would form sub-phases of a general Collection phase.

Regarding granularity, we observe the least granular and most abstract of

models presented is the Computer Forensics Secure Analyse Present (CFSAP) [89]

model of Mohay et al and the most granular is Reith’s Abstract Digital Forensics

Model, which describes 9 phases in all [109].

Beebe’s Hierarchical, Objectives-Based Framework addresses granularity

issues by proposing a hierarchical structure by which sub-phases may be related to less

granular, higher level phases [13]. Additionally, she relates a class of concerns, which

she calls Principles, which overarch many or all phases and sub-phases of the phases

and sub-phases of the investigative process. Two Principles she identifies are Evidence

Preservation and Documentation.


Casey 2000

DFRWS 2001

Preparation Collection Examination Analysis Reporting

Preparation

Ass

essm

ent

Acquisition Examination Documenting & Reporting

Iden

tific

atio

n

PresentationExamination Analysis

Pre

serv

atio

n

Prel

imin

ary

Con

side

ratio

ns

Rec

ogni

tion

Preservation Collection &

Documentation

Classification, Comparison and Individualization

Reconstruction

NIJ 2001

NIJ 2004

Col

lect

ion

Plan

ning

Figure 1: Corresponding phases of linear process models of digital forensic investigation

Overarching principles

Bebe’s inclusion of Principles in her forensic process model was novel in

explicitly relating principles to process. It was, however, making explicit a relationship

which had been previously implicit. The early standards efforts of the IOCE presented

a set of principles for the standardised recovery of computer based evidence:

• Upon seizing digital evidence, actions taken should not change that evidence.

• When it is necessary for a person to access original digital evidence, that person must

be forensically competent.

• All activity relating to the seizure, access, storage, or transfer of digital evidence must

be fully documented, preserved, and available for review.

• An individual is responsible for all actions taken with respect to digital evidence while

the digital evidence is in their possession.

• Any agency that is responsible for seizing, accessing, storing, or transferring digital

evidence is responsible for compliance with these principles. [130]

Event based digital investigation framework

More recently an investigation process model was proposed based on the

procedures conventionally used in investigating regular (or physical) crime scenes [26].

In this model, the data content of a digital device is conceptualised as a Digital Crime

Scene, with one digital crime scene per digital device. The process model, called the

Event Based Digital Investigation Framework, contains the following high level

phases: Readiness, Deployment, Physical Crime Scene Investigation, Digital Crime


Scene investigation, and Presentation. The goals related to each phase are described

below:

• Readiness: Operations readiness (training, methodology) infrastructure

readiness (prepping infrastructure for evidence gathering, forensic readiness)

• Deployment: Detection & notification (crime detected or incident detected),

confirmation and authorisation phase (search warrants)

• Physical Crime Scene: search for physical evidence

• Digital crime scene: preservation of system state, search for evidence,

reconstruction of digital events

• Presentation: Results presented to intended audience

The phases presented above are presented in Figure 2, with arrows indicating

flow through the process phases. We note here that the identification of a digital device

at a physical crime scene may instigate a digital crime scene investigation, and

transitively, that digital evidence found on a digital crime scene may instigate an

investigation at a newly identified physical location.

Figure 2: Event based digital investigation framework

The sub-phases of the digital crime scene investigation phase are depicted in

Figure 3. The first two phases are similar to early phases of the linear process models,

however the final phase, Event Reconstruction & Documentation, proposes a set of

sub-phases which attempt to prove and disprove hypotheses related to events that may

have caused digital evidence found in the crime scene.


Figure 3: Digital crime scene specific investigation phases

The inclusion of Documentation as a part of each sub-phase here points to the

implicit inclusion of one of Bebe’s principles of evidence.

2.3 Digital forensics tools

The concerns of officers of the court, and juries alike are highly abstracted

from the minutia of digital technology in general, and are highly reliant on the

testimony of expert witnesses in bridging the gap in understanding. Expert witnesses

and digital forensic investigators are in turn reliant on digital forensics tools in

interpreting digital evidence. While it is important that the first principles science

underlying the interpretation of digital evidence be understood by the technical expert,

on a day to day basis, tools are required to ensure efficiencies in the digital forensic

process. Without these efficiencies, mounting pressure on the courts to accept and

employ digital evidence could lead to a failure in the administration of justice, or limit

the employment of digital evidence to only the few who can afford it.

Early practitioners of computer forensics used generic operating systems tools

and “rolled their own”. While today a burgeoning commercial market segment exists,

producing tools for the digital forensics market, digital investigation still commonly

requires the use of multiple types of tools.

We describe three main classes of digital forensics tools below, based on their

relevance to the investigative process (acquisition, and examination and analysis tools)

as described in Section 2.2.2. The third class of tools, integrated tools, attempt to

address all phases of the investigation process from within one integrated environment.

2.3.1 Acquisition tools

Of all of areas in digital forensics, media acquisition is perhaps the most

mature of areas. This class of tools serve to make an exact copy of a digital crime

scene, which is commonly known as an image. The fundamental principle of

preserving the integrity of the crime scene, for example a hard drive, is routinely

satisfied by using hardware write blockers or operating systems modified to prevent

writing to the crime scene media. A cryptographic hash of the crime scene media, taken

at the time of acquisition, forms the foundation of the chain of custody and


maintenance of integrity by tying the physical evidence from which it is derived to the

digital crime scene. In some jurisdictions it is routine practice to print a copy of this

hash to use as a contemporaneous note establishing this link.

As raw data cannot exist without being contained or encoded in some way or

another, acquisition tools need a container in which to store the image. This container is

typically some other piece of raw media, such as a hard drive, or as the contents of a

file.

Despite the apparent simplicity of the process described above, numerous

problems emerge which must be considered. For example:

• Can we prove the write blocking technology actually ensures the integrity of

the digital crime scene?

• Is the digital crime scene copy an accurate copy of the original? In operation,

hard drives often have bad sectors which are unreadable. How do the presence

of these affect the maintenance of integrity? How do we record their presence?

• Is the digital crime scene copy a complete copy of the original? New drives

contain special areas protected from regular access, which may too be relevant

to an investigation at hand.

While the question of accuracy of write blocking technology is typically

addressed by the reputation of the particular tool used, or by independent validation, the

latter two questions relate to completeness. Acquisition of digital crime scene involves

far more than a simple copy operation, and potentially requires recording of a larger

amount of information outside of the data stored on the regular part of the drive.

The proliferation of mobile devices such as PDA’s and mobile phones, which

employ markedly different storage technologies such as embedded flash memory and

smart card SIM’s is today presenting new challenges to building acquisition tools.

Interest in acquiring and analysing the RAM of running computers is similarly on the

forefront of the DF agenda.

2.3.2 Examination & analysis tools

In the early days of forensics, the primary role of digital forensic examination

tools was to interpret raw data into information. As datasets containing potential

evidence have become larger, the information gleaned from interpretation tools has

become increasing unwieldy leading to “needle in haystack” kinds of problems. To

address this, integrated tools have emerged which provide various techniques for


searching, navigating, filtering and examining the information. This section describes

tools related to these concerns.

Four higher level strategies are typically employed for finding relevant

information within this raw data: structural interpretation, signature based searching,

file classification and event correlation.

Structural interpretation refers to the structured nature of data within most

forms of digital data. For example, the average hard drive is structured into partitions,

then file systems, then files and so on. Digital data, and the software that acts upon it,

is organised into layers of abstraction to reduce complexity. For example, as it applies

to storage, general purpose operating systems have long provided the familiar file and

directory abstractions for storing and organising data. Details of hard drive sector

addressing, and file indexing, are hidden from the average user at lower layers of

abstraction. Similar abstraction layers exist in software architectures and in network

data communications. Much of the job of forensic analysis tools is to exploit the

structure of raw binary data so data objects of an appropriate abstraction layer, and of

evidentiary value, may be found.

EnCase and FTK are storage media analysis tools which provide primarily

structural interpretation functions over a common set of evidence types. Similar

functionality is provided in the opensource tools The Coroners Toolkit9 (TCT) and The

Sleuth Kit (TSK)10 which is based on the former.

Signature based interpretation refers to a class of interpretation techniques

best exemplified by the class of tools known as file carving utilities, which search raw

digital data for characteristics which are unique to particular species of files. File

carving utilities such as scalpel [111] or foremost11 are able to identify potential

instances of image files such as GIF and JPEG, and documents in Microsoft word

format by identifying local structure, regardless of the underlying filesystem’s

presence. Local structure is used to identify data objects rather than searching global

structure. Such interpretation strategies are useful in instances where the global

structure has become corrupted.

Recent investigations in characterising large corpuses of hard drives have

employed signature methods for identifying credit card numbers, social security

numbers and email addresses [44].

9 http://www.porcupine.org/forensics/tct.html 10 http://www.sleuthkit.org/ 11 http://foremost.sourceforge.net/


Antivirus and malware identification tools can also potentially characterise

unknown files where they may be of this class of malicious content. There are

presently, however, no widely available hash sets for the identification of malware.

File Classification is another widely used analysis technique. The

predominance of the file based storage paradigm12, causes much of digital forensics

practice today to involve examination of files or artefacts related to files. An aspect of

the volume problem of digital forensics is that the number of files in investigations is

rapidly growing, as is the size of the files. Means of reducing this data volume have

consequently become important. One commonly employed technique is to filter data

based on categorisations of files. For example, authentic component files of the

Windows operating system might typically be considered as not contributing to the

goals of an investigation. A number of databases of cryptographic hashes of known

files exist for the purpose of data reduction, individuating and classifying files by

equivalence of hashes of unknown files with known ones. The National Software

Reference Library (NSRL) is perhaps the best known one, containing over 31,000,000

files [112].

Unknown files may also be characterised by their content, by exploiting local

structure within the files specific to particular file formats, in much the same way that

file carving tools work. This class of tools is useful for identifying files irrespective of

the name of the file. A wily adversary may rename a file to indicate another class of

usage, for example renaming a picture with the name foo.gif to foo.dll. Naïve searching

for image files by looking for files ending in common image extension names would

miss this file.

Event Correlation refers to an array of techniques applied to comprehending

the dynamic behaviour of systems, based on events and patterns of events in their

history. Digital evidence is often rich in event oriented evidence: from the modified-

accessed-created MAC times stored for each file to the event logs generated by most

long running services as a function of their operation.

Efforts such as ECF, apply correlation techniques to infer higher level

situations, from low level event log data [3]. Garfinkel uses correlation techniques to

identify similar features across entire corpuses of drives, a technique which could prove

useful for identifying computers with similar usage patterns [44]. Finally, another

useful form of classification is similarity. Fuzzy hashing is a technique which identifies

files which are nearly identical [65].

12 Some architectures have used difference storage architectures. The Palm Pilot, and IBM’s OS/390 do not use files, rather a record oriented storage paradigm.


2.3.3 Integrated digital investigation environments

In the more mature area of media analysis, where most practical activity is

occurring in the field, a number of commercial products have emerged that combine

acquisition, analysis, and reporting functionality in one integrated tool. EnCase and

FTK are prime examples of this class of tool.

A number task-specific features are found in integrated digital investigation

environments: these are Navigation, Search and Presentation.

Navigation features enable the investigator to visualise and explore the

structure of the digital crime scene. In practical terms, this feature is implemented in

Encase as a tree styled user interface element which represents the structure of the

digital crime scene using abstractions at various layers of the media analysis stack.

Search features enable the identification of data objects which conform to

various criteria, such as keyword or regular expression equivalence, date ranges, or data

object classifications. These criteria are evaluated against the content of data objects,

(such as in free text search of documents) the attributes of objects (as in finding all

pictoral image files with a .jpg or .gif file extension). Filtering, which we have

mentioned previously, can be seen as the opposite of search. Filtering limits the

perspective of search, presentation and navigation functionality to the data objects not

matching the filter criteria.

Finally, presentation functionality is related to presenting data objects, their

attributes (such as file metadata), and content in meaningful ways. As digital data may

have multiple interpretations, multiple viewer types may be appropriate for interpreting

data object content. For example, it may be instructive to read the textual content of a

HTML page in some instances, and the rendered page in others.

This class of tools typically provides support for recording of some

investigation documentation, such as case id’s and investigator.

2.3.4 Models of tools

Despite the pivotal nature of tools in digital forensics, little academic work has

focused on this subject.

Carrier has proposed a model of digital forensics examination and analysis

tools which relates the digital forensic tool as an interpreter of data from one layer of

abstraction to data at another, higher layer of abstraction. In this model (presented in

Figure 4) a forensic tool implements a rule set which translates input data from one

layer of abstraction into output data at another layer of abstraction. In performing this

transformation, a tool may introduce an error.


Figure 4: Carrier's digital forensics tool abstraction layer model

Tools which purely decode structure, such as media analysis tools, may

inadvertently introduce errors caused by implementation error; however these types of

error are difficult to measure and in the commercial sphere are not disclosed. File

carving tools introduce a different kind of error apart from implementation error:

abstraction error. An example of this is that signatures may inadvertently match data

that is not a valid file, leading to false positives13.

In practice, a tool may internally implement multiple abstraction layer

transformations. The open source forensics tools generally address only a few

abstraction layer transformations related to a particular class of structure. For example,

The Sleuth Kit (TSK), which takes as input data from the media management

abstraction layer (which is concerned with volumes and partitions) and outputs data at

the file system layer of abstraction, such as files (both deleted and regular) and

directories. Commercial tools tend to be more monolithic in nature and integrate

abstraction layers from separate domains. For example, while EnCase includes

abstraction layer translators equivalent to those found in TSK, it additionally includes a

translation layer which translates Redundant Array of Inexpensive Disks (RAID)

images from the physical media layer to the media management layer.

2.3.5 Current approaches to tool integration

Without doubt the integrated environments provided by commercial tools

provide tangible benefits to forensic investigations involving primarily media analysis.

These tools provide classification based filtering, search, navigation and case related

documentation maintenance services among others. The coverage and functionality of

these tools, however, often falls short, either when interpreting data objects outside of

the purview of the tool, or when searching for evidence in novel ways.

13 False positives refer to a result of a test that incorrectly indicates a positive result, despite the finding being false. In this case a signature might match some data, indicating that a file of a particular type has been found, while in fact the file is not of the type supposedly implied.


Interoperating with commercial forensics tools is cumbersome. For example,

while third party libraries have been developed to access proprietary evidence

containers such as that used by Encase, access to case related data is hampered both by

the absence of an API for accessing the internal abstractions, and the proprietary nature

of the format of the case file. Investigators are left to manually export or convert data

objects to files outside of the tool and continue investigation maintaining case related

concerns manually.

Of course, falling back to manual methods of case maintenance merely implies

performing the related activities using the protocols and methods practiced and

established before the advent of digital forensics specific tools. This does, however,

allow more opportunity for human error to creep in and results in more documentation

and case maintenance work.

The architecture underlying the implementation of a tool has a profound

impact on extensibility, robustness, scalability and integration with third party tools.

Commercial tools such as EnCase, ProDiscover, and FTK are all monolithic in nature.

The robustness problems associated with tightly coupled interpretation tool libraries is

well demonstrated by FTKs propensity to crash on particular files [4]. ProDiscover, by

means of an imbedded perl interpreter, and Encase, by means of an embedded

proprietary scripting language, provide some support for extensibility by defining an

API to access internal abstractions, however a brief investigation of the APIs reveals, in

the case of ProDiscover, to be minimally documented with an unclear and dynamic

runtime data model. With regard to scalability, many functions performed by digital

forensics tools, such as hash generation, data carving, thumbnailing, and keyword

searches are IO bound. Monolithic tools are not able to easily adapt to approaches such

as distributed computing to address these issues because of the lack of granularity, and

tight coupling of modules implementing their architectures.

The open-source model of development and software licensing has been

proposed as a potential solution to the problem of reliability of tools supporting digital

investigations [23, 60]. Regardless of whether open access to source code results in

more reliable tools, for the present time open source tools are the primary area within

computer science where digital forensics tools research is demonstrated and proved.

The majority of open source DF tools today are interpretation tools, operating on binary

devices or files and command line arguments, and producing binary files or text as

output. In the context of usage of these tools, case documentation is maintained outside

of the tool’s purview.


Notable open source digital forensics investigation tools are the Autopsy

forensic browser [25], PyFLAG14 and the TULP2G small scale device forensics

framework [137]. The architecture employed by Autopsy is a component oriented one,

with the user interface components running in separate processes to the filesystem

interpretation layer tools. The latter are sourced from separate projects, including the

related Sleuth Kit project. Theoretically, this separation of functionality enhances

robustness by limiting the effects of software faults to the implementing module, rather

than affecting the whole application. Autopsy provides limited support for the

maintenance of case related documentation.

While not open source in nature, the XIRAF digital forensics prototype [7] uses

a similar architecture to TSK, utilising wrappers around existing open source

interpretation tools.

A number of groups have begun experimenting with clustered computing

architectures as foundations for forensics tools, towards the goal of addressing the IO

and CPU bound processing issues inherent in current monolithic architectures. The

prototype Distributed Environment for Large-scale inVestigations (DELV) investigated

the feasibility of speeding up processing by spreading an entire hard disk image across

the RAM of a cluster of commodity PCs, moving the processing to each node [114].

The Open Computer Forensics Architecture (OCFA) employs a distributed processing

model for recursively processing data objects found within a digital crime scene [62].

Similar to XIRAF and Autopsy, interpretation is realised by wrapping existing

interpretation tools.

2.4 Key challenges

At the 2006 DFRWS conference, the keynote speech, “Challenges in Digital

Forensics” was delivered by Ted Lindsey a computer scientist at the FBI [70]. In his

speech, a number of the challenges were identified. These are presented in Table 1.

14 http://pyflag.sourceforge.net/


Table 1: Challenges in digital forensics - DFRWS 2006 keynote

Device diversity Volume of evidence

Video and rich media Whole drive encryption

Wireless Anti-forensics

Virtualisation Live response

Distributed evidence Usability & visualisation

These challenges as enumerated by Lindsey at DFRWS 2006 are a mix of: new

technologies (e.g. wireless, whole drive encryption), situational technology trends (e.g.

device diversity, volume of evidence, distributed evidence), and techniques (e.g. Live

response, usability & visualisation).

In 2005, the following list of challenges was presented by Mohay [87]:

• Education & certification

• Embedded systems

• Corporate governance and forensic readiness

• Monitoring the internet

• Tools

• Data volumes

In 2005 and 2004, Casey summarised the key challenges as:

• Counter forensics

• Networked evidence

• Keeping pace with technology

• Tool testing

• Adapting to shifts in law

• Developing standards and certification [28, 29]

A subset of these challenges can be generalised to the following list of

challenges, which have been selected as relevant to this dissertation.


2.4.1 Volume & Complexity

The “volume of evidence” and “distributed evidence” challenges cited by

Lindsey in the previous section are both exemplars of what is referred to as the volume

problem in digital forensics. This refers to the following trends:

• the quantity of data which may be relevant to an investigation is increasing

markedly;

• the number of individual units of potential evidence is increasing caused by the

increasing occurrence of evidence distributed over multiple devices; and

• the quantity of data in each unit under consideration is also increasing

markedly, with multiple terabyte acquisitions becoming common.

Further complicating this situation is the rate at which storage capacity is

growing compared to access times. With hard drive capacities doubling on a yearly

basis15, and access times increasing by only 10%, software will soon have to treat disks

more as sequential devices than random access devices [49]. Such a situation is a fine

example of what Casey calls “keeping pace with technology”.

Lindsey’s challenges which are more related to specific technologies (eg.

Virtualisation) as well as the “device diversity” challenge are related to what is referred

to as the complexity problem in digital forensics. This refers to the introduction of new

sources of complexity, which must be addressed to investigate evidence. Addressing

this often requires the development of new techniques. Additionally, the latent nature

of digital evidence makes complexity inherent in analysis of digital evidence.

Addressing the complexity problem requires acknowledging the problem has

two dimensions: the rate at which new concepts are generated and the change in

meaning of existing concepts over time.

For example, 10 years ago the main subject of media analysis was the content

of individual pieces of magnetic storage media: specifically partition tables and

filesystems. The emergence of storage virtualisation added another layer of abstraction

in between: the volume. Whereas before a filesystem was tied to a specific contiguous

section of bytes on a particular piece of media, now a filesystem may be contained

within a volume which spans multiple pieces of storage media, such as when a RAID

array is used. This change has both added new concepts and redefined existing

relationships in the lexicon of digital forensics.

To date, this kind of conceptual evolution has been handled by two means: a)

modifying the conceptual model to include the new concepts and relationships, or b) 15 And have been doing so since around 1989


leaving this information outside of the scope of the tool. Considering the storage

volume example given above, the integrated forensics tool EnCase has had its internal

model changed to include abstractions for RAID volumes and corresponding

component media regions, and the tools interaction model has been tweaked to

represent these abstractions. The open source forensics tool addressing similar analysis

tasks, The Sleuth Kit, however does not address storage virtualisation. Rather, it leaves

management of this conceptual change in the hands of the human tool operator.

The complexity and volume problems are well illustrated in the Gorshkov case,

which involved credit card theft from at least 11 online entities, and subsequent

fraudulent use of those cards through PayPal16 and ebay17 [8]. Successful prosecution

of this case involved evidence drawn from multiple computers, under the control of

multiple company entities, in multiple jurisdictions (some of which was acquired over

the internet from Russia). The evidence was drawn from multiple sources, such as

backups, hard disk images, emails, archive copies, and hard copies. Interpreting the

evidence required numerous applications and multiple operating systems.

As we have said, digital investigation covers a very broad range of conceptual

entities, and any schema or model attempting to fully describe the domain quickly

becomes insufficient as technology inexorably marches on. In this light, a means of

representing evidence and related information expressive enough to represent all of the

information we wish, while not committing us to a particular data model, is desirable.

Furthermore, such a model should be extensible enough that new information may be

added by arbitrary means, as new tools and techniques emerge, without breaking

existing tools, nor violating the integrity of the existing information. Conversely, a

means to declaratively attach semantics to data, without resorting to modifying the

tools which operate over the information hold promise for integration of arbitrary and

heterogeneous data.

Addressing the volume and complexity challenges requires new approaches to

building tools which acknowledge the rate of change of technology, and enable

continued tool functioning despite new sources of complexity. We hypothesise that by

focusing on the relationship between tools and formal representation, a key theme of

this research, new approaches might be identified which address these challenges.

2.4.2 Effective forensics tools and techniques

The rapid rate of evolution of technology is a significant cause of volume and

complexity problems. Existing tools and techniques are unable to keep pace with these

16 http://www.paypal.com/ 17 http://www.ebay.com/


changes. It is well acknowledged that new approaches to building tools are necessary:

this situation is reflected by a recent upturn in focus on research into new tools and

techniques.

The first DFRWS in 2001 focused primarily on frameworks and principles for

digital forensics, rather than on forensics tools and techniques [88]. Two speakers

however highlighted the need for tools and techniques to be evolved:

• The social aspects of our analytical endeavors are in need of focus, too. We

need tools that zero in on truly useful information and quickly deduce whether

it is material to the investigation or not. We need to identify a social “end-

game”. Are we prepared to take serious action to thwart wrongdoing in all its

forms? (Spafford, DFRWS 2001).

• The constant appearance of new and improved technology (e.g., cellular

phones, personal digital assistants [PDAs], the Global Positioning System) has

moved the target of media analysis tools way out of range for quick response.

(Baker, DFRWS 2001).

By the time of the DFRWS 2006, the research priorities of the field had indeed

begun to shift towards addressing these challenges. Papers were presented covering

tool validation, memory analysis, tool integration (twice) and evidence correlation (four

times). All up, 8 of the 17 papers presented at DFRWS2006 were related to

development of new techniques and tools.

Analysis of the prevalence of specific technologies cited as challenges by

Lindsey, Mohay and Turner reinforces the need for new and more effective forensics

tools and techniques. To quote Mohay:

These tools need to target the ever increasing volume and heterogeneity of

digital evidence and its sources, and they need to be inter-operable [88].

Tool interoperability (or tool integration) implicitly involves integration of

data, a goal related to the “Distributed evidence” challenge. This goal relates directly to

the representation theme of the research described in this dissertation. Additionally, the

wider challenge of “effective forensics tools and techniques” encompasses our theme of

analysis techniques

2.4.3 Meeting the standard for scientific evidence

In the early 90’s, much of the focus of the field was on building effective

forensics tools, and having them accepted in court. Frameworks for characterizing the

field of forensics, such as forensics process models, and protocols for ensuring integrity


and chain of evidence were primary concerns. Today, there appears to be consensus on

appropriate methodologies and protocols for dealing with digital evidence, a conclusion

which can be implied by the widespread adoption of digital evidence in proceedings.

Despite the apparent need to trade off expediency or other factors for rigour in

some contexts of digital investigations, the need for rigour in the conclusions is a

principle tenet. In particular, the traditional forensic sciences are based on the

application of reliable scientific methods – seeking to use techniques or tools only after

rigorous and thorough analysis. The field of digital forensics (at least in the United

States) is struggling to meet the court’s standards for scientific evidence [78].

At the first DFRWS, it was concluded that for digital forensic science to be

considered a discipline, it must have the following characteristics [98]:

• Theory: a body of statements and principles that attempts to explain how

things work

• Abstractions and models: considerations beyond the obvious, factual, or

observed

• Elements of practice: related technologies, tools, and methods

• Corpus of literature and professional practice

• Confidence and trust in results: usefulness, purpose

It is acknowledged the field only exhibits some of these characteristics: for

example, elements of practice are observable in the development of and trust in

forensics tools, however they are not tied to scientifically rigorous evaluation [98].

With the exception of the recent work of Carrier [24], which bridges between forensic

investigation process and computer science theory, little work has been contributed in

the area of theory.

The theme of building the field to become a discipline is echoed by Mohay’s

“education and certification” [87] and “standardised specification & testing of tools”

[88] challenges:, and Casey’s “Adapting to shifts in law” and “Developing standards

and certification” challenges [28].

The final sub-challenge above, “Confidence and trust in results”, directly

relates to the final theme of our research, which is assurance of digital evidence, and

analysis results.


2.5 Conclusions

The utility of the computer as a tool of production, communication, and

commerce has resulted in widespread adoption over the latter half of the twentieth

century and the start of the new millennium. Digital technology is now pervasive.

Network effects and the rapid pace of change in digital technology have led to a

situation where the employment of digital evidence is complicated by the burden of

large quantities of highly complex data. The challenge for digital forensics is to both

increase in reliability and rigour, while at the same time increasing the efficiency of

investigation. New techniques for interpreting and analysing evidence and new

approaches to building interoperable forensics tools are required.

Addressing these key challenges requires new approaches to building tools

which acknowledge the rate of change of technology, and enable continued tool

functioning despite new sources of complexity. We hypothesise that by focusing on the

relationship between tools and formal representation, new approaches might be

identified which address these challenges.

Chapter 3. Related work

“The search for truth is in one way hard, and in another easy – for it is evident that no one of us can master it fully, nor miss it wholly. Each one of us adds a little to our knowledge of nature, and from all the facts assembled arises a certain grandeur.”

(Aristotle)

The preceding chapters have provided context and background to this

dissertation as a whole. This chapter provides background material and related work

specifically relevant to the work described in Chapters 4 to 7, and consists of four

sections.

Section 3.1 describes the literature related to event correlation, both

specifically related to DF and to the wider computer security context and Section 3.2

describes current approaches to maintaining investigation documentation and storage of

digital evidence. Both of these sections relate to the problems of complexity and

volume in digital forensics, and provide motivation for Chapter 4, which proposes

formal knowledge representation as a means of addressing these problems. Chapter 5

builds upon the event correlation background in section 3.1, and the approach proposed

in Chapter 4, proposing, implementing and evaluating a novel approach to representing

heterogeneous event oriented evidence, and a novel technique for automated

identification of forensically interesting situations. Chapter 6 builds on the background

material in section 3.2 and the approach proposed in Chapter 4, proposing a digital

evidence storage format which enables tool integration and inclusion of arbitrary

investigation related documentation.

Finally, Section 3.3 describes related work in computer timekeeping, which

forms background to Chapter 7, which focuses on assuring the correct interpretation of

digital timestamps, and Section 3.4 concludes the chapter.

37

38 CHAPTER 3 – Related work

3.1 Event correlation for forensics

This section surveys the literature in event correlation, particularly focusing on

approaches taken in representing events, event patterns, and scenarios, which is a

subject of Chapter 5.

Event correlation is a term which has emerged from a number of computer

security application domains, in particular in the areas of network management and

intrusion detection. It is used to describe an array of techniques applied to

comprehending the dynamic behaviour of systems, based on events and patterns of

events in their history. As in these domains, in the digital forensics domain we find the

need for event correlation.

Abbott et al, have, in their Event Correlation for Forensics (ECF) research,

translated textual log events into instances of a generalised data model (canonical form)

implemented using a relational database [3] performing either interactive or automated

scenario identification over these events.

Stallard and Levitt employed an anomaly based expert systems approach to

identifying semantic inconsistencies in investigation related data. Their approach

translated MAC times generated by TCT and the UNIX lastlog into an XML

representation, which was asserted into the JESS expert systems shell. Knowledge is

encoded as heuristic rules which specify invariant conditions related to logins and

potential file modifications.

Elsaesser and Tanner employ a AI based approach to automated diagnosis of

how an attacker might have compromised a system [39]. Using a model of the topology

of a network, the configuration of systems, and a set of “action templates”, a class of

artificial reasoner called a “planner” generates hypothetical attack sequences which

could have led to a particular situation. These hypothetical attack sequences are then

run in a simulated environment, and the generated logs compared with the logs of the

real world system. The action templates correspond to specifications of how a

particular action will transition the state of the world from one state to the next.

Approaches to event correlation in the IDS and network management domains

have focused on single domains of interest only, and have employed models of

correlation that are very specific in nature. Repurposing these specific existing

approaches to the more general task of event correlation in the CF domain is made

difficult for a number of reasons. Existing event pattern languages do not necessarily

generalise to application in wider domains. For example, while state machine based

event pattern languages may work well for events related to protocols, they do not work

well for patterns where time and duration are uncertain [37]. Most approaches focus

CHAPTER 3 – Related work 39

exclusively on events, and ignore context related information such as environmental

data and configuration information. Furthermore, few approaches have available

implementations in a form that is readily modifiable.

Where we have modifiable implementations of event correlation systems, we

find that extension is complicated by the software paradigm underlying its

implementation, and that the systems are weak on semantics. For example, extending

the STATL language [38] involves considerable burden. Adding new vocabulary to the

event language is slowed because of compilation and linkage overheads. Addition of

concepts outside of the event pattern language requires reengineering of the STAT

language compiler and supporting framework. Finally, no means of specifying the

semantics of the vocabulary of the language is available.

3.1.1 Approaches to modeling events

The representation used to model events has a significant impact on the

usability of correlation approaches, including conceptual expressiveness, extensibility,

ease of integration of new information and maintainability. We describe here a number

of existing representation approaches observed in the event correlation literature.

The MODEL language, a component of the DECS network management

system, used an object oriented (OO) style model of classes of events related together

in class/subclass relationships (which in this case was referred to as semantic

generalization) [143]. The event correlator translates from event patterns specified in

the MODEL language directly to C++, and presumably, is encumbered by the

maintainability characteristics of C++ software development and deployment.

Expert systems based approaches such as the EMERALD IDS [69] combine a

similar knowledge model, which support class/subclass models of events, with a rule

language. The model however is dynamically constructed at run time, eliminating the

C++ compile-link phase, resulting in simpler extensibility and more rapid evolution

compared to the DECS approach.

A number of challenges were identified with the ECF [3] approach. The

approach does not incorporate notions such as semantic generalization in its modelling

approach, and identification of a methodology for mapping the detail rich, domain

specific information contained in log files to the canonical representation appeared to

be elusive. The conceptual model of the canonical form implied that every event was

seen as a time-subject-object-action tuple (TSOA), a notion which proved to be an

impediment when attempting to represent arbitrary event log entries. This canonical

form was supplemented by the addition of shadow data an arbitrary set of name-value

pairs which could be associated with a canonical entry.


Employing a trivial example, the canonical form worked well for making

statements such as

“at 12:00 on the 1st January john hit the ball”

The data model of the TSOA canonical form alone, however, prevented

expressing even slightly more complex statements (which unfortunately are on the

lower end of conceptual complexity when considering event logs) such as the

following:

“at 12:00 on the 1st January john logged into the host www”

While at first glance the statement resembles the simple time-subject-verb-

object example previously, this statement includes an explicit adjective indicating the

class of the noun it modifies; that “www” is a (network) host. This extra information

was stored as a name-value pair (e.g. hostname-www) in the shadow table. No

mechanism exists, however, for interpreting the relationship of the name-value pair –

for example does hostname-www relate to the subject or the object?

In practice, the canonical event was used in so many ways that its meaning was

unclear, requiring extensive human inference and interpretation in usage as was the

relationship between the name-value pairs and the entities in the canonical record.

Further, it is unclear how the information model could be extended beyond the limits of

the relational model adopted.

3.1.2 Event patterns and event pattern languages

In the CF domain, the only work performed on automatically identifying event

patterns is the ECF work. This work used a rule based approach, which is characterized

by statements that have the form of “IF condition THEN conclusion”, which they

referred to as “Logical Event Patterns” (LEPs). LEPS are specified in an custom XML

variant and are evaluated against the SQL database oriented repository of events. LEPs

do not support semantic generalisation of events, nor does the underlying canonical

data model.

Chapter 4 similarly investigates automated event pattern identification,

focusing on formal models of representing evidence.

It is in the area of misuse intrusion detection systems (IDS) that there has been

most work investigating the matching of event patterns. These IDS use either signature

or rule based approaches or a mixture of the two. Signature based approaches typically


operate at a higher level of abstraction than rule based approaches by using declarative

languages that model different aspects of situation. Both signature and rule based

techniques typically entail specifying event signatures or rules using some kind of

event language.

A number of signature based alert correlation languages aim to correlate events

based on abstract models of intrusion goals. The LAMBDA correlation language is a

signature based approach matches signatures of event consequences with event

prerequisites, generating Prolog based correlation rules [35]. This language uses an ad

hoc combination of XML and Prolog syntax to model both Attacks and Alerts.

JIGSAW uses a similar technique for correlating pre and post conditions, focusing

more on language syntax [132]. They model pre and post conditions as “requires” and

“provides” relationships of events. Ning et al criticize JIGSAW as overly restrictive,

and weaken the requires/provides relation in Hyper-Alerts to allow correlation in

absence of certain prerequisites [95]. Similarly to LAMBDA, both JIGSAW and

Hyper-Alerts are translatable to rules.

CEP [101], employs a rule language called RAPIDE for event pattern

recognition and correlation. This language contained features to match over parameters

such as causal ancestry, repetition, as well as simple property based comparison.

STATL uses finite state machine (FSM) models to specify signatures. Doyle et al

critique using FSMs: “Representing events as transitions through a single chain of

states precludes recognizing the achievement of a set of attack preconditions that have

no innate required time order.” (p. 21) [37] Techniques for translating FSMs into rules

are well established.

The line where rule and signature based approaches become expert systems is

blurred. Two differentiators would be the use of dynamic, object based knowledge

models, and a translation stage between signature specifications and underlying rule

representations.

A number of approaches have applied expert systems and logic based

reasoning to event correlation. The EMERALD IDS [69] uses an expert system

implemented using the P-BEST rule language to specify intrusion rules. Doyle et al

criticise P-BEST for lacking any concepts specific to event recognition [37]. The rule

language employed by Stallard and Levitt, called JESS, has a similar heritage to P-

BEST [125].

Of the correlation languages reviewed, only STATL was available with source

code. RAPIDE is available only as executables, and has not been updated since 1998.

JIGSAW has no implementation. Hyper-Alerts, while implemented, have no available

implementation, nor has LAMBDA. P-BEST is only available embedded in the


EMERALD product [69], and is not easily modified, nor is it straightforward how to

access the P-BEST functionality. It has very few features differentiating it from the

class of languages based on the CLIPS expert system [143].

3.1.3 Observations

Event correlation in the forensics domain is complicated by the high conceptual

complexity of, and volume of, events which might be drawn upon in an investigation.

Existing approaches focused primarily on correlation techniques. We hypothesise that a

formal and general approach to representational of events would benefit the field by

addressing complexity issues.

3.2 Current approaches to evidence representation and format

This section surveys the literature as it relates to digital evidence container

formats and representations of investigations. This forms the background and related

work for Chapter 6, the focus of which is evidence and tool integration through formal

representation.

Despite the apparent maturity of the media acquisition area of digital forensics,

changes in the technology landscape present new challenges. Until recently, digital

crime scenes have typically been bitwise images of the entire content of data storage

media such as hard drives or floppy drives, or images of partitions contained within

these storage media. Today, the notion of creating an image of a particular device has

become complicated by the presence of multiple data streams within devices. For

example, today’s hard drives may contain Drive Configuration Overlays (DCO) or

Host Protected Areas (HPA). These are certain areas addressed separately to the regular

data area.

3.2.1 Digital evidence container formats

A number of containers for images are in common usage, such as simple binary

data files produced by the venerable UNIX copy tool, dd, and the EnCase Expert

Witness file format, produced by EnCase’s imaging tools. The latter, besides serving as

a container for images and Palm Pilot memory [31], additionally contains checksums, a

hash for verifying the integrity of the contained image, error information describing bad

sectors on the source media, and metadata related to provenance. The Advanced

Forensics Format (AFF) is a disk image container which supports storing arbitrary

metadata as name, value pairs [45].

There is, however, little standardisation of storage containers or consideration

of how to record aspects such as those described above. The current state of the art has


given rise to a variety of ad hoc and proprietary formats for storing evidence content,

and related evidence metadata. Conversion between the evidence formats utilized and

produced by the current generation of forensic tools is complicated. The process is time

consuming and manual in nature, and there exists the potential that it may produce

incorrect evidence data, or lose metadata [30]. Validation of the results produced is

hindered by this lack of format standardisation.

It is with these concerns in mind that calls have been made for a universal

container for the capture of digital evidence. Recently, the term “digital evidence bags”

was proposed to refer to a container for digital crime scene artefacts, metadata, integrity

information, and access and usage audit records [135]. Subsequently, the Digital

Forensics Research Workshop (DFRWS) recently formed a working group with a goal

of defining a standardised Common Digital Evidence Storage Format (CDESF) for

storing digital evidence and associated metadata [30].

The Advanced Forensics Format (AFF), recently proposed as a disk image

storage format, includes storing of acquisition related metadata in the same container as

the disk image. Garfinkel et al describe the AFF and summarise the key characteristics

of nine different forensic file formats. The also outline the desirable characteristics for

an image storage container [45]. They conclude that the AFF is the only publicly

disclosed forensic format which supports storage of arbitrary metadata. The metadata

storage mechanism in the AFF is, however, limited to name/value pairs and makes no

provision for attaching semantics to the name.

Encase18, uses a monolithic case file for storing case related metadata and

stores filesystem images in separate and potentially segmented files. The format of the

case file is proprietary.

Turner’s Digital Evidence Bag (DEB) attempts to replicate the key features of

physical evidence bags, which are used for traditional evidence capture. The key

structural components of a physical evidence bag are the bag itself, a means of bag

identification (potentially a serial number), an area for recording evidence related

information (which Turner refer to as a tag), and optionally, a tamper evident security

seal.

The key features of physical evidence bags are categorised as follows:

Evidence Metadata Records: Standard evidence metadata includes a

description of the evidence, the location, date and time of the acquisition of the

evidence.

18 http://www.guidancesoftware.com/


Provenance Records: Includes chain of custody information, as well as

information pertaining to the collector of the evidence.

Identification Records: Identification information includes a unique serial

number (or seal number) which uniquely identifies the bag, and other case related

information such as the case number, item number, collecting organisation, suspect and

victim.

Integrity Device: Pieces of evidence collected at an investigation scene are

placed in evidence bags and sealed on the spot, potentially with a tamper evident tape

closure seal. This seal, and the construction characteristics of the bag itself, help to

ensure Integrity of the evidence by indicating tampering.

Evidence Content: The physical object found at the crime scene which is

preserved inside the bag.

It is worth noting here that the use of the features listed above varies dependent

on jurisdiction.

Turner’s proposal translates a number of aspects of the above features of the

physical evidence bag into the digital realm. A file archive structure is proposed which

defines a specific naming scheme for files containing digital evidence, separate files

containing evidence metadata, and a singular file which contains evidence integrity,

provenance and identification information. Figure 5 depicts the structure of Turner’s

digital evidence bag.

.bag02 .bagNN

.tag

.index01 .index02 .indexNN

Digital Evidence

.bag01

Evidence Metadata

Tag

KEY

Digital Evidence Bag

Figure 5: Turner's digital evidence bag

A DEB is a collection of the Digital Evidence files, Index Files and a single

Tag File. Turner does not detail the implementation of the container grouping these


evidence files, however we expect that in practice, the container layer of the DEB

would be an archive similar to a tar, zip or other.

Individual elements of digital evidence collected (such as filesystem images,

network traces, or the contents of image files) are stored in digital evidence files, which

are identified by a file extension .bagNN. The NN refers to a unique number.

Correspondingly, evidence metadata, such as file last access time is stored in similarly

named files with an extension .indexNN. The pairing of a single digital evidence file

with its corresponding evidence metadata file is refered to by Turner as an evidence

unit. Turner does not describe the naming of the files other than the extensions defined.

It is unclear as to whether or not multiple pieces of content are stored in a .bagNN file.

Integrity, provenance and identification information are stored as unstructured

text within the tag file, which is identified by the file extension .tag. The tag file also

enumerates the names of all of the Evidence Units.

The architecture of Turner’s digital evidence bags is oriented towards a single

monolithic digital evidence bag being used in a case, as a container for all digital

evidence acquired. Secondary evidence (evidence derived from the analysis of earlier

acquired evidence, such as files extracted from a filesystem image) would appear in

this scheme to be added to the same digital evidence bag as the original image. This

involves modification to the tag file and the addition of new files to the evidence bag.

Integrity is assured by the onion like use of hashing of the contents of the tag file.

A potentially confusing aspect of Turner’s DEB proposal is that modification

of the tag file, and the addition of new files to the DEB may lead the layman to the

conclusion that the monolithic bag is never sealed, thus raising doubts as to the

integrity of the evidence. While this may be seen more as an impedance mismatch in

translating the evidence bag metaphor, we suggest an alternate architecture for digital

evidence bags, which is presented in Chapter 5. The architecture we present favours

treating evidence bags as immutable objects. Addition of information is achieved

outside the bag, in much the same way that information is added to the tag of a physical

evidence bag without breaking the tamper evident tape.

Turner’s structure does not define a scheme for referencing of evidence and

metadata between digital evidence bags. Therefore the ability to compose multiple

evidence bags into a corpus is not addressed.

The format and vocabulary of the investigation documentation maintained in

the DEB has no formally defined syntax, data model or semantics. The syntax appears

ad hoc and the vocabulary overly abbreviated. In this context, little attention has been

paid to the nature of the metadata that is being stored, with no consideration being


given to the relationship between the metadata and wider case related information, nor

information found within the digital crime scene.

3.2.2 Representation of digital investigation documentation

Treatment of the wider issues of investigation related documentation has been

covered in an abstract sense by Bogen and Dampier [17] who attempt to model the

knowledge discovered during the identification and analysis phase of the investigation

process using the Universal Modelling Language (UML). Further development of this

work described a Unified Modelling Methodology (UMM), the purpose of which was

largely as a framework with which to describe and think about planning, performing,

and documenting forensics tasks. This methodology described three unique

perspectives from which to view computer forensics as a system:

Investigative Process View: abstract models of the forensic process and

concrete models of specifics tasks at hand. The “Sequence of activities performed by

investigators and examiners.” In this view, abstract models refer to the various digital

investigation process models described in Section 2.2.2. The concrete models refer

more to non-abstract models, more like plans of action as have been described in

Mandia’s Incident Response Methodology or, even more concrete, the lower edges of

Bebe’s Hierarchical Process Model.

Case Domain View: “A model of the information domain of the case; the

relevant information items that the investigators know, and the relevant information

items that the examiner seeks.” The view refers to models of the entities involved in the

case, both real world and virtual, and the relationships between them. For example, an

abstract case domain model might be constructed that documents concepts relevant to

the investigation, such as physical and tangible objects (ie computers and mobile

phones), transactions (ie payments or sales), and places.

Bogen and Dampier propose that such a model may be used as a tool for

planning an investigation, by using the concept diagram as a model for identifying

classes of concepts which might be related to the case at hand. Concrete models are

populated by individual entities, which are instances of the classes defined in the

abstract model. For example, a simple abstract model might contain the concepts

computer and user. Considering what instances of these concepts might exist in a

particular case might lead to numerous instances of both, each instance representing a

particular computer or user of a computer.

Evidence View: “Represents information about the incident and evidence of the

incident.” [18] The evidence view relates to the product of the investigation: evidence

that relates to the goals of the investigation. The evidence view would thus include


models of timelines, hypotheses regarding the incident, and supporting evidence. The

bookmark feature of Encase could be seen as an example of this class of model

concept.

Bogen’s work is significant in that it addresses conceptualising digital

investigations from a number of crosscutting perspectives. While the work proposes the

use of a graphical language for modelling these domains, it stops short of presenting

any actual models, or exploring means for tools to use these models.

A number of authors have proposed the definition of domain specific languages

for the purpose of representing and describing digital investigation related information.

Prior to proposing the Unified Modelling Methodology described above, Bogen and

Dampier proposed that a Computer Forensics Experience Modelling Language

(CFEML) would be of use in modelling of experiences, lessons learned, and knowledge

discovered in the course of an investigation [17]. While the purpose and need for a

CFEML was only proposed in an abstract sense, in the same year Stephenson proposed

the Digital Investigation Process Language (DIPL) [126] This language, whose syntax

is based on the Common Intrusion Specification Language (CISL) of the IDS field, and

ultimately on LISP S-expressions [113], focuses on modelling the forensic

investigation process and on the entities involved in it with a heavy emphasis on

intrusion response. Using the perspectives defined by Bogen’s UMM, it focuses on the

investigative process view, and domain case view, and aims to be suitable for

describing in a narrative the actions performed in an investigation, and the entities

being acted upon. It is, however, strongly influenced by a network incident response

viewpoint, and lacks a means for ascribing semantics to vocabulary in an extensible

and machine readable way.

Both XIRAF [7] and TULP2G [137] process digital crime scenes into single

document based XML tree representations, operating primarily in the case domain,

whereas DIPL appears to operate more in both the case domain and investigative

process domain. An important contribution of the XIRAF work is in conceptualising

the XML representation of information extracted from the digital crime scene (DCS) as

annotations of particular byte ranges within the DCS, which also implies defining a

composable addressing scheme. This approach maintains a direct linkage between

information and its source, addressing the provenance of the information. Both

approaches avoid describing the semantics of their XML based representations, and

address information integration by tree manipulation.

The alternate approach to persistent, machine readable representations of case

related information is the dynamic generation model. EnCase falls into this category of

approach. We suspect that similar to TSK, which uses C-structure based data models


internally, EnCase employs internal structural models in interpreting digital data. The

only way to interact with or otherwise view these models is, however, through the GUI

representation of this structure, or, to a limited degree, through the use of a custom

scripting language and limited object model.

3.2.3 Observations

The current efforts to create a standard digital evidence storage container do

not address storage and integration of arbitrary information related to the investigation.

In a wider context, efforts such as Bogen and Dampier’s modeling efforts attempt to

describe the process of digital investigation. We hypothesise that by adopting a formal

approach to representation we might bridge between these two related areas, enabling

digital forensics tools to both interpret evidence, and also maintain documentation

related to the surrounding investigation.

3.3 Reliable interpretation of time

This section provides background on computer timekeeping, the unreliability

of computer clocks, and methods of computer clock synchronisation. This forms the

background and related work for Chapter 7, the subject of which is assuring the reliable

interpretation of timestamps found in digital evidence.

One of the primary challenges in the use of digital evidence is assuring that

digital timestamps might be reliably interpreted as times in the real world. While

timestamps are ubiquitous in computer records, the clocks which generate them are

often unreliable, the reliability of which ranges from seconds possibly to years,

fluctuating with changes in temperature and other factors.

3.3.1 An introduction to computer timekeeping

A battery powered real time clock (RTC) (also called BIOS or CMOS clock) is

used to keep time while a computer is switched off. While the RTC is used as the basis

for determining time when the computer boots, the interpretation of this time is

operating system specific. For example, the family of Windows operating systems

interpret the RTC as civil time19, whereas the Linux operating system may interpret the

RTC as either civil time or UTC by configuration.

Commonly, UNIX operating systems implement a software clock (called the

system clock) by setting a counter from the RTC at boot, and employ a hardware timer

(such as RTC timer interrupts, an advanced programmable interrupt controller (APIC)

19 Civil time refers to the government mandated time in a particular jurisdiction, incorporating regions specific offsets such as daylight savings time.


or other means) as an oscillator. Stevens suggests that all instances of the Windows OS

base their timescale on the RTC throughout operation [128]. There is, however,

evidence to suggest that, similar to UNIX implementations, Windows 2000 and above

similarly employ a software clock rather than use the RTC directly [79, 81, 115].

3.3.2 Reliable time synchronization

PC clocks are commonly known to be inaccurate, because of the inherent

instability of the crystal oscillators with which clocks are implemented. These vary

widely with temperature, voltage and noise fluctuations [83]. Since the late 1980’s,

significant effort has been invested in the development of techniques for obtaining

reliable sources of time via directly connected atomic clocks, radio clocks, GPS and

the network time protocol (NTP). As far back as 1994, researchers were demonstrating

synchronisation of computer clocks to an accuracy of 10 ms across the Pacific Ocean

using NTP [82]. Today, system clocks on the UNIX platform are able to be

synchronised to reliable time sources on a nanosecond scale.

NTP is the most prevalent method of time synchronisation on UNIX hosts.

From Windows 2000 onwards, Microsoft has included in their operating systems a

restricted version of NTP, called Simple Network Time Protocol (SNTP). While it is

protocol compatible with NTP, SNTP does not implement the clock discipline

algorithms present in the former, and is not capable of delivering the same degree of

precision.

The degree to which reliable network time may be utilized by a particular PC

running Windows varies. By default, standalone Windows 2000 workstations have the

SNTP service switched off, while Windows XP workstations by default will

synchronize with an SNTP service hosted at time.windows.com once every week. Both

2000 and XP workstations in a domain network will by default synchronize via SNTP

from the Domain Controller (DC).

In theory then, stand alone Windows XP workstations will become

synchronized with civil time by use of SNTP once a week. Stand alone Windows 2000

PC’s will likely drift away from civil time.

3.3.3 Factors affecting timekeeping accuracy

A number of interrelated factors influence the accuracy of both timekeeping on

computers, and the interpretation of timestamps sourced from them. We summarise

these below:

System clock implementation As discussed previously the quartz crystals

used as oscillators in computers are notoriously unstable and known to be inaccurate


over time. In addition, implementing the correct local time offsets for civil time is

complicated by changes in region-specific time zones. Recent evidence for the

importance of this may be illustrated by the flurry of patches related to the Melbourne

Commonwealth Games daylight savings time extension in early 2006 [80].

Clock configuration It is common to see Windows workstations with the time

zone set to the default installation time zone. Another clock configuration error is the

commonly occurring example of systems where the BIOS time has not been correctly

set.

Tampering The practice of setting computer clocks back or forward for

reasons such as evading digital rights management or misdirection of investigation is

often referred to as tampering. Timestamps, like any data, are subject to the possibility

of deliberate modification.

Synchronisation Protocol The Windows time synchronisation protocol is

based on SNTP and is only designed to keep computers synchronised to within 2

seconds in a particular site and 20 seconds within a distributed enterprise. Furthermore,

computers using NTP and SNTP without cryptographic authentication are subject to

protocol based attacks.

Misinterpretation Timestamps are related to a particular frame of reference,

and their correct interpretation requires knowledge of that context. For example, to

interpret the time to which an Internet Explorer timestamp corresponds, in the civil

time where it was generated, one needs to know the time zone offset. Other sources of

uncertainty are the ambiguity as to what point in time the timestamp refers to for a

particular event – is it the start time or the end time of the event - and was the

timestamp generated at the time of the event or the time of writing to the event log.

Bugs Software errors in the implementation of software clocks or the

algorithms which convert the in memory clock to a timestamp have the potential for

adversely affecting timekeeping accuracy.

3.3.4 Usage of timestamps in forensics

Aside from the comprehensive study of computer time behaviour on UNIX

systems performed in the context of developing the NTP infrastructure [82, 83], to the

best of our knowledge there has been little research in either characterising the

behaviour of the timescale of unsynchronised Windows computers, or on automated

means of identifying that behaviour.

In the computer forensics literature, timelining is often referenced as a

fundamental tool in determining ordering and likely cause and effect. Stephens has

proposed a model and algorithm for relating timestamps taken from multiple timelines


[128]. In this model, a base clock is set to UTC, and subordinate clocks are defined by

skews from parent clocks with additional skews further generated from time drift rates.

Gladyshev and Patel propose using corroborating sources of time to find the

time bounds of events with an unknown time of occurrence by examining ordering

relationships with events with known times. They define both a formalism and an

algorithm for determining these temporal bounds [47].

Weil argues for dynamic analysis of the temporal behaviour of suspect

systems, proposing correlation of timestamps embedded within locally cached web

pages with the modified and accessed times (MAC times) of the cached files [141].

3.3.5 Observations

Computer clocks are inherently unreliable, which casts doubt on the usage of

timestamps in forensic investigations. Methods of post-hoc characterisation of the

behaviour of a particular computer’s clock are of interest in assuring the correct

interpretation of timestamps.

3.4 Conclusion

This chapter has summarised the literature related to event correlation in

forensics, which is a focus of Chapter 5; current approaches to evidence representation

and storage, and representation of digital investigation related documentation, which

are related to Chapter 6, and reliable interpretation of time, which is related Chapter 7.

Analysis of the literature as it relates to event correlation and digital evidence

has led to the hypothesis that digital forensics tools would benefit from a formal

approach to representation. The next chapter describes and contextualises the field of

knowledge representation (KR), where we look for inspiration and formalisms with

which to address the representational challenges in digital forensics.

Chapter 4. Digital evidence representation: addressing the complexity and volume

problems of digital forensics

“If scientific reasoning were limited to the logical processes of arithmetic, we should not get very far in our understanding of the physical world. One might as well attempt to grasp the game of poker entirely by the use of the mathematics of probability.”

(Vannevar Bush)

Analysis of the field of digital forensics has indicated that examining the nature

of the information which it operates on may help address the complexity and volume

problems described in Section 2.4.1. This chapter looks to the field of knowledge

representation (KR) for inspiration, and proposes that a KR based approach to digital

evidence representation will yield benefits in solving these problems. In particular,

Semantic markup languages, which are described in Section 4.3, are employed towards

solving these problems in Chapters 5 and 6.

The chapter is structured as follows. Section 4.1 introduces the representational

challenges involved in digital evidence, describing why the current natural language

based approach to documenting investigations hinders tool interoperability and

potentially introduces errors. Section 4.2 provides background on the field of

knowledge representation, Section 4.2.1 describes its historical foundations, Section

4.2.2 describes key definitions, and Section 4.2.3 describes hybrid approaches to KR.

Section 4.3 describes the synthesis of markup languages and KR which has led to the

current generation of semantic markup languages, the Resource Description

Framework (RDF) and the Web Ontology Language (OWL). Section 4.3.1 introduces

RDF; Section 4.3.2 describes the XML serialisation of RDF, which is intended for

publishing and machine interpretation; and Section 4.3.3 introduces ontology

languages, and OWL. Section 4.4 reviews the literature of digital forensics and

computer security for knowledge representation related themes, and finally, Section 5.5

53

54 CHAPTER 4 – Digital evidence representation

puts forward the proposition that the field of forensics would benefit from a formal

approach to representing evidence and related investigative information.

4.1 Introduction

The simplest of digital forensics investigations will involve numerous

documentary artefacts as evidence. Examples of these are printouts of data objects

identified as evidence, evidence manifests, and investigation reports. In the course of

investigation, other documents may be kept or produced, including chain of custody

documentation, file notes recording analysis activities and results, and provenance

documentation.

The current state of affairs is that the much of the information related to digital

forensics investigations is recorded in documents such as these in natural language. The

vocabulary employed in these documents is drawn from multiple domains: law

enforcement, legal, computing, and general spoken English. This situation is similar

within the digital crime scene (defined previously in Section 2.2.2), where voluminous

amounts of information are stored in free text form, and semi-structure text form.

Despite much research into Natural Language Processing (NLP), such textual

information is still unsuitable for machines to reason with. For example, consider the

following two trivial sentences:

“The box is in the pen. The pen is in the box.”

While the two statements are both syntactically and grammatically valid

English, their meaning is at first glance an oxymoronic state of affairs. Only by treating

the word “pen” as having two meanings in this context, as a fenced area and then as a

writing instrument, can one resolve the spatial contradiction first observed.

Machine interpretation of natural language is complicated not only by the free

form nature of English grammar and syntax, but also by the context dependence of

interpreting semantics of language terms. This dependence on context, and the

additional real world knowledge and reasoning which are required to resolve

ambiguities, are some of the lower level problems in NL understanding. Machine

understanding of natural language remains, today, one of the grand challenges in

computing.

The preoccupation of computer forensics has, until recently, been on the

immediate goals of interpreting binary data. While this is of fundamental importance, it

cannot be forgotten that the function of computer forensics is to not only glean

CHAPTER 4 – Digital evidence representation 55

knowledge from digital evidence, but to communicate and analyse such knowledge in

a rigorous and verifiable manner.

This communication problem is best demonstrated with the following example.

Consider the simple set of evidence depicted in Figure 6. Such a set of evidence may be

presented in a case where inappropriate content of some description is found on a

computing resource. The figure depicts two pieces of physical evidence, which are the

containers of the two sets of digital evidence, the digital crime scene, and a set of

extracted files. The digital crime scene is a bitwise image taken of a hard drive, and the

extracted files are files found within the filesystem of the digital crime scene. The

analysis report, imaging records, and chain of custody, are all regular textual

documents, and the evidence printouts/visual aids are visual printouts of the extracted

files. The blue lines connecting pieces of evidence in the figure indicate where

references must be made from one piece of evidence to another. Red lines indicate a

“part of” relationship: the extracted files are contained on the CD, and are also

contained within the digital crime scene. Finally, the digital crime scene is contained

within the hard drive.

Figure 6: Trivial set of physical, digital and document evidence

Evaluating this evidence involves verifying that the digital crime scene exactly

matches the crime scene referred to in the imaging records, and verifying that the files

found on the CD are found in the digital crime scene, in the locations described.

Performing these verifications requires human interaction, which is necessary

because of the use of natural language in the Analysis Report and Imaging Records.

The references to the particular hard drive, the names and paths of the files on the CD,

and the hash of the digital crime scene all need to be located by the analyst and

interpreted to refer to the correct artefacts, tools selected, and then employed to perform

the verification.

Such a corpus of evidence falls squarely within the purview of media forensics,

the most established area of the field. While such verification actions might seem trivial

to perform at first glance, in practice, it is complicated by numerous factors. For

example, if the files were found by file carving, how does one document the location of


the files? How does one validate that those locations form a valid file? What if the

digital crime scene was striped across multiple disks, as in a RAID array? How does

one document the raid array configuration? What if the investigation is dealing with

thousands of files, and hundreds of digital crime scenes? The basic task of verifying

simple claims becomes under these circumstances a laborious manual exercise, due to

communication problems related to natural language.

This natural language problem may additionally be seen within the digital

crime scene. Event logs have in the past been employed in computing for a number of

purposes, including auditing system activity, recording performance information, and

recording system state for post mortem debugging among other things. As such, they

record information about the computing environment, referring to entities such as hosts,

users, software agents, and activities with an ad-hoc vocabulary, irregular syntax, and

varying naming schemes. While such event log records are more structured than natural

language, their machine readability is arguably as difficult as natural language by

nature of the considerable amounts of domain knowledge required to infer their

semantics.

Beyond the problems preventing practical machine interpretation of natural

language, further problems confound the use of natural language as a common

language for documenting all aspects of a digital investigation. Producing suitably

complete and precise documentation over the course of the investigation requires

repetitive and methodical attention to detail. As such, it carries with it the threat of

unintentional introduction of errors and the omission of important details.

4.2 Background on knowledge representation

Having identified natural language, and to a lesser extent visual and audible

mediums, are presently the primary mediums for describing aspects of investigations,

we ask the question, Are there other languages (or representations) that are more

amenable to automated processing, and perhaps even machine reasoning, which may

still carry meaning to humans? We look to the field of knowledge representation for

potential guidance.

4.2.1 Historical foundations

The concept of knowledge representation has been a persistent one at the centre

of the field of artificial intelligence (AI) since its founding conference in the mid 50’s.

In the early years it was, however, not explicitly recognised as an important issue in its

own right [75]. Early approaches in this period to representing knowledge in “thinking

machines” and automated problem solving are best characterised as ad hoc, with formal


semantics remaining absent. Consider, for example, the language LISP, which was the

mainstay of the AI field at the time. LISP’s basic tree-like list data structure, with the

addition of cross links forming a malleable basis for organising data into hierarchical

and graph based structures. It, however, lacks any foundations of intelligent reasoning;

rather its foundations are computational. Any intelligent reasoning that may be

embodied in such programs must be implicit in the procedural code of the application.

Knowledge representation emerged as a field in its own right in the mid 60’s

with the next two decades seeing a number of approaches to knowledge representation

begin to emerge, with frames, production systems, and logic based approaches being

the predominant varieties.

The logic based approach takes the view that machine reasoning may be

realized by implementing programs which use the language of mathematical logic.

These approaches share the common approach of representing a domain of interest as a

set of propositions which embody specific information. Knowledge is encoded by

axioms which define logical implications which may be made about the information.

The earliest attempts used first order predicate logic (FOL) as their basis [50],

which was seen as appealing due both to the general expressive power, and well

defined semantics[41]. The use of FOL has been persistent since. FOL is however

computationally intractable, which led to experiment with smaller subsets with better

tractability. This led to the PROLOG language which was first introduced in 1972.

PROLOG supports declaratively specifying information as symbol value pairs, and

enabling axiom definition using a restricted form of FOL. This implementation of logic

based inferencing, based on declarative specification of logical rules has been used to

implement numerous expert systems.

Logical approaches are criticised for being unable to deal with exceptions to

rules, or to exploit approximate or heuristic models of knowledge. The expression of

meta-knowledge (description of what the knowledge can be used for) is also a

limitation [85]. Nor does it allow for incomplete or contradictory knowledge, or

subjective or time dependent knowledge [10].

A number of approaches commonly based on unstructured graph based

representations emerged in the early 1960’s and came to be known as semantic net

based representation schemes. The common points to these schemes were a graph

structure representing concepts and instances (or objects) and a set of inference

procedures which operate over these nodes [75]. Three types of edges are defined:

property edges, which assign properties (such as age) to the source concepts, IS-A

edges, which define class/subclass relationships between concepts, and instance

relationships between objects and classes. Semantic nets are criticized for being absent


of formal semantics, leaving the meaning of the network to the intuition of the users

and programmers who use these network based representations [10].

At direct odds with the viewpoint of the logic based approach, the Frame based

approach [84] attempts to imitate how the human mind works, drawing its inspiration

from psychology and linguistics:

Whenever one encounters a new situation (or makes a substantial change in one’s viewpoint) he selects from memory a structure called a frame, a remembered framework to be adapted to fit reality by changing details as necessary.

A frame is a data structure for representing a stereotyped situation, like being in a room, or going to a child’s birthday party. Attached to each frame are several kinds of information. Some of this information is about how to use the frame. Some is about what one can expect to happen next. Some is about what to do if these expectations are not confirmed. [84]

Under the frames-based approach, knowledge is represented in structured

networks, frames are related in class/subclass taxonomies and the relationships between

classes are attached to each frame at a placeholder called a slot. Inference is viewed as

a process of matching a particular situation at hand with a stereotyped situation which

the viewer has experienced in the past. The stereotyped situation gives guidance as to

infer consequence. The object-oriented programming paradigm, embodied by

languages such as Smalltalk and Java, has taken the structure of frame based approach

and applied it to data structures.

Production systems (or rule systems) share a similar psychological philosophy

to the frame based approach: that human problem solving is an empirical phenomenon

which may be viewed in terms of goals, plans and other complex mental structures

[36]. A set of production rules, which represent “if then” or “pattern action”

inference rules are applied to a set of knowledge (or knowledge base) by repeatedly

matching the pattern part of a rule, and where a match is found, certain actions are

taken on the set of knowledge. This style of system formed the foundations of research

into expert systems and knowledge based systems, which attempt to capture the guesses

of the sort that an expert human would make.

4.2.2 Defining knowledge representation

Despite the apparent fundamental nature of knowledge representation in AI, it

for many years remained free of direct definition. A knowledge representation is

described by Davis et al by the five distinct roles a representation plays:

• a surrogate, a substitute for the thing itself;

• a set of ontological commitments;

• a fragmentary theory of intelligent reasoning;


• a medium for pragmatically efficient computation; and

• a medium of human expression [36].

We consider these perspectives in more detail below, using digital forensics as

an example domain for representing knowledge:

KR as a Surrogate: A knowledge representation of an investigation serves as a

surrogate for things related to that investigation: things that exist in the physical and

virtual world, as well as actions, beliefs, suppositions and conclusion. Each surrogate

corresponds to its referent in the real and virtual worlds of the crime scene and the

surrounding investigation. The correspondence between the representational surrogate

and actual referent is the semantics of the representation.

Using these definitions, it becomes readily apparent that in media acquisition,

an “image” is a surrogate for the content of a particular piece of digital storage media

found at a crime scene.

KR as a set of Ontological Commitments: When one builds representations of

physical things, it is difficult if not impossible to build representations with the same

fidelity as the original. A representation contains simplifications and assumptions

which depend on purpose of the representation or the perspective of the developer. For

example a particular representation may distinguish people based on their names and

mobile phone numbers. A more involved representation related to medical diagnosis

might represent component body parts such as organs and systems of the body.

These two different perspectives on representing a human are examples of

different conceptualisations of the things which we choose to value within the context

of our representation. Both conceptualisations model the entities, categories of entities,

and relationships in a simplified view of the world. They differ, however. One is

concerned purely with identity, the other with anatomy.

To be able to discuss a conceptualisation in detail, we need a means of fully

describing the conceptualisation as a whole and its constituent parts. The term

“ontology” refers to such a description using a formal vocabulary. The most commonly

given definition for the term is “an explicit specification of a conceptualisation” [51].

The act of defining a particular representation carries with it a set of implicit

agreements upon how to view and talk about the world. We refer to these implicit

agreements as ontological commitments. According to Davis et al, “these commitments

are in effect a strong pair of glasses that determine what we can see, bringing some part

of the world into sharp focus, at the expense of blurring other parts” [36]. Referring

again to our two example conceptualisations of a human, we see that one representation


focuses on viewing humans as simple possessors of identity and endpoints of voice

communications, whereas the other focuses on anatomical structure.

The impact of committing to a particular ontology has effects not only at the

domain level, but also at the primitive language level. Production systems (rules) view

the world as facts (symbol-value pairs) and knowledge (axiomatic rules of plausible

inference between them), whereas frame based systems are based on conceptualising

the world by correspondence with prototypical objects.

One ontological commitment pervasive in the digital forensics literature is the

view of data as either evidence or metadata. This commitment constrains the rich

variety of information which relates to an evidence unit as merely a property of the

evidence unit.

KR as a Fragmentary Theory of Intelligent Reasoning: Knowledge

representation has its roots in the field of AI, the goal of which is intelligent machine

reasoning. Just what constitutes intelligent reasoning varies from one formalism to

another, and depends ultimately upon the intellectual origins of the formalism. Logic

based formalisms such as FOL rely on mathematics and propositional logic as the basis

of intelligent reasoning, whereas rule and frame based systems have at their roots

behaviourist views of intelligent reason completely devoid of logic.

As these theories propose a model of inference, it is important to then consider

what inferences are allowed by the model. For the FOL approaches this is simply any

set of logical inferences provided by traditional formal logic. A frame-based

representations however “encourages jumping to possibly incorrect conclusions based

on good matches, expectations, or defaults” [36].

A medium for pragmatically efficient computation: A knowledge

representation, and the theory of intelligent reason on which it is based is only useful in

so far as it is actually able to be usefully employed in computation. Early research into

connectionist models of inference (best exemplified by neural networks) were quickly

outclassed in the 60’s by rule and frame based approaches due in part to the

computational resource needs of the approach. The relative abundance of computing

resources in recent years has now made connectionist models pragmatically efficient,

allowing a resurgence in research in connectionist models and an appreciation of which

tasks they are more suitable for [85].

Medium of human expression Whether it is for the purpose of communicating

meaning to a computer or a human, the final role of a knowledge representation is of a

medium of expression and communication by humans. In the context of digital

forensics, where tracking the provenance of any interpretation or analysis result is a


necessity, the extent of a representation’s interpretability by human readers has direct

implications on managing complexity.

As a medium of human expression, it bears considering what the expressive

limitations of the representation are, such as what cannot be said, what can be said, and

to what precision.

4.2.3 Hybrid approaches

Despite early criticism between communities surrounding the frame and logic

based approaches KR, it became apparent effective machine reasoning would benefit

from hybrid approaches involving the application of multiple theories of intelligent

reasoning within the same representation. Within the schools of KR approaches

research was directed towards addressing deficiencies observed. The limited

expressiveness of FOL based approaches has been addressed by proposing hybrid

logics, for example Nonmonotic Logics and Modal Logics [10]and by limiting the

context in which FOL applies.

An awareness of the syntactic problems and undecidable nature of using FOL

as a representation language led to the development of so called Description Logics

(DL). This branch of logic began by attempting to address these problems by adopting

a semantic nets inspired model (which has been shown to be directly translatable to

FOL) and restricting the expressive power of the language to a decidable subset of

FOL. Description Logics model the world as atomic concepts (unary predicates) and

roles (binary predicates), using a small number of epistemologically adequate

constructors to build complex concepts and roles [10].

Focusing on the wider goal of building practical knowledge based systems, the

CYC project (named after the stressed syllable of the word encyclopaedia), embarked

upon in 1984, attempts to implement “the commonsense knowledge of a human being”

[67]. The knowledge representation language employed by CYC, called CYC-L

provided a language which addressed the vocabulary and syntax issues of representing

instance based knowledge, and the semantic linkages needed for defining ontologies.

CYC-L is based on FOL, and intelligent reasoning is implemented in the large by first

order logic theorem provers, and in the small, by domain specific micro-reasoners. In

this system, the axioms (rules) representing general theories about the world are

assumed true by default, and where exception based knowledge applies it is limited in

application by context.

The DARPA Knowledge Sharing Effort, initiated circa 1990 [91] researched

means of enabling knowledge sharing between computer systems. Its central theme was

that knowledge sharing required communication between systems, and that this, in turn,


required a common language. The research centred on defining such a common

language, and the surrounding ecosystem of tools and methodologies which were

required to interoperate with it. The common language proposed was a machine

readable syntax of FOL, called the Knowledge Interchange Format (KIF) [46], and has

since evolved into the current ISO draft Common Logic Standard [77].

4.3 Semantic markup languages

While the knowledge representation field has pursued questions of how best to

represent and reason with knowledge, the world has experienced a revolutionary

change in the way information is shared and communicated. First, by the adoption of

text based email, then by structured text formats, and finally the World Wide Web

(WWW). Today, we have an ubiquitous and globally interconnected repository of

interconnected information based on the simple linking and embedding of documents.

These two streams of research and development have not gone entirely without

interplay, and are in some areas converging towards the goal of building a globally

interconnected repository of knowledge, a so called “Semantic Web”. This section

describes in brief the history of digital information sharing and publishing which have

led to the approaches which are today being employed for storing and sharing data, and

finally how the lessons learned in building the World Wide Web are influencing current

knowledge representation research.

General purpose markup languages have their roots in the 60’s in work

performed by the both the Graphic Communications Association (GCA) and IBM [34].

IBM developed a language called the Generalized Markup Language (GML), which

was used internally as a source document from which numerous different types of

documents could be generated. GML used tags (<> and </> ) very much like we see

today in HTML and XML. Recognition of the need for standards in markup and

Document Type Definition (DTD) led to the establishment of the American National

Standards Institute (ANSI) committee on Computer Languages for the Processing of

Text. The Standard Generalized Markup Language (SGML) followed, becoming an

ISO standard in 1986 [58]. In 1990, Tim Berners-Lee took tags from a sample SGML

DTD used by CERN and added the concept of hypertext links to form the basis for the

markup language which was to become the Hypertext Markup Language (HTML), and

one of the foundations of the emerging WWW.

As the WWW became ubiquitous, a general awareness formed as to HTML’s

structural limitations, in particular HTML failed to enable declaration of the semantics

of newly added tags, nor was the syntax easy to programmatically interpret. This

situation led to the HTML language beginning to diverge along vendor lines, with


interoperability of documents beginning to become a problem. The eXtensible Markup

Language (XML) [138] began as a language intended to address these limitations, by

defining a simple to use and extensible subset of SGML. XML defined a syntax with a

data model based on tree like structures similar to LISPs S-Expressions, the balanced

parameter list. XML exhibits two properties which are useful towards achieving the

goal of extensible interchange of information:

• Well-formedness: a syntactic constraint which enables interchange of

information despite a party only being capable of understanding

portions of a document. “Well-formedness is a fundamental tool for

allowing documents to include extended information while remaining

processable by older "down-level" applications.” [14]

• Vocabulary mix-in: As it is practically infeasible to predefine a

vocabulary that spans all application domains, XML takes an approach

that all tags are potentially scoped by arbitrary and separate

namespaces, via the XML Namespace facility. This enables ad hoc use

of vocabularies from arbitrary application domains.

As the focus on the use of the WWW began to shift from information

dissemination to information exchange, numerous parties began to find a need for

publishing machine usable descriptions of collections of distributed information. For

example, Microsoft proposed the Channel Definition Format (CDF) for describing push

based web content, the Platform for Internet Content Selection (PICS) for rating web

content [66], and Netscape proposed the Metadata Content Framework (MCF) for

generally describing metadata content [52]. The XML alone proved insufficient for

addressing these needs, especially in the areas of schematic expressiveness and

evolution, and integration of data from heterogeneous sources. The Resource

Description Framework (RDF) [63] arose out of these efforts.

In the very late 90’s, Berners-Lee, now a leading figure within the standards

body which produced RDF, the World Wide Web Consortium (W3C), began to

enunciate a vision to create a universal medium for sharing information and exchanging

data, which he referred to as the semantic web. A semantic web activity was initiated

within the W3C to pursue this goal, drawing on many lessons learned in building the

WWW and applying them to the task of knowledge representation.

Berners-Lee opines that the centralised, “all knowledge about my thing is

contained here” approach taken by most existing knowledge representation systems is

stifling and unmanageable, and proposes that these shortcomings might be addressed

by adopting a decentralised approach in much the same way as was employed with


hypertext. [16]. A key architectural principle of the web which enabled it to scale

where hypertext failed to scale was the notion that all information did not have to be

published in the same place. The definition of the Uniform Resource Locator (URL), a

hypertext link spanning arbitrary information servers was the key to enabling

distributed and interconnected documents. The URL forms the foundations of the RDF

approach.

The W3C has standardised a number of technologies towards the goal of the

semantic web. These technologies are layered in a stack like manner, similar to that

observed in networking. The current semantic web stack comprises three logical layers:

a data layer, an ontology layer, and a query layer (presented below in Figure 7).

OWL Lite

OWL Full

OWL DL

OWL Lite

RDFS

RDFData Layer

Ontology Layer

Foundations: URI, XML, XML Namespaces

SPARQLQuery Layer

Figure 7: Current Semantic Web standards

Further logical layers are under development or envisaged, including

standardised rule based inference, trust, and explanation enabling services.

The following sections describe in detail the lower layers of the semantic web

stack (RDF and OWL) which are employed in Chapters 5 and 6.

4.3.1 A basic Introduction to the RDF data model

The RDF is a framework defining a data model based on the directed, labelled

graph (DLG), and can be seen to be influenced by both the semantic nets and frames

KR approaches described earlier. This section presents a basic introduction to the RDF

data model, as it is used as a representational format in Chapters 4 and 5. For a more

comprehensive introduction see [110].

In the DLG model of RDF, graph nodes are either resources (such as things or

entities, ie. people, places, events, or other) or values (such as numeric values, times, or

other resources). The directed nature of the graph corresponds to a constraint that the

subject, which is a node that a graph edge originates from may only be a resource,

while target nodes (nodes that graph edges terminates at) can either be a resource, or a


value. Graph edges correspond to properties (or attributes) of the subject. An example

of a simple graphical depiction of a RDF graph is presented in Figure 820.

Figure 8: Basic RDF node-arc-node triple

In the Figure 8, we see a simple graph which attaches a property (represented

as an arc) called “starredIn” to the resource (represented as a rectangle) called Kevin

Bacon. The value (represented as an ellipse) of the “starredIn” relationship is

“Footloose”. Through the knowledge that “starred in” is terminology relating to

theatrical works, and the hypothesis that “Kevin Bacon” is a proper name, and a little

inference beside, one might infer the meaning of this graph is “Kevin Bacon starred in

‘Footloose’”.

A reader who experienced childhood in the 1980’s might further add the

knowledge that “Kevin Bacon” corresponds to a person, and that “Footloose” is a

movie. A graph presenting this knowledge in the RDF formalism is presented in Figure

9.

Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named

'Footloose'"

In inferring there is a class of things called a Movie in the world, we also

suppose there is a particular one with the name “Footloose”. Accordingly, we create a

corresponding surrogate for the movie in the graph. We also create a surrogate for the

concept of Movie, and relate it to the particular movie named Footloose through the

introduction of a Class/Instance relationship. The relationship labelled “rdf:type”

denotes this relationship using the RDF defined vocabulary for a Class/Instance

20 We note that this is not a legal RDF graph due to the identifier of the nodes and arcs not being a legal URI, but present it for simplicity of discussion.


relationship21. Similarly, we do the same for the person with the name Kevin Bacon

and the class Person.

A fundamental premise of the RDF data model is that everything is named with

a Universal Resource Identifier (URI) [15]. URIs are a generalised addressing scheme,

of which a subset is the Uniform Resource Locator (URL) which is used to link

together web documents. When modelling data with RDF, concepts, instances,

properties, and even data types are all named using URIs22.

The use of URIs enables reuse of common concepts and instances. Returning to

our example, we wish to create an unambiguous identifier for the node whose “name”

property connects to “Kevin Bacon”. In this case, for convenience we turn to a

canonical source of information related to movies, The Internet Movie Database

(IMDB)23, and use the URL for this actor’s details page:

http://www.imdb.com/name/nm0000102/. Similarly we do the same for the movie

“Footloose”. The modified RDF graph is presented in Figure 10.

Figure 10: Unambiguous meaning is given to concepts and instances through naming with

URI’s

By reusing an identifier which is universally scoped, we provide an

unambiguous meaning for the instance that we are modelling. In this case, the

semantics are determined by the IMDB organisation, and can be determined by

fetching and perusing the (human readable) content of the URL in a web browser. It is

just as easy for one to create their own URI to represent a concept, individual or

property. This means any entity, be it an individual, a professional group, or a business

may create their own identifiers. In this case, we have minted our own URLs for

identifying the concepts Person and Movie. 24 By this means, vocabularies may

21 This is an abbreviation, so that “rdf:type” is to be interpreted as “the URI for the type predicate drawn from the ‘rdf’ vocabulary”. In practise this means taking the “rdf” namespace URI defined in the top of the document and concatenating it with the predicate, to form an actual URI. The URI for this predicate is thus the URL http://www.w3.org/1999/02/22-rdf-syntax-ns#type. 22 This is not strictly true. Nodes may also not have a name, in which case they are known as blank nodes. 23 http://www.imdb.com/ 24 We note here that this namespace does not necessarily need to resolve to a web page. Rather, it is a scope in which to define terminological names used in the particular conceptualisation that we are defining.


separately evolve within an area of expertise, yet be used outside of their original

purview.

RDF supports integration of information published in separate documents by

merging together uniquely named nodes and arcs. Returning again to our example,

suppose we publish the graph representing “A person named Kevin Bacon starred in

the Movie Footloose” in one document, and in a similar document a graph representing

the statement “A person named Sarah Jessica Parker starred in the movie named

‘Footloose’”. How can we combine the information from the two RDF graphs?

In combining our “Kevin Bacon” graph with our “Sarah Jessica Parker” graph,

an RDF implementation will preserve uniqueness of nodes based on their identifiers,

and merge nodes with the same identifiers, leading to the graph in Figure 11:

http://www.imdb.com/name/nm0000102/

starredIn

“Footloose”

http://www.imdb.com/title/tt0087277/

http://isi.qut.edu.au/Movie/Movie

rdf:type

http://isi.qut.edu.au/Movie/Person

rdf:type

“Kevin Bacon”

Name

http://www.imdb.com/name/nm0000572/

“Sarah Jessica Parker”

Name

starredIn name

Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a

Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.”

A naïve merge of the two graphs would lead to a duplication of nodes

representing the concepts Person and Movie and the instance Footloose. A correct

implementation of the RDF semantics would however merge the duplicate nodes into

one and yield a graph conforming to the knowledge we are trying to represent: “A

Person named Kevin Bacon and a Person named Sarah Jessica Parker starred in the

Movie Footloose.”25

4.3.2 RDF serialisation

The graph model defined by the RDF, while useful both for description in

terms of model theory, and for visualisation, is not suited for publishing as it is an

abstract graph. For this reason, a serialization of RDF to XML was defined at the time

of the definition of RDF. The serialisation (and subsequent ones) are based on

converting the graphical representation into a set of 3-tuples (or triples), where each

25 Note that for brevity this graph is still not a valid RDF graph, as we have not yet given URI identifiers to the properties used in the graph.


triple is a unique (resource, property, value) tuple corresponding to two nodes joined by

an edge. The triples for the fragment of the graph beginning with the movie would be: (http:/www.imdb.com/title/tt0087277/, http://www.isi.qut.edu.au/movie/name, “Footloose”)

(http:/www.imdb.com/title/tt0087277/, rdf:type, http://www.isi.qut.edu.au/Movie/Movie)

These two triples would map to the RDF/XML serialization presented below in

Table 2.

Table 2: RDF/XML Serialisation of two triples

<?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ > <rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/” http://www.isi.qut.edu.au/Movie/name=”Footloose"> <rdf:type rdf:resource="http://www.isi.qut.edu.au/Movie/Movie"/> </rdf:Description> </rdf:RDF>

All vocabulary used in RDF is scoped by a particular namespace, and the

vocabulary used by the RDF syntax is no exception to this rule. For this reason the

RDF in Table 2 contains a namespace declaration for the RDF vocabulary used. By

defining an abbreviation for our ad hoc namespace, and setting it as the base of all

unscoped names, we obtain the text presented in Table 3.

Table 3: RDF/XML serialisation using XML Namespace abbreviation

<?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ Xmlns:isimv=http://www.isi.qut.edu.au/Movie/ Xml:base=“http://www.isi.qut.edu.au/Movie/” > <rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/” isimv:name=”Footloose"> <rdf:type rdf:resource="Movie"/> </rdf:Description> </rdf:RDF>

An alternate and equivalent syntax for the above, which is tailored to declaring

instances based on their Class is presented in Table 4.:

Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition

<?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ Xmlns:isimv=http://www.isi.qut.edu.au/Movie/ Xml:base=“http://www.isi.qut.edu.au/Movie/” > <isimv:Movie rdf:about="http://www.imdb.com/title/tt0087277/” > <isimv:name>Footloose</isimv:name> </isimv:Movie> </rdf:RDF>

Finally, the entire merged document representing the statement “A Person

named Kevin Bacon and a Person named Sarah Jessica Parker starred in the Movie

Footloose.” is presented in Table 5:


Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a Person named Sarah Jessica Parker starred in the Movie 'Footloose'."

<?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ Xmlns:isimv=http://www.isi.qut.edu.au/Movie/ Xml:base=“http://www.isi.qut.edu.au/Movie/” > <isimv:Person rdf:about=”http://www.imdb.com/name/nm0000102/” > <isimv:starredIn> <isimv:Movie rdf:about="http://www.imdb.com/title/tt0087277/” > <isimv:name>Footloose</isimv:name> </isimv:Movie> </isimv:starredIn> <isimv:name>Kevin Bacon</isimv:name> </isimv:Person> <isimv:Person rdf:about=”http://www.imdb.com/name/nm0000572/”> <isimv:name>Sarah Jessica Parker</isimv:name> <isimv:starredIn rdf:resource="http://www.imdb.com/title/tt0087277/” /> </isimv:Person> </rdf:RDF>

The RDF/XML serialization of RDF has received criticism for being too

unwieldy and difficult to read, and for good reason. This has led to numerous efforts to

produce more human usable serializations. One of the earliest of these is the N3

serialization. Table 6 presents the same RDF graph above, serialized to N3 triples. In

this serialization, multiple property-value pairs may be associated with a single

definition of a resource. A number of shorthand terms are defined. In this example, we

see the “a” shorthand for “rdf:type” used.

Table 6: N3 serialisation of statement from Table 5

@prefix isimv: <http://www.isi.qut.edu.au/Movie/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix imdb: <http://www.imdb.com/> . imdb:title/tt0087277/ isimv:name “Footloose” ; a isimv:Movie . imdb:name/nm0000102/isimv:name=”Kevin Bacon” ; isimv:starredIn imdb:title/tt0087277/ ; a isimv:Person . imdb:name/nm0000572/ isimv:name=”Sarah Jessica Parker” ; isimv:starredIn imdb:title/tt0087277/ ; a isimv:Person .

4.3.3 Adding semantics to published RDF data

While RDF provides a machine readable and extensible language for

representing arbitrary information on the web, it fails to provide a means for declaring

the semantics of the vocabulary used. The XML foundations of the RDF provide a

simple means of declaring schema by use of the DTD, this really only provides for

describing constraints on document structure, and the use of which is actually

incompatible with the RDFs open world view of data [139]. RDF provides no means

for describing RDF properties, nor does it enable describing the relationships between

properties and other resources [140].


The vocabulary description language RDF Schema (RDFS) [140] begins to

address these descriptive goals, by providing meta-level classes and properties for

describing classes, properties and other resources. Additionally, RDFS is a semantic

extension of the RDF language. The RDF Schema language is similar to frame based

representations and object oriented languages in how it describes classes and

properties, however it differs in that it does not describes the world in terms of

properties belonging to a particular class. Rather, properties are first class entities in

their own right, with descriptions in terms of what classes are appropriate as the domain

and range of the property. By decoupling the notion of property from class, the RDF

vocabulary description language permits vocabulary descriptions to be further extended

by later descriptions, facilitating an extensible schema declaration.

The RDF Schema language is a lightweight vocabulary description language,

intended to address these fundamental descriptions, and is sufficient for describing

class/subclass, and property/sub-property structural relationships between vocabulary

terms, and the domain and range of properties. It was, however, defined as a minimal

solution to these goals, and falls short of providing the necessary fundamentals for

functioning as an ontology language.

The Web Ontology Language, which is conventionally abbreviated as OWL, is

a standard language defined by the W3C. It is based on earlier research efforts into

ontology languages such as DAML+OIL [54], and may be seen to trace its roots to

frame based representations and semantic networks.

Beyond what is provided by the RDF Schema language, OWL enables:

• naming of and linking together of ontologies in a web like manner;

• a simple form of describing instances of classes described by the

language and the interrelationships between them;

• the description of restrictions on properties, based either on cardinality

or data/object type;

• a means of inferring that instances with various properties are members

of a particular class; and

• relate concepts within separate ontologies (or conceptualizations) by

means of declaring which concepts mean the same thing.

So that we might demonstrate the difference in expressiveness of RDFS and

OWL, we demonstrate the kinds of statements which can be said using a number of

examples drawn from [56].

Using RDFS we can:

• Declare classes like Country, Person, Student, and Canadian


• State that Student is a subclass of Person;

• State that Canada and England are both instances of the class

Country;

• Declare Nationality as a property relating the classes Person (its

domain) and Country (its range);

• State that age is a property, with Person as its domain and integer as

its range; and

• State that Peter is an instance of the class Canadian, and that his age

has value 48.

With OWL we can additionally say:

• State that Country and Person are disjoint classes;

• State that Canada and England are distinct individuals;

• Declare HasCitizen as the inverse property of Nationality;

• State that the class Stateless is defined precisely as those members of

the class Person that have no values for the property Nationality;

• State that the class MultipleNationals is defined precisely as those

members of the class Person that have at least 2 values for the

property Nationality;

• State that the class Canadian is defined precisely as those members of

the class Person that have Canada as a value of the property

nationality… [56].

The development of OWL has been heavily influenced by research into the

decidability of Description Logics, resulting in three profiles of the language being

defined: Lite, Description Logic (DL), and Full. Each flavour of the OWL language

places different restrictions on the logical constructors available for category

description, based upon the impact the feature has on the decidability of the resulting

logic. OWL/Lite is the least computationally intensive of the three, and OWL/DL

restricts the resulting description logic to features which ensure decidability. OWL/Full

and RDFS are both formally undecidable, which means not all true statements can be

inferred [110].

A second influence Description Logics research has had on the OWL language

has been that OWL uses Description Logic style model theory to formalise the meaning

of the language [56]. The semantics of the OWL language are clearly defined, both in

terms of model-theory and FOL axioms.


The syntax of OWL is based on the RDF/XML serialisation described earlier.

An ontology which describes the simple Movie related example discussed above, is

presented using the OWL in Table 7.

Table 7: A simple Movie related ontology

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:owl=“http://www.w3.org/2002/07/owl#“ xml:base=“http://www.isi.qut.edu.au/Movie/” xmlns:isimv=“http://www.isi.qut.edu.au/Movie/” > <owl:Class rdf:ID="Person” /> <owl:Class rdf:ID="Movie” /> <owl:DatatypeProperty rdf:ID="name” > <rdfs:domain> <owl:Class rdf:resource=”Person” /> </rdfs:domain> </owl:DatatypeProperty> <owl:ObjectProperty rdf:ID="starredIn” > <rdfs:domain> <owl:Class rdf:resource=”Person” /> </rdfs:domain> <rdfs:range> <owl:Class rdf:resource=”Movie”/> </rdfs:range> </owl:ObjectProperty> </rdf:RDF>

This ontology (in Table 7), when processed by an implementation of OWL,

would merge with the earlier RDF data and provide a machine readable description of

the semantics of the vocabulary used in the earlier information. By decoupling data

definition from data description (or schema, or ontology), the schema of data does not

need to be defined apriori, as in regular relational database models of information. Data

may be defined using the RDF data model, and then when it is deemed necessary,

OWL may be used to attach semantics to this data.

This example of the usage of OWL describes it as a means of defining the

vocabulary used in RDF documents, attaching semantics to data. In this sense

ontologies may be used to transform raw data into information. Ontologies are,

however, useful outside of this kind of usage, and are in some cases, such as in building

large conceptualisations, useful in their own right. Cataloguing efforts such as the Gene

Ontology [133] and the NCI Cancer Ontology [90] are primary examples of such

applications.

4.4 KR in digital forensics and IT security

In Section 3.1 we have described a number of approaches to event correlation

in the fields of forensics and intrusion detection that have used knowledge

representation related approaches implicitly. These approaches include AI planner

languages, expert systems shells and description logic environments. Apart from the

published research results arising from work described in this chapter, and chapters 5


and 6, there is little published research investigating the wider role of knowledge

representation in computer forensics. Below is a survey of such related work.

Stephenson’s PhD dissertation work [127] focused on describing and

representing actual and ideal digital investigations, and validating digital investigation

processes against formal models. The representational approach taken by Stepenson

was to define a process language called the Digital Investigation Process Language

(DIPL), which employ’s Rivest’s S-Expressions as a syntax, defines a range of

vocabulary related to digital investigations, and finally, represents the digital

investigation as a process. Stephenson’s primary contribution is the proposal and

demonstration of the use of formal mathematical methods to validate actual

investigation procedure described in the language with investigative process standards

represented as formal petri-net models. While Stephenson’s goal of describing

investigations is a shared motivation with the work described in this dissertation, we

have chosen to adopt a representational approach based on formal knowledge

representation roots. This is due to the insufficiency of S-Expressions in the areas of

schematic expressiveness and evolution, and integration of data from heterogeneous

sources.

Brinson et al. recently proposed a taxonomy scheme characterising the field of

cyber forensics26, which they called a Cyber forensics ontology [22]. This approach

takes a liberal interpretation of the meaning of the term ontology, avoiding a “formal

explicit description of concepts” [97] in the cyber forensics domain. The model

proposed organises concepts from the cyber forensics domain into a hierarchy; however

the relationships between concepts appear to be neither consistent nor specified.

Regardless of these omissions, this work could form a useful starting point for

developing formal ontologies related to cyber forensics.

Slay and Schulz [119]have, since the research described in this dissertation was

completed, employed ontologies as a means of describing a specific conceptualisation

of files and suspicious media content in a computer forensics application. Their

conceptualisation contains various categories of media files, and the idea of

suspiciousness, and is employed by a filesystem search application. While the work

concerns itself with categorising files as suspicious based on properties and

relationships of files (ie. file size and topological proximity to other files), their

ontology does not attempt to encompass these concepts, presumably leaving these

concerns to implementation in code.

26 The term “cyber forensics” is not defined in this paper. On examination, the author appears to be referring to the digital forensics field.


We have also identified a number of applications of ontologies in the computer

and information security field, especially relating to intrusion detection.

Raskin et al. argue for the adoption of ontology as a powerful means for

organising and unifying the terminology and nomenclature of the information security

field [105]. They propose that the use of ontology in the information security field will

increase the systematics, allow for modularity and could make new phenomena

predictable within the security domain. Further work by this author investigates hybrid

ontology oriented/NLP approaches detecting deception in natural language text [104].

Schumacher focuses on systematic approaches to improving software security,

by using Security Patterns, the application of the design patterns approach to security

[116]. Ontologies are used as a means to model both the security concepts referred to

by the patterns, as well as the patterns themselves.

Undercoffer et al produced an ontology which can be used to describe a model

of a computer attack, which they call a “Target Centric Ontology for Intrusion

Detection” [136]. A DL classifier, in conjunction with a rule language is used to

classify event instances as belonging to particular classes of interest, which are in turn

described using the OWL precursor DAML+OIL.

Doyle et al [37], in reviewing the expressiveness of the state of the art in

intrusion detection correlation languages, suggest the Knowledge Representation (KR)

system CYC [67] may be of use positing that the CYC system provides powerful

constructs for reasoning with abstract and concrete concepts across multiple domains.

Goldman et al in their IDS alert fusion prototype, SCYLLARUS [48]

employed the KR system CLASSIC [19] to model a site’s security policy, static

network, software configuration, and intrusion events all within the same

representational formalism.

4.5 A formal KR approach to investigation documentation and digital evidence

The problems of building an interconnected and distributed web of machine

and human interpretable information (a so-called “Semantic Web”) parallel, in the

large, the problems we observe in acquiring, assembling, and interpreting corpuses of

digital evidence and investigation related information. As we have said, digital

investigation covers a very broad range of conceptual entities, and any schema or

ontology attempting to fully describe the domain, quickly becomes insufficient as

technology inexorably marches on. In this light, a means of representing evidence and

related investigation information, expressive enough to represent all of the information

we wish to represent, but not committing us to a particular conceptual schema, is


desirable in order that usability is not hampered by debate over terminology and

conceptual granularity.

It is well known that formal specification of systems aids implementation and

correctness. For example, formal specification of software has led to significant

outcomes in producing provably correct software. We propose that the field of

computer forensics would similarly benefit from a formal approach, but in this case a

formal approach to representing knowledge about investigations, and information

within the digital crime scene. Such a formal approach would form a middle ground

between machine and human understanding, by adopting a common language with

extensible vocabulary, clearly defined semantics, and a regular syntax. We summarise

below the attributes of the proposed approach in the computer forensics context, by

relating the advantages of modelling information using RDF/OWL identified by

Reynolds et al in [110].

Integration of arbitrary information: The representation should employ a

simple and consistent model of data, in combination with a globally unique naming

scheme, in order that separately documented information may be easily combined into

a consistent and larger whole. Such a model of data and naming scheme will enable a

corpus of forensic evidence to be decomposed and composed. The benefits of this

relate to the volume problem, by enabling of sharing of evidence in small pieces, and in

enabling scalable approaches to processing evidence (due to the elimination of large

shared resources such as databases as a container of information). The naming scheme

should enable arbitrary information to be expressed and arbitrary vocabulary terms to

be created, and in addition, enable reuse of existing vocabulary terms. Related to the

complexity problem, such a representation would enable addition of new types of

information to a corpus of evidence without need to modify existing tools.

Support for semi-structured data: The representation should allow

information to be represented in the data model without need for considering or

deciding upon a particular conceptual model a priori, in order that information may be

rapidly integrated, without becoming bogged down in issues of semantics. At a later

point semantics may be attached through relating entities to elements of an ontology.

The complexity problem is addressed by rapidly enabling integration of new and

arbitrary information.

Extensibility and resilience to change: The complexity problem in computer

forensics indicates that forensics tools must address ever increasing complexity. A new

tool that exhibits backwards compatibility would in light of this complexity retain the

ability to interpret prior generations of information and models, despite changing

definitions of terminology over time. A representation that exhibits forwards


compatibility should be evolvable: existing tools should remain able to interpret newer

generations of information expressed in the representation. For example, if a new

storage technology is developed, an imaging application which operates on this kind of

storage may record further information related to the source of evidence. Existing tools

which operate over images must still be able to interpret the image despite the presence

of the new information.

Classification and Inference: Such a representation should enable describing

the world not only by names, but by relationships between entities, and inclusion in

classes of things. Such a representation should additionally be conducive to inferring

new knowledge based on existing knowledge regarding a concept’s relationships.

Provenance: A representation should enable the ability not only to express

information, but also to express information about where the information came from.

This, in particular, is important in the forensics context, where any facts identified must

be substantiated by evidence. This need for substantiation is a considerable burden in

computer forensics given the amount of natural language currently required to describe

these provenance issues.

The research described in chapters 5 and 6 apply this proposed approach to

reducing the volume and complexity of events sourced from computer and network

event logs, and in easing the construction of corpuses digital evidence.

4.6 Conclusion

In Section 4.1 the motivation for knowledge representations was presented in

the context of documenting digital investigations. Section 4.2 described in broad brush

strokes the history of the field of knowledge representation, discussing in turn the goals

of the field. In short, the field attempts first to answer the question “How can we

express what we know?” and “How can we reason with what we express?”

A key theme which comes through on analysing the field is the preoccupation

on reasoning. In fact today the field tends to refer to itself as Knowledge

Representation & Reasoning, rather than referring to itself as simply Knowledge

Representation. This reflects a realisation that, when addressing the goals of artificial

intelligence, both representational formalism and the model of intelligent reason impact

numerous factors such as expressiveness, computational tractability and pragmatic

usefulness. The field remains today an active research area.

Section 4.3 described recent standardisation efforts on semantic information

markup, then indicated areas where the knowledge representation field has influenced

these efforts. Such efforts have multiple stakeholders advancing their varied research

and development agendas. In turn these vary from addressing the more lofty AI related


ambitions of a globally published knowledge base, to the more pragmatic, “soft AI”

goals of publishing information in a manner that it may be unambiguously interpreted

and further intermixed.

Section 4.4 described KR related work in the IT security and forensics fields,

and concluded that representation has to date been an implicit subject in forensics.

Section 4.5 puts forward the proposition that the field of forensics would

benefit from a formal approach to representation, both by documenting investigations

and automating reasoning about evidence. The section summarizes the scientific

premise motivating much of the work described in this thesis.

The next chapter investigates whether semantic web KR formalisms are

suitable for using as the basis for developing DF analysis tools.

Chapter 5. Event representation in forensic event correlation

“Nature herself cannot err, because she makes no statements. It is men who may fall into error, when they formulate propositions”

(Bertrand Russell)

Chapter 4 put forward the proposition that a knowledge representation based

approach (in particular the semantic markup languages RDF/OWL) to digital evidence

representation will yield benefits that will solve current digital forensics problems of

complexity and volume. This chapter describes the design and implementation of such

a knowledge representation based approach, and then demonstrates the proof of

concept of the approach. Section 5.1 introduces the problem of event representation and

correlation in forensics. Section 5.2 describes the design of the knowledge

representation based approach, employing RDF/OWL as a representation with which

diverse event related information might be expressed in a human and machine readable

manner. Section 5.3 describes its implementation. The approach is evaluated in two

case studies in Sections 5.4 and 5.5. The former evaluates whether the approach is

feasible in the context of event correlation, and in the latter, evaluates whether the

approach can scale to integrate information sourced from heterogeneous logs across

multiple domains.

The research work described in this chapter has led to the publication of the

following papers:

Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and Management Systems (APIEMS 2004), Gold Coast, Australia.

B Schatz, G Mohay, A Clark, (2004) ‘Generalising Event Forensics Across Multiple Domains’ Proceedings of the 2004 Australian Computer Network and Information Forensics Conference (ACNIFC), Perth, Australia.

B Schatz, G Mohay, A Clark, (2005) ‘Generalising Event Correlation Across Multiple Domains’, Journal of Information Warfare, vol 4, iss 1, pp. 69-79. (revised version)

79

80 CHAPTER 5 – Event representation in forensic event correlation

5.1 Introduction: Event correlation in digital forensics

In cases involving computer related crime, event oriented evidence such as

computer event logs, and telephone call records are coming under increased scrutiny.

The volume problem described earlier refers to the current state of computer forensics,

where the number of sources of potential evidence in any particular computer forensic

investigation has grown considerably. Evidence of the occurrence can potentially be

drawn from multiple computers, networks, and electronic systems and from disparate

personal, organizational, and governmental contexts. Furthermore the complexity

problem is evident in the amount of technical knowledge required to manually interpret

event logs. The knowledge required for interpretation encompasses multiple domains of

expertise, ranging from computer networking to forensic accounting.

When comparing the number of security related events to the total number of

events logged by modern computer systems, we find that in practice, security related

events comprise only a small proportion of logged information. This means there is a

large amount of event log information that is not related to security, but that is available

to the computer forensics investigators for use in identifying activities and events of

potential forensic interest. In addition, forensic event correlation may consider event

logs from other, disparate sources, which are not computer event logs per se. These

would include traditional sources such as electronic door logs, telephone call records,

and bank transactions records, and newly emerging ones from the plethora of

embedded devices, both consumer and industrial.

In order for forensic investigators to effectively investigate this mass of event

oriented data, automated methods for extracting event records and then classifying

events and patterns of events into higher level terminology and vocabulary are

necessary. New techniques are needed to assist investigators with voluminous, low-

level event oriented evidence. Semantically rich representational models and automated

methods of correlating event information expressed in such models are becoming a

necessity. We need means to rapidly integrate knowledge from new types of

heterogeneous event records, in a manner that makes explicit the environmental or

implicit concepts associated with those logs. This is to facilitate human understanding,

and also machine processing. A general solution is needed.

Our approach enables this, and forms the basis for automated heuristic

correlation techniques, and provides extensibility for new models of event patterns and

correlation. We have defined an extensible and semantically grounded domain model

and an ad hoc forensic event ontology expressed using the Web Ontology Language

CHAPTER 5 – Heuristic event correlation for digital forensics 81

(OWL). This ontology describes a simple conceptualisation of event correlation related

concepts and relationships, enabling facts representing events and transactions, as well

as environment-based knowledge (for example, real world information such as people,

and places, as opposed to event based knowledge).

Because of the richness or abundance of detail in the events we consider, we

call our prototype system Forensics of Rich Events (FORE). The system is

demonstrated using a scenario consisting of event correlation of events sourced from

security related logs in the context of an intrusion forensics investigation, and then

demonstrated in a cross domain context, integrating accounting system type event

records.

While the validity of the results produced by forensic tools is of serious import

to the forensic and legal community, in this work we do not focus on how the outcomes

of this tool would be made acceptable to a court of law. There is, however, an extensive

body of work explaining the deductions of expert and rule systems that would provide

the foundations for addressing such concerns; for example see [129].

5.2 Ontologies, KR and a new approach

Section 3.1 described the literature related to event correlation in the forensics

domain, identifying three related works: the ECF work of Abbott et al, Stallard and

Levitt’s anomaly based intrusion forensics, and Elasser and Tanner’s work on abducing

explanations for intrusions. All three of these approaches used a different

representational approach for modelling event log related information. These are

relational modelling, expert systems rules, and the planning language PDDL [74]. The

focus of this work is primarily on information integration and the representation of

event related evidence, whereas these former works focus primarily on analysis

techniques, eschewing issues of semantics and integration of heterogeneous events

from multiple domains.

Our approach is to employ the RDF/OWL formalism for representing arbitrary

event log related information, and higher order concepts such as causal relationships,

and validate the approach in the context of correlation of heterogeneous event logs. Our

correlation approach relies on heuristic rules which abstract low level situations of

interest into higher level situations. These rules were in this research developed using

domain knowledge. In real world applications of this approach, we expect automated

means of identifying rules, such as data mining, would be necessary to make the rule

bases scale to a useful coverage of the domains of interest.


5.2.1 Knowledge representation framework

A number of factors influenced our choice to use RDF/OWL as the

representational formalism. Firstly, the current thrust of research in KR and the

Semantic Web is related to this formalism, and consequently, a wide variety of

RDF/OWL implementations are freely available. Secondly, this research has led to a

large body of knowledge upon which we might draw.

We initially investigated Description Logic (DL) reasoner implementations as

potential KR&R environments supporting OWL. The CLASSIC system appears to

have been at a standstill for some time, with neither little evidence of an active user

community, nor any support for OWL despite their shared heritage. Current

implementations of DL reasoners, such as FaCT appeared promising as they have

implemented translators from OWL to their native syntax [55]. Instance reasoning in

FaCT is however limited by failure to support datatypes defined by XML Schema

(XSD) [124]. For our research, support for time datatypes supporting the expression of

instants (or timestamps) as date and time values is a necessity that is not satisfied by

any of the current breed of DL reasoners.

JTP [40] initially appeared useful, as it has a built in time ontology and a

temporal reasoner, OWL support, and is open source. The FOL implementation of JTP,

however, proved to perform extremely slowly, the surrounding online community

dormant, and simple Knowledge Base operations such as removal of previously

asserted facts were not supported. OWLJessKB [64], an implementation of OWL

using the JESS [43] production system provides a reasoning method over OWL and has

a small, but active community, however OWLJessKB does not support reasoning over

time in a clean way. Further, JESS is closed source software.

The JENA semantic web toolkit, a Java based RDF/OWL implementation [73]

has recently added both a forward chaining reasoner, similar to JESS, and a backward

chaining reasoner similar to the tabled Prolog, XSB. The ontology API is clear and well

documented; the source is distributed under a liberal license, and is supported by an

active community. For these reasons we chose JENA as the knowledge representation

and reasoning (KR&R) framework to use as the foundations of our architecture.

5.2.2 Application architecture

As none of the existing event correlation systems or languages reviewed (see

Section 3.1) employed a formal representation at its foundations, a prototype system

was constructed. Our prototype system, called Forensics of Rich Events (FORE) is

composed of the following components: generic event log parser, event log ontology,


correlation rules, correlation rule parser, event browser and the JENA semantic web

framework (see Figure 12).

Generic Log Parser

FR3Rule Parser

Correlation Rules

Apache Spec

Knowledge Base

Rule Base

JENA Framework

Event Browser

Win32 Spec

Door Spec

SAP Spec Forensic Ontology

Figure 12: The FORE Architecture

Raw event logs are parsed into RDF event instances by the generic log parser

and inserted into the knowledge base. Correlation rules (expressed in a language called

FR3, which we describe later) are parsed into the native format of JENA, and applied to

event instances by the JENA inference engine. Investigators may interact with the

knowledge base containing the events and entity information using the event browser.

The components implementing the architecture are described in detail in the

following sections.

5.3 Implementation

5.3.1 The design of the event representation

While a number of ontologies related to security and intrusion detection were

identified, [32, 99, 136] in developing our prototype system we initially (for case study

1 described in Section 5.4) eschewed using an externally developed ontology as the

basis conceptualising the event correlation domain. We did this so we might focus on

validating the approach of employing RDF and OWL in building event correlation

tools. For the initial case study, we limited our use of the features of OWL to class and

property definition, avoiding object properties, property hierarchies, and constraints

over properties. Case study 2, described in Section 5.5, considers the use of component

ontologies in integrating event oriented evidence sourced from multiple domains.


Our ontology is rooted in two base classes, an Entity class that represents

tangible “objects” in our world, and an Event class that represents changes in state over

time.

At the time of this work, some guidance existed towards modelling time in the

form of the OWL-S time ontology [99]. This work, performed in the context of agent

oriented computing, was related to describing the time of occurrence of events both as

instants and durations, and the topological relationships related to this model of time.

This model of time however had no implementation of a temporal reasoner. For this

reason, we adopted a simple instant based model of time that assumes basic events to

happen at an instant of time. Our basic temporal ordering property is thus supported by

reasoning over the startTime owl:DatatypeProperty of the Event class. We assume a

simplified model of time, avoiding the implications of timing irregularities, such as

clock drift, cross time zone event sources, deliberate modification of time records, and

lack of time synchronization (Chapter 7 investigates the reliability of assuming correct

clock operation, and addresses the theme of assurance of timestamps ).

Causal linkage is modelled as an owl:ObjectProperty, whose domain and range

are both Event. In effect, this is a unidirectional relation where a parent event has a

collection of causal ancestors. We borrow Luckham’s definition of causality “If the

activity signified by event A had to happen in order for the activity signified by event B

to happen, then A caused B” (p.95) [71]. The causal linkage property is represented in

OWL as the following:

<owl:ObjectProperty rdf:ID="causality"> <rdfs:range rdf:resource="#Event"/> <rdfs:domain rdf:resource="#Event"/> </owl:ObjectProperty>

Composite events are implemented by the creation of new events of the new

abstract and more generalized concept as the result of the successful matching of event

patterns and firing of a correlation rule. For example, we define a

DownloadExecutionEvent. This class is a composite event, its semantics is that a user

has executed content that has previously been downloaded, for example by

downloading with a web browser, or as an email attachment. This event composes

lower level events: a FileReceiveEvent, and a subsequent ExecutionEvent. The inter-

instance and class/subclass relationships are depicted below in Figure 13.


Figure 13: Instance and Class/Subclass relationships between events

The expressiveness of RDF/OWL enables the translation of event log entries

into instances of information with fixed and specific semantics. The presence of

class/subclass relationships in the event forensics ontology enables the definition of

abstract classes of events sharing similar characteristics. These abstract event classes in

turn enable the expression of correlation rules matching over abstract notions, while

still operating over specific information. For example, a correlation rule composing a

FileReceiveEvent will, in the presence of an ontology describing a

WebFileDownloadEvent (an event sourced from web server logs) as a subclass of

FileReceiveEvent , just as happily match the latter more specific event.

5.3.2 Log parsers

A set of regular expression based parsers were developed to introduce

unstructured and heterogeneous event log data into the JENA knowledge base. These

parsers use a similar syntax to that used by the ECF [3] effort’s parsers for matching

event specifications, and create sets of OWL instances in the knowledge base. Under

this scheme, a single instance of a Windows 2000 login event sourced from the

windows security log would be converted into three instances: one representing the

event, one representing the user, and one representing the host. This is shown in the

RDF/XML syntax following:


<Win32ConsoleLoginEvent rdf:ID="loginInstance1"> <startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime" >2002-03-04T20:30:00Z</startTime> <user rdf:resource="#user1" /> <host rdf:resource="#host1" /> </Win32ConsoleLoginEvent> <Win32DomainAccount rdf:ID="user1"> <userName>jbloggs</userName> <domain>DSTO</domain> </Win32DomainAccount> <Win32Host rdf:ID="host1"> <hostName>s3</hostName> </Win32Host>

Notable points here are the usage of XML Schema (XSD) for encoding the

time of the event in the representation, and the use of RDF resource references to link

the event with instances of entities representing the user and the host. The lack of

precision caused by the omission of a year in syslog based timestamps is addressed by

a declarative feature at the event parser layer to skew time data to the correct year. This

may be used to address timing irregularities where the irregularity is quantifiable and

regular.

A further simplification is that we assume the integrity of the event sources has

not been compromised.

5.3.3 A heuristic correlation language – FR3

The approach taken in this work is to perform correlation by successive

application of rules which have been defined by a domain expert. The problem of event

correlation and event pattern languages in particular lies in how to describe these

events, relationships and constraints. As discussed previously, we employ RDF/OWL

for describing event instances, classes of events and relationships between them,

however its expressiveness is insufficient for describing temporal constraints. The most

common approach for addressing expressiveness limitations in OWL is to employ

rules.

For this reason, and the observation that most of the correlation languages

reviewed are translatable to rules, we chose to use JENA’s built-in rule engine.

Investigations indicated the rule language understood by JENA’s rule engine was

overly verbose, so a rule language, which we dub FR3, was created to express readable

and manageable correlation rules.

Our language is based on the syntax of the language F-Logic [61] and the XML

specific features of another F-Logic inspired language, TRIPLE [118]. We have

however adopted much simpler semantics, avoiding path expressions and reified

statements (statements about statements).


Namespace support

FR3 has specific support for XML namespaces and resource identifiers.

Resource identifiers are the OWL standard of URIs [15]. Namespaces are declared as a

clause of namespace abbreviation := namespace. as follows:

fore :="http://www.isrc.qut.edu.au/fore#".

The usage of namespaces, (along with a number of OWL features we will not

discuss here) enables integration of concepts from separate ontologies. The fore

namespace declaration resolves the concepts used in FR3 rules to concepts specified in

the FORE forensic ontology. Similarly, a RDF namespace declaration enables FR3

rules to reason over type information.

Object shorthand

A shorthand form of object attribute access is expressed using a linguistic

grouping called molecules, and is supported via the following F-Logic inspired syntax:

object[property -> value; property2 -> value2]

This is a convenient form of syntax, which in the head of the rule, enables the

assignment of values to the properties of an object, without repeated use of the object.

The equivalent form of these clauses in JENAs rule language would be:

(object property value), (object property2 value3)

In an object oriented paradigm the previous could be expressed as

object.property = value; object.property2 = value2;

In the tail of the rule this form is interpreted as an equality test whereas in the

head of the rule it is interpreted as variable assignment.

Heuristic rules

Rules are specified as follows:

antecedents -> consequences;

This can be read as IF antecedents THEN consequences, where antecedents and

consequents may contain any number of molecules or procedures. Molecules appearing

in the head of the rule (an alternative term for the consequences) are interpreted as new

facts of knowledge to be inserted into the knowledge base. Molecules appearing in the


antecedents (also known as the tail of the rule) must occur in the knowledge base for

the IF part of the rule to be satisfied.Variables are introduced by including a question

mark at the beginning of an identifier.

Reasoning

The JENA toolkit is employed as the knowledge base, RDF/OWL parser, and

reasoner. We use the RETE [42] based forward chaining reasoning engine to

implement our rule language. The RETE algorithm is a speed efficient and space

expensive pattern-matching algorithm with a long history of use in expert systems and

rule languages.

At the time of this work, JENAs OWL implementation did not support rules

matching on all subtypes of an abstract type. For example, a rule matching events of

class LoginEvent would not fire when a Win32TerminalLoginEvent was added to the

knowledge base. The machinery of JENA that implemented the semantics of OWL type

hierarchy inference relied on a hybrid implementation involving both forward (RETE)

and backward chaining (similar to Prolog) reasoners. The inferred types of the

Win32TerminalLoginEvent were not available as facts to the forward chaining rule

engine, as they were only computed backwards as a query. The OWL implementation

of JENA was modified to pre-compute the type hierarchy information using the RETE

engine so that these facts were available.

An example correlation rule

The following correlation rule in Table 8 is used to causally correlate Apache

web log entries with a particular user logon session. Standard web server access logs

only provide enough detail to determine the host that downloaded content, containing

no content which aids in discriminating the user on the host. As this is the case, we

correlate the download with all login sessions that exist on the host at the time of the

download.


Table 8: Web Session / Causality Correlation Rule

Rule fore :="http://www.isrc.qut.edu.au/fore#". ?e1[rdf:type -> fore:WebFileDownloadEvent; fore:clientHost -> ?sh ; fore:startTime -> ?t2], ?e3[rdf:type -> fore:LoginSessionEvent ; fore:host -> ?sh ; fore:startTime -> ?t1; fore:finishTime -> ?t3], lessThan(?t1, ?t2), lessThan(?t2, ?t3) -> ?e1[fore:causality -> ?e3];

Meaning Take a web file download event ?e1 that came from the Host ?sh and occurred at time ?t2 and take a login session on the same Host. If the web file download event occurred during the login session, then add the web file download event to the causal ancestry (causality property) of login session event.

All of the event classes mentioned refer to concepts defined in the document

identified in the fore: namespace declaration.

Event browser

The GUI event browser provides a number of methods of interacting with the

events in the knowledge base. Two views form the basis of the user interface, the event

causality view, and the entity view.

The event causality view provides a display for all events matching a certain

context, and displaying the properties of each event in a drill down manner. It further

provides means to drill down, following the causal ancestry of a sequence of events.

We implement a simple query interface for finding sets of event instances based on

type and property values.

The entity view presents all entities identified in the event base, along with

their properties. Entities selected in this view may be used as the basis of a query of all

related events. The Entity View provides an operation enabling the investigator to

hypothesise an identity equivalence relationship between otherwise distinct instances.

This is discussed in Section 5.4.1.

5.4 Case study 1: Intrusion forensics

This section describes the results of the application of the FORE system to a

forensic scenario identified by the ECF research, comparing and contrasting the

investigation approaches enabled by each approach, from the perspective of the

forensic investigative process.

The scenario consists of the following trace of events, and provides support for

the following hypothesis: A particular person downloaded and executed an exploit

against a computer, and later gained elevated privileges on that computer. In this

example, we assume the exploit allows the user to reset the administrator password to a


known value. The various heterogeneous event logs from which each event is sourced

is identified in parentheses.

1. Person P enters room R (door log)

2. P logs on to Windows 2000 workstation W (Windows 2000 Security

log)

3. P downloads exploit file F from Apache web server A (Apache Web

Server log)

4. P executes the downloaded file F on a host W (Windows 2000 Security

log)

5. Workstation W is rebooted (Windows 2000 Security log)

6. Administrator logs on to the workstation W a short time later

(Windows 2000 Security log)

In this scenario, a user either noticing the server rebooting, or a user being

unable to log in as Administrator would most likely alert the investigator. While it is

likely the attacker would cover their tracks by deleting the event log, one still can

envisage finding the log entries either via forensic analysis of the disk where the event

log is located, or from a secured log host.

This scenario sourced events from the following security log types:

• Windows security logs: records of resource authentication in the windows OS,

• Apache web server logs: records of accesses of web resources,

• Door proximity logs: logs of proximity card readers controlling access to

rooms.

We are not focusing on cross-domain correlation, so will not address the door

log in this case.

5.4.1 Investigation using FORE

The FORE approach to forensic investigation supports three methods of

interacting with event log based data: search, hypothetical entity correlation, and

automated notification. We expect for most investigations, a mixture of all three

methods would be used. An example application of these methods of investigation are

described below.


Automated investigation: Notification

Most signature based approaches to Network Management and IDS enable

specification of signatures which, when they match, indicate an occurrence of interest.

Adopting this approach, we facilitate the specification of correlation rules that operate

over events in the KB. In this case, our investigation merely involves looking for a

certain set of events that are related to misuse oriented correlation rules.

We have developed a set of rules that causally correlate authentication events

and login sessions, and common actions performed on a computer, during a session of

activity. We have also defined a misuse rule that will detect the OS exploit scenario

described previously. This rule is presented in Table 9

Table 9: OSExploit Heuristic Rule

Rule fore :="http://www.isrc.qut.edu.au/fore#". ?e1[rdf:type -> fore: DownloadExecutionEvent; fore:startTime -> ?t1 ; fore:host -> ?h ; fore:user -> ?u; fore:causality -> ?e2 ], ?e2[rdf:type -> fore:Win32RebootEvent ; fore:host -> ?h; fore:startTime -> ?t2], ?e3[rdf:type -> fore:LoginEvent; fore:startTime -> ?t3 ; fore:hasUser-> fore:AdministratorUser ; fore:host -> ?h], during(?t2, ?t1, "http://www.w3.org/2001/XMLSchema#duration^^P10M"), notEqual(?u, fore:AdministratorUser), lessThan(?t1, ?t2), lessThan(?t2, ?t3), makeTemp(?s) -> ?s[rdf:type -> fore:OSExploitEvent ; fore:causality -> ?e1];

Meaning Match an event instance of class Win32RebootEvent with an event instance of class DownloadExecutionEvent that occurs before it.

If the user which caused these events is not the Administrator user, and the latter two events occurred within 10 minutes of each other, create a new OSExploitEvent, and link its causality property to the DownloadExecutionEvent.

By searching for instances of the class OSExploitEvent the investigator may

immediately and directly find an OSExploitEvent that has been automatically inferred.

The OSExploitEvent is a semantic generalization of the correlation of a

DownloadExecutionEvent followed by a Win32RebootEvent followed by a LoginEvent

with account Administrator. We require that the reboot be within a 10-minute duration

from the DownloadExecutionEvent. The sequence of events which are causally related

to an OSExploitEvent are displayed as a graph in Figure 14. It should be noted the

DownloadExecutionEvent event pattern would be matched by an instance of an

ApacheWebFileDownloadEvent, as the latter is a concrete subclass of the former (not

shown in the figure). A Win32LoginSessionEvent will similarly satisfy the

LoginSessionEvent.


Figure 14: Causal ancestry graph of exploit

This method of investigation will only result in success for cases where the

client computer in the Apache web server logs has the same as the name of the

computer in the Windows security log related events. This is often not the case, as

Windows uses a host name in security log events, whereas Apache, by default, uses IP

addresses. We discuss below a method of addressing this shortcoming.

Hypothetical entity correlation

Heterogeneous authentication environments make the notion of identity in the

security field difficult. Login names are often different from the real name of a user,

and a user may have several different login names associated with different computer

systems. Similarly, identifying computer hosts from log entries is complicated by the

use of hostnames in some cases, and IP addresses in others. Finally, the usage of

dynamic addressing further complicates this situation.

In the absence of information describing which names belong to a single entity,

when building a representation of a situation such as this, one has no alternative but to

create a separate and unique surrogate for each unique name. This leads to a

proliferation of surrogates for which there actually exists only a single referent in the

real (or virtual) world. We refer to this problem as surrogate proliferation.

The FORE approach provides a novel means of investigating under the

presence of surrogate proliferation. The entity view of the GUI provides an operation

enabling the investigator to incorporate hypotheses regarding equivalence relationships


between surrogates. For example, one might hypothesise separate individual entities

may represent the same individual.

Consider the following unrelated sets of correlated events in Figure 15. The

previous web server related rule in Table 8 would not correlate the

ApacheWebFileDownloadEvent with the Win32LoginSessionEvent, as the surrogate

hosts (the client in the web log and the host in the LoginSessionEvent) are not the same.

Similarly, in an unrelated scenario where we were interested in correlating remote shell

sessions with the activities on a client computer, the SSHPasswordAuthenticationEvent

would not correlate with the Win32ProcessCreationEvent that executed the SSH client,

putty.exe.

Figure 15: Related events remain unconnected because of surrogate proliferation

If the investigator looks in the entities view of the event browser he will see all

of the individual entities that have been identified from the event logs, including the

Host with IP address 131.181.6.167 and another host with name “DSTO”. Through

examining DNS logs, or by other means, the investigator may hypothesise that the Host

“DSTO” and the Host with IP address 131.181.6.167 are in fact the same host. The

investigator can select the two entities and invoke the sameAs operation on the two.

The individual entities representing the two Hosts are now treated by the OWL

implementation as one single entity, a single instance of Host that combines all

properties of the prior two.

The sameAs operation relies on the underlying semantics of the OWL

individual equivalence mechanism, owl:sameAs. This language feature may be used to

state that seemingly different individuals (or classes) are actually the same. This single

(now merged) individual will now suffice to fire the WebFileDownload-LoginSession

causality rule discussed previously, and causally correlate the

ApacheWebFileDownloadEvent to the Win32LoginSessionEvent, via a different rule as

shown in Figure 16. Additional rules may now correlate the connection initiated by the

“putty” SSH client to another UNIX host.


Figure 16: Correlated event graphs after proliferate surrogates merged

With these causal links correlated, the rules for the OSExploitEvent will now

be satisfied, driving the creation of an OSExploitEvent and its subsequent display in

the event view of the user interface.

Search oriented investigation

The prototype enables one follow an interactive search methodology to explore

arbitrary hypotheses. The investigator uses the query interface of the FORE event

browser to find all instances of a LoginEvent with the user property equal to the

Administrator user. With our current knowledge base, this will return a

Win32LoginEvent (which is a subclass of LoginEvent), corresponding to the login on

machine “DSTO”. The investigator also now knows that the host “DSTO” runs a

Windows operating system, by virtue of the specific nature of the Win32LoginEvent.

The investigator at this point may be interested in what other users were doing

on the computer in the time leading to this event. By querying for all instances of the

LoginSessionEvent class prior to the administrator login event on the machine “DSTO”,

the investigator will find that the user DSTO\bob was logged into the machine

previously. The LoginSessionEvent is an abstract event we use to represent a user’s

login session on a machine.

Examination of the event’s causal ancestry information will now accelerate the

process. In this case, we would find many events of the user DSTO\bob, causally

correlated together by rules relating web file downloads to login sessions, logins to

logouts, and so on. By navigating through the causal ancestry graph the investigator

may become suspicious of the DownloadExecutionEvent. This event generalises the

events of the user downloading a file, (in this case sourced from an Apache log) and the

execution of the file. Examining the processName property of the

DownloadExecutionEvent reveals the file ‘rawrite2.exe’. The investigator knows that

this file is a tool for copying bootable floppy images to a floppy disk. Exploring the

causal graph further reveals that the user has downloaded a file, ‘bd030426.bin’ which


is a bootable image file that, in this case, contains a utility to wipe the Windows

administrator password, resetting it to a known value.

5.4.2 Experimental results

We ran our prototype implementation against the previously presented scenario

with the knowledge base containing events sourced from the ECF dataset. A number of

instances of OSExploit events were immediately returned, demonstrating the correct

operation of the automated investigation component of the system.

Investigation of the entities found in the knowledge base revealed many hosts

refered to by name, and others referred to by IP address, which led the investigator to

formulate the hypothesis that a number of the hosts entries were in fact surrogates for

the same host. By looking at the networks current configuration it was hypothesised

that it was likely the host with name “DSTO” also had the IP address 131.181.6.167.

Upon expressing the hypothesis that the two surrogates in fact corresponded to the

same host by invoking the sameAs operation on the two surrogates (performed in the

UI by selecting two surrogates and right clicking to access the sameAs operation),

another instance of the OSExploit event was automatically generated. This

demonstrates that the hypothetical entity equivalence function of the prototype works

correctly, and furthermore, that this technique is effective in reducing the number of

false positives unavoidable without a means of hypothesis specification.

Comparison of the ECF and FORE approaches

Both the ECF and FORE approaches support querying event data in the

exploration of a hypothesis. FORE differs from ECF in that the underlying event base,

in the case of FORE, contains many linkages that have been inferred by event

correlation rules at the time events are loaded into the system. These causal linkages

enable an investigator to explore simple relationships without manual inference. ECF

however, contains none of these causal linkages. Following causal linkage in ECF

involves the operator inferring the linkages, and expressing the conditions required in

SQL queries. For example, for the scenario explored previously, the investigator would

have to write a series of SQL queries to successively narrow the set of events in

question. Investigation using ECF thus requires considerably more human inference

than using FORE.

FORE adds investigation features not found in ECF. In addition to providing a

general purpose KR framework and a complementary event correlation language, our

approach introduces the ability of hypothetically unifying entities of equivalent

identity, further enhancing the effectiveness of existing rule based correlation


approaches. We facilitate the representation of generalized events, enabling

investigators to reason with generalized concepts, at higher levels of abstraction. FORE

aims for high semantic consistency with little information loss, whereas ECF values

information normalization.

5.5 Case study 2: Extending the approach to new domains

In the previous section we presented our approach, which represents event or

transaction based knowledge as well as environment-based knowledge by defining an

extensible and semantically grounded domain model (a forensic ontology) expressed

using the Web Ontology Language (OWL). We created our own rule based correlation

language, FR3, based on the observation that most rule and signature based correlation

techniques are translatable to rules. We demonstrated the application of the approach to

the forensic investigation of a scenario in a single, homogeneous domain, using an ad

hoc ontology.

In the work described in this section, we demonstrate that the approach is

extensible and can be generalised to support forensic investigations involving multiple

heterogeneous domains. We demonstrate its applicability by applying it to two new

domains of event based evidence, along with the domain discussed previously. Where

we previously employed an ad hoc ontology, we now refine our approach by

integrating third party ontologies as our foundations, demonstrating that the approach

can scale by virtue of enabling separate development and subsequent integration of

information described by domain ontologies and knowledge encoded in inference rules.

This provides freedom to the expert in advancing forensic understanding within a

narrow domain, and also providing the necessary structure to relate and communicate

that understanding to less sophisticated practitioners and generic reasoning tools.

In this section we address the following potential scenario which illustrates the

motivation for our work and serves as a test of the success of our approach. We have

identified a scenario of potential misuse in an accounting environment where a

company is using the SAP27 ERP system.

The scenario consists of the following actions, with the sources of the events in

question indicated in parenthesis:

1. Person P enters room R (Door log)

2. P logs on to Windows 2000 workstation (Win32 System Log)

3. P runs the SAP application client (Win32 System Log)

27 http://www.sap.com/


4. P logs into the SAP application as Q, which fails (SAP Security Audit

Log)

5. P logs off the windows workstation (Win32 System Log)

Detection of this scenario could indicate a user mistyping their username or

password. It could, however, also indicate a user attempting to (or succeeding to) login

as another user or to an account which they are not authorised to use. Persistent

recurrences of this event could potentially indicate the user methodically guessing the

password of another user.

5.5.1 Integration of standard ontologies

An upper ontology refers to a set of elementary, generalised and abstract

concepts that should form the basis of all other ontologies. The two primary efforts

towards defining upper ontologies are the Standard Upper Ontology (SUO) and CYC

(which stands for enCYClopaedia) [67]. Recently, the CYC upper ontology has been

made public as a part of the openCYC.org project. The SUO working group, under the

auspices of IEEE, is working at forming this ontology from a number of upper

ontologies, including the Suggested Upper Merged Ontology (SUMO) [94] and CYC.

Both efforts further define middle level ontologies which are more domain

specific than their upper counterparts. Reed and Lenat observe that in practice, most

work on ontology merging and reuse occurs in the middle and lower levels of ontology,

where the defining vocabulary for a domain is located [108].

SUMO [94] provides two middle level ontologies related to our work:

distributed computing, and geography. Chen and Finin (2004) have defined a set of

ontologies collectively referred to as SOUPA for context aware pervasive computing

environments, which addresses concerns such as location, places and time. It imports

subsets of the OWL-S web services ontologies, and defines a spatial ontology based on

a subset of the openCYC spatial ontology.

We chose to use the SOUPA ontology for representation of place and space

related concepts as SOUPA is more lightweight than the SUMO ontologies.

Lightweight ontologies perform better in automated inference, as there are a reduced

number of concepts and instances required to be considered by the inferencing engine.

Further, the SOUPA efforts have demonstrated this ontology working with the JENA

toolkit.

Of the ontologies related to security, the security ontology of Raskin et al [105]

appeared to be promising; however the ontology was unavailable at the URL published.

Of the available security ontologies, the closest fit to our needs was the NERD

ontology. To integrate it, we first had to translate it into OWL. This was


straightforward, as it is specified using the CLASSIC DL language which, like OWL, is

based on DL foundations.

The NERD ontology was far more granular in its modelling of the composition

of network and host structure. For example, in our original ontology, we modelled the

IP address of a host as a property with domain the Host class and range the simple

datatype string. In the NERD ontology there are another three layers of abstraction in

between: the class Interface, IPSetup, and IPAddress. We use a succession of

anonymous instances to represent this host. Rather than stating “the host with IP

address 131.181.6.3” in our original ad hoc ontology, we must make the statement “the

host whose interface has an ipsetup with IP address 131.181.6.3” using the NERD

ontology. This is expressed using this ontology as:

<nerd:Host> <nerd:hasInterface> <nerd:Interface> <nerd:hasIPSetup> <nerd:IPSetup> <nerd:hasIPAddress> <nerd:IPAddress> <nerd:ipaddress >131.181.6.3</nerd:ipaddress> </nerd:IPAddress> </nerd:hasIPAddress> </nerd:IPSetup> </nerd:hasIPSetup> </nerd:Interface> </nerd:hasInterface> </nerd:Host>

This introduces many more entities into the system per log entry, which could

quickly overload the information conveyed in the entity view. In response to this, we

modified the presentation layer of our GUI to only present the outermost enclosing

instance, with the child properties connected to anonymous instances represented as

path elements. For example, in our entity viewer, we would represent the Host above

as:

[hasInterface.hasIPSetup.hasIPaddress.ipaddress=131.181.6.3]

5.5.2 Integrating new domains

The door log entries contain the date, time, card id, name of assigned owner,

the door name, and the zone. In our case, the door is named by both the room it controls

access to and the building containing the room. Integrating this knowledge into our

prototype first involves identifying the concepts implicit in the event log data, and then

determining an appropriate place for the concepts in our ontologies.

As we wish to represent Rooms and Buildings, we hook in our Room concept

by inheriting from the SOUPA class SpacedInAFixedStructure. Similarly, we inherit

Building from FixedStructure. We hooked a DoorEvent into our existing ad hoc event


ontology by inheriting it from our existing Event class. We next wrote an event parser

specification specific to the door logs, which matches the door log syntax, and declare

the OWL instances which are necessary to represent a door entry. Below we present an

example door log event, as created by the parser:

<fore:DoorEvent> <fore:building> <fore:Building rdf:ID=”building1”> <spc:name>GP. S BLOCK</spc:name> </fore:Building> </fore:building> <fore:room> <fore:Room rdf:ID=”room0”> <spc:name>GP. S BLOCK RM S826A</spc:name> <spc:spatiallySubsumedBy> <fore:Building rdf:about=”building1”/> </spc:spatiallySubsumedBy> </fore:Room> </fore:room> <fore:user> <fore:DoorSwipeCard rdf:ID=”doorcard1”> <fore:cardID>42281</fore:cardID> <fore:name>RICCO LEE</fore:name> </fore:DoorSwipeCard> </fore:user> <fore:startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime" >2004-03-04T20:30:00Z</fore:startTime> <fore:DoorEvent>

SAP Security Audit Logs record, among other things, the success or failure of

logins to SAP, along with the date and time of the event, and the host (or in SAP

terminology, terminal) that the user attempted to login from. Addition of SAP related

events specific to our scenario required the addition of the following new concepts to

our ontology (presented in Table 10):

Table 10: SAP Related Events

Class Meaning ServiceAuthenticationEvent Authentication of a user to a resource,

specifically, a resource that is a service SAPAuthenticationEvent Authentication of a user by SAP. Login success

or failure. Inherits ServiceAuthenticationEvent. SAPClientLoginSuccessEvent Successful login to SAP. Inherits

SAPAuthenticationEvent. SAPClientLoginFailureEvent Unsuccessful login to SAP. Inherits

SAPAuthenticationEvent. SAPClientProcessCreationEvent The SAP client program has been run on a client

terminal. IdentityMasqueradeEvent Multiple login names have been used to access a

service from the context of a single login account.

The basis for identifying a case of identity masquerading is by recognising

when a user uses multiple identities to access resources. This was recognised by

looking for SAP authentication events, which occur from the context of a single user’s

OS login session, where the user identity is not consistent. The LoginSessionEvent is a


higher level abstraction which represents a user’s interactive login session on a host. In

Table 11 we present a correlation rule in our language FR3, which detects instances of

this scenario:

Table 11: Identity Masquerade Rule

Rule e1?[rdf:type -> fore:LoginSessionEvent ; fore:startTime -> ?t1 ; fore:finishTime -> ?t3 ; fore:host -> ?h ; fore:user -> ?u1], e2?[rdf:type -> fore:SAPAuthenticationEvent; fore:startTime -> ?t2; fore:terminal -> ?h; fore:user -> ?u2], le(?t1, ?t2), le(?t2, ?t3), notEqual(?u1, ?u2), makeTemp(?s) -> ?s[rdf:type -> fore:IdentityMasqueradeEvent; fore:causality -> ?e1,?e2], ?e2[fore:causality -> ?e1];

Meaning Match an event instance of class LoginSessionEvent with an event instance of class SAPAuthenticationEvent where the LoginSessionEvent’s host is the same host as the terminal in the SAPAuthenticationEvent. The SAPAuthenticationEvent must occur within the time boundaries of the LoginSessionEvent, and the users in each event are not the same user.

If this is the case, create an event of type IdentityMasqueradeEvent and link its causality property to the matched events, and link the causality property of the SAPAuthenticationEvent to the LoginSessionEvent

Correlating door entries with interactive logins to a workstation is achieved

using the rule presented in Table 12:

Table 12: Door Entry- Login rule

Rule ?e1[rdf:type -> fore:DoorEvent; fore:user -> ?u ; fore:startTime -> ?t1], ?e3[rdf:type -> fore:TerminalEvent; fore:user -> ?u; fore:startTime -> ?t3], fail ( ?e2[rdf:type -> fore:DoorEvent; fore:user -> ?u; fore:startTime -> ?t2], lessThan(?t1, ?t2), lessThan(?t2, ?t3) ) -> ?e3[fore:causality -> ?e1];

Meaning Match an event instance of class DoorEvent with an event instance of class TerminalEvent that occurs before it.

If they refer to the same user and there is not another door event in between, then link the TerminalEvent’s causality property to the DoorEvent.

5.5.3 Experimental results

We ran our extended software against the previously presented multi-domain

scenario with a knowledge base containing some hundreds of events sourced from the

three different domains. The event browser immediately identified the scenario, along

with a number of false positives. The scenario was identified by instances of the

MultipleIdentitiesUsedEvent event appearing in the event browser. We provide further

means for finding instances by querying for the specific event, or by using high level


views which limit the set of events displayed to higher level concepts closer to the

concerns and vocabulary of the investigator. The user interface enables the investigator

to “drill down” to the events which caused it. In this example, the

MultipleIdentitiesUsedEvent has causal links to the LoginSessionEvent and the

SAPAuthenticationEvent that triggered its creation.

In Figure 17, we present a graph of events that correspond to the scenario,

which can be explored by an investigator using the drill-down feature of the interface.

The causal relationships correlated by the rules above are presented in using bold.

Other links are correlated by rules not presented here.

TerminalLoginEvent

user=P

TerminalLogoutEvent

user=P

LoginSessionEventuser=P

SAPClientProcessCreationEvent

user=Phost=F

SAPClientLoginSuccessEventEvent

user=Qterminal=F

IdentityMasqueradeEvent

DoorEventuser=P

Figure 17: Causal ancestry graph of identity masquerading scenario

In our test environment, like many real world deployments of SAP, the SAP

username is not necessarily the same as the OS username for the same user. The

preceding rule presented in Table 11 resulted in many false positives, as the test for

inequality fires the rule for minor differences in username. For example, “jsmith” and

“j.smith” are treated as separate users.

In this case the surrogate proliferation problem creates false positives. To

resolve this, we explicitly select the users in question, and indicate that they should be

treated as representing the same thing, again using the sameAs functionality provided

by the OWL semantics. As a result, MultipleIdentitiesUsedEvent based on this kind of

identity failure are removed from the knowledge base and event viewer. This approach

to hypothetically resolving identity between a user identified from a door log, and a


user identified in a login, similarly allowed us to causally correlate door logs with

logins to computers.

5.6 Conclusion

The FORE prototype holds the promise of collaborative development of

correlation rules that correlate events across and within domains, reducing the amount

of manual inference and query tasks, and assisting in interactive investigation. At a

higher level, we have demonstrated that correlation rules can automatically correlate

whole forensic scenarios without interactive investigation by human operators.

The four contributions of this chapter are aligned with themes of representation

and analysis techniques.

Firstly, the work investigates whether the RDF/OWL formalism is a useful

general representation upon which a digital forensics application, requiring a wide

representational scope, might be built. An experimental result of case study 1 (Section

5.4) is that we find that RDF/OWL is a useful formalism for representing low level

computer security and systems related events, composite and abstract events related to

higher order suspicious situations, and entities referred to in those events. The instance

model of RDF enabled definition of a surrogate per event and entity, and the class

based model of abstraction enabled ascribing semantics these event and entity

instances. The experiment demonstrates that the representation is of use in addressing

the complexity problem by enabling integration of arbitrary information from various

computer security and systems event logs.

The RDF/OWL representation is not, however, sufficiently expressive enough

to describe and represent heuristic knowledge describing complex relationships

involving temporal constraints, instance matching, and declaration of new property

values or new instances. This necessitated shifting outside the knowledge

representation to employ a rule language (FR3) for these purposes.

In case study 2 (Section 5.5) we have demonstrated that the representation is

extensible and generalisable to support reasoning across multiple heterogeneous

domains. We do so by successfully applying the prototype to a forensic scenario that

involves both ERP security transaction logs, and door logs, in addition to computer

security logs such as those which we have considered in our previous efforts.

Furthermore, we demonstrate that our approach can scale, by supporting the separate

development and subsequent integration of domain models, event parsers, and

correlation rules, by experts in their respective domains.

In this case, however, we addressed integration of information with differing

ontological commitments, by integrating information modelled by an existing network


intrusion detection related ontology, in addition to events sourced from the enterprise

resource planning system, SAP. The extensibility of the representational approach was

demonstrated by the ease with which an existing domain model was integrated into our

prior prototype.

The RDF/OWL language alone was not, however, expressive enough to

provide the language tools to address the areas of impedance mismatch between our

prior (ad-hoc) ontology and the intrusion related ontology. In this case the mismatch

was resolved by modifying our existing ontology and heuristics to operate at the same

level of granularity and commitments of the new ontology. An alternate approach

would have been to adopt rules to bridge across these mismatches.

In practice, the approach and implementation described carried with it an

ontological commitment which focused on modelling of situations and entities,

simplifying the subtle relationships between events and their occurrent time. This

simplified model of time carries with it the assumption that all of the clocks on the

separate machines are synchronised. While network time infrastructure such as NTP

facilitates synchronisation of computer clocks down to the millisecond, we expect that

in practice all but the simplest of forensic investigation will involve multiple computer

time sources in various states of de-synchronisation. Further work is required in

adapting event correlation techniques to work with models of time which incorporate

notions of multiple independent timelines, hypothetical specifications of clock

timescale behaviour, and automated methods for identifying the temporal behaviour of

computer clocks from event logs. This last theme is investigated further in Chapter 7.

An additional time related simplification is the embedded assumption that

values of entities remain invariant over time, whereas in reality, attribute values vary

over time. For example, while it may be widely true that a particular person’s name

remains the same over the period of their life. This assumption fails to hold, however,

when one considers events such as marriage, and officially sanctioned name change via

deed poll. Models of entity attribute values which account for different values over

time require further investigation.

Another limitation of the prototype described here is that it eschews

maintenance of provenance information. The parser, in its current state does not record

the source of event instance that it generates. Secondly, the rule engine does not record

which successful rule firings lead to which new inferred composite events.

Documentation of both of these is important as any automated conclusions must be

verifiable and traceable back to the original evidence.

The second contribution of this chapter is the demonstration of a novel analysis

technique for automated detection of a computer forensic situation, based upon


information automatically derived from digital event logs. We present a heuristic rule

based approach that has the ability to manage the scalability and semantic issues arising

in such inter-domain forensics.

Such rule based approaches have a number of shortcomings. While abstraction

goes some way towards reducing the number of rules required for automated detection,

rules must still be authored by experts. Research into automated means of identifying

potential rules and associations is warranted; approaches such as data mining hold

promise. Furthermore, rules are by nature crisp in their definition, precluding

incorporation of fuzzy concepts. For example, in the OSExploit detection rules, we

implied a causal relationship by requiring that the Win32RebootEvent and LoginEvent

be within 10 minutes of each other, under the hypothesis that an attacker would operate

quickly and to avoid the complication of the rule matching every Login after a reboot.

Intuitively, the further the events which correlate to the OSExploitEvent are away from

each other, the more likely they are to be not causally correlated. Where one draws the

line on the relatedness of two events of these types is by nature subjective and could

benefit from techniques which acknowledge this.

The third contribution is the identification of a novel means of resolving the

problem of surrogate proliferation in interpreting names in event logs, which is

described in Section 5.4.1. Surrogate proliferation refers, in this case, to the problem

which arises from a single real (or virtual) world entity having multiple names by

which it is referred to, which leads to the necessary creation of one surrogate per name.

For example, while some event log entries (such as those taken from firewall logs) may

describe events related to a host by referring to its IP address, other event log entries

may refer to the same entity by its DNS name. This abundance of multiple names for

the same entity grows the quantity of entities which must be considered in interpreting

and correlating event logs.

This problem of surrogate proliferation can be observed throughout the digital

forensics domain. The event normalising task in Stephenson’s End to End Digital

Investigation (EEDI) methodology [127] refers to this problem in the context of

resolving records referring to the same network event being received from multiple

sensors. Similar problems may be observed in ascribing identity to particular versions

of files (i.e. operating system) found across multiple digital crime scenes.

The technique addressing this problem (described in Section 5.4.1) exploits a

general feature of the RDF/OWL formalism; the owl:sameAs language term, and

associated OWL defined semantics. That the general reasoning machinery of the

knowledge representation is employed to solve this problem demonstrates the


immediate benefits of employing a knowledge representation towards reducing

complexity and volume digital evidence.

This approach has a number of limitations related to expressiveness. Besides

the temporal simplifications described previously, we additionally simplified our model

of events by not modelling positional relationships between textual events within the

log file. Fully describing the information within an event log file requires a detailed

understanding of the meaning of the log file entries, and as such requires considerable

domain knowledge. Additionally, writing parsers which translate between textual event

log records and information expressed in the RDF/OWL representation carries with it

the additional burden of understanding the representation, and modelling methodology

employed. Description of event logs in this manner requires orders of magnitude more

storage, which exacerbates the volume problem.

The final, and significant limitation of the approach, is that the current

generation of RDF/OWL reasoners and data-stores were observed to be problematic in

scaling to large volumes of information. Wholesale import of event logs into current

generation data-stores and reasoners yields unsatisfactory results (measured by the

amount of time taken to import event logs and perform correlation), leading to the

conclusion that further work is required in identifying scalable methods of search and

reasoning over the OWL/RDF representation.

The next chapter proposes the same KR approach towards solving the

challenges of tool interoperability, and integration of arbitrary information.

Chapter 6. Sealed digital evidence bags

“There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy.”

(William Shakespeare)

The previous chapter proposed and demonstrated the use of formal knowledge

representation in automating correlation of digital event oriented evidence, to facilitate

identifying situations of interest from heterogeneous and disparate domains. This

chapter addresses themes of representation and assurance in addressing how forensics

tools might scale and interoperate in an automated fashion, while assuring evidence

quality. The chapter considers the problem of sharing of digital evidence between tools

or even more widely, between organisations.

The chapter is structured as follows. Section 6.1 introduces the problem of

digital evidence storage formats, the related literature of which is described in Section

3.2. Section 6.2 enumerates a number of definitions of terms related to digital evidence

and related documentary artefacts. Section 6.3 proposes a novel integrated storage

container architecture and KR based information architecture for digital evidence bags,

which we call sealed digital evidence bags (SDEB). This approach supports arbitrary

composition of evidence units, and related information into a larger corpus of evidence

Section 6.4 describes the compositional nature of the architecture in the context of

usage scenario: building digital forensics tools and acquiring digital evidence from hard

disks. Section 6.5 describes experimental results validating the compositional nature of

the prototype approach, and Section 6.6 presents the conclusions of the chapter and

relates opportunities for future work.

The research work described in this chapter has led to the publication of the

following paper:

B Schatz, A Clark, (2006) ‘An information architecture for digital evidence integration’ Proceedings of the 2006 Australian Security Response Team Annual Conference (AUSCERT 2006), Gold Coast, Australia.

107

108 CHAPTER 6 – Sealed digital evidence bags

6.1 Introduction

The rapid pace of innovation in digital technologies presents substantial

challenges to digital forensics. New memory and storage devices and refinements in

existing ones provide constant challenges for the acquisition of digital evidence. The

proliferation of competing file formats and communications protocols challenges one’s

ability to extract meaning from the arrangement of ones and zeros within. Overarching

these challenges are the concerns of assuring the integrity of any evidence found, and

reliably explaining any conclusions drawn.

Researchers and practitioners in the field of digital forensics have responded to

these challenges by producing tools for acquisition and analysis of evidence. To date,

these efforts have resulted in a variety of ad hoc and proprietary formats for storing

evidence content, analysis results, and evidence metadata, such as integrity and

provenance information. Conversion between the evidence formats utilized and

produced by the current generation of forensic tools is complicated. The process is time

consuming and manual in nature, and there exists the potential that it may produce

incorrect evidence data, or lose metadata [30].

It is with these concerns in mind that calls have been made for a universal

container format for the capture and storage of digital evidence. Recently, the term

“Digital evidence bags” was proposed to refer to a container for digital evidence,

evidence metadata, integrity information, and access and usage audit records [135].

Subsequently, the DFRWS formed a working group with a goal of defining a

standardised Common Digital Evidence Storage Format (CDESF) for storing digital

evidence and associated metadata [30]. For further background on digital evidence

container formats, see Section 3.2.1.

Another source of complications related to the ad hoc nature of forensic tools is

the absence of a common representational format for Investigation Documentation.

This includes a number of generally related classes of information, such as Continuity

of Evidence, Provenance, Integrity, and Contemporaneous Notes (see Section 2.2). This

is not a trivial problem owing to the nature of the forensics domain, which deals with

massive conceptual complexity within multiple layers of abstraction. The challenge

here is to identify a means that decouples the evidence container formats and

investigation documentation used by forensics tools from the implementation logic of

these tools. Furthermore, this needs to be accomplished in a manner that facilitates the

assurance of provenance and maintains integrity.

This problem of evidence representation is not simply limited to the challenge

of tool interoperability. In outlining the “Big Computer Forensic Challenges”, Spafford

CHAPTER 6 – Sealed digital evidence bags 109

observes that practitioners and researchers in the field of digital forensics do not use

standard terminology [98], and indeed it is clear that there is limited attention paid to

the formal definition of taxonomies or ontologies describing this domain.

We propose the use of ontologies in addressing these terminological and

representational problems. We have produced a number of basic ontologies modelling

the domain of digital evidence acquisition, computer hardware, and networks, and

described these ontologies using the Web Ontology Language (OWL). In combination

with semantic markup languages such as RDF, ontologies encourage knowledge

sharing and reuse within a domain, which has the potential to lead towards a

convergence of vocabulary in the forensics domain.

In this chapter we propose an extensible architecture for integrating digital

evidence by applying an ontology based approach to Turner’s digital evidence bags

concept. We enumerate the representational requirements for the investigation

information component of an open common digital evidence storage format, and

formalise the domain by describing it with an ontology. An architecture for digital

evidence bags is demonstrated which facilitates modular composition of forensic tools

by way of an extensible information architecture. Further, a novel means of identifying

digital evidence, and digital evidence bags is proposed which supports arbitrary

referencing of information within and between digital evidence bags. The proposal

modifies Turner’s design to strengthen evidence assurance, proposing an sealed

(immutable) bag metaphor.

6.2 Definitions

Our concerns involve representation and terminology. To avoid confusion, the

following terms used throughout the chapter, and in our digital evidence ontology, are

defined below. As the subject is digital evidence, we omit the use of the word digital in

our definitions.

Continuity of Evidence Documentation: Information maintained to track

who has handled evidence since it was preserved.

Digital Evidence: A term which loosely refers to a related set of Evidence

Content or Secondary Evidence and Investigation Documentation.

Evidence Content: Stream of bytes of computer data: typically data which is

stored in a file, or a stream of a file, or in raw storage, such as the ordered sectors of a

disk.

Evidence Content File: A file containing evidence content.

Image : A contiguous sequence of bytes, which is a copy of a digital crime

scene.


Integrity Documentation: Information which is used to detect the

modification of evidence content or metadata.

Investigation Documentation: Contextual information which is related to

Evidence Content. For example, commonly gathered Investigation Documentation

related to a JPEG image might be the file name, the path which it was stored in, and the

last modification, last access and creation times of the file.

Investigation Documentation File: A specific file containing arbitrary

Investigation Documentation.

Provenance Documentation: Information which relates to the provenance of

the evidence. For example, information about who captured the evidence, where it was

stored, what tools were used fall into this category.

Secondary Evidence : Digital evidence produced by an analysis tool.

6.3 An extensible information architecture for digital evidence bags

The primary aim of our work is to identify a general solution which meets the

representational needs for storing arbitrary information, including both investigation

documentation and secondary evidence, in digital evidence bags in a manner that is

both machine and human readable. We seek to do this in a manner that allows separate

evolution of, definition of, and interoperability between the abstractions which are used

in forensic tools, in a manner that is not dependant on the management of a single

entity or governing body. The secondary aim of the work is to produce an evidence

container that enables a compositional approach to evidence sharing and integration.

We look to the near future, where analysis cases may involve digital evidence

from sources orders of magnitude more numerous than the current norm. In fact we see

the beginnings of this challenge as investigations of P2P networks involve multiple

terabyte sized images, sourced from numerous locations and computers. We expect that

the monolithic approaches to digital evidence containers will not scale to this future, for

reasons such as evidence bag size, concurrent access, and IO efficiency.

For example, consider the case where two multi-terabyte images must be

acquired. The use of a single monolithic DEB for containing both images could imply

serialising access to the DEB, and prohibit acquiring the images in parallel. With

current IO speeds, this would add tens if not hundreds of hours to the acquisition time.

To address these scaling issues we propose a compositional rather than

monolithic approach to assembling of a corpus of digital evidence. This requires

amongst other things defining an identification scheme that is independent of location

and global in nature. This architecture facilitates the building of a corpus of evidence


by recursively embedding digital evidence bags within digital evidence bags, as well as

by intra-bag reference, which we depict in Figure 18. We call the architecture the

Sealed Digital Evidence Bags (SBEB) in reference to Turner’s proposal of the DEB.

Figure 18: Referencing nested and external digital evidence bags

For example, in the case of the multi-terabyte imaging scenario discussed

above, both imaging processes could happen in parallel, producing two digital evidence

bags. A further digital evidence bag, which references both these images could then be

used for adding provenance documentation such as the examiner’s name and case

number.

The architecture may be described in terms of two orthogonal components, the

storage container architecture and the information architecture.

6.3.1 Storage container architecture

The storage container architecture describes how data streams containing data

objects, investigation documentation, and evidence bag documentation are contained in

one archive.

Sealable digital evidence bags follow a similar structure to Turner’s bags. The

key difference is the use RDF/XML to represent the Tag and Investigation

Documentation related information, in order to facilitate an interoperable

representation. The Tag File of any digital evidence bag is called Tag.rdf. The naming

of the Investigation Documentation files is tool or user determined, however the

extension is .rdf to signify that the format of the file is RDF.

The XML/RDF format does not support recursive definition of RDF/XML

content within the content of another RDF/XML content block, and makes no provision

for arbitrary text outside the syntax of the XML syntax. This leads us to maintain

integrity information regarding the content of the Tag in a file external to the Tag,

unlike the DEB proposal. Turner’s DEB uses an onion like approach where a hash of

the previous contents of the Tag is recursively appended to the Tag. We instead define


a Tag Integrity File, called Tag.rdf.sig, which contains integrity information pertaining

to the Tag.

Sealable digital evidence bags are designed to be created and populated with

evidence and investigation documentation, then sealed exactly once. The Tag of an

SDEB is immutable after the Tag Integrity File has been added to the SDEB. Before

that the bags are unsealed and mutable.

The structure of the SDEB is presented in Figure 19.

Figure 19: Proposed sealed digital evidence bag structure

To demonstrate the SDEB architecture in context, we have developed a

prototype online acquisition tool for creating a digital evidence bag containing images

of the Internet Explorer cache and history index files (these are also referred to as web

browser logs). These files are typically located in a number of subfolders of the \Local

Settings\Temporary Internet Files\ path under the user’s profile directory on a

Windows host. The files in question are all named index.dat.

We present the file oriented contents of the digital evidence bag produced by

the prototype tool called acquireIELogs.py in Table 13. The tool creates images of the

browser log files, naming them according to a programmatic naming scheme based on

their original filename (in this case, index.dat), in combination with the user name, the

kind of file (cache or history), and the specific history file set.


Table 13: The file content of a browser log SDEB

jbloggs.history.MSHist012006010420060105.index.dat.rdf jbloggs.history.MSHist012006010420060105.index.dat jbloggs.history.MSHist012006010320060104.index.dat.rdf jbloggs.history.MSHist012006010320060104.index.dat jbloggs.history.MSHist012005121220051219.index.dat.rdf jbloggs.history.MSHist012005121220051219.index.dat jbloggs.history.MSHist012005121920051226.index.dat.rdf jbloggs.history.MSHist012005121920051226.index.dat jbloggs.cache.index.dat.rdf jbloggs.cache.index.dat jbloggs.history.index.dat.rdf jbloggs.history.index.dat Tag.rdf Tag.rdf.sig

6.3.2 Information architecture

The information architecture is described by two ontologies, a representation

layer, and a unique naming scheme for referring to arbitrary information.

Unambiguous identification of evidence and arbitrary information

Recalling that in RDF (see Section 4.3.1), Subjects, Predicates and Objects are

named using a URI, we use a special category of URI called a Uniform Resource Name

(URN) [86] for identifying digital evidence bags, investigation documentation, and

arbitrary secondary evidence instances. URNs are intended to serve as persistent,

location-independent resource identifiers.

Following work performed in the life sciences area in uniquely identifying

proteins in distributed databases (which has resulted in the definition of the Life

Sciences Identifier (LSID) standard [117]), we propose a digital evidence specific URN

scheme. This scheme, which we call Digital Evidence IDentifier (DEID) is based on

the organisation of the tool user, and employs message digest algorithms as a globally

unique identifier. The format of a Digital Evidence Identifier is as follows:

urn:deid:organisation:digestalgorithm:digest:discriminator

For example, we identify a particular image taken of a file in our example

further below using the following URN:

urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea17d4:image

The string “deid”28 is used to provide a unique namespace for digital evidence

identifiers. We provide scoping information in the organisation field which would

potentially enable one to resolve a URN back to a set of information or an evidence bag

as has been employed in the LSID work. The digestalgorithm field refers to the 28 We follow the LSID convention, which uses a lower case string “lsid” in the URN.


message digest algorithm used to generate text in the following field. The

descriminator field is provided for further addition of naming terms. It should be noted

that we rely on the collision free nature of message digest algorithms to assure globally

unique names. Given that flaws may be found in cryptographic hashes over time, our

proposal provides for the use of other digest algorithms.

Of course these identifiers are long and unwieldy and not suited for use as

names for the evidence we are concerned with. Evidence may be given more human

friendly, case specific names by asserting further RDF triples which have the identifier

as the subject. An example of this kind of usage is given in the case study in Section

6.4.

Where it is necessary to refer to the contents of a particular file, for example a

digital evidence container file in the same DEB, the DEB implementation interprets the

standard URI file protocol (i.e. file://./foo) to find the file.

DE and SDEB Ontologies

The representation approach underlying the SDEB information architecture is

that every real world or virtual world entity has a corresponding surrogate represented

using RDF/OWL. Prior approaches blur the distinction between entities. For example,

an AFF container holding a file image might define a number of name-value pairs to

describe the set of sectors from which it read the file, and another name value pair to

describe the serial number of the hard drive from which it read the sectors. The

ontological commitment of the name-value pair representation places the subject of any

statements in the background, leaving its identity and surrogate implicit. This

representation works well for making statements about a single entity such as a hard

drive image, however, when then number of discernable entities which must be

identified increases beyond a single instance, the representation becomes clumsy due to

its absence of surrogates. For example, say that the file was read from a filesystem on a

RAID5 array. How does one document; the volume, the various physical drives

composing the volume, the RAID5 configuration, and the relationships between them?

The SDEB approach creates a surrogate for every entity which it documents,

distinguishing between an image, its content, the source media from which it was

copied, and allows the representation of secondary evidence in the same

representational formalism.

Two ontologies are defined to describe a sufficient set of concepts and

properties required for describing both the storage related components of a SDEB, and

the digital investigation related concepts. The SDEB ontology defines concepts such as

DigitalEvidenecBag, TagIntegrityFile, EvidenceDocumentationFile, and


EvidenceMetadataFile, and the properties bagContents and contains. The Digital

Evidence ontology describes a wider range of concepts relating to imaging (FileImage),

data (ContiguousBytes), tools (AcquisitionTool) and disk structure (Partition).

The Investigation Documentation Files produced by the prototype tool all

contain information of a similar format to that presented in Table 14 (abridged). The

Investigation Documentation File is used for storing arbitrary information related to the

case. As such, it never contains information related to the SDEB ontology. We can see

in the example below that some documentation is stored related to Digital Evidence –

the FileImage defined in the Digital Evidence ontology. A further ontology is

referenced in this example – a web browser specific one that is required to represent the

concepts related to web browser cache and history files – the subject of the prototype

imaging application.

Table 14: XML/RDF content of Investigation Documentation File named jbloggs.cache.index.dat.rdf

<de:FileImage rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:image"> <de:imageContainer rdf:resource="file:///./jbloggs.cache.index.dat"/> <de:imageOf rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786f…b29a2104c476:original"/> <de:acquisitionTool> <de:OnlineAcquisitionTool rdf:about = ”http://www.isi.qut.edu.au/2005/acquireIELogs.py”> <de:name>acquireIELogs.py</de:name> <de:version>0.1</de:version> </de:OnlineAcquisitionTool> </de:acquisitionTool> </de:FileImage> <wb:BrowserCacheFile rdf:about = "urn:deid:isi.qut.edu.au:sha1:4056e4786f….b29a2104c476:original"> <fs:filePath>D:\Documents and...Files\Content.IE5\\index.dat</fs:filePath> <de:messageDigest rdf:datatype="http://www.w3.org/2000/09/xmldsig#sha1">4056e4786fc460d9adbe98a0bc19b29a2104c476</de:messagedigest> </wb:BrowserCacheFile>

This file (Table 14) contains RDF instance data which asserts two top level

instances; a FileImage and a WebBrowserCacheFile. The instances describe the

relationship between the Evidence Content (the content of an Evidence Content File in

the digital evidence bag) and the original data object, which is a Web Browser Cache

File, located on a particular host.

Our ontology here discriminates between the original data object, the web

browser cache file (which at one point in time resided on some piece(s) of physical

storage media) and the image of that file. As the contents of these two files are, from

the digital perspective, identical this results in a DEID URN with the same message

digest value. We discriminate between the two instances by using the labels “image”

and “original” in the discriminator field of the DEID URN. This distinguishes between

the FileImage and the BrowserCacheFile. The de:imageContainer property links the


FileImage instance in Table 13 with the contents of file jbloggs.cache.index.dat as seen

in Table 12.

The tool generates Provenance Documentation identifying itself by name,

location, and version, relating itself to the FileImage by use of the acquisitionTool

property. Provenance information identifying the examiner running the tool would be

added to a separate evidence bag, which refers to this sealed one. We do this to

simplify the acquisition tool, preferring that more complex data entry and annotation

tasks are performed using a task specific tool, such as an analogue of Turner’s Tag

editor application.

The property and class names used in the vocabulary above are defined in

ontologies specific to the domains of discourse that we are dealing with. The prefix de

is an alias for an ontology stored in the document located at

http://isi.qut.edu.au/2005/digitalevidence, which describes the digital evidence domain.

Hence, de:FileImage refers to a specific concept (a class) defined in this ontology.

Similarly we define an ontology for filesystem related concepts aliased as fs

(http://isi.qut.edu.au/2005/filesystem) and web browser related aliased as wb

(http://isi.qut.edu.au/2005/webbrowser).

Figure 20 depicts a portion of the RDF graph implied by the ontology

discriminating between the original data object and the image discussed above and

presented in Table 14.

Figure 20: RDF Graph relating original data object and image

The tag file contains the RDF data representing the SDEBs contents and related

integrity information. The DEID of the deb:DigitalEvidenceBag instance is based on

the hash of the content of the Investigation Documentation Files, in the order in which

they are defined in Table 15. The deb:bagContents property is an ordered list which

refer to instances of digital investigation documentation contained in the digital

investigation documentation files.


Table 15: Digital Evidence Bag instance data stored in the Tag File

<deb:DigitalEvidenceBag rdf:about="urn:deid:isi.qut.edu.au:sha1:44bc23235f5e797aae992e5de09524e9071fd8c6"> <deb:bagContents> <rdf:Seq> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea17d4:image"/> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:4a03ed30ebdf919004d4b40222b721c4771adee9:image"/> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:c117652d98a4f612979c19f5701d278e025749fa:image"/> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:05de1243f67753150334968a2effcc4f8114ef45:image"/> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:f3a9fd3fcc017d822f10bc4466b6d19ddbdd5042:image"/> <rdf:li rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:image"/> </rdf:Seq> <deb:bagContents> </deb:DigitalEvidenceBag>

6.3.3 Integrity

Current best practice for ensuring the integrity of digital evidence involves the

use of collision resistant message digest functions. Typically a message digest is taken

of the original evidence, and recorded in a manner that asserts the time of the digest

being taken (often via contemporaneous notes or printouts). The integrity of subsequent

images made, or copies of images made may then be ensured by taking the message

digest of the image or copy, and comparing with the original message digest.

In this proposal, integrity of evidence and investigation documentation is

ensured by the use of chained message digests. Besides using the message digest of

each piece of Evidence Content as a component of a unique identifier for both the

Evidence Content Documentation instance and the Digital Investigation Documentation

instance, we also define a property within the class de:EvidenceContext class called

de:messageDigest. This property is presented in context in Table 16.

Table 16: Evidence Content message digest property

<wb:IEBrowserCacheFile rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:original"> <de:messageDigest rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1 >4056e4786fc460d9adbe98a0bc19b29a2104c476</de:messagedigest> </wb:IEBrowserCacheFile>

The value of the de:messageDigest property is the hash of the Digital Evidence

Content obtained from the file. Work in the xml signature area has already defined a

datatype representing a SHA-1 message digest, and defined a URI representing this


datatype, we use the URL http://www.w3.org/2000/09/xmldsig#sha1 to specify the datatype

of this property.

Integrity of the Investigation Documentation Files is maintained within the Tag

File, by definition of separate de:InvestigationDocumentationContainer instances per

Investigation Documentation File, as presented in Table 17. Integrity of the content is

assured by the inclusion of a message digest of the Investigation Documentation File,

using the de:messageDigest property.

Table 17: Investigation Documentation Container Metadata stored in the Tag File.

<deb:InvestigationDocumentationContainer> <deb:contains rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea17d4:image" /> <de:messageDigest rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1 >731251ae7216b935cccf51a4018a00d8d89a89cd</de:messagedigest> <fs:filePath>file:///./jbloggs.history.index.dat.rdf</fs:filePath> </deb:InvestigationDocumentationContainer>

As the focus of this chapter is not the mechanics of integrity maintenance, we

do not specify the format or contents of the Tag Integrity File. We expect that the

contents of the file may be formatted according to the XML Signatures standard [11],

or some other standard. We do not consider here what kind of archive is used as the bag

medium.

6.3.4 Evidence assurance

We provide no construct that directly translates to the audit oriented functions

of the Tag Continuity Blocks of the DEB proposal, as we expect that further application

of tools to sealed bags will result in new digital evidence bags being produced. The

Provenance Documentation within these new bags would refer back to the original bag,

thus serving this role.

6.3.5 Clarifications

It appears that the DEB allows a number of pieces of evidence to be stored in a

single Evidence Content File. We restrict the definition of the Evidence Content File to

refer to a container with exactly one piece of evidence content.

6.4 Usage scenario: imaging and annotation

We demonstrate the modular manner in which forensic tools may interoperate

with evidence bags built using the sealed digital evidence bags approach by way of the

following hypothetical example.


In this case, the examiner uses a DEB enabled hard drive imaging application

for acquiring the evidence image. This tool is scripted together from a variant of the

UNIX dd29 tool, and the Linux hdparm utility30. The examiner acquires the hard drive

using this utility, resulting in a digital evidence bag containing an Evidence Content

File, called hda.dd, an Investigation Documentation File, called hda.dd.rdf, as well as

Tag.rfd. The imaging application is designed to be as simple as possible, and produce a

sealed digital evidence bag. It automatically generates a message digest of the Tag.rdf

file and stores it in the Tag Integrity File, Tag.rdf.sig. At this point the evidence bag is

sealed, and considered immutable, depending on the underlying scheme of

implementation of the Tag Signature.

The examiner has further data associated with this digital evidence bag, namely

the Job ID, a case specific name, the examiner’s name and identifying details, and

perhaps the serial number printed on the drive. An evidence annotation program is used

by the examiner to create a new, unsealed digital evidence bag, and the original digital

evidence bag embedded within it. A new Tag File is created within this new bag by the

annotation application. The additional data is entered using the annotation user

interface, and added to the Tag File. In this case the annotation editor eschews creating

a new Investigation Documentation File, as no new evidence has been acquired.

There are two distinct activities involved in the above scenario: evidence

acquisition and evidence annotation. By the former, we refer to the process of making

an exact copy of a piece of digital evidence, for example a hard disk. The latter refers to

the act of recording details relevant to the acquisition process and the evidence source.

By modularizing these two tasks, individual tool complexity is reduced, which has the

potential to increase reliability and enable testing at a more granular level. Bugs in the

consuming forensic tool (the annotation tool), are more likely not to jeopardize the

integrity of the product of the evidence acquisition task.

The tool annotates the information in the original sealed digital evidence bag

by asserting new properties and their values, related to the DEID of the particular piece

of information from the subject bag, as new RDF triples. These triples are stored in the

Tag File of the new unsealed DEB. In reference to the above example, the new data is

related to the instance representing the hard disk by means of its unique identifier. A

depiction of a portion of the RDF graph formed from the new information as well as

the original investigation documentation is presented in Figure 21.

29 A low level block oriented copying tool found on most UNIX variants. 30 A utility which queries information such as serial numbers, size, and addressing information from hard disks.


Figure 21: RDF graph resulting from addition of new documentation to embedded DEB

Modularity is not only facilitated in terms of interoperability between forensics

tools, but also by modular composition of ontologies. In this way an organisation could

create its own specific ontology (say for the purpose of adding an organisation specific

identifier) which would seamlessly integrate with the existing RDF graph and ontology.

We allude above to a further ontology (fooPolice) which defines the fooPolice:jobID

property.

6.5 Experimental results

To validate the effectiveness of the approach, two prototype SDEB aware

applications were created. The first, as mentioned previously, was an online acquisition

tool for acquiring Internet Explorer web browser cache and history logs. The second

prototype tool was a generic validation tool, the function of which was to validate the

integrity of an arbitrary SDEB.

The online imaging application was used to create an SDEB from a Windows

computer: many of the examples in Section 6.3 are taken from the produced SDEB.

The validation application was build in Java, using JENA as the RDF/OWL

implementation. The validation application implemented the storage architecture by the

following process:

1. validate the SDEB is a valid ZIP archive

2. load the Tag.rdf file into a JENA knowledge base

3. find an instance of DigitalEvidenceBag in the KB

4. find any associated EvidenceDocumentationContainer’s via the

bagContents property

5. load the contents of all EvidenceDocumentationContainer’s into the

KB, first ensuring their integrity via comparing with the associated

message digest


The semantics of the information architecture are implemented by nature of

JENA RDF/OWL implementation: on loading the RDF/XML, instances of the same

surrogate residing in separate files are merged together where their DEID is the same.

This leads to one integrated KB of digital investigation related information from the

separate RDF files contained within the SDEB. Both the Digital Evidence and SDEB

ontologies are also loaded into the KB.

The integrity of the Digital Evidence Content files are then validated by the

following process:

1. find any associated instances of DigitalEvidence by traversing the

contains property

2. validate the integrity of the contents of the related Digital Evidence

Content File via comparing with the associated message digest

The compositional nature of the architecture was validated by manually

generating a new digital evidence bag containing a single Investigation Documentation

File (in addition to Tag File) which contained the information excerpted in Table 18.

The new SDEB was loaded into the same KB as the previously described DEB, and the

contents of the KB dumped as RDF/XML, and manually inspected. The JENA RDF

implementation had successfully merged the information from both evidence bags into

the KB, associating the investigation documentation describing who acquired the Image

with the related FileImage in the original SDEB.

Table 18: Annotated information from composing SDEB

<de:Image rdf:resource= "urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:image"> <de:acquiredBy> <foaf:Person> <foaf:name>Bradley Schatz</foaf:name> <foaf:mbox rdf:resource=”mailto:[email protected]”/> </foaf:Person> </de:acquiredBy> </de:Image>

6.6 Conclusion and future work

The contributions of this chapter are aligned with themes of representation and

assurance. Similar to the research presented in the previous chapter, the research

described in this chapter investigates whether the RDF/OWL formalism is a useful

general representation upon which digital forensics application may be built. In this

case, however, we focus on representing information related to the investigation, rather

than the evidence.


The first contribution is the proposal of a formal knowledge representation as a

means of documenting digital investigations, in order that digital forensics tools may

interoperate and evidence documentation may be simply integrated without human

intervention. We demonstrate that semantic markup languages, in particular

RDF/OWL, are a suitable common information representation layer for digital evidence

related information, and digital investigation information. The proposed approach

addresses the complexity problem by demonstrating a general approach to documenting

investigation related information and storing evidence, which at the same time

increases automation.

In this context, the instance based data model of RDF/OWL enabled the

definition of surrogates for arbitrary investigation related information. Compared to the

attribute-value data model implicit in prior approaches to evidence containers, this

formal approach enables description of discreet entities, whereas prior approaches

preclude describing more than the single implicit entity (in this case the single disk

image in the container) due their omission of objects or instances from their data

model.

Proof of concept was demonstrated by way of describing the operation of a

prototype online acquisition application. We have focused on validating the approach

to representation in this work building a simple set of digital investigation and evidence

related ontologies, and a prototype acquisition tool, which are published at

http://www.isi.qut.edu.au/2005/sdeb/31. This ontology is however ad hoc, and we believe

that the field of digital forensics would benefit from a standardised ontology describing

its domain.

The RDF/OWL representation was sufficiently expressive to represent and

document all aspects of the investigation and evidence considered, with the exception

of one aspect; integrity information related to the Tag file. This required stepping

outside the representation to define a Tag Integrity File for storing integrity related

statements about the information stored in the Tag File. This was necessitated by a

fundamental problem in the RDF/OWL formalism with regards to making statements

about statements (or more formally, reified statements). A simple example of this is

statements along the lines of “John thinks that Mary likes Bill”.

We observe that in the digital forensics domain, the omission of reification

from a common representation will be detrimental because of the provenance related

31 The prototype implementation and ontology use the term ‘Evidence Metadata’ where we now use ‘Investigation Documentation’. This refinement in terminology is intended to signify the arbitrary information which may be related to the evidence by multiple layers of abstraction.


concerns of digital forensics. Reification will assist in making possible statements such

as “The pasco tool interpreted the following statements from file X”.

The second contribution is a conceptual advance, proposing an improvement

on the Digital Evidence Bag (DEB) proposal of Turner. Our proposal, which we call

the Sealed Digital Evidence Bag (SDEB) enables arbitrary composition of evidence

bags and information within evidence bags, without modifying any data in original

evidence bags. This proposal improves upon the DEB proposal by simplifying aspects

of evidence authentication.

Central to the compositional approach is our proposal of a globally unique

identification scheme for identifying digital evidence and related information, which

we dub Digital Evidence IDentifiers (DEID). This unique naming scheme enables

automated integration of information from separate evidence bags by the

implementation of the underlying knowledge representation layer. This demonstrates

that employing a knowledge representation as a common language for documenting the

digital investigation provides immediate benefits towards solving the complexity

inherent in integrating this information.

The final benefit of the SDEB approach is that it enables granular composition

and decomposition of evidence into a corpus of inter-related evidence bags, which

addresses the volume problem by facilitating automated validation and scalable

processing of evidence .

The next chapter addresses the theme of analysis techniques and evidence

assurance.

Chapter 7. Temporal provenance & uncertainty

“I used to be Snow White, but I drifted.”

(Mae West)

Chapters 4, 5, and 6 addressed themes of representation in computer forensics.

They did so at the documentation and evidence level, representing digital evidence and

the surrounding world, in a manner that is semantically crisp enough that both

machines and humans may unambiguously interpret such evidence. This chapter,

however, drops down a level to question the foundations of a particular part of

representation: time.

One of the key challenges in the field of digital forensics is “Meeting the

standard for scientific evidence”, which was the subject of Section 2.4.3. One of the

numerous aspects of this challenge is “Confidence and trust in results”[98]. This

chapter focuses on the trustworthiness of digital time-stamped data, and in particular,

assuring that digital timestamps might be reliably interpreted as times in the real world.

The chapter is structured as follows. Section 7.1 introduces the problem of

computer timestamps (related literature is described in Section 3.3). Section 7.2

presents the empirical results of a study identifying where the real world behaviour of

computer clocks diverges from the ideal, by studying the behaviour of computer clocks

in a real world windows network. Section 7.3 proposes a correlation approach for

characterising the behaviour of a remote clock from client and server side logs. Two

algorithms implementing this approach are described and experimental results

evaluated. Section 7.4 compares the two algorithms and experimental results. Finally

Section 7.5 summarises the conclusions of the chapter and describes future work.

The research work described in this chapter led to the publication of the

following paper:

B Schatz, G Mohay, A Clark, (2006) ‘Establishing temporal provenance of computer event log evidence’ Proceedings of the 2006 Digital Forensics Workshop (DFRWS

125

126 CHAPTER 7 – Temporal provenance & uncertainty

2006), Lafayette, USA, and published as Digital Investigation, 3 (Supplement 1), pp. 89-107.

7.1 Introduction

The use of timestamps in digital investigations is fundamental and pervasive.

Timestamps are used to relate events which happen in the digital realm to each other

and to events which happen in the physical realm, helping to establish event ordering

and cause and effect. A well known difficulty with timestamps, however, is how to

interpret and relate the timestamps generated by separate computer clocks when they

are not known to be synchronized [128]. Commonly observed differences in time occur

from computer to computer caused by location specific time variations (such as time

zones), the rate of drift of the hardware clocks in modern computers, and

misconfiguration and inadequate synchronisation.

Current approaches to inferring the real world interpretation of timestamps

assume idealised models of computer clock time, eschewing influences such as

synchronisation and deliberate clock tampering. For example, to determine the clock

skew of a computer being seized, it is commonly recommended that a record be made

“of the CMOS time on seized or examined system units in relation to actual time,

obtainable using radio signal clocks or via the Internet using reliable time servers.”

[20]. CERT recommend that, “As you collect a suspicious system’s current date, time

and command history … determine if there is any discrepancy between the collected

time and date and the actual time and date within your time zone” [96].

While this approach will approximately identify the skew between the local

time and the observed computer time at the time of the check, it says nothing about the

passage of time on the computer’s clock prior to that point [141]. Uncertainty remains

as to the behaviour of the clock of the suspect computer prior to seizure. This further

leads to uncertainty as to what real world time to ascribe to any timestamp based on

this clock.

In this work we explore two themes related to this uncertainty. Firstly, we

investigate whether it is reasonable to assume uniform behaviour of computer clocks

over time, and test this assumption by attempting to characterise how computer clocks

behave in practice. Secondly, we investigate the feasibility of automatically identifying

the local time on a computer by correlating timestamps embedded in digital evidence

with corroborative time sources.

CHAPTER 7 – Temporal provenance & uncertainty 127

7.2 Characterising the behaviour of drifting clocks

Having identified that computer clocks are unreliable, we attempt here to

experimentally validate whether one can make informed assumptions about their

behaviour, as seems to be the current practice in forensic investigations. We do this by

empirically studying the temporal behaviour of a network of computers in a commonly

deployed small business environment.

7.2.1 Experimental setup

The subject of our case study is a network of machines in active use by a small

business. The network consists of a Windows 2000 domain, containing one Windows

2000 server, a Domain Controller (DC), and a variety of Windows XP and 2000

workstations. Access to the internet is provided by a Linux based firewall. In this case,

the Windows 2000 DC (the server) has not been configured to synchronize with any

reliable time source, and as such has been drifting away from the civil timescale for

some time. The Linux firewall also provides both a squid32 web proxy server, and an

NTP server, which is synchronised with a stratum 2 NTP server33. All workstations are

configured to use the squid proxy cache for web access.

Our goal here is to observe both the temporal behaviour of the Windows 2000

DC, and the effects of synchronization on the subordinate workstation computers. We

would expect that the timescales of the workstation computers would approximate that

of the DC, because of the use of SNTP in this network arrangement (see Chapter 3).

To observe this behaviour, we have constructed a simple service that logs both

the system time of the host computer and the civil time for the location, which we

obtain via SNTP from the local NTP server. The program samples both sources of time

and logs the results to a file. Figure 22 depicts the network topology and time related

infrastructure for this experiment.

32 http://www.squid-cache.org/ 33 Stratum refers to the distance from a reference clock.


Figure 22: Experimental setup for logging temporal behaviour of windows PC's in small

business network

The logging program was deployed on all workstations and the server on the 1st

February 2006, and the results checked mid March. Unfortunately, the program was

rendered short lived, as a particular bug in the Windows service implementation of

Python (the implementation language) saw the log service crash after writing 4k of

debug messages to the standard output steam. On fixing the bug, a new version was

redeployed on the 21st March, 2006 for 20 days (until the 10th April), and then results

collected.

7.2.2 Analysis and discussion of results

The graphs presented below are based on the sampled timescales taken from

machines in the subject network. The x-axis is the time and date of the sample, taken

from the civil timescale, as served by the NTP server. The y-axis is the difference in

time between the system time and civil time at that moment, in seconds.

Figure 23 is the graph of results taken from the domain controller of the

Windows 2000 server based network. The solid line of samples shows a uniform drift

of the system time away from civil time for the time period 21st March through 10th

April. The other two sets of samples from the 1st February through the 21st March

2006 are samples taken by the initial version of the program in the time after a boot

(before the program crashed). In Figure 23 two clusters are visible outside the

aforementioned line, one about the 1st February, and one about 13th March. These two


clusters indicate reboots at that time. Extrapolating the solid line shows the drift of the

server to be at a near uniform rate.

Figure 23: Clock skew of Domain Controller "Rome" offset from civil time.

Figure 24 shows results taken from a Windows 2000 workstation called

Florence over the same period. It displays a general time drift trend which matches that

of the domain controller. The faulty logging service has generated far more samples

than were generated in the case of the server. This is caused by the habitual shutting

down of the computer by the user at the end of each work day, resulting in a number of

samples generated every time the machine reboots (before the initial version of the

program crashes).


Figure 24: Clock skew of workstation "Florence" offset from civil time.

The scale of the graph is misleading as to the number of outlier values present

from 8:19:34AM through 8:25:56AM on the 20th February. The cross at 0 skew

actually represents 38 outlier values, which do not fit a model of time where the clock

is synchronised to the DC. It seems highly irregular that during this period the machine

became synchronized to within one second of Civil Time (a time stream which the

network in question has no configured reference to).

The default auditing configuration of the Windows network failed to include

the necessary privilege to identify whether this was user instigated. The accuracy to

which the clock became synchronised with civil time leads us to suspect that this was

not the result of user interaction; rather that it was the action of some program which

had access to an external, reliable time source. The Windows update service was active

during this period; we speculate that the cause of synchronisation with the Civil

Timescale during this period was the Windows Update service.

The graph presented in Figure 25 shows the skew data taken from a Windows

XP workstation named Milan.34 Again the drift rate generally remains constant, and

correlated with that of the server; however there are two sets of anomalies which

deviate from this general trend. Immediately noticeable are the almost vertical lines

which indicate a resynchronisation with the DC timescale from wide time skews. We

speculate that these features indicate a computer reboot immediately before. The

second anomaly is the two peaks on the graph around the 6th and 7th April.

34 The scale of this graph differs from the previous graphs to present a more clear view of the features in discussion. We note that the overall form of the graph when taken at the previous scale follows the same gradient and offset.


On closer investigation, the vertical line on the 4th of April reveals that over a

period of 22 minutes and 0 seconds of real time, the system clock only advanced 20

minutes 51 seconds. In total the system clock loses 1 minute 9 seconds over this period.

This behaviour occurred in small, and incremental changes and is consistent with the

disciplining of a skewed clock back into synchronisation with a trusted source.

Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed).

A check of this workstation’s RTC via the BIOS configuration interface a

small number of days after the results were collected revealed that RTC was minutes

ahead of the system time measured just moments before. It would appear from this that

either Windows XP does not update the RTC, or that update of this particular RTC

failed. Interestingly, we see similar behaviour for the PC named “Trieste” (shown

below in Figure 26) which is the only other Windows XP host on the network. All four

other workstations (which are running Windows 2000) do not exhibit this behaviour.

The near linear relationship of the lower ends of the vertical reboot lines may indicate

the rough drift of the RTC.35

35 As the focus of this section is on describing observed deviations of Windows based clocks from the ideal, we leave experiments which conclusively determine the behaviour of the Windows XP clock at boot to others.


Figure 26: Clock skew of workstation “Trieste” offset from civil time.

The graph in Figure 27 combines data from the DC and Milan (figures 23 and

21) about the period where peaks are seen in the skew graph36. We can see here that the

DC was maintaining a stable timescale (part of its data, the points forming a thin line, is

showing through under the peak) for the period with Milan drifting away sharply at the

peaks. At the start of the peak we can see that Milan began drifting away from the DC

at a rate of around 1 second every 14 minutes, before re-synchronising with the DC.

Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed).

Investigation of the event logs of the computer “Milan” revealed an inordinate

number of Print subsystem warning events in the system logs (which appeared to

indicate repeated retries of installation of a print driver) before this time. No other

36 Note that colour would help with this graph. Rome data is plotted as points and Milan as crosses.


events of interest were found. This drift is unlikely to have been based on a single

operator action, as the corresponding change in skew would have been immediately

visible, with a discontinuity between the two points.

The remaining three workstations stayed synchronised with the DC, with no

temporal anomalies observed. The skew timelines for these were similar to Figure 23.

For reasons of brevity they are not reproduced here.

From these results we make a number of conclusions. In general, we find that

Windows hosts (2K and XP) integrated with a Windows based time synchronisation

network will stay synchronised. The anomalies observed above, however, indicate that

making reliable statements about the timescale of a particular workstation computer

within a Windows Domain network (and as such the interpretation of timestamps from

these workstations) is problematic.

Windows computers not in a Domain network, either untethered from reliable

sources of time (such as windows 2000), or loosely tethered (such as computers

running the XP OS) may suffer from the same problem. Indeed, as XP hosts are

tethered to synchronise with time.windows.com on a far less frequent basis (weekly),

there will be larger periods of de-synchronisation. The observation that the host

“Milan” became synchronised with civil time for a period, and the further observation

of it drifting away from the DC timescale and civil time (for no observable reasons)

indicate that other factors are influencing the behaviour of the clock.

7.3 Identifying computer timescales by correlation with corroborating sources

Given our uncertainty with respect to the timescale of a particular computer (as

identified in the previous section), we seek automated methods for identifying the

temporal behaviour of a computer. In this section we describe an automated approach

which correlates timestamped events found on a suspect computer with timestamped

events from a more reliable, corroborating source.

Web browser records are increasingly employed as evidence in investigations,

and are a rich source of timestamped data. The ISP side which correspond to these are

proxy logs. We expect that the common practice of deploying transparent proxies by

ISPs will see a greater availability of this kind of event log, and corresponding interest

in their use as evidence by law enforcement.

Because of the increasing ubiquity of web browsers, we have chosen to use the

web browser and proxy records as data sources for use in characterising temporal

behaviour. We expect here that in the process of an investigation, proxy logs which


relate to a suspect computer may be obtained from the ISP which has served as the

computer’s gateway to the Internet.

We assume that these records on the proxy would be produced by a computer

which is synchronized with an accurate time source. While this might not at present be

a generalisable assumption, we look towards a near future where the provenance of

audit records receives closer attention by ISPs and business in general as forensic

preparedness finds its way onto the agenda for compliance reasons among others.

7.3.1 Experimental setup

Our experimental setup uses the same infrastructure which was used in the first

study. Relevant to this experiment is the deployment of the Microsoft Internet Explorer

web browser on all Windows based machines, and the presence of a Squid HTTP proxy

on the firewall, which the computers are configured to use to access the web. The

experimental setup is depicted in Figure 28.

Figure 28: Experimental setup for correlation

This experiment takes the browser records from the machines in the network

and correlates them with the proxy logs from the squid log, to determine the temporal

behaviour of the Windows machines on the network. The correctness of the correlation

techniques are evaluated using the data collected from the previous experiment.


7.3.2 Challenges in correlating browser and squid logs

IE stores records of browsing access in two subsystems: the cache and history.

These records are all stored in separate files, called index.dat, but located in different

directories.

The IE cache subsystem stores locally cached copies of web content, such as

pages and images in files such as those with a jpg extension amongst others. An index

is kept mapping web addresses to these locally stored copies in a file called index.dat.

The cache index files contain entries for all cacheable resources visited, including

component files of a particular viewable page (for example, images, sounds, and flash

animations,).

The history subsystem creates a historical record of URLs visited over time in a

set of index.dat files. Three separate types of history file are kept: the root history, daily

sort history and weekly sort history. Within these files are records of visits to top level

viewable pages:

• Pages visited by typing a URL

• Pages visited by clicking on a hypertext link

• Documents opened within Windows Explorer by double clicking (i.e.

.xls, .doc…)

The cache and history index.dat files are all of a similar undocumented binary

file format. Despite the lack of documentation, there exist a number of documented

analyses of reverse engineering the file format, and a number of tools are available

which will interpret the content of this file. For a good description of the file format,

especially notable in distinguishing some subtle semantic differences in interpreting the

timestamps in these records, see [20].

We initially used the Pasco [59] tool for extracting the data contained in these

files. We chose this tool as it had freely available source code. In practice we suspected

that it was generating spurious results. This prompted us to perform our own reverse

engineering effort. Our new tool identified a bug in the Pasco tool where a spurious

record was generated from an unchecked file read for an offset outside the bounds of

the file37.

The squid proxy cache logs a record of all web transactions which it processes

in a file called access.log. This is a textual log file. The fields of interest to us are the

resource access time (which, similar to the IE index files is the end of the transaction),

and the URL visited.

37 Our new parsing tool, imaginatively named pasco2, is available at http://www.bschatz.org/2006/pasco2/


Our experiment involves translating the web browser records and squid logs

into a common representation and matching entries from the two sources based on the

URL visited. We assume that the last accessed time from the squid record is relative to

civil time (kept tightly synchronised using, for example, NTP), and compare that time

with the last accessed time from the corresponding history or cache record.

The primary challenge related to correlation is in determining which entry in

the Squid cache log corresponds to a particular entry in the cache or history records. As

IE records are most recently used (MRU) records, there will not be a one to one

mapping between history entries and Squid events. We illustrate this with the following

example.

Figure 29: Matching is complicated by only the most recent record present in the history.

Figure 29 depicts the relationship between records of visits to a particular page

over two days. On the first day, the user has visited the site once, and on the second day

has visited the site a further three times. As the history is a MRU record, the visits at

7:00 and 7:36 are absent from the IE history, with the visit at 8:21 being the only record

left for that day. Simple matching based on the URL field of each record will result in 4

potential matches for each history file record. The addition of further history records

and cache records related to visits to this URL complicates the matter even further. Our

correlation approach must in this case determine which potential match is the correct

match.

7.3.3 Analysis methodology

For the two algorithms explored, the sampled timescales from the previous

experiment in Section 7.2 are used as a baseline for determining which matches are true

or false. True positives are data points output by the correlation algorithm which

correlate with the timescale identified in Section 7.2. False positives are matches

generated by the algorithm which do not correlate with the timescale. True negatives

are prospective matches that are rightly discarded by the algorithm. False negatives are


data points which would correlate with the timescale, but the algorithm classifies or

misidentifies as not correlated.

7.3.4 Clickstream correlation algorithm

Our initial approach to correlation is based on the concept of a clickstream. We

borrow this term from the web content industry, where it refers to the path taken by an

individual visitor navigating a website, in a particular session. Our hypothesis was that

as a user navigates through a website, the time taken to read each page, and select a link

and follow it, and so on, would lead to a set of unique timing characteristics between

page visits for a particular clickstream.

We define a clickstream as a time ordered sequence of page hits within a

website. The intra-hit time is the time period between two successive object access

events in a clickstream. We constrain the definition of clickstream such that the intra-

hit time for successive hits is within max seconds of each other and further than min

seconds apart. We define a maximum limit so that we may disambiguate sessions. The

function of the minimum is described further below. Finally, the dimensions of a

clickstream are the ordered set of intra-hit times, the unique timing characteristics

between page visits.

The algorithm attempts to fit a clickstream identified in the web browser

records to corresponding events in the squid logs. The heuristic here is that the longer

the clickstream, the more unique the dimensions of it will be, thus giving a single

unique match when fitting to the other event stream. This is, however, complicated by

the presence of sub-page resources such as images and flash content which is

immediately loaded after the page: the timing characteristics of these are less unique.

Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host

“Milan” do not correlate because of presence of false positives.


Figure 30 and Figure 31 are of a clickstream correlation run, graphed with the

timescale log of the workstation “Milan”. The clickstream correlation dataset in Figure

30 is graphed as crosses. It contains 75 results, of which we can see 4 clusters of

clickstream results38. Clearly there is conflicting data. The two clusters visible, but not

on the timeline, actually contain 5 false positive values which are causing the problem.

These 5 values are false positives as we know from our earlier experiment what the

actual time was on the computer clock at that particular point in time, and is plotted on

the graph as dots. Removing these false positives from the result set results in the graph

labelled Figure 31, where we can see tight correlation with the workstation’s timescale.

Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when false

positives are removed.

The results of running the same correlation algorithm on the host “Pompeii”

which has generated far less web traffic over the period is presented in Figure 32. In

this case the clickstream correlation algorithm produces no false positives.

38 We note here that a colour graph would be more illuminating, as the timeline values on the graph dominate. The apparent line on y-axis 0 is actually the individual timeline samples (which are graphed as dots) merging to form a solid line. Three clusters are visible about the 11/04 x-axis coordinate, and a further cluster is visible just after 07/04 on the 0 y-axis.


Figure 32: Pompeii" cache correlation.

7.3.5 Results

In practice the rate of false positives increased when comparing intra-hit times

at magnitudes below the magnitude of one second. We expect that this is caused by the

measurement error being more pronounced the smaller the intra-hit time becomes.

Values of around 20 minutes for the maximum intra-hit time and values of over 1

second for the minimum value, produced clickstreams with the best uniqueness

properties (as measured by a reduction in rate of false positives).

Modifying the algorithm to filter clickstream acceptance based on clickstream

length produced a similar effect on the false positive rate, and consequently a high rate

of false negatives for clickstreams of larger size. With larger sized clickstreams the rate

of true positives falls off quickly however, and the rate of false negatives becomes

high.

The algorithm performed far better on cache records than on history records.

We expect that this is caused by the difference in granularity of record keeping in the

sources. As the cache stores cache records both for top level web pages and component

content such as images, style sheets and the like, clickstreams are more likely to be

formed. The IE history subsystem only records the top level page views, so is less

likely to produce long clickstreams in situations where users do not heavily explore

websites.

Designing an algorithm which eliminates these false positives is complicated

by the fact that the last access timestamp of any particular cache record is unreliable, as

it may have been accessed more recently by the user (before the cached content

expired). In this case, no corresponding Squid event would be logged even though the


cache record timestamp is updated, thus introducing a skew to the expected offset of

the matching Squid event.

For this reason, we set about identifying a means of identifying IE records

which must have been requested via the Squid proxy and not from the local cache.

7.3.6 Non-cached records correlation algorithm

After some further investigation, documentation of another effort at reverse

engineering the index.dat format came to light [134]. This work identified another field

in the IE History record which recorded the total number of accesses to a particular web

resource. For records where this field has a value of one, we can be sure that there has

only been one access and that the record has come directly via the squid proxy.

Our new algorithm reduces uncertainty by choosing only history records which

must have come directly via squid, bypassing the local cache. Furthermore it places a

high value on matching entries for which there exists only one corresponding match in

the squid log.

The algorithm is defined as follows:

• All matching history and squid records with common IP and URL are

found, each of these matches is called a history-squid tuple.

• A subset of these tuples, called the base set is identified, where for

each matched URL, only one history-squid tuple exists and the history

record is “non-cached”.

• We call the set of remaining history-squid tuples the remainder set.

• The base-set mean is the mean of the skews of each history-squid tuple

in the base set.

• We further cluster the remainder set based on URL of the history-squid

tuples (we note here that for a particular URL visit, we might have

multiple history records and multiple squid records). We call each

cluster of history-squid tuples with a common URL within this

remainder set a remainder set cluster.

• For each of remainder set clusters, we find the history-squid tuple with

a skew closest to the base set mean, and add it to the initially empty

inferred set, discarding the rest of the tuples from that cluster.

• The results of this algorithm are the union of the base set and the

inferred set.


7.3.7 Results

In practice this algorithm produces a set of data which correlates well with the

timescales produced by our previous experiment. For example, Figure 33 is a graph of

the output of the algorithm described above overlaid over the timescale for host Milan

obtained from the previous experiment.

Of 1188 unique history records, 821 history-squid tuples were identified. One

would expect that the number of history-squid tuples would be higher, however URLs

with encoded GET requests are not matched because of squid’s anonymised logging of

this kind of URL.

In practice there are a significantly high proportion of non-cached hits in the

history for our algorithm to work effectively. The algorithm identifies 304 potential

non-cached matches, and a base set of 110 matches from this. In total the algorithm

generates 134 data points (see Figure 33).

Figure 33: History Correlation vs. Timescale.

Comparison with the sampled timescale reveals numerous false positives

(around 15-20) for which we have no explanation.

We are confident that the algorithm generates many false negatives caused by

its simplicity in selecting non-cached hits for the base set. A more comprehensive

algorithm would in addition to finding history records with an accessed count of 1, use

the temporal ordering relationship between the history record sets. For example, say the

oldest weekly sort file contains a record for a particular URL with accessed count equal

to one. If a newer sort file contains a record for the same URL visit, with an accessed

count of two then one can be sure that this record corresponds to a non-cached access.


7.4 Discussion

In this section the two algorithms are compared, and the general problems

related to correlating these types of event logs are outlined.

Of the two algorithms, the history correlation algorithm performed the best.

Results are generated which cover a far wider period of time than the cache oriented

algorithm, giving greater insights into the temporal behaviour of the computer.

Furthermore the ratio of true to false positives is far higher.

The history algorithm was originally the worst performing of the two

approaches. At that point in time, determining whether the high rate of false positives

was caused by a tool implementation error or an error in the correlation algorithm was

problematic. Boyd’s paper [20] was at that point in time essential in identifying that

our interpretation of the weekly history timestamps was mistaken.

Despite having re-implemented a new set of index.dat file parsers, (and

discovered a third timestamp in the history records39), we still used the semantics

defined by the pasco tool. Our model was corrected to treat the first timestamp in the

weekly sort history record as the accessed time, offset by the local time zone offset in

operation. This resulted in the high rate of true positives and a low rate of false

positives previously seen in Figure 33.

Both approaches to developing a correlation algorithm outlined above make a

closed world assumption – that the algorithm has access to all of the information that it

needs. In practice, development of the algorithm was complicated by this not being the

case. Consider for example Figure 34, which was generated using the same history

correlation algorithm as that seen in Figure 33. The input to the algorithm was however

a dataset which omitted a particular squid access log.

Strong correlation with the computer’s timescale is evident; however, there are

in this case false positives in the extremes of the graph. Examination of the false

positives indicated that they were related to records from the particular squid access log

which had been omitted, which had not been included in the correlation run for

processing speed reasons. The omission of the records resulted in the algorithm picking

a match from another squid log file, resulting in a far greater offset. Adding the

excluded log produces the results seen previously in Figure 33.

39 A 32bit MSDOS timestamp was identified at offset 0x50h within the history record. Within the root history file, interpretation of this timestamp is the last accessed time, as is apparent by comparing the last access time in the Internet Explorer history viewer. In practise, the value is always a small amount after the 64bit FileTime based last accessed time.


Figure 34: Incomplete information

There is another problem related to the closed world assumption. By assuming

that all data is present, we assume a perfectly functioning logging system in Internet

Explorer and Squid.

The corollary of this assumption is that we assume a perfect implementation

for our index.dat parser and that we have interpreted the semantics of the records

correctly, and also avoided bugs in our implementation. Clearly, this is not valid given

the challenges in reverse engineering the file format and allowing for inevitable bugs.

We expect that the false positives present in the history correlation algorithm are

attributable to these.

Despite the challenges outlined above in correlating IE History and Cache

records with Squid access logs, we find that we are able to correlate the records to a

dataset which correlates reasonably with the timestreams sampled from the first

experiment.

We expect that far higher rates of true positives are possible; both algorithms

ignore large parts of the dataset, as our heuristics and rules only apply to a small

proportion of the dataset where we can infer certainty in matches. Algorithms which

model uncertainty in matching records and incorporate probabilistic methods hold

promise towards this goal; Monte Carlo Markov Chains have been identified as a

potential approach. We expect that the principles underlying Gladyshev and Patel’s

[47] event bounding approach could have relevance.

7.4.1 Relation to existing work

We compare our approach here to the two closest approaches identified in the

literature, which are summarised in Section 3.3.4.


The approach of Gladyshev and Patel [47] differs from ours in that we deal

predominately with events which indeed have a timestamp, but there is uncertainty as

to that real world time this corresponds to. The approach taken by Gladyshev and Patel

instead tries to find the temporal bounds of an event which may or may not have a

timestamp associated with it.

Our work has similar objectives but differs significantly from Weil [141] in

two respects. Firstly, we investigate to what degree timescales are unstable. Secondly,

Weil’s approach relies on manual classification of cached web pages as dynamically or

statically generated. This is because the technique relies specifically on dynamic

content in order for the embedded timestamps to be interpreted. In addition, we also

present two algorithms which enable the automatic determination of the behaviour of a

suspect computer’s clock by comparison with a commonly logged corroborative

source.

7.5 Conclusions

This chapter has investigated a key problem which lies at the foundations of

evidence representation: how to assure the reliability of timestamps found in digital

evidence. The contributions of this chapter are aligned with the theme of assurance, and

tangentially, representation.

The first contribution is an analysis of the temporal behaviour of PC clocks as

generally implemented in the windows operating system and empirical results

demonstrating the unreliability of timestamps sourced from windows based computers.

This was presented in Section 7.2.

The second contribution, presented in Section 7.3, demonstrated the feasibility

of automatically characterising the temporal behaviour of a computer by correlating

timestamps embedded in digital evidence with corroborative time sources. Two

algorithms were proposed and evaluated, and experimental results were presented

which demonstrate that the latter algorithm produces outputs which correlate

reasonably with the timescales of the subject computers. We have additionally

described how the history correlation algorithm could be modified to produce a higher

rate of true positives.

There are a number of areas where future work is warranted. First, in order that

results based on this kind of correlation may be more clearly interpreted and explained

in forums such as courts of law, a means of qualifying and quantifying the error

involved would be of use. Second, in order that the resolution of the characterised

timescales may increase, improved algorithms which incorporate uncertainty in record

matching should be investigated. Finally, the Internet Explorer index.dat file format is


still not fully understood. We expect that a clearer understanding of the file format

would lead to a reduction in errors.

Chapter 8. Conclusions and future work

“(I am) acutely aware of the difficulties created by saying that when Aristotle and Galileo looked at swinging stones, the first saw constrained fall, the second a pendulum. Nevertheless, I am convinced that we must learn to make sense of sentences that at least resemble these.”

(The Structure of Scientific Revolutions, Thomas Kuhn)

A widespread migration of communication and publishing from analogue to

digital formats is occurring. It has been reported in 2001 that at least 93 percent of

information created was in digital form, and in 2000, that 70 percent of corporate

records were kept in digital format [107]. The effect of this migration is that digital

information is increasingly presented as evidence in legal and other proceedings. A

revolution in the way that courts of law, and law enforcement treat, and view evidence

is underway.

The nature of digital technologies and information in digital formats is

markedly different from traditional evidence forms, because of the latent nature of the

information in digital data, the capacity for possibly undetectable modification and,

conversely, the capacity for perfect copying. Existing approaches to evidence are being

reinterpreted in this new technical context, and new techniques for interpreting relevant

information derived from digital evidence are the focus of a new field currently called

digital forensics.

The work described in this dissertation examines at a fundamental level the

role of representation in interpreting and analysing digital evidence, identifying where a

formal approach to documenting digital investigations and digital evidence reduces the

complexity and volume problems in the field. Additionally, the work identifies flaws in

fundamental assumptions in the interpretation of temporal evidence, and proposes a

novel method of characterising the temporal behaviour of hosts.

147

148 CHAPTER 8 – Conclusions & future work

8.1 Summary of contributions and achievements

As previously summarised in Section 1.2, the principal achievements and

contributions of the dissertation include the following:

• Proposition of formal knowledge representation as an approach to solving

current digital forensics problems of complexity and volume;

• Demonstration of the usefulness of a particular representational formalism,

RDF/OWL, in representing arbitrary and diverse information implicit in event

log based evidence, investigation related documentation and wider domain

knowledge. This is demonstrated in the context of building improved forensic

correlation tools, and in building interoperable forensics tools and digital

evidence storage formats;

• Identification of particular areas of the digital forensics domain where the

RDF/OWL formalism is insufficiently expressive

• Demonstration of a novel analysis technique which supports automated

identification of high level forensically interesting situations by means of

heuristic event correlation rules which operate over general information;

- A novel means of addressing the problem of surrogate proliferation,

improving automated correlation by interactive (human guided)

declaration of hypothetical equivalence relationships between

surrogates;

• Proposal of a novel architecture for containers of digital evidence and arbitrary

investigation related information, in a manner that enables composition of

evidence units and related information into a larger corpus of evidence, while

assuring the integrity of evidence;

- Definition of a unique naming scheme for identifying digital evidence

which enables separate and subsequent addition of arbitrary

information without violating the integrity of original evidence;

• An analysis of the temporal behaviour of PC clocks as generally implemented

in the Windows OS and empirical results demonstrating the unreliability of

timestamps sourced from windows based computers; and

- A novel approach for characterising the temporal behaviour of a host

based on correlating commonly available local timestamps and

timestamps from a reference source.

CHAPTER 8 – Conclusions & future work 149

8.2 Discussion of main themes and conclusions

The work described in dissertation has examined at a fundamental level the

nature of digital evidence and its use in digital investigations, following three

interwoven themes: representation, analysis techniques, and information assurance.

8.2.1 Addressing complexity and volume of digital evidence

Chapter 3 concluded that the field of digital forensics might benefit from the

application of formal knowledge representation to digital evidence and digital

investigations. Chapter 4 investigated the history of formal representation, in the

context of Knowledge Representation and Semantic markup languages, introduced the

RDF/OWL formalism, and proposed that this formalism would be of benefit to

addressing the complexity and volumes in forensic event correlation.

Chapter 5 investigated using this formalism in the context of event correlation

for forensic purposes. The primary outcome of this chapter was to show that the

RDF/OWL formalism is useful as a general representation and is expressive enough to

represent and integrate digital evidence sourced from disparate arbitrary event oriented

sources, composite and abstract events corresponding to higher level situations, and

entities referred to in those events. This was demonstrated by building tools which

translated heterogeneous event logs into the formalism.

The second outcome was to show that the formalism is useful for building tools

which analyse such information. This was demonstrated by building automated

correlation tools which automatically identified forensically interesting scenarios from

event log based evidence based on heuristic rules. This was additionally demonstrated

buy the ease with which investigator hypotheses regarding entity identity could be used

to solve the problem of surrogate proliferation, reducing the volume of entities under

consideration.

A final outcome of this chapter is the identification of areas where the

formalism in insufficiently expressive

Chapter 6 showed that formal representation is useful in documenting digital

investigations and sharing digital evidence. This was demonstrated by the proposal of

improved approach to digital evidence containers which enables more scalable

processing of evidence, extensible integration of arbitrary information, and simplified

evidence authentication.


8.2.2 Assurance of fundamental temporal information

In Chapter 7 focus moved away from wider representational issues, focusing

on the low level interpretation of digital timestamps. Empirical results were presented

which draw into question commonly made assumptions about the passage of time on

computers, showing that such assumptions are not generally applicable because of the

unreliability of computer clocks, on Windows systems in particular. More generally,

other operating systems and embedded devices are likely to suffer from similar

problems. Following from empirical results identifying ways in which real-world

computer clock operation deviates from the ideal, an analysis technique for

characterising the behaviour of a computer clock in the past based on commonly

available event log data was presented.

8.3 Implications of Work

The representational approach underlying Chapters 5, 6 and 7 has clear

implications for the development of forensics tools. Having identified that tool

interoperability is complicated by machine’s inability to read natural language, and

conversely human’s inability to understand binary data, we have demonstrated that

employing a formal knowledge representation as a middle language for documenting

investigation and evidence related information is of benefit towards building digital

forensics tools. In particular, we have shown here the first practical use of the

RDF/OWL formalism in digital forensics, demonstrating that ontologies, in concert

with semantic markup languages, are a practical middle language for recording

evidence assurance documentation.

The SDEB approach has the potential to form a lingua franca for forensics

tools to use in assembling a corpus of evidence bags, maintaining assurance

documentation, automated validation of the integrity of evidence, and scalable evidence

processing. Similarly it was shown that the same representational approach may be

used for documenting, in a machine and human readable way, the information

interpreted from event logs.

These results point towards a document-oriented approach to digital evidence,

which relies on a common representational formalism for documenting all information

produced, interpreted or inferred by forensics tools. Documents produced by such tools,

would, through a consistent syntax and easily interpreted and extensible semantics, be

able to be integrated with otherwise unrelated information, and be read and

manipulated by generic libraries and programs. While concrete results which validate

that the representation is expressive and extensible and that the SDEB information


architecture is interoperable were presented, this work has only covered a small portion

of possible tool integration scenarios.

One implication of the results is the impact that an ontology and document

oriented approach to digital evidence might have upon firming the terminology used in

the field. We are not the first to argue for terminological precision in forensics; a

number of parties have observed that the terminology in the field is used in differing

ways. Some have proposed the use of ontologies as a useful tool for discussing and

defining the field, from a theoretical standpoint, from the top down. The practical

employment of ontologies in approaches such as have been described in this

dissertation has the potential to shape the terminology of the field from the bottom up,

with human readable results expressed using semantically grounded vocabulary,

passively shaping the investigator’s conception of digital evidence and the information

interpreted and derived from it.

The implications of the results showing the unreliability of a Windows based

time synchronisation infrastructure are clear. In cases where establishing the precise

time at which a computer event occurred is important, one cannot assume that

computers running MS Windows 2000 or XP have behaved in uniform ways with

respect to keeping time. Where precision is not so necessary, it would be expected that

corroborating sources of timestamped evidence might be useful in characterising the

behaviour of computer clocks, and thus enabling one to challenge the acceptability of

blanket assumptions made about clock behaviour. Where one expects to depend on the

correctness of timestamps, other, more reliable, measures must be taken towards

assuring synchronised computer clocks. Areas such as real time stock trading and

banking would be potential areas where this kind of forensic preparedness could be

warranted.

8.4 Opportunities for further work

This section of the chapter addresses areas of future work that have been

identified over the course of this research.

8.4.1 Document oriented evidence

While the work described in Chapters 5 and 6 validate that the representation is

expressive and extensible and that the SDEB information architecture is interoperable

were presented, this work has only covered a small portion of possible tool integration

scenarios.

Future work is required to ascertain how to best integrate evidence and

information with conflicting ontological commitments, the impact of part/whole


relationships on identifying entities, and how to practically integrate investigation

domain concepts such as hypotheses, investigator actions, assumptions, suspicions, and

likelihood into the approach. Furthermore, representing multiple, possibly conflicting

interpretations of evidence presents future challenges.

Additional work is required to investigate the linkages between both Bogen and

Dampier’s work on conceptualizing the digital forensics investigation process [17], and

Stephensen’s formal verification of investigation work [127], and this document

oriented evidence approach. We suspect that such a document oriented approach would

help bridge between the abstract goals of their work, and practical automation of the

digital investigation process.

In certain areas of documenting the investigation and representing event log

evidence, expressiveness problems were identified in the RDF/OWL language.

Reification was identified as an area where the RDF/OWL formalism was deficient in

expressiveness, and information provenance related statements were identified as areas

where the representation failed to effectively express information. The representation

was additionally identified to be insufficiently expressive enough describe and

represent heuristic knowledge involving temporal constraints, instance matching, and

declaration of new property values and instances. Finally, the formalism is not suited

toward efficiently modelling and reasoning with models of time involving uncertainty,

multiple timelines, and property values which vary over time. Future work is needed to

establish where the current and future generations of knowledge representations

address these problems, and to direct future research in the field of knowledge

representation.

A limitation to the RDF/OWL approach to representation is its high resource

consumption. While current generation RDF stores routinely scale to hundreds of

millions of statements, the implementation of OWL semantics remains a problem

because of the large amount of computation and querying/search involved. In the

context of the SDEB, performance would not be problematic because of the limited

amount of data required for SDEB composition. In the context of representing the

information content of digital evidence, as has been described in Chapter 5 in the

context of event correlation, the approach will likely fail to scale with current

approaches to RDF/OWL implementations. Future work is needed to address these

scaling difficulties.

8.4.2 Ontologies in digital forensics

This work principally employed ontologies as a means of ascribing a fixed

semantics to digital evidence and related information, for the purpose of documenting


knowledge related to a case, and as an information format compatible with rule based

reasoning. Another form of reasoning which is possible with description logics is

categorization, which is performed by description logic reasoners (in this work we did

not employ these as detailed in section 5.2.1). From data asserted in RDF, and an OWL

ontology, a description logic reasoner can classify instances of information as

belonging to a class or category, based on the relationships between the individual and

other classes or instances. Future work is necessary to determine the extent to which

this kind of reasoning could automatically identify situations or information of interest.

The ontologies used in the course of this research have been developed in an ad

hoc manner and built only for the purpose at hand. There has been no attempt to create

a comprehensive digital forensics ontology. Such an ontology would be of worth both

in building consensus on the meaning of the digital forensics related vocabulary,

highlighting areas where language is used in inexact or confusing ways, and as machine

readable semantics for tool interoperability. Building such an ontology is, however,

complicated by established linguistic conventions (ie. UK usage of the term “computer

based electronic evidence” vs. US usage of “digital evidence”), the context dependent

nature of terminology, and the difficulty of limiting the scope of the ontology40.

Future work applying automated ontology construction methods (ie. “ontology

learning” [72]) could potentially produce a digital forensics ontology with low human

time, energy and consensus costs, and at the same time identify areas of the digital

forensics vocabulary which are used in divergent ways.

8.4.3 Temporal assumptions underlying event correlation

Forensic event correlation is fundamentally different from event correlation in

the IDS or network management fields so it is important to be cognisant of assumptions

which are normally made in those fields. Typical assumptions are made related to

temporal nearness in those fields. Many IDS operate on small time windows to enable

the systems to scale. A problem with this approach is that situations exceeding the

windows of temporal focus are missed. For example, a commonly observed problem

with IDS is that the wily adversary need only wait 24 hours between steps in a multi

step attack in order for the IDS to miss the ongoing attack.

In event correlation in the forensics context, determining what window of time

to apply to an event pattern is problematic for the same reason that IDS use limited

time windows: scalability. Simple assumptions such as forgetting state after 24 hours

may help limit the state space of correlation algorithms, but may produce false

40 For a good survey of approaches to building ontologies, see [102].


negatives, which may be acceptable in the IDS context however not in the forensics

one.

Temporal correlation methods such as those we have proposed in Chapter 7

imply models of time more complex than can be described using terms such as offset

and drift; our research visually depicted the relationship between our reference timeline

(UTC) and the subject computer. The initial results characterising the temporal

behaviour of a particular clock showed that this relationship might be described by

successive time offsets and drift rates. The problem is that in practice there are no

events stored on the computer which allow one to see changes in rate of the passage of

time at this micro level granularity.

The presence of false positives in the results generated by the correlation

method precludes its use as a directly usable means of interpreting unreliable

timestamps to corresponding times on a reference timescale. Upstream tools seeking to

work with events with unreliable timestamps may use results generated by this

correlation method, however the raw results would need to be manually interpreted into

a set of assumptions about the passage of time relative to the reference timescale.

Assuming that the false positives problem might be solved by a more thorough

reverse engineering of the IE cache and history file format, automated upstream tool

use of the correlation results is still complicated by the granularity of the correlation

results, and the likely limited period which the results would cover. For this reason,

extending the concrete results of correlation requires production of a set of assumptions

about the passage of time in between the samples. These assumptions, or temporal

theories, about the passage of time could be used by a correlation tool to ascribe a

theoretical real world time to an event.

The results regarding temporal provenance indicate that event correlation

processes would benefit from richer models of temporal progress including timescale

deviation, event time uncertainty, and orthogonal to this, assumptions about these.

What effect such notions might have upon event pattern languages is an open question.

It would appear likely that their affect on the algorithmic complexity of correlation

approaches would be adverse to a high degree.

8.4.4 Characterising temporal behavior of computers

The study in Chapter 7 focused on the Windows platform because of its

dominance in deployment. While a number of studies have observed widespread

temporal skews in computer networks, the extent to which the results relate to temporal

behaviour of other operating systems is still an open question. The computers in this

experiment were tethered to a time source where synchronisation occurred often.


Future work is needed in characterising the behaviour of Windows PCs that are either

untethered from or loosely tethered to reliable time sources, and also on the behaviour

of UNIX and RTOS variants.

8.4.5 Event pattern languages

This work addresses means for analyzing event log based evidence, utilizing

RDF/OWL for representing entities and events, and rules for expressing correlation

relationships. In this context correlation refers to an abstraction relationship between a

set of events and a higher level event or situation. This correlation relationship is in

turn dependent upon a variety of relationships between the lower level events,

including temporal constraints and constraints over property relationships with entities

involved in the events and the wider environment.

The problem of event correlation and event pattern languages in particular lies

in how to describe these events, relationships and constraints. This work relied on

OWL for modeling events and relationships; however its expressiveness is insufficient

for describing temporal constraints. This led to the employment of a rule language for

declaring these. Some work has been performed on extending description logics to

incorporate temporal descriptions; however the work is preliminary.

Future investigations of event pattern languages would benefit from working

with abstract notions of time such as before, after, coincident, and during, rather than

reasoning with time as a single discreet numerical value. How a language incorporating

these notions would interact with temporal models such as those mentioned in Section

7.3 similarly requires further investigation.

Chapter 9. Bibliography

[1] HB171-2003 Guidelines for the management of IT evidence. 2003,

Standards Australia International: Sydney, Australia.

[2] AAFS. So you want to be a forensic scientist? 2006 [Viewed Nov

2006]; Available from:

http://www.aafs.org/default.asp?section_id=resources&page_id=choosin

g_a_career.

[3] Abbott, J., J. Bell, A. Clark, O.D. Vel, and G. Mohay. Computer

forensics (CF): Automated recognition of event scenarios for digital

forensics. in 2006 ACM Symposium on Applied Computing. 2006. Dijon,

France: ACM Press.

[4] AccessData. FTK Crashes or Hangs on Certain Files. 2006 [Viewed

29 Nov 2006]; Available from:

http://www.accessdata.com/media/en_us/print/techdocs/techdoc.FTK_cr

ashes_or_hangs_on_certain_files.en_us.pdf.

[5] ACM, Next-generation cyber forensics. Communications of the ACM,

2006. 49(2).

[6] ACPO. Good Practise Guide for Computer based Electronic Evidence.

2006 [Viewed 19 Oct 2006]; Available from:

http://www.acpo.police.uk/asp/policies/Data/gpg_computer_based_evid

ence_v3.pdf.

157

http://www.aafs.org/default.asp?section_id=resources&page_id=choosing_a_career

http://www.aafs.org/default.asp?section_id=resources&page_id=choosing_a_career

http://www.accessdata.com/media/en_us/print/techdocs/techdoc.FTK_crashes_or_hangs_on_certain_files.en_us.pdf

http://www.accessdata.com/media/en_us/print/techdocs/techdoc.FTK_crashes_or_hangs_on_certain_files.en_us.pdf

http://www.acpo.police.uk/asp/policies/Data/gpg_computer_based_evidence_v3.pdf

http://www.acpo.police.uk/asp/policies/Data/gpg_computer_based_evidence_v3.pdf

158 CHAPTER 9 – Bibliography

[7] Alinka, W., R.A.F. Bhoedjanga, P.A. Bonczb, and A.P.d. Vriesb, XIRAF

– XML-based indexing and querying for digital forensics. Digital

Investigation (6th Digital Forensics Research Workshop), 2006.

3(Supplement 1): p. 89-107.

[8] Attfield, P., United States v Gorshkov detailed forensics and case study:

expert witness perspective, in 1st International Workshop on Systematic

Approaches to Digital Forensic Engineering. 2005: Taipei, Taiwan. p.

3-24.

[9] Austen, J., Some stepping stones in computer forensics. Information

Security Technical Report, 2003. 8(2): p. 37-41.

[10] Baader, F., Logic-based Knowledge Representation, in Artificial

Intelligence Today: Recent Trends and Developments. 1999, Springer.

[11] Bartel, M., J. Boyer, B. Fox, B. LaMaccia, and E. Simon. XML-

Signature Syntax and Processing. 2002 [Viewed 9 Jan2006]; Available

from: http://www.w3.org/TR/xmldsig-core/.

[12] Beckett, J., Digital Forensics: Validation and Verification in a Dynamic

Work Environment, in 40th Annual Hawaii International Conference on

Systems Science. 2007: Hawaii.

[13] Beebe, N.L. and J.G. Clark, A Hierarchical, Objectives-Based

Framework for the Digital Investigations Process, in 4th Digital

Forensics Research Workshop. 2004: Baltimore, MD.

[14] Berners-Lee, T., D. Connolly, and R.R. Swick. Web Architecture:

Describing and Exchanging Data. 1999 [Viewed 4 Dec 2006];

Available from: http://www.w3.org/1999/06/07-WebData.

[15] Berners-Lee, T., R. Fielding, and L. Masinter. Uniform Resource

Identifiers (URI): Generic Syntax. 1998 [Viewed 9 January 2006];

Available from: http://www.ietf.org/rfc/rfc2396.txt.

http://www.w3.org/TR/xmldsig-core/

http://www.w3.org/1999/06/07-WebData

http://www.ietf.org/rfc/rfc2396.txt

CHAPTER 9 – Bibliography 159

[16] Berners-Lee, T., J. Hendler, and O. Lassila, The Semantic Web.

Scientific American, 2001. 284(5): p. 28-37.

[17] Bogen, A. and D. Dampier. Knowledge discovery and experience

modeling in computer forensics media analysis. in International

Symposium on Information and Communication Technologies. 2004:

Trinity College Dublin.

[18] Bogen, A.C. and D.A. Dampier. Unifying computer forensics modeling

approaches: a software engineering perspective. in 1st International

Workshop on Systematic Approaches to Digital Forensic Engineering.

2005.

[19] Borgida, A., R.J. Brachman, D.L. McGuinness, and L.A. Resnick,

CLASSIC: A Structural Data Model for Objects, in ACM SIGMOD

International Conference on Management of Data. 1989: Portland,

Oregon.

[20] Boyd, C. and P. Forster, Time and date issues in forensic computing – a

case study. Digital Investigation, 2004: p. 18-23.

[21] Brill, A.E., M. Pollitt, and C.M. Whitcomb, The Evolution of Computer

Forensic Best Practices: An Update on Programs and Publications.

Journal of Digital Forensic Practice, 2006. 1(1): p. 2-11.

[22] Brinson, A., A. Robinson, and M. Rogers, A cyber forensics ontology:

Creating a new approach to studying cyber forensics, in 6th Digital

Forensics Research Workshop. 2006: Lafayette, IN.

[23] Carrier, B. Open Source Digital Forensics Tools: The Legal Argument.

@stake Research Report 2002 [Viewed Dec 2006]; Available from:

http://www.digital-evidence.org/papers/opensrc_legal.pdf.

[24] Carrier, B. A Hypothesis-Based Approach to Digital Forensic

Investigations (Ph.D. Thesis). 2006. West Lafayette: Purdue University.

http://www.digital-evidence.org/papers/opensrc_legal.pdf


[25] Carrier, B. The sleuth kit & autopsy: Forensics tools for linux and other

unixes. 2006 [Viewed 29 Nov 2006]; Available from:

http://www.sleuthkit.org/.

[26] Carrier, B. and E. Spafford, An Event-based Digital Forensic

Investigation Framework, in 4th Digital Forensic Research Workshop.

2004: Baltimore, MD.

[27] Casey, E., Digital evidence and computer crime. 2000, San Diego, Calif:

Academic Press.

[28] Casey, E., State of the field: growth, growth, growth. Digital

Investigation, 2004. 1(4): p. 241-309.

[29] Casey, E., Digital arms race e The need for speed. Digital Investigation,

2005. 2(4): p. 229-280.

[30] CDESF. Common Digital Evidence Storage Format. 2004 [Viewed 21

December 2005]; Available from:

http://www.dfrws.org/CDESF/index.html.

[31] CDESF. Survey of Disk Image Storage Formats. 2006 [Viewed Dec

2006]; Available from: http://www.dfrws.org/CDESF/survey-dfrws-

cdesf-diskimg-01.pdf.

[32] Chen, H., T. Finin, and A. Joshi, An Ontology for Context-Aware

Pervasive Computing Environments, in Adjunct Proceedings of the 6th

International Conference on Ubiquitous Computing. 2003: Seattle,

Washington.

[33] Collier, P.A. and B.J. Spaul, A Forensic Methodology for Countering

Computer Crime. Journal of Forensic Science, 1992. 32(1).

[34] Connolly, D., R. Khare, and A. Rifkin. The Evolution of Web

Documents: The Ascent of XML. 1997 [Viewed 4 Dec 2006]; Available

from: http://www.cs.caltech.edu/~adam/papers/xml/ascent-of-xml.html.

http://www.sleuthkit.org/

http://www.dfrws.org/CDESF/index.html

http://www.dfrws.org/CDESF/survey-dfrws-cdesf-diskimg-01.pdf

http://www.dfrws.org/CDESF/survey-dfrws-cdesf-diskimg-01.pdf

http://www.cs.caltech.edu/%7Eadam/papers/xml/ascent-of-xml.html


[35] Cuppens, F. and A. Miege, Alert Correlation in a Cooperative Intrusion

Detection Framework, in IEEE Symposium on Security and Privacy.

2002: Berkeley, California.

[36] Davis, R., H. Shrobe, and P. Szolovits, What Is a Knowledge

Representation? AI Magazine, 1993. 14(1): p. 17-33.

[37] Doyle, J., I. Kohane, W. Long, H. Shrobe, and P. Szolovits, Event

Recognition Beyond Signature and Anomaly, in IEEE Workshop on

Information Assurance and Security. 2001: United States Military

Academy, West Point, New York.

[38] Eckmann, S. and G. Vigna, STATL: An Attack Language for State-based

Intrusion Detection. 2000, Dept. of Computer Science, University of

California: Santa Barbara.

[39] Elsaesser, C. and M. Tanner. Automated diagnosis for computer

forensics. 2001 [Viewed 2007 Feb]; Available from:

http://www.mitrecorp.org/work/tech_papers/tech_papers_01/elsaesser_f

orensics/esaesser_forensics.pdf.

[40] Fikes, R., J. Jenkins, and G. Frank, JTP: A System Architecture and

Component Library for Hybrid Reasoning, in Proceedings of the

Seventh World Multiconference on Systemics, Cybernetics, and

Informatics. 2003: Orlando, Florida.

[41] Fikes, R. and T. Kehler, The role of frame-based representation in

reasoning. Communications of the ACM, 1985. 28(9): p. 904-920.

[42] Forgy, C., Rete: A Fast Algorithm for the Many Patterns/Many Objects

Match Problem. Artificial Intelligence, 1982. 19(1): p. 17-37.

[43] Friedman-Hill, E. JESS: The Rule Engine for the JavaTM Platform.

2003 [Viewed Nov 2003]; Available from:

http://herzberg.ca.sandia.gov/jess/.

http://www.mitrecorp.org/work/tech_papers/tech_papers_01/elsaesser_forensics/esaesser_forensics.pdf

http://www.mitrecorp.org/work/tech_papers/tech_papers_01/elsaesser_forensics/esaesser_forensics.pdf

http://herzberg.ca.sandia.gov/jess/


[44] Garfinkel, S., Forensic feature extraction and cross-drive analysis.

Digital Investigation (6th Digital Forensics Research Workshop), 2006.

3(Supplement 1): p. 71-81.

[45] Garfinkel, S.L., D.J. Malan, K.-A. Dubec, C.C. Stevens, and C. Pham,

Disk Imaging with the Advanced Forensics Format, Library and Tools.

Advances in Digital Forensics (2nd Annual IFIP WG 11.9 International

Conference on Digital Forensics), 2006.

[46] Genesereth, M.R. and R.E. Fikes, Knowledge Interchange Format,

Version 3.0 Reference Manual. 1992, Technical Report Logic-92-1,

Computer Science Department, Stanford University, 1992.

[47] Gladyshev, P. and A. Patel, Formalising Event Time Bounding in Digital

Investigations. International Journal of Digital Evidence, 2005. 4(2).

[48] Goldman, R., W. Heimerdinger, S. Harp, C. Geib, V. Thomas, and R.

Carter, Information Modeling for Intrusion Report Aggregation, in

DARPA Information Survivability Conference and Exposition II. 2001:

Anaheim, CA.

[49] Gray, J. and D. Patterson, A conversation with Jim Gray. ACM Queue,

2003. 1(4).

[50] Green, C. Application of Theorem Proving to Problem Solving. in 1st

International Joint Conference on Artificial Intelligence. 1969: Stanford

Research Institute, Artificial Intelligence Group.

[51] Gruber, T.R., Toward principles for the design of ontologies used for

knowledge sharing? International Journal of Human Computer Studies,

1995. 43(5-6): p. 907-928.

[52] Guha, R.V. and T. Bray. Meta Content Framework Using XML. 1997

[Viewed 2006 20 Dec 2006]; Available from:

http://www.w3.org/TR/NOTE-MCF-XML/.

http://www.w3.org/TR/NOTE-MCF-XML/


[53] Hannan, M., To Revisit: What is Forensic Computing?, in 2nd

Australian Computer, Network & Information Forensics Conference.

2004: Perth, Australia.

[54] Harmelen, F., P.F. Patel-Schneider, and I. Horrocks. Reference

description of the DAML+OIL (March 2001) ontology markup

language. 2001 [Viewed 20 July, 2004]; Available from:

http://www.daml.org/2001/03/reference.html.

[55] Horrocks, I. The FaCT system. 1999 [Viewed Nov 2003]; Available

from: http://www.cs.man.ac.uk/~horrocks/FaCT/.

[56] Horrocks, I., P.F. Patel-Schneider, and F. van Harmelen, From SHIQ

and RDF to OWL: The making of a web ontology language. Journal of

Web Semantics, 2003. 1(1): p. 7-26.

[57] IOCE. G8 Proposed principles for the procedures relating to digital

evidence. 2002 [Viewed 16 Jan 2007]; Available from:

http://ncfs.org/documents/ioce2002/reports/g8ProposedPrinciples.pdf.

[58] ISO, ISO 8879:1986 Information processing — Text and office systems

— Standard Generalized Markup Language (SGML). 1986.

[59] Jones, K.J. Pasco – An Internet Explorer Activity Forensics Analysis

Tool. 2004 [Viewed April 2006]; Available from:

http://sourceforge.net/project/shownotes.php?group_id=78332&release_

id=237810.

[60] Kenneally, E.E., Gatekeeping Out Of The Box: Open Source Software As

A Mechanism To Assess Reliability For Digital Evidence. Virginia

Journal of Law and Technology, 2001. 6(3).

[61] Kifer, M., G. Lausen, and J. Wu, Logical Foundations for Object-

Oriented and Frame-Based Languages. Journal of the Association of

Computing Machinery, 1995. 42(3): p. 741-843.

http://www.daml.org/2001/03/reference.html

http://www.cs.man.ac.uk/%7Ehorrocks/FaCT/

http://ncfs.org/documents/ioce2002/reports/g8ProposedPrinciples.pdf

http://sourceforge.net/project/shownotes.php?group_id=78332&release_id=237810

http://sourceforge.net/project/shownotes.php?group_id=78332&release_id=237810


[62] KLPD. The Open Computer Forensics Architecture (OCFA). 2006

[Viewed 30 Nov 2006]; Available from: http://ocfa.sourceforge.net/.

[63] Klyne, G. and J. Carrol. Resource Description Framework (RDF) :

Concepts and Abstract Syntax. 2004 [Viewed 21 December 2005];

Available from: http://www.w3.org/TR/rdf-concepts/.

[64] Kopena, J. OWLJessKB: A Semantic Web Reasoning Tool. 2003

[Viewed Feb 2003]; Available from:

http://edge.cs.drexel.edu/assemblies/software/owljesskb/.

[65] Kornblum, J., Identifying almost identical files using context triggered

piecewise hashing. Digital Investigation (6th Digital Forensics Research

Workshop), 2006. 3(Supplement 1): p. 91-97.

[66] Lassila, O., Web metadata: a matter of semantics. Internet Computing,

IEEE, 1998. 2(4): p. 30-37.

[67] Lenat, D.B., CYC: A Large-Scale Investment in Knowledge

Infrastructure. Communications of the ACM, 1995. 38(11): p. 33-38.

[68] LexisNexis, Butterworths Encyclopaedic Australian Legal Dictionary.

2006.

[69] Lindqvist, U. and P.A. Porras. Detecting computer and network misuse

through the production-based expert system toolset (P-BEST). in IEEE

Symposium on Security and Privacy. 1999. Berkeley, California.

[70] Lindsey, T. Challenges in Digital Forensics. 2006 [Viewed 7 Mar


http://www.dfrws.org/2006/proceedings/Lindsey-pres.pdf.

[71] Luckham, D., The Power of Events. 2002, Indianapolis, Indiana: Pearson

Education.

[72] Maedche, A. and S. Staab, Ontology learning for the Semantic Web.

IEEE Intelligent Systems, 2001. 16(2): p. 72 - 79

http://ocfa.sourceforge.net/

http://www.w3.org/TR/rdf-concepts/

http://edge.cs.drexel.edu/assemblies/software/owljesskb/

http://www.dfrws.org/2006/proceedings/Lindsey-pres.pdf


[73] McBride, B., Jena: a semantic web toolkit. IEEE Internet Computing,

2002. 6(6): p. 55-59.

[74] McDermott, D., The 1989 AI Planning Systems Competition. AI

Magazine, 2000. 21(2).

[75] McGalla, G. and N. Cercone, Guest Editor's Introduction: Approaches

to Knowledge Representation. IEEE Computer, 1983: p. 12-18.

[76] McKemmish, R., What is Forensic Computing? Trends and Issues in

Crime and Criminal Justice, 1999(118).

[77] Menzel, C., Common Logic Standard, in Metadata Forum Symposium

on Ontologies. 2003 Santa Fe.

[78] Meyers, M. and M. Rogers, Computer Forensics: Meeting the

challenges of Scientific Evidence. Advances in Digital Forensics (1st

Annual IFIP WG 11.9 International Conference on Digital Forensics),

2005. 1(1).

[79] Microsoft. How Windows Keeps Track of the Date and Time. 2006

[Viewed April 2006]; Available from:

http://support.microsoft.com/?kbid=232488.

[80] Microsoft. Microsoft products do not reflect Australian daylight saving

time changes for the year 2006. 2006 [Viewed April 2006]; Available

from: http://support.microsoft.com/kb/909915.

[81] Microsoft. The system clock may run fast when you use the ACPI power

management timer as a high-resolution counter on Windows 2000-

based, Windows XP-based, and Windows Server 2003-based computers.

2006 [Viewed April 2006]; Available from:

http://support.microsoft.com/?kbid=821893.

[82] Mills, D.L., Precision synchronization of computer network clocks.

ACM Computer Communications Review, 1994. 24(2): p. 28-43.

http://support.microsoft.com/?kbid=232488

http://support.microsoft.com/kb/909915

http://support.microsoft.com/?kbid=821893


[83] Mills, D.L., A brief history of NTP time: confessions of an Internet

timekeeper. ACM Computer Communications Review, 2003. 33(2): p.

9-22.

[84] Minsky, M., A Framework for Representing Knowledge, in The

Psychology of Computer Vision, P.H. Winston, Editor. 1974, McGraw-

Hill: New York.

[85] Minsky, M., Logical vs.Analogical or Symbolic vs. Connectionist or

Neat vs. Scruffy. Artificial Intelligence at MIT, Expanding Frontiers,

1991. 1.

[86] Moates, R. URN Syntax. 1997 [Viewed 6 Jan 2006]; Available from:

http://www.ietf.org/rfc/rfc2141.txt.

[87] Mohay, G. Technical Challenges and Directions for Digital Forensics.

in 1st International Workshop on Systematic Approaches to Digital

Forensic Engineering,. 2005.

[88] Mohay, G., From Computer Forensics to Digital Forensics, in 1st

International Conference on Information Security and Computer

Forensics. 2006: Chennai, India.

[89] Mohay, G., A. Anderson, B. Collie, R. McKemmish, and O. de Vel,

Computer and Intrusion Forensics. 2003: Artech House, Inc. Norwood,

MA, USA.

[90] NCI. The National Cancer Institute Thesaurus in OWL. 2003 [Viewed

2007 Jan]; Available from:

http://www.mindswap.org/2003/CancerOntology/.

[91] Neches, R., R. Fikes, T.W. Finin, T.R. Gruber, R. Patil, T.E. Senator,

and W.R. Swartout, Enabling Technology for Knowledge Sharing. AI

Magazine, 1991. 12(3): p. 36-56.

[92] NIJ, Electronic Crime Scene Investigation: A Guide for First

Responders. 2001, National Institute of Justice: Washington, DC.

http://www.ietf.org/rfc/rfc2141.txt

http://www.mindswap.org/2003/CancerOntology/


[93] NIJ, Forensic Examination of Digital Evidence: A Guide for Law

Enforcement. 2004, National Institute of Justice: Washington, DC.

[94] Niles, I. and A. Pease, Towards a Standard Upper Ontology, in 2nd

International Conference on Formal Ontology in Information Systems

(FOIS-2001), C. Welty and B. Smith, Editors. 2001: Ogunquit, Maine.

[95] Ning, P., Y. Cui, and D. Reeves, Constructing attack scenarios through

correlation of intrusion alerts, in 9th ACM conference on Computer and

Communications Security. 2002: Washington, DC.

[96] Nolan, R., C. O'Sullivan, J. Branson, and C. Waits, First Responders

Guide to Computer Forensics. 2005, Software Engineering Institute,

Carnegie Mellon University: Pittsburgh, PA.

[97] Noy, N.F. and D.L. McGuinness. Ontology Development 101: A Guide

to Creating Your First Ontology. 2001 [Viewed 2004]; Available from:

http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology10

1-noy-mcguinness.html.

[98] Palmer, G. (ed), A Road Map for Digital Forensic Research, in First

Digital Forensic Research Workshop, G. Palmer, Editor. 2001: Ucita,

New York.

[99] Pan, F. and J.R. Hobbs, Time in OWL-S, in 2004 AAAI Spring

Symposium Series - Semantic Web Services. 2004: Stanford University.

[100] Parker, D.B., Rules of ethics in information processing. Communications

of the ACM, 1968. 11(3): p. 198-201.

[101] Perrochon, L., E. Jang, S. Kasriel, and D.C. Luckham, Enlisting Event

Patterns for Cyber Battlefield Awareness, in DARPA Information

Survivability Conference & Exposition. 2000: Hilton Head, South

Carolina.

[102] Pinto, H.S. and J.P. Martins, Ontologies: How can They be Built?

Knowledge and Information Systems, 2004. 6(4).

http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html

http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html


[103] Pollack, J. U.S. v Plaza, Acosta (Cr. No. 98-362-10, 11,12). 2002 7 Jan

2002 [Viewed; Available from:

http://www.paed.uscourts.gov/documents/opinions/02d0046p.pdf.

[104] Raskin, V., C.F. Hempelmann, and K.E. Triezenberg. Semantic

Forensics: An Application of Ontological Semantics to Information

Assurance. in Second Workshop on Text Meaning and Interpretation.

2004.

[105] Raskin, V., C.F. Hempelmann, K.E. Triezenberg, and S. Nirenburg,

Ontology in information security: a useful theoretical foundation and

methodological tool, in Workshop on New Security Paradigms. 2001:

Cloudcroft, New Mexico.

[106] RCFL. REGIONAL COMPUTER FORENSIC LABORATORY

PROGRAM: Fiscal Year 2003 Annual Report. 2003 [Viewed Dec


http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual.pdf.

[107] Redgrave, L.M., A.S. Prasad, J.B. Fliegel, T.S. Hiser, and J.H. Jessen,

The Sedona Principles: Best Practices Recommendations & Principles

for Addressing Electronic Document Production in The Sedona

Conference Working Group Series. 2004, The Sedona Conference.

[108] Reed, S.L. and D.B. Lenat, Mapping Ontologies into Cyc, in AAAI

workshop on Ontologies and the Semantic Web. 2002: Edmonton,

Canada.

[109] Reith, M., C. Carr, and G. Gunsch, An Examination of Digital Forensic

Models. International Journal of Digital Evidence, Fall, 2002. 1(2).

[110] Reynolds, D., C. Thompson, J. Mukerji, and D. Coleman. An assessment

of RDF/OWL modelling. 2005 [Viewed Aug 2006]; Available from:

http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf.

http://www.paed.uscourts.gov/documents/opinions/02d0046p.pdf

http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual.pdf

http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf


[111] Richard III, G.G. and V. Roussev, Scalpel: A Frugal, High Performance

File Carver, in Digital Forensics Research Workshop. 2005: New

Orleans, LA.

[112] Richard III, G.G. and V. Roussev, Next-generation digital forensics.

Communications of the ACM, 2006. 49(2): p. 76-80

[113] Rivest, R. SEXP---(S-expressions). 1997 [Viewed 4 Dec 2006];

Available from: http://theory.lcs.mit.edu/%7Erivest/sexp.html.

[114] Roussev, V. and G.G.R. III, Breaking the Performance Wall: The Case

for Distributed Digital Forensics, in 5th Digital Forensics Workshop.

2005: New Orleans, LA.

[115] SandhillConsulting. How Microsoft Windows NT 4.0 Handles Time.

1998 [Viewed April 2006]; Available from:

http://folkworm.ceri.memphis.edu/ew-

doc/PROGRAMMER/NTandTime.html.

[116] Schumacher, M., Security Engineering with Patterns. Lecture Notes in

Computer Science, 2003. 2754.

[117] Seneger, M. Life Sciences Identifiers LSID Response. 2004 [Viewed 6

Jan 2006]; Available from: http://www.omg.org/cgi-

bin/doc?lifesci/2003-12-02.

[118] Sintek, M. and S. Decker, TRIPLE---A Query, Inference, and

Transformation Language for the Semantic Web, in International

Semantic Web Conference (ISWC). 2002: Sardinia.

[119] Slay, J. and F. Schulz, Development of an Ontology Based Forensic

Search Mechanism: Proof of Concept. Journal of Digital Evidence,

Security and Law, 2006. 1(1): p. 19-34.

[120] Sommer, P. Computer Forensics: an introduction. 1997 [Viewed Dec

2006]; Available from: http://www.virtualcity.co.uk/vcaforens.htm.

http://theory.lcs.mit.edu/%7Erivest/sexp.html

http://folkworm.ceri.memphis.edu/ew-doc/PROGRAMMER/NTandTime.html

http://folkworm.ceri.memphis.edu/ew-doc/PROGRAMMER/NTandTime.html

http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

http://www.virtualcity.co.uk/vcaforens.htm


[121] Sommer, P., Digital Footprints: Assessing Computer Evidence. Criminal

Law Review Special Edition, 1998: p. 61-78.

[122] Sommer, P. Digital Evidence: Emerging Problems in Forensic

Computing. 2002 [Viewed Jan 16 2007]; Available from:

http://www.cl.cam.ac.uk/research/security/seminars/2002/2002-05-

21.pdf.

[123] Spafford, E.H. and S.A. Weeber, Software forensics: can we track code

to its authors? Computers and Security, 1993. 12(6): p. 585-595.

[124] Sperberg-McQueen, C.M. and H. Thompson. XML Schema. 2001

[Viewed November 2003]; Available from:

http://www.w3.org/XML/Schema.

[125] Stallard, T. and K. Levitt. Automated analysis for digital forensic

science: semantic integrity checking. in Computer Security Applications

Conference. 2003. Las Vegas, Nevada.

[126] Stephenson, P., A Comprehensive Approach to Digital Incident

Investigation. 2003, Elsevier Information Security Technical Report.

[127] Stephenson, P. Structured Investigation of Digital Incidents in Complex

Computing Environments (Ph.D. Thesis). 2004.

[128] Stevens, M.W., Unification of relative time frames for digital forensics.

Digital Investigation, 2004. 1: p. 225-239.

[129] Swartout, W., C. Paris, and J. Moore, Explanations in knowledge

systems: design for explainable expert systems. IEEE Expert, 1991. 6: p.

58 - 64.

[130] SWGDE, Digital Evidence: Standards and Principles. Forensic Science

Communications, 2000. 2(2).

[131] SWGDE. SWGDE and SWGIT Glossary of Terms. 2005 25 Aug 2007

[Viewed; Available from:

http://68.156.151.124/documents/swgde2005/SWGDE%20and%20SWG

http://www.cl.cam.ac.uk/research/security/seminars/2002/2002-05-21.pdf

http://www.cl.cam.ac.uk/research/security/seminars/2002/2002-05-21.pdf

http://www.w3.org/XML/Schema

http://68.156.151.124/documents/swgde2005/SWGDE%20and%20SWGIT%20Combined%20Master%20Glossary%20of%20Terms%20-July%2020..pdf


IT%20Combined%20Master%20Glossary%20of%20Terms%20-

July%2020..pdf.

[132] Templeton, S.J. and K. Levitt, A Requires/Provides Model for Computer

Attacks, in New Security Paradigms Workshop. 2000: Ballycotton,

County Cork, Ireland.

[133] TGO. The Gene Ontology. 2006 [Viewed Jan 2007]; Available from:

http://www.geneontology.org/.

[134] Thomas, L.K. Reverse engineering index.dat. 2003 [Viewed April


http://www.latenighthacking.com/projects/2003/reIndexDat/.

[135] Turner, P. Unification of Digital Evidence from Disparate Sources

(Digital Evidence Bags). in 5th Digital Forensics Research Workshop.

2005. New Orleans.

[136] Undercoffer, J., A. Joshi, T. Finin, and J. Pinkston, A Target-Centric

Ontology for Intrusion Detection, in 18th International Joint Conference

on Artificial Intelligence. 2004: Acapulco, Mexico.

[137] van den Bos, J. and R. van der Knijff, TULP2G–An Open Source

Forensic Software Framework for Acquiring and Decoding Data Stored

in Electronic Devices. International Journal of Digital Evidence, 2005.

4(2).

[138] W3C. Extensible Markup Language (XML). 1998 [Viewed 7 Mar

2007]; Available from: http://www.w3.org/TR/REC-xml.

[139] W3C. XML Schema Requirements. 1999 [Viewed Dec 21 2006;

Available from: http://www.w3.org/TR/NOTE-xml-schema-req.

[140] W3C. RDF Vocabulary Description Language 1.0: RDF Schema. 2004

[Viewed 21 Dec 2006]; Available from: http://www.w3.org/TR/rdf-

schema/.



http://www.geneontology.org/

http://www.latenighthacking.com/projects/2003/reIndexDat/

http://www.w3.org/TR/REC-xml

http://www.w3.org/TR/NOTE-xml-schema-req

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/rdf-schema/


[141] Weil, C., Dynamic Time & Date Stamp Analysis. International Journal of

Digital Evidence, 2002. 1(2).

[142] Whitcomb, C., An historical perspective of digital evidence: A forensic

scientist's view. International Journal of Digital Evidence, 2002. 1(1).

[143] Yemini, S.A. and S. Kliger, High Speed and Robust Event Correlation.

IEEE Communications, 1996: p. 433-450.