welsh government workshop

Post on 23-Jan-2015

47 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Abaca: Technically Assisted Sensitivity Review of Digital Records Presentation of our proof-of-concept classifier for assisting sensitivity review of digital records.

TRANSCRIPT

Abaca:Technically Assisted Sensitivity

Review of Digital Records

0

Agenda

● Transferring of Records to Archives● The Digital Problem● The Abaca Project● Abaca Classifier Experiment● The Test Collection● The Abaca Project - Where Next?● Break-Out Group Session● Groups Discussion

1

Transferring of Records to Archives● Department selects and appraises records

for permanent preservation– In paper, about 5% of output selected - digital

may rise to 20%● Prior to transfer, department must

complete sensitivity review– Paper review is well understood– Digital presents many new challenges and is

not so well understood● Hence our research !

2

The Digital Problem● The file has gone● Volume will increase

– The way business is done has changed– Largely unstructured despite EDRMs

● Big transfers of departmental records● Appraisal

– Separate issue not addressed today● Precautionary closure – Need to research a solution

● Not unique to public records3

Our Approach● Provide a Framework of Utilities ...

– to assist the Review Process● Need Methods ...

– that respect the reality of Digital Records in all their “Glory”

– that can be tailored to specific circumstances ● Need tools ...

– to help reviewers be more productive

4

The Abaca Project

● Research to show that utilities will help● Two Phases

– Proof of Concept (In Progress)– Full Project (Seeking external funding)

● Today we are describing our proof-of-concept work

● Abaca:Technically Assisted Sensitivity Review of Digital Records

6

Abaca Classifier Experiment● Overview of the Task & Approach● Predicting Exemptions using a Classifier

– Features– Types of Features

● Example Sensitive Document● Research Question● Overview of Classification● Evaluation Methodology● Results

7

The Task

Produce a classifier that can predict the presence of sensitive material within unstructured text.

Initially focusing on two FOIA sensitivitiesSection 27: International RelationsSection 40: Personal Information

8

Approach

Manually review sensitive data to create a test collection.

Split test collection into training and test sets.

Train a classifier to predict the sensitivities in documents using the set of identified features.

Test the classifier on previously “unseen” documents.

Measure classification success.

9

External Resources

External Resources

Predict Exemptions Using a Classifier

FeatureExtraction

LearnClassifier

Features representedas real numbers.

Documents representedas feature vectors.

FeatureExtraction

RunClassifier

Features representedas real numbers.

Documents representedas feature vectors.

Learned Model

Predictions

Usi

ng

10

FeaturesDocument features, such as the words it contains or the

entities it references, convey information about a document.

11

FeaturesDocument features, such as the words it contains or the

entities it references, convey information about a document.

A document can be modelled by using a statistical representation of its features.

11

FeaturesDocument features, such as the words it contains or the

entities it references, convey information about a document.

A document can be modelled by using a statistical representation of its features.

We use external knowledge bases, Natural Language Processing and semantic analysis to better understand

the document features.

11

FeaturesDocument features, such as the words it contains or the

entities it references, convey information about a document.

A document can be modelled by using a statistical representation of its features.

We use external knowledge bases, Natural Language Processing and semantic analysis to better understand

the document features.

The classifier recognises patterns in the documents’ feature sets and uses them for prediction.

11

The features we use can be divided into three main categories.Types of Features

Feature Type Examples Comments

StructureLists of Words (tf/idf)

Document LengthNumber of Recipients

Ubiquitous throughout the collection.Can expose patterns in document types.High value information about the nature

of the communication.

ContentSubjectivity

Verbs“D.O.B”Negation

By applying techniques such as Natural Language Processing and dictionary

based term matching, we can identify the tone of the communication.

EntitiesCountries

PeopleOrganisations

Tells us what the document “is about”. Context related to the entity, such as a “high-risk” country or a “significant” person or role can suggest sensitivity

likelihood.

12

Research Question:Can we produce a classifier that can predict the presence

of sensitive material within unstructured text?

13

Research Question:

Measure:

Can we produce a classifier that can predict the presence of sensitive material within unstructured text?

Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000

13

Research Question:

Measure:

Test Collection:

Can we produce a classifier that can predict the presence of sensitive material within unstructured text?

Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000

Total Documents 1849

Total Section 27 208

Total Section 40 14213

Overview of Classification

LearnClassifier

on trainingdata

RunClassifieron unseen

data

Learned Model

Predictions

TestCollection

14

Evaluation Methodology

Test CollectionAssessorJudgments

Results Statistical analysis

Classifier Predictions

15

Results

By adding features to a tf/idf text classification baseline, we see noticeable improvement in both Section 27 and

Section 40 predictions.

But there is still much work to be done !

Balanced AccuracyBalanced AccuracyFeatures s27 s40

Text Classification 0.6327 0.6344

+ Source Count 0.6369 0.6303 + Country Count 0.6453 0.6406 + Country Risk Score 0.6417 0.6368 + DOB Score 0.6327 0.6391 + Negation Score 0.6378 0.6382

16

Test Collection - Aims

● To provide sensitivity judgements and training data to develop and measure tools

17

Test Collection - Aims

● To provide sensitivity judgements and training data to develop and measure tools

● To measure and understand assessors’ behavior

17

Test Collection - Measurments

● Time

18

Test Collection - Measurments

● Time

● Agreement of sensitivity – Not previously studied

18

Test Collection - Measurments

● Time

● Agreement of sensitivity – Not previously studied

● Hard Judgements● Identify borderline cases● Sensitivities sub-categories

– Good indicator for features

18

The Abaca Project - Where Next?

● Understanding the real digital environment– Changes in working practice

● Testing our proof-of-concept system against real data

● More, wider and deeper– More exemptions, more data, more features– BIS, HO, MOJ, FCO, ... and more to come!– Funding

19

Questions and Feedback

20

Break-Out Groups

Discuss sensitivity review in the Welsh Government and language context.

Share your understanding anddevelop some ideas.

Aims:

21

Break-Out GroupsQuestions:

1. What digital records does The Welsh Government create?

2. What sort of sensitivities are expected within these digital records?

3. What aspects of the sensitivity review process could be technically supported by a software tool or system?

4. What document features could be used to identify the expected sensitivities?

22

Contact

http://projectabaca.wordpress.com/

graham.mcdonald@glasgow.ac.uk

23

top related