data mining with aura jim austin university of york & cybula ltd
TRANSCRIPT
Data Mining with AURA
Jim Austin
University of York
&
Cybula Ltd
22 Oct 2001 2
Overview• AURA
• Background to AURA• Brief overview of its components• Its implementation
• AURA within UK e-Science• What is e-Science• The DAME pilot project• Use of AURA in DAME
• GRID issues in DM
22 Oct 2001 3
The AURA Technology
• Neural network based associative storage
• Set of tools to build fast pattern recognition systems
• Aimed at unstructured data
• Aimed at large datasets
• Scaleable technology
22 Oct 2001 4
AURA as a basis for search
• The game is to remove the chaff using AURA.
• Later processes find the exact match.
22 Oct 2001 5
The storage system
• Correlation Matrix Memory based
• Exploits threshold logic methods
• Uses distributed encoding of information
• Implemented using binary ‘weights’ for efficient software and hardware implementation
22 Oct 2001 6
Threshold, T
weights ( )
Inputs
M
P
R
22 Oct 2001 7
Why is it fast?
• Access only rows that are activated by inputs.
• Inputs are made as sparse as possible and fixed weight.
• Only need to sum over active rows (bit vectors) – ideal for most processors
• Great for bit vector machines (DAP!).
22 Oct 2001 8
Use of the CMM
Data
CMM systemQuery Data subset
Slow algorithm
Final data
22 Oct 2001 9
CMM system
Pre-processOperations
Prepare data
CMM system Post process
22 Oct 2001 10
Pre-processing
• Implements a number of pre-processors– N-grams for text strings– CMAC for numeric data– Graphs for images and graphics– Tokens for logical data– Quantisation for time series
22 Oct 2001 11
Post processing
• Data selected by the CMM must be accessed quickly.
• Uses ‘best bit index’ method to match output data and recover stored data.
22 Oct 2001 12
Implementation
• The AURA C++ library
• Implemented on PC or workstation
• Beowulf parallel cluster
• Origin 2000 supercomputer
• Bespoke hardware
22 Oct 2001 13
AURA parallel implementation 28 dedicated PCI based processors
Beowulf configuration3.5Gb memory size
Cortex-1
22 Oct 2001 14
UK eScience
• Aims to build on the concept of Grids– To make computing and data provision as
direct and simple as electrical power delivery
• £110M initiative started 18 months ago
• DAME is a £3.5M pilot project to demonstrate its application in the engineering field.
22 Oct 2001 15
DAME Objectives
• DAME: Distributed Aircraft Maintenance Environment.
• Demonstrate diagnostic capability on the GRID
• Examine timeliness properties of the GRID
• Demonstrate on the RR Aeroengine diagnostic problem
22 Oct 2001 16
Rolls-Royce
University of Oxford, Lionel Tarassenko.
University of Leeds, Peter Dew, Alison McKay.
York, J Austin, J McDermid, A Wellings.
University of Sheffield, P Fleming.
Rolls-Royce, Derby.
Data Systems & Solutions.
Cybula Ltd.
22 Oct 2001 17
Engine flight data
Airline office
Maintenance Centre
European data center
London Airport
New York Airport
American data center
GridDiagnostics centre
22 Oct 2001 18
Diagnostic issues• The system must analyse and report
– Novel engine operation– Identify any cause of events– Do this quickly
• Data– Large (many Tb)
22 Oct 2001 19
Data – Zmod plots
22 Oct 2001 20
How does AURA contribute• Search technology for multi-media data
• Parallel pattern match engine based on neural networks.
• Built on Correlation Matrix Memories.
• High performance Beowulf and dedicated hardware implementations.
• Commercially sold by Cybula Ltd.
22 Oct 2001 21
QuoteNovelty indication
Data used to identify novelty
Data reductionprocesses
Features
Data stores/data warehouse
Diagnostic stationEngine data
Data to be searched for Pattern match
results
Match requests
AURA-G
GRID
Diagnosis
22 Oct 2001 22
Data sample DM coding CMM
Matching previous events
Simple example of processing chain
22 Oct 2001 23
Typical pre-processing
DM coding01101111011110111
(1 up and 0 down)
FastPreserves informationProduces a binary vector
Time
Fre
quen
cy
22 Oct 2001 24
AURA-G
• This is a Globus enabled AURA implementation.
• Developed under DAME
• Will be available end of 2002 for use in other problems.
22 Oct 2001 25
AURA-G
• Support of scalable pattern matching
• Supports distributed search, across multiple CMM engines at different sites
• OGSA compliant
22 Oct 2001 26
Grid Issues in Data Mining
• Data provenance
• Standards:– Data transparency independent of location– Managing DB/Data mining link in distributed
system– OGSA DAI
22 Oct 2001 27
Conclusions
• AURA is a mature component for data search and retrieval
• Robust software and hardware implementation available
• Applications in e-Science for Grid applications underway
22 Oct 2001 28
ContactsJim Austin
Dept Computer Science, University of York, York,
YO1O 5DD.
www.cs.york.ac.uk/arch
01904 432734
01904 432767
Cybula Ltd.
www.cybula.com
01377 236382
DAME : www.cs.york.ac.uk/dame