karen loughran

25
The Queen’s University of Belfast The Queen’s University of Belfast GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining Karen Loughran

Upload: darin

Post on 19-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining. Karen Loughran. Introduction. G rid E nabled D istributed D ata M ining Industrial partner Overview of GEDDM GEDDM Common Semantic Model (CSM) objectives - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Karen Loughran

The Queen’s University of Belfast The Queen’s University of Belfast

GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in

legal data mining

Karen Loughran

Page 2: Karen Loughran

The Queen’s University of Belfast

Introduction

Grid Enabled Distributed Data MiningIndustrial partner Overview of GEDDMGEDDM Common Semantic Model (CSM)

objectivesGrid enabled solution

Page 3: Karen Loughran

The Queen’s University of Belfast

Industrial Partner - Datactics

Northern Ireland based ern Ireland based (formed 1999)(formed 1999)

Specialising in grid enabled “data-centric” Specialising in grid enabled “data-centric” matching across multiple sectorsmatching across multiple sectors

Datactics technology is fully parallelisedDatactics technology is fully parallelisedComputationally intensive - need to compare Computationally intensive - need to compare

every record with every record with every every other recordother record Improve data quality by applying fuzzy matching Improve data quality by applying fuzzy matching

techniquestechniquesData mining software being used in the real worldData mining software being used in the real world

Page 4: Karen Loughran

The Queen’s University of Belfast

GEDDM Business Driver

Data sourcesnumerous structures, formats, locations, administrative

domains…Client

US County Court: insider trading litigation case45TbVariety of formats

Email, pdf, weblogs, DBMS, report text dumps …How to interface to large volumes of data in

common structured parallel approach

Page 5: Karen Loughran

The Queen’s University of Belfast

Common Semantic Model (CSM) Objectives

Representation of unstructured data such as email, weblog, report dumps.

Conversion to structured format.Evaluation of Grid technologies for access

and conversion.Secure, reliable and scaleable.Exploit high bandwidth.

Page 6: Karen Loughran

The Queen’s University of Belfast

CSM Grid Enabled Solution

Two Stages:Represent and convert unstructured Flat File

Formats (FFF) to structured Common Output Format File (COFF).

Investigate Grid technologies for the remote access and conversion of unstructured data.

Page 7: Karen Loughran

The Queen’s University of Belfast

CSM Representation & Conversion

Data Description Language DDL - XSDData Description File DDFParser

Page 8: Karen Loughran

The Queen’s University of Belfast

Sample FFF data source & DDF

App Account Address BalanceIMP 343818 Dede H Smith 8600.76 181 Glen Rd Earls Court, LondonIMP 565777 Annie Saunders 9905.50 60 Newhaven St Edinburgh, Scotland

___________________________________________________________________<datasource> <database> <header><headertext>App Account Address Balance </headertext></header> <rectype eorecord=’\n’> <pfield name=”App” pos=1 length=3/> <pfield name=”Account” pos=10 length=6/> <pfield name=”Address” pos=24 length=23 multiline=”yes”/> <pfield name=”Balance” pos=49 length=8/> </rectype> </database></datasource>

Page 9: Karen Loughran

The Queen’s University of Belfast

Parser Design

Object oriented component hierarchyEach object represents an XML elementEncapsulates data relating to the flat file

component it describesEncapsulates all import “parse”SAX parse performed on DDF to build up

internal OO representation of FFFParse called on top level object.

Page 10: Karen Loughran

The Queen’s University of Belfast

CSM Grid technologies

Transfer & conversion toolsOGSA-DAI (Version 4)GridFTP (GT4.0.0)

GUI interfacing to both of these technologies.

Page 11: Karen Loughran

The Queen’s University of Belfast

GUI interface – access & conversion

Data Conversion Services

Conversion Module

Structured Data (COFF)

View Results (COFF)

OK ?

Complete

Yes

No

Unstructured FFF Data

View Sample

Describe (DDF)

Convert

GUI Interface to sample remote FFF, DDF creation and conversion.

Page 12: Karen Loughran

The Queen’s University of Belfast

Implementation under OGSA-DAI

OGSA-DAI 4.0.0Globus Toolkit 3.2.1New conversion activity designed &

implementedCalls out to python scripts to perform

conversion

Page 13: Karen Loughran

The Queen’s University of Belfast

Implementation under GridFTP

Globus Toolkit 4.0.0Data Storage Interface (DSI) creation to

perform conversion processing at serverInstead of original unstructured FFF, send

the COFF file back to clientSetup striped server architecture – multiple

nodes working together in parallel.

Page 14: Karen Loughran

The Queen’s University of Belfast

GridFTP Striped Architecture

Host A

Host B

Host C

RaidHost X

Host Y

Host Z

LondonBelfast

Raid

Raid Raid

Raid

Raid

Page 15: Karen Loughran

The Queen’s University of Belfast

GridFTP Machine Specifications

BELFAST AMD4400 Dual Processor 4Gig RAM 1 Terabyte hard disk, serial ATA2 1 Gigabit ethernet

LONDON Dual Optron Processor 4Gig RAM 1 Terabyte hard disk 1 Gigabit ethernet

Page 16: Karen Loughran

The Queen’s University of Belfast

GridFTP Evaluation Tests

Attempted conversion and access to large files across the network.

File sizes:13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb,

817Mb, 1634Mb

Buffer sizes:Default, 4915, 409150, 785408MTU 1400 - 8000

Page 17: Karen Loughran

The Queen’s University of Belfast

OGSA-DAI Benchmark Results

Currently no results available:Socket Timeout Error and Engine receives a

terminate signal when Activity takes longer than approximately 10 minutes to run.

DeliverToGridFTP activity would not work in version 4. Patches required. So far, unable to get working with these patches.

Security setup issues.

Page 18: Karen Loughran

The Queen’s University of Belfast

GridFTP Network Topology

BBC NI

Queens BESC Router BBC ROUTER

BBC London

Janet Bar

1GBit

1GBit

1GBit

1GBit

Queens

100MBit

Page 19: Karen Loughran

The Queen’s University of Belfast

Results – GridFTP transfer

Throughput hindered by:Physical Infrastructure/Service Provider-

80MbsRouter/switches/NIC808 Mbs CPU to CPU (London to Belfast)688 Mbs Disk to Disk (BBC NI)Striping with 2 BE servers - 60% improvement

Local 100Mbs switch:Disc to disc – 82 Mbs

Page 20: Karen Loughran

The Queen’s University of Belfast

OGSA-DAI Evaluation ….

DeliverToGridFTP not working in 4.0.0Configuring GridFTP not possible (buffer sizes,

no. of streams, striped transfer etc.)Some way to go in efficient transfer of large files.Installation/runtime overheads Design/code conversion activity & design perform

documents for access/conversion Timeouts converting large files. Threads may be

solution.Clear documentation

Page 21: Karen Loughran

The Queen’s University of Belfast

GridFTP Evaluation

Secure, reliable, fast and scaleableLightweight installationOptimum use of high bandwidth networksExtra ERET/ESTO processing allows

tighter integration of conversions operation through the definition of a DSI

Striping for much improved efficiency

Page 22: Karen Loughran

The Queen’s University of Belfast

GridFTP Evaluation

Extensive tuning requiredNo clear documentation for writing a DSI. [email protected] useful source of infoPoor performance on NFS. PVFS like filesystem recommended for striping.1Gbit bandwidth in practice difficult to achieve

due to problems with:RouterNICPhysical Infrastructure

Page 23: Karen Loughran

The Queen’s University of Belfast

Conclusions

Investigated grid technologies for remote access & conversion

OGSA-DAI disappointing due to lack of support for large file transfer

GridFTP involved extensive configuration and due to network infrastructure problems difficult to get optimum performance in remote transfer

Page 24: Karen Loughran

The Queen’s University of Belfast

Future work

Tighter integration of conversion services within GridFTP DSI server module.

Extend the services under GridFTP to cope with Distributed Query Processing.

COFF produced as XML, ready for XPATH queries.

Page 25: Karen Loughran

The Queen’s University of Belfast

Questions ?