symplectic.co.uk vivo isf: investigating speed factors graham triggs head of repository systems...

Post on 01-Jan-2016

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

symplectic.co.uk

VIVO ISF:Investigating Speed Factors

Graham TriggsHead of Repository Systems

graham@symplectic.co.uk

@grahamtriggs

symplectic.co.uk

About the title..

versus

pre-ISF

This is not

VIVO-ISF

symplectic.co.uk

This is..

Practical use of VIVO 1.8

Challenges encountered

Solutions and suggestions

symplectic.co.uk

Loading Data

symplectic.co.uk

Demo Client #1 Client #2

Users 136 27,489 5,544

External Co-authors ~46,000 ~120,000 ~140,000

Articles ~36,000 ~110,000 ~150,000

Events ~8,000

Asserted Triples 6,683,071 12,372,999

Inferred Triples 6,848,955 12,236,798

Total Triples 13,532,026 24,609,797

Datasets

symplectic.co.uk

r3.large

- optimized for memory-intensive applications• 2 vCPU (Intel Xeon E5-2670 v2 Ivy Bridge)• 15.25 GiB memory• 32 GB SSD instance storage• added 50 GB SSD general purpose (gp2) storage

Demo Server

symplectic.co.uk

24 hours – data still not loaded

Unreserved SSD = limited IO by size

Small disks = low IO

(AWS GP2 = max 128 MiBs rising to 160. 3 IOPs per GiB)

4000 IOPs provisioning max – at $0.065 per IOP/month ($260)

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

IO Problems

symplectic.co.uk

• Amazon EBS Provisioned IOPS (SSD) volumes

• $0.125 per GB-month of provisioned storage

• $0.065 per provisioned IOPS-month

• EX40-SSD

• 32 GB RAM, 2x240 SSD, i7-4770

• ~60 euros

• Load time - ~ 3hours (plus inferencing / indexing)

New Server

symplectic.co.uk

fio AWS VM Dedicated

Read IOPS 155 91937

Read Bandwidth 636KB/s 367.7MB/s

Write IOPS 23 11345

Write Bandwidth 96KB/s 45.3MB/s

IO Comparison

symplectic.co.uk

2.0 Gb RDF/XML

3.2 Gb MySQL database (pre inference)

6.1 Gb MySQL database (post inference)

Transfer slows dramatically after ~ 1Gb written

Regains speed after ~2Gb

MySQL – Demo Dataset

symplectic.co.uk

Processing Data

symplectic.co.uk

Fast (~8-12ms per individual)

However…

2 million individuals = 6-7 hours

Large datasets still slow down (up to 60ms per individual)

Memory problems

Suspect IndexListener

Inferencing

symplectic.co.uk

Query for graphs• Co-authorship

Client #1 • SDB – 10 secs• TDB – 1 sec

Triple store performance

symplectic.co.uk

Using YourKit profiler to show SQL executed

No evidence of complex queries

Combined predicates, functions appear to be processed in Java

Is performance of TDB down to in-memory vs SQL parsing?

Simple SQL Queries

symplectic.co.uk

select g, count(*) from Quads whereg IN (-364693509095697557,786347385076487474)GROUP BY g;

24 seconds

select count(*) from Quads;

14.72 seconds

select count(g) from Quads whereg=786347385076487474

4.16 seconds

MySQL Performance

Total rows: 24,647,663

symplectic.co.uk

Co-author graph query executed• On page access• On GraphML retrieval

Two queries = twice the effort

When each takes 10 secs rather than 1…

Redundant Effort

symplectic.co.uk

Number of triples not necessarily relevant

Small queries still execute quickly

Amount of data matched by SPARQL important• This may include parts of the query• 1 author may have

• 90 publications• 10 investigator roles (grants)

Result sets vs Triples

symplectic.co.uk

Would subproperties give simpler queries with fewer results?e.g.

vivo:hasAuthorshipvivo:hasInvestigatorRole

As subproperties of vivo:relates

Parent property can be inferred and available

Should subproperties be used to ease understanding?vivo:bearerOf vs obo:RO_0000053

(UI hides ontologies with labels, but not from developers)

So, More Triples?

symplectic.co.uk

Thank you!

top related