symplectic.co.uk vivo isf: investigating speed factors graham triggs head of repository systems...
TRANSCRIPT
symplectic.co.uk
VIVO ISF:Investigating Speed Factors
Graham TriggsHead of Repository Systems
@grahamtriggs
symplectic.co.uk
About the title..
versus
pre-ISF
This is not
VIVO-ISF
symplectic.co.uk
This is..
Practical use of VIVO 1.8
Challenges encountered
Solutions and suggestions
symplectic.co.uk
Loading Data
symplectic.co.uk
Demo Client #1 Client #2
Users 136 27,489 5,544
External Co-authors ~46,000 ~120,000 ~140,000
Articles ~36,000 ~110,000 ~150,000
Events ~8,000
Asserted Triples 6,683,071 12,372,999
Inferred Triples 6,848,955 12,236,798
Total Triples 13,532,026 24,609,797
Datasets
symplectic.co.uk
r3.large
- optimized for memory-intensive applications• 2 vCPU (Intel Xeon E5-2670 v2 Ivy Bridge)• 15.25 GiB memory• 32 GB SSD instance storage• added 50 GB SSD general purpose (gp2) storage
Demo Server
symplectic.co.uk
24 hours – data still not loaded
Unreserved SSD = limited IO by size
Small disks = low IO
(AWS GP2 = max 128 MiBs rising to 160. 3 IOPs per GiB)
4000 IOPs provisioning max – at $0.065 per IOP/month ($260)
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
IO Problems
symplectic.co.uk
• Amazon EBS Provisioned IOPS (SSD) volumes
• $0.125 per GB-month of provisioned storage
• $0.065 per provisioned IOPS-month
• EX40-SSD
• 32 GB RAM, 2x240 SSD, i7-4770
• ~60 euros
• Load time - ~ 3hours (plus inferencing / indexing)
New Server
symplectic.co.uk
fio AWS VM Dedicated
Read IOPS 155 91937
Read Bandwidth 636KB/s 367.7MB/s
Write IOPS 23 11345
Write Bandwidth 96KB/s 45.3MB/s
IO Comparison
symplectic.co.uk
2.0 Gb RDF/XML
3.2 Gb MySQL database (pre inference)
6.1 Gb MySQL database (post inference)
Transfer slows dramatically after ~ 1Gb written
Regains speed after ~2Gb
MySQL – Demo Dataset
symplectic.co.uk
Processing Data
symplectic.co.uk
Fast (~8-12ms per individual)
However…
2 million individuals = 6-7 hours
Large datasets still slow down (up to 60ms per individual)
Memory problems
Suspect IndexListener
Inferencing
symplectic.co.uk
Query for graphs• Co-authorship
Client #1 • SDB – 10 secs• TDB – 1 sec
Triple store performance
symplectic.co.uk
Using YourKit profiler to show SQL executed
No evidence of complex queries
Combined predicates, functions appear to be processed in Java
Is performance of TDB down to in-memory vs SQL parsing?
Simple SQL Queries
symplectic.co.uk
select g, count(*) from Quads whereg IN (-364693509095697557,786347385076487474)GROUP BY g;
24 seconds
select count(*) from Quads;
14.72 seconds
select count(g) from Quads whereg=786347385076487474
4.16 seconds
MySQL Performance
Total rows: 24,647,663
symplectic.co.uk
Co-author graph query executed• On page access• On GraphML retrieval
Two queries = twice the effort
When each takes 10 secs rather than 1…
Redundant Effort
symplectic.co.uk
Number of triples not necessarily relevant
Small queries still execute quickly
Amount of data matched by SPARQL important• This may include parts of the query• 1 author may have
• 90 publications• 10 investigator roles (grants)
Result sets vs Triples
symplectic.co.uk
Would subproperties give simpler queries with fewer results?e.g.
vivo:hasAuthorshipvivo:hasInvestigatorRole
As subproperties of vivo:relates
Parent property can be inferred and available
Should subproperties be used to ease understanding?vivo:bearerOf vs obo:RO_0000053
(UI hides ontologies with labels, but not from developers)
So, More Triples?
symplectic.co.uk
Thank you!