digital library collection management using hbase
DESCRIPTION
Speaker: Ron Buckley (OCLC) OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.TRANSCRIPT
The world’s libraries. Connected.
Digital Library Collection Management using HBase“AKA: A Success Story”
Case Studies
Ron Buckley
HBaseCon
May 5, 2014
The world’s libraries. Connected.
About OCLC
Worldwide, member-owned library cooperative• Based in Dublin, Ohio• Founded in 1967• Not-for -profit
Worldcat • Union catalog of library items from 72,000 libraries in 170 countries
• Over 2 billion records, 2.5 billions location listings
Hosting • Melvyl, University of California Digital Library (and many others) are hosted directly out of
Worldcat
The world’s libraries. Connected.
Center of our world• 15 month project to rebuild data infrastructure with Hadoop at the center. • Leveraged HBase to build multiple new products.• Replaced and decommissioned multiple Oracle RAC environments.
Old Meets New• Dewey Decimal System – OCLC owns and maintains the Dewey Decimal
System. The Dewey Decimal System is stored in and maintain in HBase.
HBase @ OCLC
The world’s libraries. Connected.
Why• Data set was too big a long time ago – Not long after we built our Oracle
database we removed almost all joins and views.• Too expensive – Making a dataset available for free open-access was going to
cost us almost $1 Million, just for storage• Slow – Couldn’t analyze data set because it took a week just to walk it.
How• Text index and our own secondary indexing for Hbase • Transition period of about 12 months with both - Multiple tools built and run find
and fix discrepancies.
Moving from Relational to HBase
The world’s libraries. Connected.
http://www.worldcat.org/title/HBase-the-definitive-guide/oclc/761693417
HBase Book – from HBase
The world’s libraries. Connected.
HBase - Hub of Linked DataIt is imperative that library data be available in new data formats that are native to the web.
• Databases are walked and analyzed frequently
• Many hundreds of millions, soon billions, of interrelated endpoints are stored back to HBase.
• Endpoints are made available through multiple standard protocols (RDF,JSON,Turtle, N-Triple) for machine use.
- Tim Berners Lee
The world’s libraries. Connected.
HBase - Hub of Linked Datahttp://experiment.worldcat.org/entity/work/data/1151002411.html
The world’s libraries. Connected.
“Libraries aren’t just about books”
• OCLC Contentdm is used by 1000’s of libraries to manage local digital content preservation.
• We’re moving over 40 millions digital objects (many TB’s) into a centrally hosted HBase repository.
HBase as Content Store
The world’s libraries. Connected.
• Key – Internal Key is MD5 hashed into HBase key.
• PDF’s - Compression (snappy) doesn’t reduce the size of PDF documents.
• 10 MB cellsize - Objects over 10 MB are not being stored in HBase. We’re storing them in HDFS. (We do store Metadata Rows for these objects in HBase.)
Digital storage in HBase
The world’s libraries. Connected.
Academy of Motion Picture Arts and Sciences. Margaret Herrick Library.
http://collections.oscars.org/prodart/
The world’s libraries. Connected.
Illinois Digital Archives (via Illinois State Library)
http://oc.lc/lrzLFr
The world’s libraries. Connected.
http://cdm15937.contentdm.oclc.org/cdm/ref/collection/DSDL01/id/46
U.S. Department of State
The world’s libraries. Connected.
Stability - Almost 7 months uptime• CDH 4.3 – April 26, 2014 - 37 Region Servers up for 7 months
The world’s libraries. Connected.
Performance –Fast
The world’s libraries. Connected.
Performance – Cache Hits Help
The world’s libraries. Connected.
• We run hundreds of M/R jobs a day on our user facing cluster.
• Our cluster is oversized for HBase
• M/R jobs run with limited tasks, niced,…
• Still faster than “the old way”
• Looking forward to multi-tenant features in upcoming releases
M/R and HBase?
The world’s libraries. Connected.
- We needed a way to upgrade HBase, without downtime.
- Rolling installs on a 50-Node cluster sounded cumbersome
Upgrading HBase
The world’s libraries. Connected.
• HBase Master-Master replication is used to maintain an always available disaster site.
• We have a middle tier service layer (like the thrift server) that knows about both our main cluster and our DR cluster.
• When we shutdown the main cluster, the middle tier automatically switches to disaster site.
• Each cluster runs a web server that exposes it’s hadoop config.
• Example: http://HBase-config-perf.ent.oclc.org:9007/HBaseconf/HBase-site.xml
Replication for 0 downtime install
The world’s libraries. Connected.
• Instead of relying on HBase-site.xml in the classpath, we load the HBase-site.xml via addResource.
public HBaseManagedConnection(String HBaseSiteUrl, int maxPoolSize) {
tableCounter = new BlockingCounter(maxPoolSize);
Configuration config = HBaseConfiguration.create();
try {
config.addResource(new URL(HBaseSiteUrl));
} catch (MalformedURLException mue) {
LOG.error("**** URL to HBase Site is invalid, Unable to connect to HBase: {} *****", HBaseSiteUrl);
}
Replication for 0 downtime install
The world’s libraries. Connected.
Summary
• HBase is the center of our world. By association, a lot of libraries.
• You can move from relational to HBase.
• We’ve been successful running user facing traffic alongside Map/Reduce.
• EASY to support. We have two converted Oracle DBA’s as our front line admins. Mostly, they’re lent to MySQL support for other internal systems.
The world’s libraries. Connected.
Questions?
The world’s libraries. Connected.
Come to Ohio -Our snowballs roll themselves!