digital library collection management using hbase

23
The world’s libraries. Connected. Digital Library Collection Management using HBase “AKA: A Success Story” Case Studies Ron Buckley HBaseCon May 5, 2014

Upload: hbasecon

Post on 10-May-2015

462 views

Category:

Software


1 download

DESCRIPTION

Speaker: Ron Buckley (OCLC) OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.

TRANSCRIPT

Page 1: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Digital Library Collection Management using HBase“AKA: A Success Story”

Case Studies

Ron Buckley

HBaseCon

May 5, 2014

Page 2: Digital Library Collection Management using HBase

The world’s libraries. Connected.

About OCLC

Worldwide, member-owned library cooperative• Based in Dublin, Ohio• Founded in 1967• Not-for -profit

Worldcat • Union catalog of library items from 72,000 libraries in 170 countries

• Over 2 billion records, 2.5 billions location listings

Hosting • Melvyl, University of California Digital Library (and many others) are hosted directly out of

Worldcat

Page 3: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Center of our world• 15 month project to rebuild data infrastructure with Hadoop at the center. • Leveraged HBase to build multiple new products.• Replaced and decommissioned multiple Oracle RAC environments.

Old Meets New• Dewey Decimal System – OCLC owns and maintains the Dewey Decimal

System. The Dewey Decimal System is stored in and maintain in HBase.

HBase @ OCLC

Page 4: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Why• Data set was too big a long time ago – Not long after we built our Oracle

database we removed almost all joins and views.• Too expensive – Making a dataset available for free open-access was going to

cost us almost $1 Million, just for storage• Slow – Couldn’t analyze data set because it took a week just to walk it.

How• Text index and our own secondary indexing for Hbase • Transition period of about 12 months with both - Multiple tools built and run find

and fix discrepancies.

Moving from Relational to HBase

Page 5: Digital Library Collection Management using HBase

The world’s libraries. Connected.

http://www.worldcat.org/title/HBase-the-definitive-guide/oclc/761693417

HBase Book – from HBase

Page 6: Digital Library Collection Management using HBase

The world’s libraries. Connected.

HBase - Hub of Linked DataIt is imperative that library data be available in new data formats that are native to the web.

• Databases are walked and analyzed frequently

• Many hundreds of millions, soon billions, of interrelated endpoints are stored back to HBase.

• Endpoints are made available through multiple standard protocols (RDF,JSON,Turtle, N-Triple) for machine use.

- Tim Berners Lee

Page 7: Digital Library Collection Management using HBase

The world’s libraries. Connected.

HBase - Hub of Linked Datahttp://experiment.worldcat.org/entity/work/data/1151002411.html

Page 8: Digital Library Collection Management using HBase

The world’s libraries. Connected.

“Libraries aren’t just about books”

• OCLC Contentdm is used by 1000’s of libraries to manage local digital content preservation.

• We’re moving over 40 millions digital objects (many TB’s) into a centrally hosted HBase repository.

HBase as Content Store

Page 9: Digital Library Collection Management using HBase

The world’s libraries. Connected.

• Key – Internal Key is MD5 hashed into HBase key.

• PDF’s - Compression (snappy) doesn’t reduce the size of PDF documents.

• 10 MB cellsize - Objects over 10 MB are not being stored in HBase. We’re storing them in HDFS. (We do store Metadata Rows for these objects in HBase.)

Digital storage in HBase

Page 10: Digital Library Collection Management using HBase

The world’s libraries. Connected.

University of the Pacific

http://oc.lc/bDo9l0

Page 11: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Academy of Motion Picture Arts and Sciences. Margaret Herrick Library.

http://collections.oscars.org/prodart/

Page 12: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Illinois Digital Archives (via Illinois State Library)

http://oc.lc/lrzLFr

Page 13: Digital Library Collection Management using HBase

The world’s libraries. Connected.

http://cdm15937.contentdm.oclc.org/cdm/ref/collection/DSDL01/id/46

U.S. Department of State

Page 14: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Stability - Almost 7 months uptime• CDH 4.3 – April 26, 2014 - 37 Region Servers up for 7 months

Page 15: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Performance –Fast

Page 16: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Performance – Cache Hits Help

Page 17: Digital Library Collection Management using HBase

The world’s libraries. Connected.

• We run hundreds of M/R jobs a day on our user facing cluster.

• Our cluster is oversized for HBase

• M/R jobs run with limited tasks, niced,…

• Still faster than “the old way”

• Looking forward to multi-tenant features in upcoming releases

M/R and HBase?

Page 18: Digital Library Collection Management using HBase

The world’s libraries. Connected.

- We needed a way to upgrade HBase, without downtime.

- Rolling installs on a 50-Node cluster sounded cumbersome

Upgrading HBase

Page 19: Digital Library Collection Management using HBase

The world’s libraries. Connected.

• HBase Master-Master replication is used to maintain an always available disaster site.

• We have a middle tier service layer (like the thrift server) that knows about both our main cluster and our DR cluster.

• When we shutdown the main cluster, the middle tier automatically switches to disaster site.

• Each cluster runs a web server that exposes it’s hadoop config.

• Example: http://HBase-config-perf.ent.oclc.org:9007/HBaseconf/HBase-site.xml

Replication for 0 downtime install

Page 20: Digital Library Collection Management using HBase

The world’s libraries. Connected.

• Instead of relying on HBase-site.xml in the classpath, we load the HBase-site.xml via addResource.

public HBaseManagedConnection(String HBaseSiteUrl, int maxPoolSize) {

tableCounter = new BlockingCounter(maxPoolSize);

Configuration config = HBaseConfiguration.create();

try {

config.addResource(new URL(HBaseSiteUrl));

} catch (MalformedURLException mue) {

LOG.error("**** URL to HBase Site is invalid, Unable to connect to HBase: {} *****", HBaseSiteUrl);

}

Replication for 0 downtime install

Page 21: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Summary

• HBase is the center of our world. By association, a lot of libraries.

• You can move from relational to HBase.

• We’ve been successful running user facing traffic alongside Map/Reduce.

• EASY to support. We have two converted Oracle DBA’s as our front line admins. Mostly, they’re lent to MySQL support for other internal systems.

Page 22: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Questions?

Page 23: Digital Library Collection Management using HBase

The world’s libraries. Connected.

Come to Ohio -Our snowballs roll themselves!