Download - The return of the hierarchical model
The Return of theHierarchical Model
Jukka Zitting @ Day Software
/Agenda
Part 1: Hierarchy
Concepts -
Benefits -
Drawbacks –
Examples -
Part 2: Case Study
- JCR
- Jackrabbit
- Sling
- Lessons Learned
questions and comments allowed
/Hierarchy/Concepts Every record has a parent record
Except the root No cyclical parent relations allowed Referential integrity, but often no other reference
types supported
A name identifies a record within its parent The name is not necessarily unique (XML, DNS, etc.) Path as an identifier: /path/to/record
Record hierarchy is distinct from type hierarchy Structural flexibility, optionally limited by type
constraints
A B
C D E
F
/Hierarchy/BenefitsNatural
Data in many domains is inherently hierarchical Easy to understand
Self-similar Recursive algorithms Incremental map-reduce!
Scalable Partitioning Parallel processing
Efficient Highly optimized path-based access and “joins”
on the parent-child and subtree relationships
/Hierarchy/DrawbacksLimited support for references
Graph databases solve this problem, at a cost DAG a partial solution
Handling of flat structures Chronological: blogs, tweets, email, log entries, etc. Sets: wiki pages, user accounts, etc. Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today
Standards are domain-specific or limited in scope POSIX, DNS, XPath/XQuery, JCR, etc.
Difficulty of organizing things Coming up with good names for records is hard Hierarchy requires maintenance
/Hierarchy/ExamplesFile system
DNS
LDAP
XML
WebDAV
RDBMS
/Hierarchy/Examples/File SystemUniversally available
Two main types: files and folders Notable extensions: /dev/* and /proc/* Unix philosophy: Everything is a file!
Heavily optimized for specific use cases
Limited support for fine-grained data Some systems support things like extended attributes
Built-in access controls, but usually no query support
Major limitations in distributed solutions SAN and NAS solutions reasonably efficient but limited in scope Truly distributed systems like HDFS applicable only for limited use cases
/Hierarchy/Examples/DNSGlobally distributed, heterogenous, eventually consistent
In production since 1983!
Standardized query and update protocols
Domain-specific, highly optimized for scalability
Multiple records can have the same name
Fine-grained record types: A, NS, MX, TXT, AAAA, etc.
Security issues, both in design and implementations Not much impact in practice
/Hierarchy/Examples/LDAPProtocol for accessing X.500-style directories
Record names are constructed from selected properties dn: cn=John Doe, dc=example, dc=com
Record types defined by extensible schemas
Limited form of record references
Fairly powerful search Though no aggregate queries or arbitrary joins
Optimized for fine-grained data that is mostly read
Replication and distributed use widely supported
/Hierarchy/Examples/XMLData storage based on the XML DOM
Various levels of conformance
Highly buzzword compliant in the early 2000’s Few of the XML database products are still in active use
Inefficient handling of binary data (at all granularities)
Powerful query and transformation tooling XPath, XQuery, XSLT, etc. Many implementations not optimized for performance
Optional type constraints with XML Schema, etc.
The result? XML extensions in SQL
/Hierarchy/Examples/WebDAVExtends HTTP with concepts of collections and properties
Also: locking, versioning, search, etc.
Often used (only) for HTTP-based access to a file system Also leveraged by fs-like systems like Subversion
Limited XML-based query with PROPFIND More query power with DASL
Somewhat heavy-weight for fine-grained access
Fragmented and often incompatible implementations File system backend as the lowest common denominator cf. AtomPub
/Hierarchy/Examples/RDBMSVarious ways of representing hierarchies in RDBM systems
Adjacency model: Each row has a reference to the parent Nested sets: Rows numbered in depth-first traversal order etc.
Little structural flexibility
Expensive parent-child or subtree joins Vendor-specific extensions to address this problem
Two words: Impedance mismatch
/Hierarchy/SummaryData storage/management using an explicit tree hierarchy
Natural mapping, nice non-functional characteristics
Limited functionality, lack of generic standards
Widely used, but in domain-specific ways Extremely efficient/scalable for certain data models
How about a generic, feature-rich hierarchical database?
/Case/JCRContent Repository for Java Technology API (JCR)
JCR 1.0 out in 2005, specified in JSR 170 JCR 2.0 out in 2009, specified in JSR 283 Work on JCR 2.1 starting
A content repository is a hierarchical content store with full text search, observation, versioning, transactions, etc. JCR 2.0 adds retention, type management, join queries, etc.
Designed for both structured and unstructured content handling of both finely and coarsely grained data
Application platform more than an integration API
/Case/JackrabbitReference implementation of both JCR 1.0 and 2.0
Primary focus on feature-completeness
Apache incubator since 2004, TLP since 2006
Internal storage through an abstracted key-value API Tree model implemented on top of that Lucene search index maintained separately Separate journal for cluster deployments
Advanced WebDAV support
Jackrabbit 3: Focus on scalability, modularity
/Case/SlingWeb framework based on the JCR content model
Apache incubator since 2007, TLP since 2009
Intuitive URL mapping Path selects the underlying content resource Optional selectors and extensions determine representation
JSON and POST servlets with Javascript support
OSGi for server-side modularity
Everything is content
/Case/Lessons LearnedContent-driven development
Data first, structure later
Distribute for redundancy Modern hardware goes a long way for scalability/performance For small/medium deployments, distribution is more important for fault-tolerance
especially in cloud environments
Relationships are important JCR 2.0 is a DAG, plus references for expressing full graphs Referential integrity not so important
Notable data sets are flat
Don’t forget tool support for ad-hoc tasks!
/Questions?
http://jackrabbit.apache.org/
http://sling.apache.org/
http://www.day.com/jsr283