marklogic as a real-time data hub · 2018-09-29 · use case: central location to share enterprise...
TRANSCRIPT
About the AuthorMichael Bowers• Principal Engineer — Data Architect • Using NoSQL professionally for 10 years
• Author– Pro CSS and HTML Design Patterns– Pro HTML5 and CSS3 Design Patterns
2
Church of Jesus Christ of Latter-day Saints• Hundreds of NoSQL servers running
190+ applications and websiteswith billions of page views annually
• 15.9 million members (30,304 congregations worldwide)
• 10,238 Humanitarian Service Missionaries in 189 countries
• Thousands of documents in 188 published languages
https://www.mormonnewsroom.org/facts-and-statistics
What is a Data Hub?
“A data hub is…populated with data
from one or more sources and from which data is taken to one or more destinations.”
Why use a Data Hub?
“The more that data is…an enterprise resourcethat needs to be shared…
the more likely…data hubs will appear…
Data can indeed be shared…via point-to-point interfaces between pairs of applications,
and this is a valid architectural alternative…[and]is much simpler to implement than any hub,
and it seems that all enterprises now have gigantic spider webs of such interfaces that have grown up over the years.”
Malcolm Chisholm Ph.D.
Why use a Real-time Data Hub?
To replace the gigantic spider webs of real-time APIs
• SOAP• REST• Database Links
Why is a Data Hub better?
“In my experience (Malcolm Chisholm Ph.D.), point-to-points have the following issues (amongst others):
• They often needlessly replicate movement of the same data.
• They typically have poor controls around them, and minimal governance.
• They are typically difficult to modify.
• They are typically poorly documented and understood.
• They tend to promote coupling and fusion of applications into a giant monolithic enterprise silo that is very difficult to evolve in line with business changes.
• They rely on the application pair involved to do things like data integration and transaction integration.”
A data hub untangles the spider web that comes with point-to-point solutions
Dr. Chrisholm's Six Flavors of Data Hub3. Operational Data Hub
for Data Warehouse (Batch)
ODS
App 1
App 2
App 3
DW
5. Message Hub(Real-time)
ESB
App 1
App 2
App 3
App 4
App 5
App 6
4. Master Data Management Hub (Real-time or Batch)
App 1
App 2
App 3
MDM
App 4
App 5
App 6
6. Integration Hub (Real-time or Batch)
DB
App 1
App 2
App 3
App 4
App 5
App 6
DW
1. Publish Subscribe Hub(Real-time or Batch)
Cache
App 1
App 2
App 3
App 4
2. Operational Data Hubfor Integrated Reporting
App 1
App 2
App 3
reportsODS
Where is MarkLogic often used as a Data Hub?3. Operational Data Hub
for Data Warehouse (Batch)
ODS
App 1
App 2
App 3
DW
5. Message Hub(Real-time)
ESB
App 1
App 2
App 3
App 4
App 5
App 6
4. Master Data Management Hub (Real-time or Batch)
App 1
App 2
App 3
MDM
App 4
App 5
App 6
6. Integration Hub (Real-time or Batch)
DB
App 1
App 2
App 3
App 4
App 5
App 6
DW
1. Publish Subscribe Hub(Real-time or Batch)
Cache
App 1
App 2
App 3
App 4
2. Operational Data Hubfor Integrated Reporting
App 1
App 2
App 3
reportsODS
A Real-time Data Hub is an Advanced Cache
• Engineers who love REST, Messaging, and Service Busses
– Can be strongly opposed to the idea of a "Data Hub"• They believe it is an anti-pattern
– to replicate data – to use database technologies for integration because databases are below the API layer
that contains business rules
– Like the idea of intelligently caching data delivered by REST endpoints• It is a best practice to
– cache data for performance, scale, and availability– to use REST APIs to share data after applying business rules
How can we use MarkLogic as a Real-time Data Hub?Sources Consumers
Source Cache to Scale/Filter REST API's Reads
App App
App
App
Central Cache to Share Enterprise Data
App App
App
AppApp
App
Consuming Cache to Combine REST Data
App App
App
App
Use Case: Source caches its shared data to offload delivery
MarkLogic Advantages1. Sources write REST data as is2. Consumers read REST data as desired3. Built-in, standard, powerful REST API4. Scale out performance & availability5. Fully decouple consuming applications
• from non-standard REST• from sources• from data centers
6. Shard data for data sovereignty7. Low cost and quick to deploy, maintain
MarkLogic Considerations1. Developer centric2. Needs more integrations with
message queues and relational databases
Publish-Subscribe Hub
ManyConsuming
Apps
One Source
App
Source app uses Pub-Sub Data Hub Pattern to publish data to many consuming apps sources across many channels
8. Queries9. Joins10. Security Filters
Use Case: System integrates data from many sources
MarkLogic Advantages1. Load data as is2. Data transforms3. Data quality4. Data validation5. Data harmonization6. Data lineage and provenance7. Data discovery: queries, joins, search8. Get data from remote data centers as if it were local9. Get data from common integration channels: REST, Queue, Database, File
10. Temporal tracking11. Semantic views12. Ontologies13. Real-time and batch14. Integrated with
Java and Node.JS15. Scale out
Consuming app uses Integration Data Hub Pattern to collect and harmonize data from many sources across many channels
IntegrationHub One
ConsumingApp
Many SourceApps
MarkLogic Considerations1. Developer centric2. Needs more integrations with
message queues and relational databases
Use Case: Central location to share enterprise data
Combines the strengths of the previous two use cases for publishing and integrating data
Allows data from many sources to be integrated and queried together
Centralizes the work of integrating data
CentralIntegration
HubMany
ConsumingApps
ManySourceApps
Enterprise creates Integration Data Hub to collect and harmonize data from many sources for many consuming systems
Global Real-time Data Hub
USEurope
Africa
Synced Data
Data Hubs replicate some or all of their data around the world for local, highly available, near real-time access
Data Hub Conceptual Architecture
1. Real-time and Batch2. REST first3. Doc/Graph Queries4. Doc/Graph Joins5. Security Filters6. Data quality7. Data transforms
8. Low cost to maintain9. Quick to provision10. Build once and reuse11. Multiple data centers12. Data Sovereignty13. Integrate IDs14. Be a data source
15. Backwards Compatibility16. Support REST with JSON/XML17. Support Message Queues18. Support Relational DBs19. Support File transfers20. Real-time & scheduled data
Files
Relational DB
Message Q
REST
PublishingShared Data
Getting Read-only
Filtered Data
Sources Consumers
QueryableRESTCache
3rd Party Tools
Files
Relational DB
Message Q
REST / Apps
3rd Party Tools
MarkLogic Data Hub Physical Architecture
ODBCmostly pull
CONSUMERS
File Exportpush
SOURCES
RESTpush or pull
RESTpush
ODBCmostly push
BulkFiles
Bulk Files
JMS Qmostly push
File Importpull
RESTpush or pull
Databases Data Hub
Work Area
PublishedArea
DSA Filter
File Transferpush
File Transferpush
JMS Qpush
JMS Qmostly push
RESTpull or push
RESTpull or push
Databases
EMX: API Gateway REST APIs& Apps
REST APIs& Apps
EMX: API Gateway
EMX ESB EMX ESB
Data Sync Data Sync
EMX: Sterling File
Gateway
EMX: Sterling File
Gateway
Currently BuildingCurrently EnhancingMarkLogic
MuleSoft MuleSoft