marklogic as a real-time data hub · 2018-09-29 · use case: central location to share enterprise...

17
MarkLogic as a Real-Time Data Hub Prepared by Mike Bowers 5/9/2018 version 1.3

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


2 download

TRANSCRIPT

MarkLogic as a Real-Time Data HubPrepared by Mike Bowers

5/9/2018 version 1.3

About the AuthorMichael Bowers• Principal Engineer — Data Architect • Using NoSQL professionally for 10 years

• Author– Pro CSS and HTML Design Patterns– Pro HTML5 and CSS3 Design Patterns

[email protected]

2

Church of Jesus Christ of Latter-day Saints• Hundreds of NoSQL servers running

190+ applications and websiteswith billions of page views annually

• 15.9 million members (30,304 congregations worldwide)

• 10,238 Humanitarian Service Missionaries in 189 countries

• Thousands of documents in 188 published languages

https://www.mormonnewsroom.org/facts-and-statistics

What is a Data Hub?

“A data hub is…populated with data

from one or more sources and from which data is taken to one or more destinations.”

Why use a Data Hub?

“The more that data is…an enterprise resourcethat needs to be shared…

the more likely…data hubs will appear…

Data can indeed be shared…via point-to-point interfaces between pairs of applications,

and this is a valid architectural alternative…[and]is much simpler to implement than any hub,

and it seems that all enterprises now have gigantic spider webs of such interfaces that have grown up over the years.”

Malcolm Chisholm Ph.D.

Why use a Real-time Data Hub?

To replace the gigantic spider webs of real-time APIs

• SOAP• REST• Database Links

Why is a Data Hub better?

“In my experience (Malcolm Chisholm Ph.D.), point-to-points have the following issues (amongst others):

• They often needlessly replicate movement of the same data.

• They typically have poor controls around them, and minimal governance.

• They are typically difficult to modify.

• They are typically poorly documented and understood.

• They tend to promote coupling and fusion of applications into a giant monolithic enterprise silo that is very difficult to evolve in line with business changes.

• They rely on the application pair involved to do things like data integration and transaction integration.”

A data hub untangles the spider web that comes with point-to-point solutions

Dr. Chrisholm's Six Flavors of Data Hub3. Operational Data Hub

for Data Warehouse (Batch)

ODS

App 1

App 2

App 3

DW

5. Message Hub(Real-time)

ESB

App 1

App 2

App 3

App 4

App 5

App 6

4. Master Data Management Hub (Real-time or Batch)

App 1

App 2

App 3

MDM

App 4

App 5

App 6

6. Integration Hub (Real-time or Batch)

DB

App 1

App 2

App 3

App 4

App 5

App 6

DW

1. Publish Subscribe Hub(Real-time or Batch)

Cache

App 1

App 2

App 3

App 4

2. Operational Data Hubfor Integrated Reporting

App 1

App 2

App 3

reportsODS

Where is MarkLogic often used as a Data Hub?3. Operational Data Hub

for Data Warehouse (Batch)

ODS

App 1

App 2

App 3

DW

5. Message Hub(Real-time)

ESB

App 1

App 2

App 3

App 4

App 5

App 6

4. Master Data Management Hub (Real-time or Batch)

App 1

App 2

App 3

MDM

App 4

App 5

App 6

6. Integration Hub (Real-time or Batch)

DB

App 1

App 2

App 3

App 4

App 5

App 6

DW

1. Publish Subscribe Hub(Real-time or Batch)

Cache

App 1

App 2

App 3

App 4

2. Operational Data Hubfor Integrated Reporting

App 1

App 2

App 3

reportsODS

A Real-time Data Hub is an Advanced Cache

• Engineers who love REST, Messaging, and Service Busses

– Can be strongly opposed to the idea of a "Data Hub"• They believe it is an anti-pattern

– to replicate data – to use database technologies for integration because databases are below the API layer

that contains business rules

– Like the idea of intelligently caching data delivered by REST endpoints• It is a best practice to

– cache data for performance, scale, and availability– to use REST APIs to share data after applying business rules

How can we use MarkLogic as a Real-time Data Hub?Sources Consumers

Source Cache to Scale/Filter REST API's Reads

App App

App

App

Central Cache to Share Enterprise Data

App App

App

AppApp

App

Consuming Cache to Combine REST Data

App App

App

App

Use Case: Source caches its shared data to offload delivery

MarkLogic Advantages1. Sources write REST data as is2. Consumers read REST data as desired3. Built-in, standard, powerful REST API4. Scale out performance & availability5. Fully decouple consuming applications

• from non-standard REST• from sources• from data centers

6. Shard data for data sovereignty7. Low cost and quick to deploy, maintain

MarkLogic Considerations1. Developer centric2. Needs more integrations with

message queues and relational databases

Publish-Subscribe Hub

ManyConsuming

Apps

One Source

App

Source app uses Pub-Sub Data Hub Pattern to publish data to many consuming apps sources across many channels

8. Queries9. Joins10. Security Filters

Use Case: System integrates data from many sources

MarkLogic Advantages1. Load data as is2. Data transforms3. Data quality4. Data validation5. Data harmonization6. Data lineage and provenance7. Data discovery: queries, joins, search8. Get data from remote data centers as if it were local9. Get data from common integration channels: REST, Queue, Database, File

10. Temporal tracking11. Semantic views12. Ontologies13. Real-time and batch14. Integrated with

Java and Node.JS15. Scale out

Consuming app uses Integration Data Hub Pattern to collect and harmonize data from many sources across many channels

IntegrationHub One

ConsumingApp

Many SourceApps

MarkLogic Considerations1. Developer centric2. Needs more integrations with

message queues and relational databases

Use Case: Central location to share enterprise data

Combines the strengths of the previous two use cases for publishing and integrating data

Allows data from many sources to be integrated and queried together

Centralizes the work of integrating data

CentralIntegration

HubMany

ConsumingApps

ManySourceApps

Enterprise creates Integration Data Hub to collect and harmonize data from many sources for many consuming systems

Global Real-time Data Hub

USEurope

Africa

Synced Data

Data Hubs replicate some or all of their data around the world for local, highly available, near real-time access

Data Hub Conceptual Architecture

1. Real-time and Batch2. REST first3. Doc/Graph Queries4. Doc/Graph Joins5. Security Filters6. Data quality7. Data transforms

8. Low cost to maintain9. Quick to provision10. Build once and reuse11. Multiple data centers12. Data Sovereignty13. Integrate IDs14. Be a data source

15. Backwards Compatibility16. Support REST with JSON/XML17. Support Message Queues18. Support Relational DBs19. Support File transfers20. Real-time & scheduled data

Files

Relational DB

Message Q

REST

PublishingShared Data

Getting Read-only

Filtered Data

Sources Consumers

QueryableRESTCache

3rd Party Tools

Files

Relational DB

Message Q

REST / Apps

3rd Party Tools

MarkLogic Data Hub Physical Architecture

ODBCmostly pull

CONSUMERS

File Exportpush

SOURCES

RESTpush or pull

RESTpush

ODBCmostly push

BulkFiles

Bulk Files

JMS Qmostly push

File Importpull

RESTpush or pull

Databases Data Hub

Work Area

PublishedArea

DSA Filter

File Transferpush

File Transferpush

JMS Qpush

JMS Qmostly push

RESTpull or push

RESTpull or push

Databases

EMX: API Gateway REST APIs& Apps

REST APIs& Apps

EMX: API Gateway

EMX ESB EMX ESB

Data Sync Data Sync

EMX: Sterling File

Gateway

EMX: Sterling File

Gateway

Currently BuildingCurrently EnhancingMarkLogic

MuleSoft MuleSoft