marklogic_bigdata
DESCRIPTION
MarkLogic_BigDataTRANSCRIPT
Copyright © 2013 Accenture All Rights Reserved. Accenture, its logo, and Accenture High Performance Delivered are trademarks of Accenture.
MarkLogic – A NoSQL Database
Presented by:- Chandan Abhishek
Copyright © 2013 Accenture All Rights Reserved.
About me…
Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.
Chandan Abhishek
About 6 Years of comprehensive IT Experience (out of some months of
Onsite in UK).
In Accenture for 1.2yrs and part of Digital Data & Analytics Capability.
Currently working for client “Warner bros” which comes under CMT.
Technical Expertise in Big Data, MarkLogic, PL-SQL and .Net framework.
Professional experience with major clients like Springer, Macmillan and
Warner bros.
Recently won CMT Apex Award.
Copyright © 2013 Accenture All Rights Reserved.
MarkLogic Server
Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.
• XML Server
• Special-purpose DBMS for XML
– Semi-structured
– Hierarchical
• Designed for 100s of TB of XML
Copyright © 2013 Accenture All Rights Reserved.
How Did We Get Here?
• Founder: Christopher Lindblad
– MIT
– Architect of Ultraseek Server
• Intranet seach engine product
• Met people that wanted to use a search engine like a database
– Rich query language
– Guaranteed correctness
– Transactions
Copyright © 2013 Accenture All Rights Reserved.
Consider an Application
• Documents + metadata
• Documents: rich, variable structure
• Want: complex full-text search
• Want: combined text, metadata, structure-aware search
• Want: granular ad hoc access
• Want: real-time query
• How do you build it?
Copyright © 2013 Accenture All Rights Reserved.
Two-headed Monster
I’m an RDBMS
Answers are right or wrong
I like to combine small pieces
I allow granular access
Linguistic complexity hurts my brain
I guarantee ACID properties
Updates are visible right away
I’m a search engine
Some answers are better than others
Most pieces of information are large
I can give you the whole document
Structure hurts my brain
I’m optimized for sparse data
Updates are visible… oh, whenever
Copyright © 2013 Accenture All Rights Reserved.
A Different Approach
• Soul of Search Engine: Data Model And Queries
• Database: On-disk Organization And Transactions
Copyright © 2013 Accenture All Rights Reserved.
Data Model
•Document
•Title
•Author
•Abstract
Section
Section
•Footer
Section
Section
Section (cont’d)
•Metadata
Copyright © 2013 Accenture All Rights Reserved.
Data Model
• A database for XML . . .
. . . uses the XML Data Model
• XML is a tree
Document
Title Author
Section
Section Section Section Section Section
FirstLast
Metadata
Copyright © 2013 Accenture All Rights Reserved.
Example Document
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best
answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
What Queries Is It Good At?
1) Full-Text Search
Find all documents that contain the phrase “high performance”.
2) XML Structure
Find all articles that have an abstract.
3) XML Semantics
Find all documents that mention the company “Mark Logic”.
4) All of the above . . .
Find all articles that contain the phrase “high performance” and
mention the company Mark Logic in the abstract.
at the same time
Copyright © 2013 Accenture All Rights Reserved.
1) Full-text Search
Find all documents that contain the phrase “high performance”
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best
answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
1) Full-text Search
very
high
perform
ance
index
122 0 1 0 0
123 1 0 1 1
124 0 0 0 0
125 0 1 0 0
126 0 1 1 0
127 1 0 0 0
129 1 1 0 0
130 0 1 1 1
Find all documents that contain the phrase “high performance”
Copyright © 2013 Accenture All Rights Reserved.
1) Full-text Search
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
“very high”
“performance index”
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, 212, 219, 377 . . .
Find all documents that contain the phrase “high performance”
Copyright © 2013 Accenture All Rights Reserved.
2) XML Structure
Find all articles that have an abstract
<article><title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>Where should one put their XML? <company>Mark Logic</company> has the best
answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
2) XML Structure
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, 212, 219, 377 . . .
Find all articles that have an abstract
Copyright © 2013 Accenture All Rights Reserved.
3) XML Semantics
Find all documents that mention the company “Mark Logic”
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company>has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
3) XML Semantics
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, 212, 219, 377 . . .
Find all documents that mention the company “Mark Logic”
Copyright © 2013 Accenture All Rights Reserved.
4) All Of The Above
Find all articles that contain the phrase “high performance” and
mention the company “Mark Logic” in the abstract
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company>has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
4) All Of The Above
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, 212, 219, 377 . . .
Find all articles that contain the phrase “high performance” and
mention the company “Mark Logic” in the abstract
Copyright © 2013 Accenture All Rights Reserved.
Scalar Indexes
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, …
Identify a set of documents based on criteria and then characterize the
set with scalar indexes (float, dateTime, string etc.)
Copyright © 2013 Accenture All Rights Reserved.
Geospatial, too
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document
References
126, 130, 167, …
Just a special kind of scalar index, except values are points and scan
operators know about Earth geometry
Copyright © 2013 Accenture All Rights Reserved.
Universal Index Is Our Hammer
We turn queries into nails
Copyright © 2013 Accenture All Rights Reserved.
Examples Of Nails
• Directories
– Exclusive, hierarchical, analogous to file
system, map to URI
• Collections
– Set-based, N:N relationship
• Security
– Invisible to your app
Copyright © 2013 Accenture All Rights Reserved.
Many Shapes And Sizes
News Article Book Research Report
Slide Presentation Product Sheet Operations Manual
Copyright © 2013 Accenture All Rights Reserved.
Load As Is
XML is self-describing
<article>
<title>MarkLogic Server: . . .</title>
<author>
<first-name>John</first-name>
<last-name>Kreisa</last-name>
</author>
<abstract>
. . . . <company>Mark Logic</company>
</abstract>
<body>
<section>
<section> . . .</section>
</section>
<section> . . . index . . . </section>
</body>
<copyright>Copyright© . . . </copyright>
</article>
Copyright © 2013 Accenture All Rights Reserved.
Load As Is
<article>
<title> <abstract><body> <copyright>
<author>
<first-name>
<last-name>
<section> <section>
<section>
<company>
"MarkLogic Server: . . ."
"John"
"Kreisa"
"MarkLogic"
" . . . " " . . . "
" . . . "
“ . . . "" . . . index. . . "
XML is self-describing
Copyright © 2013 Accenture All Rights Reserved.
Load As Is
<article>
<title> <abstract><body> <copyright>
<author>
<first-name>
<last-name>
<section> <section>
<section>
<company>
"MarkLogic Server: . . ."
"John"
"Kreisa"
"MarkLogic"
" . . . " " . . . "
" . . . "
“ . . . "" . . . index. . . "
XML is self-describing No Schema Needed!
Copyright © 2013 Accenture All Rights Reserved.
Degrees Of Flexibility
Str
uc
ture
Ad h
oc
Pre
defined
Queries
Ad hocPredefined
IMS
IDMSRelational
Databases
Search
EnginesMarkLogic
Server
XML
Server
Copyright © 2013 Accenture All Rights Reserved.
The Query Language
XMLUniversal
Index
XQuery
Full-Text Search
XML Structure
XML Semantics
Application Logic
Manipulate XML
Render Results
Load As Is
Copyright © 2013 Accenture All Rights Reserved.
The Programming Language
XMLUniversal
Index
XQuery
Full-Text Search
XML Structure
XML Semantics
Application Logic
Manipulate XML
Render Results
Load As Is
Copyright © 2013 Accenture All Rights Reserved.
A Different Approach
• Sould of a Search Engine: Data Model And Queries
• Database: On-disk Organization And Transactions
Copyright © 2013 Accenture All Rights Reserved.
What’s In A Database?
• No tables
• No rows
• forests . . .
. . . . of trees
•Database
Forest1 Forest2Forest3
Copyright © 2013 Accenture All Rights Reserved.
Host e1
Forest1
Host ek
Host d1 Host d2 Host d3 Host dl
Forest2 Forest3 Forestm
Host e2
Forest4
The Cluster
Copyright © 2013 Accenture All Rights Reserved.
What About Updates?
• Typical XML document:
– 10KB – 1MB
– Referenced by 1,000s to 10,000s of term lists
• Search engines are bad at updates
– Many indexes to update
– Option: Index and Information out of sync
– Option: Slow
• We want
– High throughput
– Transactions (ACID)
• So how do we avoid updates?
Copyright © 2013 Accenture All Rights Reserved.
Solution: Temporal Database
• No update! No delete!
• Only insert and read-at-a-time
• Every document has two timestamps
– “created”, “expired”
Copyright © 2013 Accenture All Rights Reserved.
Temporal Database
520 528
Createa.xml
Createb.xml
Updatea.xml Updatea.xml
Deleteb.xml...
QueryQuery
Copyright © 2013 Accenture All Rights Reserved.
The Cluster
Host e1
Forest1
Host ek
Host d1 Host d2 Host d3 Host dl
Forest2 Forest3 Forestm
Host e2
Forest4
Copyright © 2013 Accenture All Rights Reserved.
•Host
A Single Forest
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2013 Accenture All Rights Reserved.
•Host
1. Create A New Tree
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2013 Accenture All Rights Reserved.
•Host
2. Expire Trees
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2013 Accenture All Rights Reserved.
•Host
3. Save A Buffer To Disk
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2013 Accenture All Rights Reserved.
The Four Forest Operations
1. Create a new document
• Into a buffer
2. Mark a document as expired
• Memory-mapped document timestamps per stand
3. Write buffer out to disk
• Our buffers are 100s of megabytes
• For performance, double buffer
4. Merge
• Background process
• Optimization: reduces number of stands in forest
Copyright © 2013 Accenture All Rights Reserved.
Consistency And Throughput
• 2-phase commit
– Transactions span forests
• Recovery
– Forest Journals
• Lock-free queries
– Use the search engine at a point-in-time
– Increased throughput
– Time travel?
Copyright © 2013 Accenture All Rights Reserved.
A Different Approach
• Sould of a Search Engine: Data Model And Queries
• Database: On-disk Organization And Transactions
Copyright © 2013 Accenture All Rights Reserved.
Native XQuery Support
• MarkLogic supports XQuery as its native interface
– Query language designed for querying XML data and content
– An open, W3C standard
Example content query: quality assurance
for $proc in /book/section[title = "Procedure"]
where not (some $a in $proc//anesthesia
satisfies $a << ($proc//incision)[1])
return $proc
Copyright © 2013 Accenture All Rights Reserved.
Native XQuery Support
• MarkLogic supports XQuery as its native interface
– Query language designed for querying XML data and content
– An open, W3C standard
•Example content query: quality assurance
•Find all medical procedures that have incision before anesthesia
for $proc in /book/section[title = "Procedure"]
where not (some $a in $proc//anesthesia
satisfies $a << ($proc//incision)[1])
return $proc
Copyright © 2013 Accenture All Rights Reserved.
Manipulate Content
• Navigate within content
– Walk through the tree structure of the document – e.g.,
– Create breadcrumb trail to top of document
– Move to adjacent paragraphs, illustrations, tables, or captions
• Modify content programmatically
– Translate content to different languages
– Alphabetize index terms and produce new index sheet
– Summarize by returning lead paragraphs or topic sentences
• Combine content from multiple sources
– Nested queries across content sources
– Create common index across content from multiple sources
Copyright © 2013 Accenture All Rights Reserved.
Render Content
• Flexibly output content for multi-channel delivery
– XHTML for web browsers
– XSL-FO for PDF generation, custom publishing
– WML for mobile devices
– Office XML for Microsoft Office documents
• High-performance, server-based transformations
– Performed close to the content
– Faster than XSLT
Copyright © 2013 Accenture All Rights Reserved.
(1) Specify a search using composable constructors
cts:and-query(("wrist", "injury"))
cts:or-query((cts:and-query(("cat", "scratch"))
cts:and-query(("dog", "bite")) ))
cts:and-not-query(“United States”, "Texas")
cts:element-query(xs:QName("Year"),
cts:or-query(("1980", "1981")))
(2) Define a searchable set of nodes
//MedlineCitation[
Journal/JournalIssue/PubDate/Year = "1980"]
(3) Apply the search query to the nodes
cts:search(//MedlineCitation,
cts:and-query(("wrist", "injury")))
(4) Return the results in relevance order
Search Processing Model
Copyright © 2013 Accenture All Rights Reserved.
Free Text Search
cts:word-query( $text as xs:string,
[$options as xs:string*],
[$weight as xs:double] )
as cts:word-query
• Options include:"case-sensitive“ Specifies a case-sensitive query
"case-insensitive“ Specifies a case-insensitive query
"punctuation-sensitive" Specifies a punctuation-sensitive query
"punctuation-insensitive" Specifies a punctuation-insensitive query
"stemmed" Specifies a stemmed query
"unstemmed" Specifies an unstemmed query
"wildcarded" Specifies a wildcarded query
"unwildcarded“ Specifies an unwildcarded query
"lang=en“ Specifies, (e.g.) that the query is in English
Copyright © 2013 Accenture All Rights Reserved.
Boolean Queries
• cts:and-query()
conjunction of an arbitrary lists of sub-queries
• cts:or-query()
disjunction of an arbitrary lists of sub-queries
• cts:and-not-query()
relative complement of two sub-queries
• cts:not-query()
complement of a single sub-query
Copyright © 2013 Accenture All Rights Reserved.
Linguistic Controls
• “case-sensitive”, “case-insensitive” options
– Configuration option to add case-sensitive index termscts:word-query(“Genetic Engineering”,”case-insensitive”)
• “punctuation-sensitive”, “punctuation-insensitive”
– Configuration option to add punctuation-sensitive index termscts:word-query(“Genetic-Engineering”,”punctuation-
insensitive”)
• Stemming - “stemmed”, “unstemmed” query options
– Stemming does not cross different parts of speech
• Thesaurus – XML Schema, query expansion
Copyright © 2013 Accenture All Rights Reserved.
Spelling
• Double-metaphone
• Spelling suggestionsspell:suggest("/mySpell/spell.xml","alfabet")
• Spell checking
spell:is-correct("/mySpell/spell.xml","alfabet")
• Dictionary load and managementspell:load("c:\dictionaries\spell.xml",
"/mySpell/spell.xml")
spell:add-word("/mySpell/spell.xml",”uxorious”)
spell:remove-word("/mySpell/spell.xml","atomise")
Copyright © 2013 Accenture All Rights Reserved.
Wildcards
• A*, *B, A*B, A?, ?B, A?B, A*B*C, A*B?C*, etc.
• Regular expression optimization
• For example:
cts:search(input(), cts:word-query("he*"))
will result in a wildcard search
• Character indexing provides optimization forfn:contains(), fn:matches(),
fn:starts-with(), fn:ends-with()
Copyright © 2013 Accenture All Rights Reserved.
Proximity Queries
cts:near-query($queries, $distance, $ordered, $weight)
• The results match if two queries match and the distance between the two matches is equal to or less than the specified distance. A distance of 0 matches only when there is overlapping text. The default value is 100.
• For example,
cts:search(//p,
cts:near-query(
(cts:word-query("James"),
cts:word-query("Maxwell")), 2))
Copyright © 2013 Accenture All Rights Reserved.
Proximity Queries
For example,
cts:search(//p,
cts:near-query((
cts:near-query(("James","Maxwell"), 2),
cts:near-query(("Albert", "Einstein"), 2),
cts:near-query(("Lorentz", "Contraction"), 2)
), 50, "unordered"))
Copyright © 2013 Accenture All Rights Reserved.
Beyond Free Text Search
• XML Query / search integration
– XML granular search
– XPath constraints
– Rich interaction between text and structural constraints
– Free access to all fields and combinations of constraints
• XML searchable database
– Integrate data, metadata, search and update
Copyright © 2013 Accenture All Rights Reserved.
Range Queries
• Numeric range queries are optimized with range indexes
//article[date <= xs:date("2002-10-10T17:00:00Z")]
• Lexicographic range queries, likewise
//article[("A" <= name) and (name < "B")]
• Sort optimization uses range indexes to eliminate post-sort
for $x in //article
order by $x/last/name, $x/first/name
return <li>{ $x/date }</li>
Copyright © 2013 Accenture All Rights Reserved.
Structured Highlighting
• Embed hyperlink to commercial drug equivalent for each instance of a generic drug:
define function
lookup-drug-name($name as xs:string)
{
<xhtml:selection>
{
doc(“drug-list.xml”)/name[.=$name]/variants
}
</xhtml:selection>
}
for $a in cts:search(//articles, "ibuprofen“ )
return
cts:highlight($a, cts:word-query( "ibuprofen“ ),
lookup-drug-name( "ibuprofen” ))
Copyright © 2013 Accenture All Rights Reserved.
Summary
• XML as data model
– Ad hoc schema
• A search engine core
– Universal Index
• Temporal transaction model
– High throughput while keeping . . .
• Performance and scalability of a search engine
Copyright © 2013 Accenture All Rights Reserved. Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.
Questions ??????
Copyright © 2013 Accenture All Rights Reserved. Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.
Thank You.
Please contact for further info [email protected]