solr graph query: presented by kevin watters, kmw technology
TRANSCRIPT
O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A
Solr Graph Query Kevin WaDers
Founder, KMW Technology
Solr 6.0 Graph Query Overview
Kevin WaDers KMW Technology [email protected] www.kmwllc.com
October 14, 2016
KMW Technology Overview • Boston based soIware consulJng and professional services
organizaJon. • Founded in 2010. • Developers & consultants with deep industry experience. • BouJque firm specializing in Open Source, Search, Big Data,
Machine Learning, and AI • Custom Connectors, Pipelines, Classifiers, Search, UI/UX
development. • Data and InformaJon Architecture
What is a Graph? “One data model to rule them all!” A generic representaJon of all linked data models. G = <V,E> ?!?! A graph is made up of nodes and edges… • Nodes/VerJces ( node_id ) has metadata and links to other nodes. • Edges/Links ( edge_ids ) are associated with a node and point to other
nodes. Nodes can be modeled as documents in the index with a mulJ-‐value field containing the edges. For other use cases edges can also be modeled as documents.
Graph Traversal There are many graph traversal / exploraJon algorithms. DFS, BFS, A*, Alpha–beta, etc… Solr Graph Query implements “BFS” Breadth-‐First Search, each hop expands the “FronJer” of the graph. It explores all current edges in a single step/query!
Graph Query Parser Syntax
Parameter Default DescripJon
from field containing the node id
to Field contaning the edge id(s)
maxDepth -‐1 The number of hops to traverse from the root of the graph. -‐1 means traverse unJl all edges and documents have been collected. maxDepth=1 is similar behavior to a JOIN.
traversalFilter null arbitrary query string to apply at each hop of the traversal
returnRoot true true|false – indicaJon of if the documents matching the root query should be returned.
leafNodesOnly false true|false – indicaJon to return only documents in the result set that do not have a value in the “to” field.
useAutn false Decide to use Automaton query term for edge traversal or TermsQuery.
Uses Solr’s query parser plugin and “local params” syntax: {!graph from=”node_id” to=“edge_ids”}query
Key Features and Design Goals
“Graph is a Filter on top of your data” -‐someone • Designed for large scale and large number of edges and very deep traversals. • Limited memory usage for traversal • Cycle detecJon for “free” (based on current bit set!) • Highly cacheable via the FilterCache! • Support mulJValued fields for nodes and/or edges • Support arbitrary query filters during the exploraJon with the “Traversal Filter” • Follow Every Edge! No edge leI behind! Traversal is complete! • Works with Facets, Facet Queries, and other search components seamlessly
Memory Usage • One bit set to rule them all (for the result set) • BitSet provides cycle detecJon for free. (Have I been here
before?) • BitSet equal to size of index! • 100 Million doc index only uses about 12 MB RAM per query!
(Same size as 1 filter cache entry!) • root nodes BitSet only if returnRoot = false • leaf nodes same for all graph queries.
Performance ConsideraJons • Use DocValues, they’re SO MUCH FASTER! • Don’t tokenize your node/edge ids! (unless that’s what you want)
• Performance is a funcJon of the number of unique edges that are traversed, not the number of nodes.
• Limit depth if you know how far to go in the traversal.
Graph Query For Security • Graph queries are elegant and simple to use for traversing security hierarchies such as LDAP and AD
• Custom security models that are hierarchical or folder based in nature.
• Supports Users being members of Groups that can be members of other Groups
• Adding or removing a user/group is updaJng just 1 document, not re-‐indexing large porJons of your index!
Example Company with Security Model
Document Security Model within the Solr Index
Graph Traversal for User 1
Graph Traversal for User 2
Graph Based Security Query
• Single security query to traverse the graph: {!graph from=node_id to=edge_ids returnOnlyLeaf=true}id:user_1
• Security query is applied as a filter to the query request to ensure the security filter is cached!
Distributed & Solr Cloud • You can distribute the user/group records to all shards in the index with smart rouJng!
• Distribute the documents only across the shards.
• Fixed number of permissions on each shard and distributed documents keeps graph traversals local for the best performance!
Users , AcJons and Items • Model your browsing/purchase history as
– Users (have an ID) – Items (have an ID, metadata, category, etc.) – AcJons (link between user and Items, such as raJng, purchase, like/dislike)
Find similar users • Graph traversal from a user (or set of users) through their acJons to items they like, to find similar users, and out to items they like.
• Now, exclude the original starJng set • “returnRoot=false”
User 1 (depth=2)
Item 1 (root)
Item 4 (depth=4)
Item 2 (depth=4) AcJon/Buy
(depth=1)
AcJon/Buy (depth=3)
AcJon/Buy (depth=3)
User 2 (depth=2)
Item 3 (depth=4)
AcJon/Buy (depth=3)
4 hops in the graph from an Item gets you to related items, omit the starJng point and only return records that are “items” {!graph from=node_id to=edge_id maxDepth=4 returnRoot=false}id:Item_1 AND type:item
AcJon/Buy (depth=1)
Users who buy X also buy Y
WordNet as a Knowledge Graph WordNet maintained by Princeton University provides a hierarchical model of the English language. Words have relaJonships to each other such: • Hypernym – a more general case of another word • Hyponym – a more specific case of another word • Jaguar is a type of Cat • Cat is a type of Animal Cat is a hypernym of Jaguar. Jaguar is a hyponym of cat. Index WordNet entries with fields containing the links to the hypernyms and hyponyms!
WordNet Hypernym Traversal +{!graph from="synset_id" to="hypernym_id" maxDepth=8}sense_lemma:jaguar
WordNet Graph IntersecJons Is a jaguar a type of animal? If a graph intersecJon exists, the answer is yes! IntersecJon of knowledge graph traversals can be used to answer quesJons!
Wikipedia • Pages have links! Lots of Links… • Pages have Infoboxes that contain great metadata. • Infobox types like : person, scienJst, writer, arJst.. Etc
• What if you’re looking for all Wikipedia pages about people?
Infobox facets • The infobox tags are more specific than the users search/request.
• Searching for People should include ScienJsts, Authors, and ArJsts!
• Wikipedia doesn’t know a ScienJst is a person, but WordNet does!
WordNet knows a scienJst is a person!
Wikipedia pages linked to Graph Theory
InformaJon Overload! It’s difficult to see the people in this sea of informaJon!
Combine WordNet and Wikipedia With Graph Queries to find people!
Using WordNet we’re able to disambiguate that the enJty_types of “scienJst” , “person” and “philosopher” are all types of people! Normal FaceJng is not enough!
Nested and Filtered Graph Queries!
The Graph query can be nested. This allows you to traverse one set of fields, then change the fields you are traversing. This example first traverses all WordNet documents that are a type of person, then based on that result set, it does a 1 hope traversal to Wikipedia data on the enJty_type field to restrict the results. {!graph from="enPty_type" to="sense_lemma" maxDepth=1}{!graph from="sense_lemma" to="sense_hyponym_lemma" maxDepth=2}sense_lemma:person Intersect that with pages that are related/linked to from the Wikipedia query of node_id:”Graph theory” {!graph from=node_id to=edge_ids maxDepth=1}node_id:”Graph theory” AddiJonally use returnRoot=false if you want to omit the WordNet docs from the result set!
Gather Nodes? • If you’re interested in doing some distributed Graph traversal in Solr there are a few opJons.
• You can use the Gather Nodes funcJonality in Streaming AggregaJons. Not super fast, but it gets the job done!
Distributed Graph Traversal • Do you think you need to scale up? We have an implementaJon based on Ka{a & Solr Cloud that uses Ka{a to distribute the fronJer query.
What next? • Edge weights, Relevancy, and Scoring
– Based on |/idf or bm25, – Based on numerical field values (min/max/sum/avg weight
applicaJon)? – Skip high frequency edges?
• Min distance computaJon • Driving direcJons? • Be=er support for visualizaJon libraries like D3.js! • Distributed Traversal via Ka{a fronJer query broker
AddiJonal Detail
Related Solr Tickets h=ps://issues.apache.org/jira/browse/SOLR-‐7543 h=ps://issues.apache.org/jira/browse/SOLR-‐8632
h=ps://issues.apache.org/jira/browse/SOLR-‐8176 QuesJons? Kevin Wa=ers, KMW Technology [email protected]
AcJons occur over Jme • These events can’t easily be aggregated or fla=ened onto a
record. • Model this as a “person” record, with a set of “acJon” records. • Each acJon record has the id of the “previous” acJon. • Search for an acJon, graph traverse based on person id to
another acJon, then finally to the person record.
OpenCV, Video RecogniJon • Imagine indexing each frame of video from security cameras.
Pass each frame of video through OpenCV for object recogniJon & face recogniJon.
• Each frame has a frame number of it’s frame and the previous frame.
• Search for object/face “A” detected, followed by object/face “B” detected, across all of your video streams.