adding regular expressions to graph reachability and ... regular...t h e u n i v e r s i t y o f e d...

1
T H E U N I V E R S I T Y O F E D I N B U R G H Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan 1,2 , Jianzhong Li 2 , Shuai Ma 1 , Nan Tang 1 , Yinghui Wu 1 1 University of Edinburgh 2 Harbin Institute of Technology {wenfei@inf.,shuai.ma@,ntang@inf.,y.wu-18@sms.}ed.ac.uk,[email protected] Introduction Real-life graphs bear different edge types, indi- cating a variety of relationships. Various graph querying semantics: subgraph isomorphism: function-based, edge-edge matching. graph simulation: relation-based, edge- edge matching. Existing methods cannot meet the demands in emerging applications, e.g., social network. We revise graph simulation by adding regular expressions, to find more sensible information than their traditional counterparts. Querying Essembly Network Graph reachability and pattern queries Data Graph. A directed graph with node prop- erties (e.g., labels, keywords, blogs, comments, ratings) and edge types (e.g., marriage, friend- ship, work, co-membership). Reachability queries.A reachability query (RQ) is in the form of an edge (u 1 ,u 2 ), where u 1 and u 2 are two nodes, each node carries a predicate as a conjunction of atomic formulas (e.g., job=‘doctor’), an RQ bears a regular expression drawn from the subclass: F ::= c | c k | c + | FF. (c represents a colour, k is a positive integer, and c + denotes one or more occurrences of c), query result is a set of node pairs, each pair is connected by a path satisfying the length and colour constraints of RQ. A reachability query and its result graph Graph pattern queries.A graph pattern query (PQ) is a directed graph where each edge is an RQ. The result of PQ is the recursively defined union of the results for its edges. The result for an edge in PQ is the result of RQ the edge represents. A graph pattern query and its result graph Fundamental problems Containment. Given two PQs Q 1 and Q 2 , Q 1 is contained in Q 2 , if for all data graph, the result of each edge in Q 1 is contained in the result of an edge in Q 2 . Equivalence. Two PQs Q 1 and Q 2 are equiva- lent, iff they are contained in each other. Theorem: Given two PQs Q 1 and Q 2 , it is in cubic time to determine whether Q 1 is contained in, or equivalent to Q 2 . Query Containment and Equivalence Query minimization. The minimization prob- lem is to find, for a given PQ Q, another PQ Q m that is equivalent to Q and has a minimum size (the sum of nodes and edges). Theorem. Given any PQ Q, a minimum equiva- lent PQ Q m of Q can be computed in cubic time. Query Minimization Algorithms Reachability queries An RQ query can be evaluated in quadratic time, by capitalizing a matrix of shortest distances. Graph pattern queries Given a PQ Q and a data graph G, Q can be evaluated in cubic time. Two cubic-time algorithms. Join-based algorithm: Initialize candidates for query nodes. Join operation for query edges till fixpoint. Split-based algorithm: Initialize over-estimated partition-relation pair for query nodes. Split blocks and filter candidates till fixpoint. Experimental results Querying Youtube Network Querying Terrorist Network Summary: PQs are able to identify far more sensible matches in emerging application than the con- ventional approaches. PQs can be efficiently evaluated, and scale well with large graphs and complex patterns. Conclusion Extensions of reachability queries (RQs) and graph pattern queries (PQs) by incorporating a subclass of regular expressions to capture edge relationships . Fundamental problems (containment, equiva- lence, minimization) for these queries are all in low ptime. Two cubic-time algorithms for evaluating PQs. Future work Extend RQs and PQs by supporting general reg- ular expressions. Identify application domains in which simulation-based PQs are most effective. Find incremental evaluation algorithms that guarantee to minimize unnecessary re- computation.

Upload: others

Post on 30-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • TH

    E

    U NI V E R

    S

    IT

    Y

    OF

    ED I N B U

    RG

    H

    Adding Regular Expressions toGraph Reachability and Pattern QueriesWenfei Fan 1,2, Jianzhong Li 2, Shuai Ma 1, Nan Tang 1, Yinghui Wu 11University of Edinburgh 2Harbin Institute of Technology

    {wenfei@inf.,shuai.ma@,ntang@inf.,y.wu-18@sms.}ed.ac.uk,[email protected]

    Introduction•Real-life graphs bear different edge types, indi-

    cating a variety of relationships.•Various graph querying semantics:– subgraph isomorphism: function-based,edge-edge matching.

    – graph simulation: relation-based, edge-edge matching.

    •Existing methods cannot meet the demands inemerging applications, e.g., social network.

    •We revise graph simulation by adding regularexpressions, to find more sensible informationthan their traditional counterparts.

    Querying Essembly Network

    Graph reachability andpattern queries

    Data Graph. A directed graph with node prop-erties (e.g., labels, keywords, blogs, comments,ratings) and edge types (e.g., marriage, friend-ship, work, co-membership).

    Reachability queries. A reachability query (RQ)is in the form of an edge (u1, u2), where•u1 and u2 are two nodes,• each node carries a predicate as a conjunction

    of atomic formulas (e.g., job=‘doctor’),• an RQ bears a regular expression drawn from

    the subclass:

    F ::= c | c≤k | c+ | FF.

    (c represents a colour, k is a positive integer,and c+ denotes one or more occurrences of c),

    • query result is a set of node pairs, each pair isconnected by a path satisfying the length andcolour constraints of RQ.

    A reachability query and its result graph

    Graph pattern queries. A graph pattern query(PQ) is a directed graph where each edge is an RQ.The result of PQ is the recursively defined unionof the results for its edges. The result for an edgein PQ is the result of RQ the edge represents.

    A graph pattern query and its result graph

    Fundamental problems

    Containment. Given two PQs Q1 and Q2, Q1 iscontained in Q2, if for all data graph, the resultof each edge in Q1 is contained in the result of anedge in Q2.

    Equivalence. Two PQs Q1 and Q2 are equiva-lent, iff they are contained in each other.

    Theorem: Given two PQs Q1 and Q2, it is incubic time to determine whether Q1 is containedin, or equivalent to Q2.

    Query Containment and Equivalence

    Query minimization. The minimization prob-lem is to find, for a given PQ Q, another PQ Qmthat is equivalent to Q and has a minimum size(the sum of nodes and edges).

    Theorem. Given any PQ Q, a minimum equiva-lent PQ Qm of Q can be computed in cubic time.

    Query Minimization

    Algorithms

    Reachability queriesAn RQ query can be evaluated in quadratic time,by capitalizing a matrix of shortest distances.

    Graph pattern queriesGiven a PQ Q and a data graph G, Q can beevaluated in cubic time.

    Two cubic-time algorithms.

    • Join-based algorithm:– Initialize candidates for query nodes.– Join operation for query edges till fixpoint.

    • Split-based algorithm:– Initialize over-estimated partition-relation

    pair for query nodes.– Split blocks and filter candidates till fixpoint.

    Experimental results

    Querying Youtube Network

    Querying Terrorist Network

    Summary:

    • PQs are able to identify far more sensiblematches in emerging application than the con-ventional approaches.

    • PQs can be efficiently evaluated, and scale wellwith large graphs and complex patterns.

    Conclusion•Extensions of reachability queries (RQs) and

    graph pattern queries (PQs) by incorporating asubclass of regular expressions to capture edgerelationships .

    •Fundamental problems (containment, equiva-lence, minimization) for these queries are all inlow ptime.

    •Two cubic-time algorithms for evaluating PQs.

    Future work•Extend RQs and PQs by supporting general reg-

    ular expressions.• Identify application domains in which

    simulation-based PQs are most effective.•Find incremental evaluation algorithms that

    guarantee to minimize unnecessary re-computation.