sql tool kit - computer science€¦ · sql tool kit submitted by: ... antlr4[2] is a powerful...

1

ROCHESTER INSTITUTE OF

TECHNOLOGY

SQL Tool Kit

Submitted by: Advisor:

Chandni Pakalapati Dr.Carlos Rivero

2

Table of Contents

1. Abstract ........................................................................................... 3

2. Introduction .................................................................................... 4

3. System Overview ............................................................................ 5

4. Implementation ............................................................................... 6

4.1. Schema-free SQL Parser ......................................................... 7

4.2. Relation Tree Mapper ............................................................. 7

4.3. Similarity evaluation ............................................................... 9

4.4. Network Builder .................................................................... 10

4.5. SQL Composer ...................................................................... 12

4.6. Testing .................................................................................. 12

5. Conclusion ………………………………………...................... 15

6. Future Work ………………………………………………..…. 16

7. References ………………………………………………....…... 17

3

1 Abstract

Standard Query Language (SQL) is the most common way to fetch information from the

relational databases. The database schemas usually have complex structures with many

constraints/foreign key references among the tables and many table/attribute names to be

remembered. In order to write proper queries, the user should know SQL syntax and have a

complete understanding of the relation schema. The paper [1] proposed a new Schema-Free

query language to help naïve users write SQL queries. This paper studies the proposed approach,

tested the existing implementation and identified the limitations in it.

4

2. Introduction

Schema-free SQL language is a new kind of query language which offers maximum liberty to

the user in writing the queries using whatever partial schema they know. The system is tolerant

of unknown or incorrectly specified relation names and attributes, and also does not require

additional information on the tables to be used and how they need to be joined. The paper [1]

discusses techniques to implement it and discusses the steps to convert the partial query to a

complete query.

The motive behind this language is to reduce the burden of understanding the database schema

for the user as well as to design a system which does not guess every element. Since, SQL has

broad range of users, it would be beneficial for the users if they can spend more time in

designing the query logic rather than understanding the schema in detail. The system reduces the

user burden by offering two flexibilities – 1. Schema Relaxation: Users need not write the correct

and complete names for the schema elements including relations and attributes 2.Join Path

Relaxation: Users need not specify the join paths, how intermediate tables should be joined.

The input is evaluated at the database level only after each vague schema elements is mapped to

its similar schema elements in the database and includes the join path using optimal join

networks that connects all the specified schema elements. The uncertainty in the input can be

expressed by using ‘?’ next to the schema element. For example, que? notifies the system that the

user is guessing the name of the entity. Using only ‘?’ indicates an element that the user does not

know its name. Also, owing to the second relaxation, the user may not include any relations in

the FROM clause, or mention the join paths in the WHERE clause. Considering the movie

database shown in Fig1, a user having partial knowledge can write a schema free query in order

to fetch the number of male actors who have acted in movies directed by James Cameron and

produced by 20th Century Fox between the years 1995 and 2005-

Eg1: SELECT count(actor?.name?)

and year? > 1995 and year? < 2005;

The system carefully resolves the schema relaxation by mapping vague schema elements to

similar schema elements and completing the unspecified join path, using a strong network that

connects all tables including the mapped schema elements.

This paper explains the architecture of the system and its current implementation. In order to find

the limitations, the system was tested for some cases and some points of failure have been

identified.

5

Fig1 Database Schema

3. System Overview

The main components in the system (refer to Fig2) include:

Schema-free SQL Parser – The parser separates the schema elements and join paths from the rest

of input query. The schema elements are stored in the form of relation trees.

Relation Tree Mapper – The relation trees are formed with the table as the node and the

attributes, their values and condition constraints as its children. The trees are merged at different

levels and are mapped to relations in the database based on their similarity.

Network Builder – The tables are connected based on the specified join paths and foreign key

references and are stored as view graphs. The tables are nodes and an edge exists between them

if any of the mentioned connections exist. Join networks are generated from these views and are

ranked so as to consider the top – k paths. These join networks are more intuitively correct than

the ones which are formed from scratch.

SQL Composer – The composer replaces user’s guesses with the exact schema elements and

generates the from/where clauses from the generated join paths and returns K different SQL

queries, one for each join network formed.

6

Fig2 System Architecture

4. Implementation

The user can provide the input query along with the number of output queries and the c value

(unique value to the database) in the user interface.

Fig3 User Interface

7

4.1. Schema-free SQL Parser

The input query is parsed to get the schema elements from Select, From and Where clauses and

identify the table names, attributes and values. ANTLR4[2] is a powerful parser tool that can

build and walk parse trees. ANTLR4 library[2] is used as the existing implementation is in Java.

The input to the parser is grammar and the SQL grammar is obtained from Github[3]. The parser

has rules which matches all strings in the input and the Listener is notified when any of the

parser rules is triggered. The necessary code to extract the elements is written in the Listener

class. In this way the grammar can parse Schema-free SQL and extract elements like tables,

attribute names and some clauses for further processing.

When the example 1 is given as the input query, the parser generates a tree:

(select_statement (query_expression (query_specification SELECT (select_list (select_list_elem

(expression (full_column_name (table_name (id (simple_id actor?))) . (id (simple_id name?))))))

WHERE (search_condition (search_condition_and (search_condition_not (predicate (expression

(full_column_name (id (simple_id year?)))) (comparison_operator >) (expression (constant

1995)))) and (search_condition_not (predicate (expression (full_column_name (id (simple_id

year?)))) (comparison_operator <) (expression (constant 2005)))))))) ;)

When the Listener walks through the tree it invokes the corresponding method. We use these

methods to capture data when the required parse rules are triggered. For instance, when it comes

across “full_column_name”, the methods “enter_full_column_name” and

“exit_full_column_name” are triggered. So, the necessary logic to extract the schema elements is

written in the appropriate methods.

Improvisation: The existing implementation does not parse the From clause. To store the table

names given in From clause, the method “enter_Table_name” in tsqlBaseListener class was

modified.

4.2. Relation Tree Mapper

The schema relevant expressions occur in 1) relation names in from clause 2) attribute names

(sometimes along with table names) in other clauses 3) value constraint conditions specified in

Where clause. This information is represented uniquely as expression triples with three entries –

relation name, attribute name and condition constraint. The missing information can be stored as

*. The expression triples are stored as relation trees – with the relation name as the root, the

attributes fetched from various clauses as children and its children are values taken from the

Where clause.

From the given example, we can see that the expression triples formed from actor.name? is

(actor, name, *). Actor being the table name is the root, name being the attribute is its child and

8

the value of the attribute is not specified. So, its child is stored as a *. The other triples formed

from the query are (*, year, >1995), (*, year, <2005). These are stored as relation trees:

Fig4 Relation Trees

Since multiple trees with the same table name or trees having the same attributes are formed, we

can merge them, so that a relation tree includes all the information pertaining to one relation. The

trees are merged at relation level, if they have identical relation names. Attribute level merging

happens when the trees have same attribute names and the relation names are also the same or

are not present.

Fig5 Merging at attribute level

The results of relation trees thus obtained for the above example

1. Relation tree

actor = {name=[ name [] [] ]}

2. Relation tree

* = {year=[ year [>, <] [1995, 2005] ]}

9

4.3. Similarity evaluation

The relation trees are mapped to the relations in database based on their similarity which is

computed using an evaluation function. Each relation tree can be mapped to a set of relation

trees. For a relation tree rt with name n(rt) and attributes (at1,at2,…atn), R be a relation in the

database – rt is similar to R if R if it contains similar information at root level (n(rt)) and at

attribute level (each atn). The function is given by:

𝐒𝐢𝐦(𝐫𝐓𝐫𝐞𝐞, 𝐑) = 𝑺𝒊𝒎(𝒏(𝒓𝑻𝒓𝒆𝒆), 𝑹) ∏ 𝑺𝒊𝒎(𝒂𝒕𝒕𝒊, 𝑹)𝒏𝒊=𝟏

The similarity at root level is calculated by taking the maximum of the similarities between the

rTree and the relation tables along with its neighbors. It is given by Sim(n(rt),R) =

MAX(Sim(n(rt),n(R)),{ Sim’(n(rt), n(Ri))}) where R is the table Ri is a list of tables which are

neighbors to the relation R.

The similarity at attribute level is computed by taking the maximum of the similarity between the

attribute tree and the Attributes of a Relation. It is given by Sim(at,R) = MAX({Sim(at,Ai )

|1<i<m}), where R is a relation table with a list of attributes (A1,A2,….An).

Improvisation: The similarity Sim (a,b) between two words a, b is computed using WUP (Wu-

Palmer) algorithm [3]. WUPSimilarity.java has the code for computing the score.

The Wu & Palmer metric (wup) calculates similarity using the depths of the two words in the

Unified Medical Language System, along with the depth of the lcs (least common

subsummer)[3]. The formula for score calculation is 2*depth(lcs) / (depth(w1) + depth(w2)).[3]

It basically measures semantic similarity/relatedness between words. The result is always a value

between 0 and 1, 0 for being totally different words and 1 for same words.

Since this algorithm checks only for the closeness in the word meanings- the words such as

name, firstname will yield a 0 score. So in order to include cases like this, we consider longest

common subsequence algorithm along with WUP. Longest common subsequence compares the

characters in the words and gives the score based on the matched substring and length of words.

Score = length (common subsequence) / max (length (word1), length (word2)). Hence the final

value is given by max (score (WUP), score (longest common subsequence)).

Also, the final similarity is the product of similarity at attribute level and similarity at root level.

The implementation has the summation of these values. It has been changed to multiply the

values. This ensures that the final outcome is always a value between 0 and 1.

The results of the mapped relation trees for the above example are:

Mapping: actor=Person, year=Movie: release year

10

Improvement needed:

The relation trees should be mapped to a set of relations, for which the sim(rt,Ri) is greater than σ

* Max{Sim(rt,Rj)|1<j<n}}. This ensures that all likely relations are considered. Instead of

checking with this condition the implementation takes the relations, for which the table similarity

is a value between less than 0.4 and the neighbor sim is greater than 0.7.

The paper didn’t clearly state how the similarity for the relation trees which have no name. It

says that the similarity should be computed for the attribute tree and the relation names.

However, it is not clear if the similarity at root level would be the maximum of the values

calculated among all attribute trees and the relation names or the maximum when it is computed

between each attribute and other relations. The computation in the implementation is done in the

method SimilarityEvaluation.calSimgAttributeLevel. This method doesn’t update the kdef value,

the attribute tree is not compared with all relations and the code needs to be changed.

Also, if there are two attributes with the same name in different tables, but in the schema free

query the attributes are mentioned without the table names. Due to merging of relation at

attribute level, the two trees will be merged to form one single tree and is mapped to a single

relation instead of two relations.

4.4. Network Builder

After mapping the relation trees, the next step is to create a join path which contains the mapped

relations. The users may partially specify the join paths in the queries. We create graphs with

tables as nodes and connect them by assigning weights to edges. This is a representation of the

schema-free query.

The existing database is represented as a graph with the relations as nodes and an edge if foreign

key references exist between the two tables. Also, any existing join paths mentioned in the input

is included in the graph, creating a connection if it does not exist already. These connection form

the basic Views. As already mentioned in the previous section, each relation tree is mapped to

several relations. We need to incorporate this information also into the views. So an Extended

View Graph is formed from the views. In this, the nodes in existing views are replaced with the

corresponding mapped relation tree. The relations which have not been mapped to any other

relations or a relation tree is also included.

For example, a view exists as Person – Actor – Movie – Director – Person, and lets say that two

relation trees get mapped to Person, and another trees get mapped to movie producer and

company respectively. In order to generate an extended view graph, the person can be replaced

with Person1 and the last person can be replaced with Person2 or vice versa. So this creates two

graphs for us:

11

Fig6 Extended View Graph

Person1- actor- movie-director-perosn2 or person2 – actor – movie – director – person1.

Every edge is assigned a weight based on similarity function between the relation tree of one

node and the relation of the connected node. The formula used for computing the value is:

𝑊𝑒𝑖𝑔ℎ𝑡(𝑒) = 1 − (1 − 𝑐) ∗ (1 − 𝑀𝐴𝑋(𝑆𝑖𝑚(𝑛(𝑟𝑒𝑙𝑇𝑟𝑒𝑒1), 𝑛(𝑅2)), 𝑆𝑖𝑚(𝑛(𝑟𝑒𝑙𝑇𝑟𝑒𝑒2), 𝑛(𝑅1)))

We have assumed the value of c is 0.7and relTree1 is the relation tree containing the information

about relation1 of the database.

All the nodes in the view graph are replaced with the mapped relation trees, so that the resulting

join network will contain the relations used in the schema free query. As the connections are

made, the weight of the resultant network is calculated by multiplying the edge weights and

taking the square root of the resultant. The networks containing all mapped relation trees is

called total join network.

Improvisation: Used WUP similarity[3] function to calculate the weights of the edges.

Introduced a parameter k to get the top – k queries instead of only five queries which was

hardcoded in the code.

Improvements needed: The partial join paths mentioned in the input query are not being

considered while forming View Graphs. While expanding the join path to generate the k minimal

total join networks, the relations form a self- join with itself, and also join with the wrong tables

leading to inappropriate joins in the resultant query. When a relation tree is not mapped to any

relation a null pointer exception arises when trying to create a connection to those relations.

12

4.5. SQL Composer

After the top – k queries are generated, the relation trees are mapped with actual relations and the

nodes in the join path are included in From clause, and edges in the join path are translated as

join conditions with foreign key references to form the WHERE clause.

4.6. Testing

The existing implementation was tested in order to identify its shortcomings. We test the system

capacity to be able to handle various errors that users can frequently make. There are two kind of

errors possible in the queries – syntactic and semantic. A syntactic error has invalid characters

and a semantic error is a legal query, but doesn’t produce the intended results. Some queries

were run and their output was analyzed to understand the shortcomings. Since we are generating

k queries in the result, some of them are repetitive and only the unique answers are reported here:

Query1 – To find the names of directors that Jack has co-operated with.

SELECT (Person_2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 WHERE Person_1.name = 'Jack' and Person_1.person_id = Actor.person_id and Actor.movie_id = Movie.movie_id;

Output:

SELECT (Person2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 from Movie as Movie1, Actor as Actor1, Person as Person1, Person as Person2, Director WHERE Person1.name = 'Jack' and Person1.person_id = Actor1.person_id and Actor1.movie_id = Movie1.movie_id and Actor1.person_id=Person2.person_id and Person2.person_id=Director.person_id and Actor1.person_id=Person1.person_id and Actor1.movie_id=Movie1.movie_id; SELECT (Person2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 from Movie as Movie1, Actor as Actor1, Person as Person1, Person as Person2 WHERE Person1.name = 'Jack' and Person1.person_id = Actor1.person_id and Actor1.movie_id = Movie1.movie_id and Actor1.person_id=Person1.person_id and Actor1.movie_id=Movie1.movie_id and Actor1.person_id=Person2.person_id;

13

The repetitive joins indicate that the join view path generation is ignoring the join conditions

already present in the query and are creating duplicates.

Query2 - Return all the directors that “Jack” has cooperated with.

SELECT director?.name? Where actor?.name? = ‘Jack’;

Actual Query:

SELECT (Person_2.name)

From Person as Person_1, Actor, Movie, Director, Person as Person_2

WHERE Person_1.name = 'Jack'

and Person_1.person_id = Actor.person_id

and Actor.movie_id = Movie.movie_id

and Movie.movie_id = Director.movie_id

and Director.person_id = Person_2.person_id;

Output:

SELECT Person2.name from Person as Person1, Person as Person2, Director, Actor, Where Person1.name = 'Jack' and Director.person_id=Person1.person_id and Person1.person_id=Actor.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Director, Where Person1.name = 'Jack' and Actor.person_id=Person1.person_id and Person1.person_id=Director.person_id and Director.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Where Person1.name = 'Jack' and Actor.person_id=Person1.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Director, Where Person1.name = 'Jack' and Director.person_id=Person1.person_id and Director.person_id=Person2.person_id;

The important joins between Actor and Movie (Actor.movie_id = Movie.movie_id), Movie and

Director (Movie.movie_id = Director.movie_id) are missing. Also the existing joins between

Director, Actor and Person are wrong.

Query 3 – To get the actor names.

select actor?.name?

from actor;

Actual Query:

select Person1.name from Person as Person1, Actor where Actor.person_id=Person1.person_id;

14

Output:

select Person1.name from Person as Person1, Director where Director.person_id=Person1.person_id; select Person1.name from Person as Person1, Actor, where Actor.person_id=Person1.person_id;

It generated only 2 queries and one of it is correct. The join condition is wrong in the first query.

Query4 – To get the ids of actors. select id?

From actor;

Actual Query:

select Actor.person_id from Person as Person1, Actor where Actor.person_id=Person1.person_id;

Output:

select Movie1.movie_id

from Movie as Movie1, Movie_Producer where Movie_Producer.movie_id=Movie1.movie_id;


from Movie as Movie1, Director where Director.movie_id=Movie1.movie_id;


from Movie as Movie1, Actor where Actor.movie_id=Movie1.movie_id;

The id is mapped wrongly to movie_id. This indicates that the schema free does not consider the

user’s intent. Also, wrong joins indicate that join path generation should be improved.

Query5: To select everything from the table actor. Select ?

From actor;

NullPointer exception because, actor table is not mapped. The similarity mapping failed.

Query6: to name the directors with whom actor jack has co-operated. SELECT director?.name?

Where actor?.name? = 'jack' and director?.movie? = actor?.movie?;

15

Actual query:

Select person2.name From from Person as Person1, Person as Person2, Director, Actor Where person2.person_id = director.person_id And person1.name = ‘Jack’ And person1.person_id = actor.person_id And actor.movie_id = director.movie_id SELECT Person2.name from Person as Person1, Person as Person2, Director, Actor, Where Person1.name = 'jack' and Person2.name = Person1.name and Director.person_id=Person1.person_id and Person1.person_id=Actor.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Director, Where Person1.name = 'jack' and Person2.name = Person1.name and Actor.person_id=Person1.person_id and Person1.person_id=Director.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Where Person1.name = 'jack' and Person2.name = Person1.name and Actor.person_id=Person1.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Director, Where Person1.name = 'jack' and Person2.name = Person1.name and Director.person_id=Person1.person_id and Director.person_id=Person2.person_id;

The joined in the input query is missing in the output and wrong joins are formed.

5. Conclusion

The system can build a completed Sql query from the schema-free input, using all steps as

described above. The results contain at least one correct query. The input query is successfully

parsed and vague schema names are correctly mapped, due to accurate similarity evaluation and

the join networks developed generates an equivalent Sql query. This ensures that the user need

not have a complete idea of the database schema and its names, thereby reducing his burden of

understanding it.

16

6. Future Work

This project can be further extended to make it interactive thereby giving suggestions to user.

The existing algorithm should be improvised to include nested queries, improvise join networks

generation and increase the accuracy of resulting queries. The system should be tested on many

more complex databases.

17

References

[1] Fei Li, Tianyin Pan, and Hosagrahar V Jagadish. Schema-free sql. In Proceedings

of the 2014 ACM SIGMOD international conference on Management of data,

pages 1051{1062. ACM, 2014.

[2] ANTLR Grammar, 2016. https://github.com/antlr/grammars-v4.

[3] Wu-Palmer, 2016. http://search.cpan.org/dist/WordNet

Similarity/lib/WordNet/Similarity/wup.pm

sql tool kit - computer science€¦ · sql tool kit submitted by: ... antlr4[2] is a powerful...

Documents