sql tool kit - computer science€¦ · sql tool kit submitted by: ... antlr4[2] is a powerful...
TRANSCRIPT
1
ROCHESTER INSTITUTE OF
TECHNOLOGY
SQL Tool Kit
Submitted by: Advisor:
Chandni Pakalapati Dr.Carlos Rivero
2
Table of Contents
1. Abstract ........................................................................................... 3
2. Introduction .................................................................................... 4
3. System Overview ............................................................................ 5
4. Implementation ............................................................................... 6
4.1. Schema-free SQL Parser ......................................................... 7
4.2. Relation Tree Mapper ............................................................. 7
4.3. Similarity evaluation ............................................................... 9
4.4. Network Builder .................................................................... 10
4.5. SQL Composer ...................................................................... 12
4.6. Testing .................................................................................. 12
5. Conclusion ………………………………………...................... 15
6. Future Work ………………………………………………..…. 16
7. References ………………………………………………....…... 17
3
1 Abstract
Standard Query Language (SQL) is the most common way to fetch information from the
relational databases. The database schemas usually have complex structures with many
constraints/foreign key references among the tables and many table/attribute names to be
remembered. In order to write proper queries, the user should know SQL syntax and have a
complete understanding of the relation schema. The paper [1] proposed a new Schema-Free
query language to help naïve users write SQL queries. This paper studies the proposed approach,
tested the existing implementation and identified the limitations in it.
4
2. Introduction
Schema-free SQL language is a new kind of query language which offers maximum liberty to
the user in writing the queries using whatever partial schema they know. The system is tolerant
of unknown or incorrectly specified relation names and attributes, and also does not require
additional information on the tables to be used and how they need to be joined. The paper [1]
discusses techniques to implement it and discusses the steps to convert the partial query to a
complete query.
The motive behind this language is to reduce the burden of understanding the database schema
for the user as well as to design a system which does not guess every element. Since, SQL has
broad range of users, it would be beneficial for the users if they can spend more time in
designing the query logic rather than understanding the schema in detail. The system reduces the
user burden by offering two flexibilities – 1. Schema Relaxation: Users need not write the correct
and complete names for the schema elements including relations and attributes 2.Join Path
Relaxation: Users need not specify the join paths, how intermediate tables should be joined.
The input is evaluated at the database level only after each vague schema elements is mapped to
its similar schema elements in the database and includes the join path using optimal join
networks that connects all the specified schema elements. The uncertainty in the input can be
expressed by using ‘?’ next to the schema element. For example, que? notifies the system that the
user is guessing the name of the entity. Using only ‘?’ indicates an element that the user does not
know its name. Also, owing to the second relaxation, the user may not include any relations in
the FROM clause, or mention the join paths in the WHERE clause. Considering the movie
database shown in Fig1, a user having partial knowledge can write a schema free query in order
to fetch the number of male actors who have acted in movies directed by James Cameron and
produced by 20th Century Fox between the years 1995 and 2005-
Eg1: SELECT count(actor?.name?)
and year? > 1995 and year? < 2005;
The system carefully resolves the schema relaxation by mapping vague schema elements to
similar schema elements and completing the unspecified join path, using a strong network that
connects all tables including the mapped schema elements.
This paper explains the architecture of the system and its current implementation. In order to find
the limitations, the system was tested for some cases and some points of failure have been
identified.
5
Fig1 Database Schema
3. System Overview
The main components in the system (refer to Fig2) include:
Schema-free SQL Parser – The parser separates the schema elements and join paths from the rest
of input query. The schema elements are stored in the form of relation trees.
Relation Tree Mapper – The relation trees are formed with the table as the node and the
attributes, their values and condition constraints as its children. The trees are merged at different
levels and are mapped to relations in the database based on their similarity.
Network Builder – The tables are connected based on the specified join paths and foreign key
references and are stored as view graphs. The tables are nodes and an edge exists between them
if any of the mentioned connections exist. Join networks are generated from these views and are
ranked so as to consider the top – k paths. These join networks are more intuitively correct than
the ones which are formed from scratch.
SQL Composer – The composer replaces user’s guesses with the exact schema elements and
generates the from/where clauses from the generated join paths and returns K different SQL
queries, one for each join network formed.
6
Fig2 System Architecture
4. Implementation
The user can provide the input query along with the number of output queries and the c value
(unique value to the database) in the user interface.
Fig3 User Interface
7
4.1. Schema-free SQL Parser
The input query is parsed to get the schema elements from Select, From and Where clauses and
identify the table names, attributes and values. ANTLR4[2] is a powerful parser tool that can
build and walk parse trees. ANTLR4 library[2] is used as the existing implementation is in Java.
The input to the parser is grammar and the SQL grammar is obtained from Github[3]. The parser
has rules which matches all strings in the input and the Listener is notified when any of the
parser rules is triggered. The necessary code to extract the elements is written in the Listener
class. In this way the grammar can parse Schema-free SQL and extract elements like tables,
attribute names and some clauses for further processing.
When the example 1 is given as the input query, the parser generates a tree:
(select_statement (query_expression (query_specification SELECT (select_list (select_list_elem
(expression (full_column_name (table_name (id (simple_id actor?))) . (id (simple_id name?))))))
WHERE (search_condition (search_condition_and (search_condition_not (predicate (expression
(full_column_name (id (simple_id year?)))) (comparison_operator >) (expression (constant
1995)))) and (search_condition_not (predicate (expression (full_column_name (id (simple_id
year?)))) (comparison_operator <) (expression (constant 2005)))))))) ;)
When the Listener walks through the tree it invokes the corresponding method. We use these
methods to capture data when the required parse rules are triggered. For instance, when it comes
across “full_column_name”, the methods “enter_full_column_name” and
“exit_full_column_name” are triggered. So, the necessary logic to extract the schema elements is
written in the appropriate methods.
Improvisation: The existing implementation does not parse the From clause. To store the table
names given in From clause, the method “enter_Table_name” in tsqlBaseListener class was
modified.
4.2. Relation Tree Mapper
The schema relevant expressions occur in 1) relation names in from clause 2) attribute names
(sometimes along with table names) in other clauses 3) value constraint conditions specified in
Where clause. This information is represented uniquely as expression triples with three entries –
relation name, attribute name and condition constraint. The missing information can be stored as
*. The expression triples are stored as relation trees – with the relation name as the root, the
attributes fetched from various clauses as children and its children are values taken from the
Where clause.
From the given example, we can see that the expression triples formed from actor.name? is
(actor, name, *). Actor being the table name is the root, name being the attribute is its child and
8
the value of the attribute is not specified. So, its child is stored as a *. The other triples formed
from the query are (*, year, >1995), (*, year, <2005). These are stored as relation trees:
Fig4 Relation Trees
Since multiple trees with the same table name or trees having the same attributes are formed, we
can merge them, so that a relation tree includes all the information pertaining to one relation. The
trees are merged at relation level, if they have identical relation names. Attribute level merging
happens when the trees have same attribute names and the relation names are also the same or
are not present.
Fig5 Merging at attribute level
The results of relation trees thus obtained for the above example
1. Relation tree
actor = {name=[ name [] [] ]}
2. Relation tree
* = {year=[ year [>, <] [1995, 2005] ]}
9
4.3. Similarity evaluation
The relation trees are mapped to the relations in database based on their similarity which is
computed using an evaluation function. Each relation tree can be mapped to a set of relation
trees. For a relation tree rt with name n(rt) and attributes (at1,at2,…atn), R be a relation in the
database – rt is similar to R if R if it contains similar information at root level (n(rt)) and at
attribute level (each atn). The function is given by:
𝐒𝐢𝐦(𝐫𝐓𝐫𝐞𝐞, 𝐑) = 𝑺𝒊𝒎(𝒏(𝒓𝑻𝒓𝒆𝒆), 𝑹) ∏ 𝑺𝒊𝒎(𝒂𝒕𝒕𝒊, 𝑹)𝒏𝒊=𝟏
The similarity at root level is calculated by taking the maximum of the similarities between the
rTree and the relation tables along with its neighbors. It is given by Sim(n(rt),R) =
MAX(Sim(n(rt),n(R)),{ Sim’(n(rt), n(Ri))}) where R is the table Ri is a list of tables which are
neighbors to the relation R.
The similarity at attribute level is computed by taking the maximum of the similarity between the
attribute tree and the Attributes of a Relation. It is given by Sim(at,R) = MAX({Sim(at,Ai )
|1<i<m}), where R is a relation table with a list of attributes (A1,A2,….An).
Improvisation: The similarity Sim (a,b) between two words a, b is computed using WUP (Wu-
Palmer) algorithm [3]. WUPSimilarity.java has the code for computing the score.
The Wu & Palmer metric (wup) calculates similarity using the depths of the two words in the
Unified Medical Language System, along with the depth of the lcs (least common
subsummer)[3]. The formula for score calculation is 2*depth(lcs) / (depth(w1) + depth(w2)).[3]
It basically measures semantic similarity/relatedness between words. The result is always a value
between 0 and 1, 0 for being totally different words and 1 for same words.
Since this algorithm checks only for the closeness in the word meanings- the words such as
name, firstname will yield a 0 score. So in order to include cases like this, we consider longest
common subsequence algorithm along with WUP. Longest common subsequence compares the
characters in the words and gives the score based on the matched substring and length of words.
Score = length (common subsequence) / max (length (word1), length (word2)). Hence the final
value is given by max (score (WUP), score (longest common subsequence)).
Also, the final similarity is the product of similarity at attribute level and similarity at root level.
The implementation has the summation of these values. It has been changed to multiply the
values. This ensures that the final outcome is always a value between 0 and 1.
The results of the mapped relation trees for the above example are:
Mapping: actor=Person, year=Movie: release year
10
Improvement needed:
The relation trees should be mapped to a set of relations, for which the sim(rt,Ri) is greater than σ
* Max{Sim(rt,Rj)|1<j<n}}. This ensures that all likely relations are considered. Instead of
checking with this condition the implementation takes the relations, for which the table similarity
is a value between less than 0.4 and the neighbor sim is greater than 0.7.
The paper didn’t clearly state how the similarity for the relation trees which have no name. It
says that the similarity should be computed for the attribute tree and the relation names.
However, it is not clear if the similarity at root level would be the maximum of the values
calculated among all attribute trees and the relation names or the maximum when it is computed
between each attribute and other relations. The computation in the implementation is done in the
method SimilarityEvaluation.calSimgAttributeLevel. This method doesn’t update the kdef value,
the attribute tree is not compared with all relations and the code needs to be changed.
Also, if there are two attributes with the same name in different tables, but in the schema free
query the attributes are mentioned without the table names. Due to merging of relation at
attribute level, the two trees will be merged to form one single tree and is mapped to a single
relation instead of two relations.
4.4. Network Builder
After mapping the relation trees, the next step is to create a join path which contains the mapped
relations. The users may partially specify the join paths in the queries. We create graphs with
tables as nodes and connect them by assigning weights to edges. This is a representation of the
schema-free query.
The existing database is represented as a graph with the relations as nodes and an edge if foreign
key references exist between the two tables. Also, any existing join paths mentioned in the input
is included in the graph, creating a connection if it does not exist already. These connection form
the basic Views. As already mentioned in the previous section, each relation tree is mapped to
several relations. We need to incorporate this information also into the views. So an Extended
View Graph is formed from the views. In this, the nodes in existing views are replaced with the
corresponding mapped relation tree. The relations which have not been mapped to any other
relations or a relation tree is also included.
For example, a view exists as Person – Actor – Movie – Director – Person, and lets say that two
relation trees get mapped to Person, and another trees get mapped to movie producer and
company respectively. In order to generate an extended view graph, the person can be replaced
with Person1 and the last person can be replaced with Person2 or vice versa. So this creates two
graphs for us:
11
Fig6 Extended View Graph
Person1- actor- movie-director-perosn2 or person2 – actor – movie – director – person1.
Every edge is assigned a weight based on similarity function between the relation tree of one
node and the relation of the connected node. The formula used for computing the value is:
𝑊𝑒𝑖𝑔ℎ𝑡(𝑒) = 1 − (1 − 𝑐) ∗ (1 − 𝑀𝐴𝑋(𝑆𝑖𝑚(𝑛(𝑟𝑒𝑙𝑇𝑟𝑒𝑒1), 𝑛(𝑅2)), 𝑆𝑖𝑚(𝑛(𝑟𝑒𝑙𝑇𝑟𝑒𝑒2), 𝑛(𝑅1)))
We have assumed the value of c is 0.7and relTree1 is the relation tree containing the information
about relation1 of the database.
All the nodes in the view graph are replaced with the mapped relation trees, so that the resulting
join network will contain the relations used in the schema free query. As the connections are
made, the weight of the resultant network is calculated by multiplying the edge weights and
taking the square root of the resultant. The networks containing all mapped relation trees is
called total join network.
Improvisation: Used WUP similarity[3] function to calculate the weights of the edges.
Introduced a parameter k to get the top – k queries instead of only five queries which was
hardcoded in the code.
Improvements needed: The partial join paths mentioned in the input query are not being
considered while forming View Graphs. While expanding the join path to generate the k minimal
total join networks, the relations form a self- join with itself, and also join with the wrong tables
leading to inappropriate joins in the resultant query. When a relation tree is not mapped to any
relation a null pointer exception arises when trying to create a connection to those relations.
12
4.5. SQL Composer
After the top – k queries are generated, the relation trees are mapped with actual relations and the
nodes in the join path are included in From clause, and edges in the join path are translated as
join conditions with foreign key references to form the WHERE clause.
4.6. Testing
The existing implementation was tested in order to identify its shortcomings. We test the system
capacity to be able to handle various errors that users can frequently make. There are two kind of
errors possible in the queries – syntactic and semantic. A syntactic error has invalid characters
and a semantic error is a legal query, but doesn’t produce the intended results. Some queries
were run and their output was analyzed to understand the shortcomings. Since we are generating
k queries in the result, some of them are repetitive and only the unique answers are reported here:
Query1 – To find the names of directors that Jack has co-operated with.
SELECT (Person_2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 WHERE Person_1.name = 'Jack' and Person_1.person_id = Actor.person_id and Actor.movie_id = Movie.movie_id;
Output:
SELECT (Person2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 from Movie as Movie1, Actor as Actor1, Person as Person1, Person as Person2, Director WHERE Person1.name = 'Jack' and Person1.person_id = Actor1.person_id and Actor1.movie_id = Movie1.movie_id and Actor1.person_id=Person2.person_id and Person2.person_id=Director.person_id and Actor1.person_id=Person1.person_id and Actor1.movie_id=Movie1.movie_id; SELECT (Person2.name) From Person as Person_1, Actor, Movie, Director, Person as Person_2 from Movie as Movie1, Actor as Actor1, Person as Person1, Person as Person2 WHERE Person1.name = 'Jack' and Person1.person_id = Actor1.person_id and Actor1.movie_id = Movie1.movie_id and Actor1.person_id=Person1.person_id and Actor1.movie_id=Movie1.movie_id and Actor1.person_id=Person2.person_id;
13
The repetitive joins indicate that the join view path generation is ignoring the join conditions
already present in the query and are creating duplicates.
Query2 - Return all the directors that “Jack” has cooperated with.
SELECT director?.name? Where actor?.name? = ‘Jack’;
Actual Query:
SELECT (Person_2.name)
From Person as Person_1, Actor, Movie, Director, Person as Person_2
WHERE Person_1.name = 'Jack'
and Person_1.person_id = Actor.person_id
and Actor.movie_id = Movie.movie_id
and Movie.movie_id = Director.movie_id
and Director.person_id = Person_2.person_id;
Output:
SELECT Person2.name from Person as Person1, Person as Person2, Director, Actor, Where Person1.name = 'Jack' and Director.person_id=Person1.person_id and Person1.person_id=Actor.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Director, Where Person1.name = 'Jack' and Actor.person_id=Person1.person_id and Person1.person_id=Director.person_id and Director.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Where Person1.name = 'Jack' and Actor.person_id=Person1.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Director, Where Person1.name = 'Jack' and Director.person_id=Person1.person_id and Director.person_id=Person2.person_id;
The important joins between Actor and Movie (Actor.movie_id = Movie.movie_id), Movie and
Director (Movie.movie_id = Director.movie_id) are missing. Also the existing joins between
Director, Actor and Person are wrong.
Query 3 – To get the actor names.
select actor?.name?
from actor;
Actual Query:
select Person1.name from Person as Person1, Actor where Actor.person_id=Person1.person_id;
14
Output:
select Person1.name from Person as Person1, Director where Director.person_id=Person1.person_id; select Person1.name from Person as Person1, Actor, where Actor.person_id=Person1.person_id;
It generated only 2 queries and one of it is correct. The join condition is wrong in the first query.
Query4 – To get the ids of actors. select id?
From actor;
Actual Query:
select Actor.person_id from Person as Person1, Actor where Actor.person_id=Person1.person_id;
Output:
select Movie1.movie_id
from Movie as Movie1, Movie_Producer where Movie_Producer.movie_id=Movie1.movie_id;
select Movie1.movie_id
from Movie as Movie1, Director where Director.movie_id=Movie1.movie_id;
select Movie1.movie_id
from Movie as Movie1, Actor where Actor.movie_id=Movie1.movie_id;
The id is mapped wrongly to movie_id. This indicates that the schema free does not consider the
user’s intent. Also, wrong joins indicate that join path generation should be improved.
Query5: To select everything from the table actor. Select ?
From actor;
NullPointer exception because, actor table is not mapped. The similarity mapping failed.
Query6: to name the directors with whom actor jack has co-operated. SELECT director?.name?
Where actor?.name? = 'jack' and director?.movie? = actor?.movie?;
15
Actual query:
Select person2.name From from Person as Person1, Person as Person2, Director, Actor Where person2.person_id = director.person_id And person1.name = ‘Jack’ And person1.person_id = actor.person_id And actor.movie_id = director.movie_id SELECT Person2.name from Person as Person1, Person as Person2, Director, Actor, Where Person1.name = 'jack' and Person2.name = Person1.name and Director.person_id=Person1.person_id and Person1.person_id=Actor.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Director, Where Person1.name = 'jack' and Person2.name = Person1.name and Actor.person_id=Person1.person_id and Person1.person_id=Director.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Actor, Where Person1.name = 'jack' and Person2.name = Person1.name and Actor.person_id=Person1.person_id and Actor.person_id=Person2.person_id; SELECT Person2.name from Person as Person1, Person as Person2, Director, Where Person1.name = 'jack' and Person2.name = Person1.name and Director.person_id=Person1.person_id and Director.person_id=Person2.person_id;
The joined in the input query is missing in the output and wrong joins are formed.
5. Conclusion
The system can build a completed Sql query from the schema-free input, using all steps as
described above. The results contain at least one correct query. The input query is successfully
parsed and vague schema names are correctly mapped, due to accurate similarity evaluation and
the join networks developed generates an equivalent Sql query. This ensures that the user need
not have a complete idea of the database schema and its names, thereby reducing his burden of
understanding it.
16
6. Future Work
This project can be further extended to make it interactive thereby giving suggestions to user.
The existing algorithm should be improvised to include nested queries, improvise join networks
generation and increase the accuracy of resulting queries. The system should be tested on many
more complex databases.
17
References
[1] Fei Li, Tianyin Pan, and Hosagrahar V Jagadish. Schema-free sql. In Proceedings
of the 2014 ACM SIGMOD international conference on Management of data,
pages 1051{1062. ACM, 2014.
[2] ANTLR Grammar, 2016. https://github.com/antlr/grammars-v4.
[3] Wu-Palmer, 2016. http://search.cpan.org/dist/WordNet
Similarity/lib/WordNet/Similarity/wup.pm