data grouping process in extended sql language containing fuzzy elements

10
Data Grouping Process in Extended SQL Language Containing Fuzzy Elements Bo˙ zena Malysiak-Mrozek, Dariusz Mrozek, and Stanislaw Kozielski 1 Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected], [email protected], [email protected] Summary. Incorporation of fuzziness into database systems allows to expand the analysis of data during the querying process. Approximate processing of data in- cludes not only rows that exactly meet the search criteria, but also rows that are similar with the given range of tolerance. Queries formulated in natural language usually consist of imprecise and fuzzy terms. However, the implementation of the imprecision in query languages, such as SQL, requires additional extensions in a target database system. In the paper, we devote our attention to different methods of data grouping. In the first part of the paper we concentrate on the process of fuzzy grouping of crisp data. In the second part, we focus on the grouping of fuzzy data. 1 Introduction Queries submitted to databases are often first formulated in a natural lan- guage. Afterwards, these queries are translated to database query languages, such as SQL [1], [2]. Imprecise and fuzzy terms are typical to appear in a natural language. This is characteristic for human’s perception of things that happen in the surrounding world. However, it is very hard to implement it in database query languages. The fuzzy logic [3], [4] can be applied not only in human way of thinking, but also in databases that store data related to many domains and query languages, e.g. the SQL, operating on the data. The fundamental principle of the fuzzy logic theory is that everything is true, but with different degree of compatibility [5], [6]. In general, the fuzzy logic can help to improve results of SQL queries. Fuzzy queries provide a simple way to retrieve data that we want to obtain without defining exact search criteria [7], [8], [9]. In huge databases this approach can be very useful and such imple- mentation can significantly speed up the retrieval process [8], [9]. This process may consist of two stages. In the first stage, we perform a quick data search, which provides a set of data that meet search criteria at least with the given compatibility degree. Afterwards, in the next stage, the user can quickly find the most suitable data from the result set created in the first stage. This is an author's version of the paper. Original version: K.A. Cyran (Eds.): Man-Machine Interactions, AISC 59, pp. 247–256. The final publication is available at link.springer.com: DOI:10.1007/978-3-642-00563-3_25

Upload: polsl

Post on 23-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Data Grouping Process in Extended SQL

Language Containing Fuzzy Elements

Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski1

Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100Gliwice, Poland [email protected], [email protected],

[email protected]

Summary. Incorporation of fuzziness into database systems allows to expand theanalysis of data during the querying process. Approximate processing of data in-cludes not only rows that exactly meet the search criteria, but also rows that aresimilar with the given range of tolerance. Queries formulated in natural languageusually consist of imprecise and fuzzy terms. However, the implementation of theimprecision in query languages, such as SQL, requires additional extensions in atarget database system. In the paper, we devote our attention to different methodsof data grouping. In the first part of the paper we concentrate on the process offuzzy grouping of crisp data. In the second part, we focus on the grouping of fuzzydata.

1 Introduction

Queries submitted to databases are often first formulated in a natural lan-guage. Afterwards, these queries are translated to database query languages,such as SQL [1], [2]. Imprecise and fuzzy terms are typical to appear in anatural language. This is characteristic for human’s perception of things thathappen in the surrounding world. However, it is very hard to implement itin database query languages. The fuzzy logic [3], [4] can be applied not onlyin human way of thinking, but also in databases that store data related tomany domains and query languages, e.g. the SQL, operating on the data. Thefundamental principle of the fuzzy logic theory is that everything is true, butwith different degree of compatibility [5], [6]. In general, the fuzzy logic canhelp to improve results of SQL queries. Fuzzy queries provide a simple way toretrieve data that we want to obtain without defining exact search criteria [7],[8], [9]. In huge databases this approach can be very useful and such imple-mentation can significantly speed up the retrieval process [8], [9]. This processmay consist of two stages. In the first stage, we perform a quick data search,which provides a set of data that meet search criteria at least with the givencompatibility degree. Afterwards, in the next stage, the user can quickly findthe most suitable data from the result set created in the first stage.

This is an author's version of the paper.

Original version: K.A. Cyran (Eds.): Man-Machine Interactions, AISC 59, pp. 247–256.

The final publication is available at link.springer.com: DOI:10.1007/978-3-642-00563-3_25

2 Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski

Implementation of fuzzy queries in database systems is very importantfor the retrieval of various types of data, such as text, image, video, audio,graphics, and animation [10]. Studies in the area have been performed forthe last two decades by several research centers. E.g. in works [11] and [12],we can find proposals of querying systems (FQUERY and SQLf, respectively)that support imprecise queries. In the work [13], Tang and Chen describe fuzzyrelational algebraic operators useful for designing fuzzy query languages. Theusage of different membership functions in processing fuzzy SQL queries isdiscussed in authors’ work [14].

Indeed, the SQL language is the most popular language for retrieving andmodifying data stored in databases [1], [2]. Fuzzy terms can occur in all partsof the SQL SELECT statement [9], [15], [16]. In the paper, we concentrate onthe GROUP BY phrase of the SELECT statement. In the following sections,we present different methods of fuzzy grouping and grouping of fuzzy data.Some of them base on already known methods and some of them are com-pletely novel. We also show our implementation of these grouping methodsin the SQL language (PostgreSQL DBMS). All issues considered in this workare illustrated by appropriate examples.

2 Data Grouping

Data grouping is often a required step in the aggregation process. In the SQLlanguage notation, the GROUP BY phrase is responsible for the groupingprocess. The process groups rows with identical values in columns specifiedin the GROUP BY phrase, so each unique combination of values in specifiedcolumns constitutes a separate group [1], [2]. This is the classical grouping.If we allow to group similar (not identical) data in grouping columns, we canuse methods of fuzzy grouping. If there is a possibility to store fuzzy datain a database, these data can be grouped with the use of classical or fuzzygrouping methods. There are different approaches to the data grouping thatcan be used if we incorporate the fuzziness into the database system:

• fuzzy grouping of crisp values,• classical grouping of fuzzy values,• fuzzy grouping of fuzzy values.

2.1 Fuzzy Grouping of Crisp Values

In the classical grouping of crisp values, the smallest difference between valuesin grouping column(s) leads to the separation of groups [1], [2]. If we wantto group similar data, it is necessary to apply mechanisms of fuzzy grouping,which join similar data into the same groups [9]. In our work, we have extendedthe SQL language by implementing the following algorithms of fuzzy grouping:

Data Grouping Process in Extended SQL Language 3

• grouping with respect to linguistic values determined for the attributedomain,

• fuzzy grouping based on the hierarchical clustering method,• fuzzy grouping based on the author’s algorithm.

Grouping with Respect to Linguistic Values

This method of grouping can be applied if domains of grouped attributes canbe described by linguistic values [9], [11], [17]. These values can be definedby membership functions [6], [14]. The grouping process consists of two steps:In the first step, we assign the most corresponding linguistic values (havingthe highest value of the compatibility degree) to existing numerical values.Afterwards, we group data according to assigned linguistic values. The numberof groups is equal to the number of different linguistic values.

Example

Let’s assume, there is the Measurements table in the database. The tableincludes the temperature attribute, which stores values of temperature forparticular days, and temperatures are given in Celsius degrees (Table 1, leftcolumns). For the temperature attribute we define the set of linguistic val-ues describing possible temperatures: very cold, cold, warm, very warm, etc..Each of these values is determined by appropriate membership function. Inthis case, we assign the most corresponding linguistic value to each value ofthe temperature column (Table 1, right column). In the next step, we runthe process of classical grouping with respect to assigned linguistic values.Actually, both steps are executed in one SQL statement.

Table 1. Measurements table and linguistic values of the temperature after theassignment step

Date Temperature ———— Linguistic value

02.07.03 17 quite warm08.07.03 18 warm11.07.03 14 quite warm14.07.03 25 very warm21.07.03 29 very warm25.07.03 30 very warm31.07.03 15 quite warm

Let’s consider the following query:Display number of days with similar temperatures.This query written in the extended SQL language can have the following

form:

4 Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski

SELECT temperature_to_ling(temperature) as temperature,

COUNT(date) as Days_No

FROM Measurements

GROUP BY temperature_to_ling(temperature);

where the temperature to ling function converts an original, numericalvalue of the temperature column into the most corresponding linguistic valueof the temperature. The result of such formulated query is presented in Ta-ble 2.

Table 2. Query results

Temperature Days No

quite warm 3warm 1very warm 3

Modified Hierarchical Clustering Method in Data Grouping

In the previous case, the assignment to a group was strictly related to thedivision of the attribute domain. However, in many cases, it could be interest-ing to join in groups many similar values, while the division of the attributedomain is not predefined. The approach is similar to the problem of clusterseparation in clustering algorithms [18], [19].

Let’s make the assumption we perform clustering of N data x1, x

2, ..., xN .

The purpose of the process is to obtain M clusters. The classical hierarchicalclustering algorithm contains the following steps:

1. In the beginning, we assume each separate data xi creates single clusterKi = {xi}, i = 1...N ; the number of clusters is l := N ;

2. We find two nearest clusters Kk and Kj using one of the distance measuresdefined later in the section;

3. Next, clusters Kk and Kj are joined and create the new cluster Kk; ClusterKj is being removed; Number of clusters decreases l := l − 1;

4. If l > M , go to point 2.

Presented algorithm can be applied to group data in databases. To thispurpose, we need to specify the meaning of data x

1, x

2, ..., xN . In the SQL

language, data grouping is implemented in the GROUP BY phrase, whichcan be presented in the following form:

GROUP BY <A1, A2, ..., Ap>,

Data Grouping Process in Extended SQL Language 5

where A1, A2, ..., Ap are attributes (column names) of the table T, on whichthe grouping process is performed. A single data x is formed by a set of valuesof attributes A1, A2, ..., Ap coming from a single row of the table T. Therefore,data x

1, x

2, ..., xN represent sets of attributes’ values involved in the grouping,

for successive rows of the table T, and the N corresponds to the number ofrows being grouped.

In order to implement the presented algorithm, it is required to makethe assumption that it is possible to define a function of distance betweenxi and xj (and generally - between clusters) in the multidimensional spacedetermined by data x

1, x

2, ..., xN . The distance may be calculated as [20]:

1. the minimum distance between any data of clusters A and B:

dmin(A,B) = minx

A∈A,x

B∈B

|xA − xB |, (1)

2. the maximum distance between any data of clusters A and B:

dmax(A,B) = maxx

A∈A,x

B∈B

|xA − xB |, (2)

3. the arithmetic average of all distances between all data in clusters A andB:

davg(A,B) =1

card(A)card(B)

xA∈A

xB∈B

|xA − xB |, (3)

4. the distance between central point (average value) mA of the cluster Aand the central point (average value) mB of the cluster B:

dmean(A,B) = |mA − mB |. (4)

In the presented algorithm, the number of groups should be defined in advance.This may be uncomfortable and unacceptable in many cases. For this reason,we decided to modify the algorithm by incorporating different stop condition.The condition shouldn’t be associated with the number of groups. Imposingrestrictions on the maximal size of a group seems to be a better solution.This maximal size can be defined as a maximum distance between any twodata in the group. In the modified version of the algorithm we have replacedthe cluster concept by the group concept. As in previous case, we assume wewant to group N data x

1, x

2, ..., xN . However, in this case, the purpose of the

grouping process is to obtain groups, which size does not exceed the givenvalue maxd. The maximal group size maxd is usually defined by the domainexpert. The modified algorithm has the following steps:

The algorithm of hierarchical grouping

1. In the beginning, we assume each data xi creates separate group Gi ={xi}, i = 1...N , N > 1. The set of all groups is represented by G ={G1, G2, ..., GN}.

6 Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski

2. In the set G, we find two nearest groups Gk and Gj. Next, these two groupsare joined together into G′

k group. The size d of the G′

k group is computed(maximum distance between any two data in the group). If d > maxd thencancel the G′

k group and go to 4.3. Replace Gk group with the G′

k group. Delete group Gj from group-set G.Go to 2.

4. Stop grouping - the G set consists of created groups.

We also considered other stop conditions. E.g. while joining groups, thedistance between central points rises. For this reason, the grouping may be fin-ished when the distance between central points of two nearest groups exceedsa given value. We have implemented the presented algorithm of hierarchi-cal grouping in the DBMS PostgreSQL [21]. Since all data are sorted beforethe grouping process, we could simplify the implemented algorithm describedabove. In the implemented version, for sorted data, it is enough to evaluatethe distance between the first and consecutive elements of the group. Eachcalculated distance should not exceed the specified value of the maxd. Theimplemented algorithm of hierarchical grouping is presented in Fig. 1.

Fig. 1. The implemented algorithm of hierarchical grouping

With the use of the implemented algorithm of hierarchical grouping wecan solve problems presented below. Let’s assume there is a Students table ina database. The table stores the information about the height and the weightof students on particular semesters. The table has the following schema:

Data Grouping Process in Extended SQL Language 7

Students (StdNo, StdName, semester, height, weight)

We will consider the following query:Display the number of students with similar weight in particular semesters.

We assume the similar means the weight differs no more than 5kg in the samegroup. The query written in our extended SQL language has the followingform:

SELECT semester, weight, COUNT(StdNo)

FROM Students

GROUP BY semester, surroundings(weight, 5);

The surroundings function evaluates distances between data in groups anddecides where the border of a group is and which data belong to consecutivegroups.

2.2 Grouping of fuzzy data

Grouping of fuzzy data is required only in database systems that allow tostore fuzzy data. Analysing possibilities of grouping according to attributesthat store fuzzy values, we can first consider the same grouping methodsthat operate on crisp data. In the case, we assume the grouping process isperformed on LR-type fuzzy numbers [4], [6]. In the most rigorous solution, thegrouping allows for all parameters describing fuzzy values. Therefore, selectedfuzzy values are joined into one group, if they have the same set of parameters.More elastic solutions allow grouping of fuzzy data regarding their modalvalues. We can distinguish two approaches:

• grouping of LR-type fuzzy numbers - all fuzzy numbers with the samemodal value form a group,

• grouping of LR-type fuzzy intervals - all fuzzy intervals with the samerange of modal values become a group.

In our work, we have implemented the first approach. The example shows howit can be used. We assume there is a Requirement table in a database:

Requirement(DeptNo, Year, Paper, Toner, CDs)

The table stores the information about requirements of particular departmentsfor paper, toner and CDs in particular years. The data type for Paper, Toner,and CDs attributes is the fTrapezium, which is a structure of four parametersdescribing trapezoidal membership function. We consider the following query:

Display the number of departments that make requirements for similaramount of paper in particular years.

The query in the extended SQL language has the following form:

SELECT Year, count(distinct DeptNo)

FROM Requirement

GROUP BY Paper, Year;

8 Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski

We can also apply all methods of fuzzy grouping described in section 2.1 forthe LR-type fuzzy data. In the situation, we deal with the process of fuzzygrouping of fuzzy data. The process can be implemented with the use ofarbitral division of a fuzzy attribute domain (linguistic values) and modifiedalgorithm of hierarchical grouping.

Finally, we can consider specific situation that doesn’t fit any of the givencategories. The fuzziness may occur indirectly in the GROUP BY clause, e.g.if we want to group by the compatibility degree of fuzzy condition. We assumethere is a Departments table in a database:

Departments(DeptNo, DeptName, NOEmployees)

The table stores the information about the number of employees working inparticular departments. Let’s consider the following query:

Display number of departments with the same compatibility degree for thefuzzy condition: the number of employees is about 25.

The query in the extended SQL language has the following form:

SELECT count(DeptNo), NOEmployees is about 25

FROM Departments

GROUP BY NOEmployees is about 25;

The query groups together rows with the same compatibility degree for thespecified condition. It differs from the previous cases where values of the givenattributes were grouped.

3 Concluding Remarks

Data grouping is one of the most important operations performed in alldatabase systems. We have developed extensions to the SQL language thatallow to process fuzzy data or operate on data with the use of fuzzy tech-niques. The main advantage of the extended GROUP BY phrase presentedin the paper is its simplicity. This allows to write SQL statements with theminimum knowledge of the fuzzy sets theory. However, don’t let the simplic-ity trip you up. The huge power of our SQL extensions including groupingprocess is hidden behind all functions that had to be implemented during thedevelopment process. Especially, the hierarchical grouping method requiredoverloading of comparison operators, which is not possible in many popularDBMSs.

Our extensions to the SQL cover not only data grouping, but also fuzzyfiltering in the WHERE clause, fuzzy terms in subqueries, and fuzzy aggrega-tion in the SELECT and HAVING clauses. However, these were the subjectof our previous papers, e.g. [22], [23], [24]. The grouping methods that wepresented in the paper can be used for reporting purposes in conjunction withthe classical or fuzzy aggregation. When we use standard grouping methodswe can find data too detailed to make a reasonable analysis. Fuzzy grouping

Data Grouping Process in Extended SQL Language 9

allows to generalize data in order to make better conclusions. Therefore, aconscious loss of some information results in better analysis and observationsof interesting relations between data. We have recently implemented presentedgrouping methods in the dating agency web service [25] and in the databasesystem registering missing and unidentified people [26]. In the last case, pre-sented grouping methods help to make analysis, such as which age group ismostly at risk of going missing or unidentified (e.g. old people).

References

1. Elmasri R, Navathe SB (2000) Fundamentals of Database Systems. Addison-Wesley Publishing Comp., World Student Series

2. Ullman JD (1988) Database and knowledge-base systems. Computer SciencePress

3. Zadeh LA (1965) Fuzzy sets. Information and Control 8 (3): 338–3534. Yager RR, Filev DP (1994) Essentials of Fuzzy Modeling and Control. John

Wiley&Sons, New York5. Dubois D, Prade H (1988) Fuzzy Sets and Systems. Academic Press, New York6. Piegat A (1999) Fuzzy Modeling and Control. Exit, Warsaw7. White DA, Jain R (1997) Algorithms and Strategies for Similarity Retrieval.

Visual Computing Lab., University of California8. Badurek J (1999) Fuzzy Logic in databases. Informatyka9. MaÃlysiak B (2003) Approximate retrieval methods in database systems. PhD

Thesis, Gliwice10. Swain M, Anderson JA, Swain N, Korrapati R (2005) Study of information

retrieval using fuzzy queries. Proc. of the 2005 IEEE SoutheastCon: 527–53311. Kacprzyk J, ZioÃlkowski A (1986) Database Queries with Fuzzy Linguistic Quan-

tifiers. IEEE Transactions of Systems, Man, and Cybernetics. Vol. SMC-16, No3

12. Bosc P, Pivert O (1995) SQLf: A Relational Database Language for FuzzyQuerying. IEEE Transactions on Fuzzy Systems. Vol. 3, No 1

13. Tang X, Chen G. (2004) A complete set of fuzzy relational algebraic opera-tors in fuzzy relational databases. Proceedings of the 2004 IEEE InternationalConference on Fuzzy Systems: 565–569

14. MaÃlysiak B, Mrozek D, Kozielski S (2005) Processing Fuzzy SQL Queries withFlat, Context-Dependent and Multidimensional Membership Functions. Proc.of 4th IASTED International Conference on Computational Intelligence, Cal-gary, Canada. ACTA Press:36–41

15. Yu CT, Meng W (1998) Principles of Database Query Processing for AdvancedApplications. Morgan Kaufmann Publ. Inc.

16. Chen SM, Chen HH (2002) Fuzzy query processing in the distributed relationaldatabases environment. Database and Data Communication Network Systems,Vol.1, Elsevier Science

17. Bordogna G, Pasi G (1994) A fuzzy query language with a linguistic hierarchicalaggregator. ACM

18. Wojciechowski K (1992) Image processing and pattern recognition. SilesianUniversity of Technology. Script no 1662, Gliwice

10 Bozena MaÃlysiak-Mrozek, Dariusz Mrozek, and StanisÃlaw Kozielski

19. Kolatch E (2001) Clustering algorithms for spatial databases: a survey. Univer-sity of Maryland

20. CzogaÃla E, ÃLeski J (2000) Fuzzy and neuro-fuzzy intelligent systems. Phisical-Verlag, Heidelberg

21. PostgreSQL 7.2 Programmer’s Guide. The PostgreSQL Global DevelopmentGroup. (2001)

22. MaÃlysiak B (2004) Interpretation of Filtering Conditions in SQL Queries. Stu-dia Informatica, Vol. 25, No. 2(58):89–101

23. MaÃlysiak B, Mrozek D (2005) Fuzzy Aggregation in SQL Queries. Databases:Models, Technologies, Tools. v. 2. WKL, Warsaw:77–84

24. MaÃlysiak B (2004) Fuzzy Values in Nested SQL Queries. Studia Informatica,Vol.25, nr 2(58)

25. MaÃlysiak B, Bieniek S (2005) Internet as a Medium Supporting EstablishingRelationships Between People. Proc. of the Computer Networks Conference,WKÃL, Warsaw

26. MaÃlaczek P, MaÃlysiak B, Mrozek D (2007) Searching Missing and Unidenti-fied People Through the Internet. In: Kwiecien A. et al.: Computer Networks:Applications, WKÃL, Warsaw: 237–249