hybrid solution for scalable research ......develop effective and efficient analytical models and...

UNIVERSITETI I EVROPËS JUGLINDORE УНИВЕРЗИТЕТ НА ЈУГОИСТОЧНА ЕВРОПА

SOUTH EAST EUROPEAN UNIVERSITY

FAKULTETI I SHKENCAVE DHE TEKNOLOGJIVE BASHKËKOHORE ФАКУЛТЕТ ЗА СОВРЕМЕНИ НАУКИ И ТЕХНОЛОГИИ FACULTY OF CONTEMPORARY SCIENCES AND TECHNOLOGIES

Third Cycle of Academic Studies – Doctoral Studies

Doctoral Dissertation Topic:

HYBRID SOLUTION FOR SCALABLE RESEARCH ARTICLES RECOMMENDATION

CANDIDATE: MENTOR: Nuhi Besimi, MSc Prof. Dr. Betim Çiço

October, 2020

3

Abstract

In recent decades, machine learning has become a crucial factor in automating business

operations and assisting in the decision-making process. The massive volume of data generated

at an unprecedented rate has motivated researchers and industry analysts to continually

develop effective and efficient analytical models and machine learning techniques.

In text mining, clustering and classification are essential techniques to extract information from

textual data. These are techniques, which allow us to identify similar groups of textual

documents or build classification models based on some similarity. Application of machine

learning techniques on textual data has become a crucial factor in extracting useful and

unknown information from textual documents. With the massive volume of unstructured data

generated on the Web, researchers and the industry have been motivated to develop efficient

techniques for structuring and processing such data to extract meaningful information.

In this thesis we present a hybrid model based on clustering and classification techniques to

recommend research articles to researchers. Since the literature review process is time-

consuming, we aim to automate this process and recommend the most relevant research

articles based on users’ research field preferences. All extracted information from raw

unstructured research articles has been represented in a graph structure. The graph

representation of the machine learning outcomes represents a suitable form for the

recommendation process.

This research adds to the machine learning community by evaluating some of the most

significant text mining techniques for unsupervised and supervised learning that will supposedly

ease the process of literature review for researchers. Furthermore, it evaluates the accuracy

and execution time for all the model phases by comparing multiple techniques. It also

compares the execution of the model in terms of cost and energy consumption in three

different environments, namely, cloud instance, cloud functions, and distributed Raspberry PIs.

Results showed that our proposed model could have a positive impact on easing the processing

of literature reviews and identifying trend topics for a given field. Overall, we found out that

4

both unsupervised learning and supervised learning showed promising results in accuracy when

working with textual data. On the other hand, this solution does not perform very well in

execution time as the volume of data increases.

Results yielded in our study showed that distributed Raspberry PIs could have a highly positive

impact in terms of lowering costs and being energy efficient. Overall, we found out that

machine-learning algorithms could be adapted and run on distributed raspberry PIs with low

cost and low energy consumption compared to cloud alternatives. On the other hand, this

solution does not offer high scalability, and it requires more time on management, deployment,

and configuration.

Graph structure for representing the extracted information from machine-learning techniques

is one of the most suitable forms for machine-learning tasks and recommendation systems. It

allows us to query the data easily, represent all the relationships better and achieve

performance and scalability for recommendation systems. Other data structures have shown a

lack of performance and an increase of complexity in the process of recommending and storing

the extracted information.

Keywords: recommendation system, supervised learning, unsupervised learning, text mining,

graph databases

5

Contents 1. INTRODUCTION ..................................................................................................................... 16

1.1 Problem description ............................................................................................................ 19

1.2 Hypothesis ........................................................................................................................... 22

1.3 Research Questions ............................................................................................................. 22

1.4 Methodology ....................................................................................................................... 22

1.5 Thesis Structure ................................................................................................................... 24

1.6 Conclusion ........................................................................................................................... 25

2. LITERATURE REVIEW ................................................................................................................. 27

2.1 Document Clustering ........................................................................................................... 32

2.1.1 K-means Clustering ....................................................................................................... 33

2.1.2 K-Means++ Clustering ................................................................................................... 34

2.1.3 K-Medoids Clustering ................................................................................................... 35

2.1.4 Hierarchical Clustering .................................................................................................. 38

2.1.5 Text representation formats ........................................................................................ 39

2.2 Supervised Learning ............................................................................................................ 41

2.2.1 k-NN Classifier ............................................................................................................... 45

2.2.2 Centroid Classifier ......................................................................................................... 46

2.2.3 Naive Bayes ................................................................................................................... 47

2.2.4 SVM – Support Vector Machine ................................................................................... 49

2.2.5 Convolutional Neural Network ..................................................................................... 50

2.3 Hadoop ................................................................................................................................ 51

2.3.1 Hadoop Architecture .................................................................................................... 53

2.3.2 Map Reduce and Spark ................................................................................................. 55

2.3.3 Real-time Data Stream Processing: Spark Streaming ................................................... 56

2.4 Storage Systems .................................................................................................................. 56

2.4.1 Hadoop vs. Relational Database Management Systems .............................................. 57

2.4.2 Hadoop vs. Data Warehouse ........................................................................................ 59

2.4.3 Cloud Solutions ............................................................................................................. 61

6

2.4.4 Graph Databases .......................................................................................................... 62

2.5 Conclusion ........................................................................................................................... 64

3. METHODOLOGY ........................................................................................................................ 65

3.1 Proposed Model .................................................................................................................. 68

3.2 Phase 1 – Initial text clustering ........................................................................................... 69

3.2.1 What is the right number of clusters? .......................................................................... 71

3.2.2 Outlier Clusters ............................................................................................................. 73

3.3 Phase 2 – A supervised learning model .............................................................................. 74

3.4 Phase 3 – Graph representation and topic modeling ......................................................... 76

3.5. Proposed Distributed Model .............................................................................................. 78

3.6 Text Pre-Processing ............................................................................................................. 81

3.7 Datasets ............................................................................................................................... 81

3.8 Experimental setup ............................................................................................................. 83

3.9 Proof of concept – Unsupervised Learning ......................................................................... 84

3.10 Proof of concept – Supervised Learning ........................................................................... 91

3.11 Conclusion ......................................................................................................................... 94

4. EXPERIMENTS AND RESULTS .................................................................................................... 96

4.1 Results Phase 1 – Text Clustering ........................................................................................ 96

4.2 Results Phase 2 – Supervised Learning ............................................................................. 105

4.3 Results Phase 3 – Graph representation and topic modeling ........................................... 118

4.4 Experiments in cost and energy consumption .................................................................. 123

4.5 Conclusion ......................................................................................................................... 127

5. DISCUSSION OF FINDINGS....................................................................................................... 129

5.1 Evaluation of Machine Learning Techniques .................................................................... 129

5.2 Cost and Energy Consumption .......................................................................................... 132

5.3 Text Pre-Processing ........................................................................................................... 133

5.4 Findings on Data Storage .................................................................................................. 134

5.5 Limitations ......................................................................................................................... 136

5.6 Conclusion ......................................................................................................................... 136

7

6. CONCLUSION ........................................................................................................................... 138

PUBLICATIONS AND PRESENTATIONS......................................................................................... 143

ACKNOWLEDGEMENT ................................................................................................................. 144

REFERENCES ................................................................................................................................ 145

APPENDIX A ................................................................................................................................. 160

APPENDIX B ................................................................................................................................. 165

APPENDIX C ................................................................................................................................. 169

APPENDIX D ................................................................................................................................. 225

8

List of Figures

Figure 1. Overall Architecture ....................................................................................................... 21

Figure 2. Paper distribution by year ............................................................................................. 29

Figure 3. Paper classification ........................................................................................................ 29

Figure 4. K-Mean Clustering Process ............................................................................................ 34

Figure 5. K-Means vs. K-Medoids Algorithms ............................................................................... 35

Figure 6. K-Medoids Process ........................................................................................................ 37

Figure 7. Hierarchical Clustering ................................................................................................... 38

Figure 8. Agglomerative Hierarchical clustering is bottom-up approach (left) Hierarchical

clustering is top-down approach (right) ....................................................................................... 39

Figure 9. Supervised Learning process.......................................................................................... 42

Figure 10. Decision tree (example) [Data mining Concepts and Techniques] .............................. 44

Figure 11. k-NN example ............................................................................................................... 46

Figure 12. Support Vector Machine Margin ................................................................................. 49

Figure 13. Convolutional Neural Network .................................................................................... 51

Figure 14. Hadoop Core Components .......................................................................................... 53

Figure 15. Hadoop Architecture.................................................................................................... 54

Figure 16. Hadoop Cluster ........................................................................................................... 54

Figure 17. Spark Streaming ........................................................................................................... 56

Figure 18. Data Warehouse Architecture ..................................................................................... 60

Figure 19. Neo4j ............................................................................................................................ 63

Figure 20. Proposed Model ........................................................................................................... 63

Figure 21-a. Phase1 ....................................................................................................................... 68

Figure 21-b. Labeling Clusters ....................................................................................................... 70

Figure 22. Vector representation of textual data ......................................................................... 73

Figure 23. Phase 2 ......................................................................................................................... 75

Figure 24. Graph Structure ........................................................................................................... 76

Figure 25. Phase 3 Identifying trend topics (Add a reference) ..................................................... 78

Figure 26. Typical Master-Slave Architecture ............................................................................... 79

Figure 27. Master-Slave Architecture model with Raspberry PIs ................................................. 80

Figure 28. Phase 1 Clustering Accuracy ........................................................................................ 99

Figure 29. Execution Time in seconds ......................................................................................... 101

Figure 30. Visualization, Top Generated Clusters ....................................................................... 104

Figure 31. Phase 2 Supervised Learning Accuracy ...................................................................... 107

Figure 32. Phase 2 Execution Time in seconds ........................................................................... 109

Figure 33. Natural Language Processing Graph. Group of Papers which all belong to a specific

field in NLP .................................................................................................................................. 119

9

Figure 34. Medical Graph. ........................................................................................................... 120

Figure 35. Medical and the bridge papers with other fields ...................................................... 121

Figure 36. Natural Language processing and the bridge papers with other fields..................... 121

Figure 37. Computer Vision and the bridge papers with other fields ........................................ 122

Figure 38. Playing Games and the bridge papers with other fields ............................................ 122

Figure 39. Cost Comparison for 1 year ....................................................................................... 123

Figure 40. Cost Comparison for 1 year ....................................................................................... 125

Figure 41. Power consumption (Watt per hour) of Physical servers with near 100% CPU

utilization. Source: https://www.anandtech.com/show/7285/intel-xeon-e5-2600-v2-12-core-

ivy-bridge-ep/11 ......................................................................................................................... 126

10

List of Tables

Table 1 Research articles by field ................................................................................................. 28

Table 2. RDBMS vs. Hadoop .......................................................................................................... 58

Table 3. RDMS vs. MapReduce ..................................................................................................... 58

Table 4. Data Warehouse vs Hadoop [71] .................................................................................... 60

Table 5. Dataset organization ....................................................................................................... 82

Table 6. k-NN accuracy (3 classes) ................................................................................................ 86

Table 7. k-NN accuracy (5 classes) ................................................................................................ 87

Table 8. k-NN accuracy (3 classes only keywords) ........................................................................ 88

Table 9. k-NN accuracy (5 classes only keywords) ........................................................................ 88

Table 10. News articles – Experiment 1 ........................................................................................ 91

Table 11. News articles – Experiment 2 ........................................................................................ 91

Table 12. Testing the accuracy of classifiers ................................................................................. 91

Table 13. Classify Politics News Articles (Total news articles:49) ................................................. 92

Table 14. Classify Technology News Articles (Total news articles: 86) ......................................... 92

Table 15. Classify Sports News Articles (Total news articles: 102) ............................................... 92

Table 16. Execution time (in seconds) .......................................................................................... 93

Table 17. Phase 1 Unsupervised Learning Accuracy ..................................................................... 98

Table 18. Efficiency of Silhouette Coefficient (input: 7 clusters) .................................................. 99

Table 19. Phase 1 Unsupervised Learning Execution Time in seconds ...................................... 100

Table 20. Generated clusters from Dataset 1 ............................................................................. 102

Table 21. Top Generated Clusters from Dataset 1 ..................................................................... 102

Table 22. Phase 2 Supervised Learning Accuracy ....................................................................... 105

Table 23. Phase 2 Supervised Learning Average Accuracy ......................................................... 108

Table 24. Naive Bayes - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

Speech ......................................................................................................................................... 110

Table 25. Naive Bayes (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 110

Table 26. SVM - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 111

Table 27. SVM (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 112

11

Table 28. Logistic Regression - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 112

Table 29. Logistic Regression (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

Speech ......................................................................................................................................... 113

Table 30. Decision Tree - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 114

Table 31. Decision Tree (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 114

Table 32. KNN - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 115

Table 33.KNN (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 116

Table 34. Random Forest - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 117

Table 35. Random Forest (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –


Speech ......................................................................................................................................... 117

Table 36. PRO attributes for different execution platforms ....................................................... 126

Table 37. Comparison of environments ..................................................................................... 133

Table 38. Comparison of OLAP and OLTP ................................................................................... 135

Table 39. Random Forest - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ...... 169

Table 40. Random Forest (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical 169

Table 41. Random Forest - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 169

Table 42. Random Forest (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 170


Methodology, 3 – Natural Language Processing ........................................................................ 170




Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 171

12




Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 172



Table 49. KNN - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ........................ 174

Table 50. KNN (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 174

Table 51. KNN - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology

..................................................................................................................................................... 174

Table 52. KNN (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 175


3 – Natural Language Processing ................................................................................................ 175

Table 54. KNN (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –



3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 176




3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 177



Table 59. Decision Tree - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ......... 179

Table 60. Decision Tree (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical ... 179

Table 61. Decision Tree - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 179

Table 62.Decision Tree (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 180









13





Table 69. Logistic Regression - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical 184

Table 70. Logistic Regression (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical

..................................................................................................................................................... 184

Table 71. Logistic Regression - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 184


2 - Methodology .......................................................................................................................... 185




2 – Methodology, 3 – Natural Language Processing .................................................................. 186




2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .................................... 187




2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ..... 188

Table 79. SVM - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ....................... 189

Table 80. SVM (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 189

Table 81. SVM - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology

..................................................................................................................................................... 189

Table 82. SVM (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 190


3 – Natural Language Processing ................................................................................................ 190




3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 191



14


3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 192



Table 89. Naive Bayes - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 194

Table 90. Naive Bayes (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

Methodology ............................................................................................................................... 194













15

List of Abbreviations

SVM Support Vector Machine

k-NN k - Nearest Neighbor

TF Term Frequency

TF-IDF Term Frequency Inverse Document Frequency

LDA Latent Dirichlet Allocation

NMF Non-negative Matrix Factorization

LDA Linear Discriminant Analysis

CNN Convolutional Neural Network

NLP Natural Language Processing

16

1. INTRODUCTION

The growth of technologies and the continuous generation of data have posed unique

challenges, especially to the data mining community. These challenges have motivated

researchers and industry analysts to continually develop new tools and methods for improving

the application of various machine-learning techniques [1] [2] [46]. The main goal is to identify

patterns, build recommendation(s) systems, and predictive models that will eventually support

an organization's decision-making process. The application of machine-learning techniques is

broad and spans across different research areas [38] [62].

Nowadays, data is considered as one of the most valuable assets that organizations and

companies are willing to acquire. Vast volumes of data are being captured to gain a better

insight into business processes, operations, products, customers, and more [25].

Because the volume of unstructured data is overgrowing, many enterprises also turn to

technological solutions to better manage and store their unstructured data. These can include

hardware or software solutions that enable them to make the most efficient use of their

available storage space [1] [3] [5] [44].

In machine-learning tasks, supervised learning methods are essential because they allow us to

make predictions. Supervised learning is also known as classification [9] [35]. Unsupervised

learning methods are also very commonly used in data mining. Its primary purpose is to

discover groups (clusters) of similar data, where elements on the same group are very similar

and differ with the other groups' elements. Cluster analysis has been widely used in many

applications such as business intelligence, image pattern recognition, web search, biology, and

security. It is also used to improve recommendation systems. Similarly, searching engines do

use clustering, and the cluster mechanisms improve the quality and the speed for a search [6]

[26] [28].

Statistical and analytical algorithms have recently shown favorable results in working with

structured data. However, analyzing semi-structured and unstructured data is not a

17

straightforward task. Most proposed solutions are ad-hoc solutions that are applied to specific

problems [10] [41] [54].

Text mining is one of the most challenging areas in many machine-learning applications, mainly

because of data nature. Textual data is unstructured, and as such, requires additional pre-

processing steps [117]. Two of the most important measurements when applying machine-

learning techniques, and especially when dealing with unstructured textual data, are accuracy

and performance. Accuracy issues emerge as result of the variety of the data and performance

issues due to the enormous volume. To tackle the issues mentioned above, one must establish

a well-defined strategy to store and process "Big Data" [62].

The two significant challenges in the world of big data are 1) storing the data and 2) processing

the data. Since more and more data are available on the Web, applying statistical and analytical

algorithms on these data is an important topic nowadays. Processing unstructured and massive

amount of data usually requires more in-depth analysis and more pre-processing tasks than

traditional data-mining techniques applied on regular datasets. Traditional

techniques/algorithms have limitations on performance and accuracy when working with a

massive amount of data; therefore, there is a need for new programming models/new

programming paradigms to overcome these challenges [129]. Map Reduce, Storm, Spark

Frameworks are the most common frameworks nowadays used for processing big data. The

main advantage of these programming models is that they can process the data in parallel on a

distributed environment [120].

Machine-learning algorithms mainly supervised learning algorithms, and unsupervised learning

algorithms cannot be easily integrated on distributed environments to process the data in

parallel. There are required additional changes to the algorithms to make them suitable for

processing big data. First, the structure of the data differs from the traditional datasets, and

second not every sequential algorithm can be easily adapted and transformed to work in

parallel [23].

Various models and techniques have been proposed for overcoming these two challenges,

which include distributed environments for parallel processing, cloud platforms, high

18

computing resources, GPU processing, etc. [1] [3] [122]. However, not always adding more

resources is the best and preferred solution for these scenarios. On the one hand, distributed

environments for parallel processing are widely used and arguably the most suitable for Big

Data [3] [5]. However, any custom distributed environment comes with the cost of

configuration and the knowledge to prepare it. On the other hand, cloud solutions have been

one of the fanciest, since they allow for processing and distributing data stores and integrating

them from legacy systems. Nevertheless, the cost of cloud is the main drawback and the most

significant limitation for researchers and small and medium-sized companies that cannot afford

it [1].

Recommendation systems are becoming more and more important nowadays in solving and

helping people in different tasks. They are continuously used in computer science, finance,

medicine, sports, and many other areas, for automating or semi-automating various tasks. They

are built on top of different data sets and data structures, starting from various databases,

textual data, Web, and streaming data from different sensors [51] [67] [76]. These systems can

also help researchers find the most relevant research articles for their research fields.

Therefore, in this thesis, we propose and evaluate a hybrid model for recommending research

articles built using both supervised and unsupervised machine-learning techniques.

Extracting information and knowledge from unstructured data is becoming more and more

important nowadays since there is a potential of utilizing machine-learning techniques on

organizing and extracting details from textual data [10].

To improve the process of literature review is crucial, therefore we target this problem by using

machine-learning techniques. This problem is critical nowadays to produce more qualitative

and novel research topics by new researchers. One of the most important aspects is making

sure the researchers can explore the research fields easier and check for recent activities and

topics for each field. As a result, they need to have the ability to check the dependency and

relationship between various research fields in recent years.

19

1.1 Problem description

Researchers spend too much time reading others' work and finding research questions [143].

This process requires lots of effort in reading and classifying relevant papers. The process of

literature review is a challenging task for new researchers in different fields of study. Through

the proposal in this thesis, we try to ease the literature review process and speed up the time

for defining the research problem.

By analyzing the possible research gaps, we define a problem that we target.

Can we automate the process of literature review?

Our study aims to collect/retrieve and analyze research/scientific articles by applying machine-

learning techniques to recommend research articles or/and research gaps to researchers based

on their research fields. With research articles, we mean scientific articles of any field where we

take into consideration the following attributes:

• Title – the title of the research article plays an important role on the analysis process,

• Author/s – it is important to know the author/s of the research article,

• Year of publication,

• Abstract,

• Keywords,

• Content,

• Contribution,

• Results,

• Future work,

• Conference/Journal,

• Related articles,

• Bibliography.

We plan to propose a model based on a large dataset containing research articles, which will

use a hybrid model for recommending research articles to researchers. This approach will use

20

the input parameters such as an abstract, a list of keywords, research articles, or research

field/s. In Figure 1, we present the overall architecture of our research experiment.

We aim to automate and/or semi-automate the process of literature review using machine-

learning algorithms. Nowadays, researchers, master students, Ph.D. candidates spend too much

time identifying the research fields they want to target; in addition, they spend too much time

investigating the research gaps and possible future work on a field. Therefore, our main

contribution is to build a hybrid model based on machine-learning algorithms (supervised and

unsupervised learning).

We will also focus on integrating variously supervised and unsupervised learning techniques

into our model. We will evaluate the techniques and extract the most efficient ones.

Our aim is also to generate a scalable and updatable training set by using any textual dataset.

The training set can be used to recommend research articles or apply other analysis like feature

extraction, classification of new research articles, and a correlation between different research

fields.

21

Figure 1. Overall Architecture

The first part is integrating digital research articles from various digital libraries. As a result of

the first phase, we will have a centralized database that can be used to construct a

recommendation system. The second phase is to build a model based on unsupervised and

supervised learning algorithms, which, as an output, will generate a training dataset (model)

that will be used to recommend research articles. The third phase will analyze the quality of the

model; the focus will be on the model's ability to update itself and it’s scalability. By scalability,

we mean the distribution of our training set as the dataset is increasing in the future. The fourth

phase will analyze the models for storing our training set to have an efficient recommendation

system. Finally, we will have the researchers going to test and evaluate the model based on their

queries.

22

1.2 Hypothesis

1. The literature review process can be simplified by using a hybrid model based on text-

mining algorithms.

1.3 Research Questions

1. Which unsupervised learning algorithms provide the most efficient results on large data

sets?

2. Which data pre-processing techniques provides higher accuracy for unsupervised and

supervised learning on textual documents?

3. How to generate an efficient training set by using machine-learning algorithms on

textual documents?

4. How to construct an updatable model based on unsupervised and supervised learning

algorithms? Which is the best data storage for machine-learning tasks?

5. What are the different Supervised Learning Techniques for textual documents? Which

techniques are the most efficient for our research?

1.4 Methodology

Our research will answer the one defined hypothesis and five research questions. We will

analyze whether by using machine-learning techniques, we can automate/semi-automate the

process of literature review. As a result, we propose an efficient hybrid model on top of which

we will apply our experiments. Finally, we are planning a case study to evaluate the model,

which will be constructed on a specific data set.

The process of our literature review is an example of our qualitative research. For data

gathering, we have analyzed others' work on text mining, data mining, and big data. As a result,

we have extracted possible research gaps in the field and identified the research problem we

aim to solve.

23

Experiments and the generated empirical results will be the base for our quantitative research

methodology. We will also provide comparisons between various techniques (supervised and

unsupervised learning techniques) on efficiency.

A suitable plan of action must be established and carried out to resolve the problem. This

section introduces the chosen research strategy and specific scientific techniques for data

collection. Subsequently, the application of the method and research ethics has been

described. This research, in terms of strategy, follows the experimental approach. ("An

experiment is an empirical investigation under controlled conditions designed to examine the

properties of, and relationships between, specific factors" [7])

Our focus is on supervised and unsupervised learning algorithms because we see a potential for

building a hybrid solution with a combination of classification and clustering algorithms to build

a highly accurate model for textual data. Our study focuses on highly accurate data-mining

techniques for textual data. We also analyze the need for the highly efficient recommendation

systems and the limitations of algorithms on big datasets.

The goal of this thesis is to evaluate the proposed model for recommending research articles. It

will present the overall accuracy and the accuracy of the individual steps. Multiple supervised

and unsupervised techniques have been taken into consideration to evaluate and identify the

most efficient techniques for this type of system. Our primary hypothesis is that we can ease

the literature review process for researchers and identify trend topics for a given field.

Our solution is based on machine-learning techniques, and it contains 3 (three) phases:

1. Generating a group of relevant articles from raw input of research articles. By applying

unsupervised learning techniques, the aim is to generate “n” number of clusters where

we have similar articles on each cluster. The outputs of this phase are the clusters,

distance between clusters, centroid, outlier clusters, cluster labels, and extracted

keywords for each cluster.

2. Using the clusters generated from Phase 1 to build a model based on supervised

learning techniques, which is used to add more articles in the future and recommend

24

research articles. Various machine-learning techniques have been considered in this

phase; the results from the experimentations have been presented in Chapter 5.

3. Using topic modeling to extract trend topics and keywords for specific fields (clusters).

All outcomes, i.e. the information extracted from all three phases, have been presented in a

graph structure, which is used as the basis of our recommendation system and the extraction of

trend topics.

To evaluate existing techniques and propose the best approach, we have tested three

unsupervised learning techniques for Phase 1, and 15 supervised learning techniques for Phase

2 on two different datasets. For this purpose, we conducted five studies:

1. Evaluation of the accuracy and execution time of unsupervised learning techniques,

which represents Phase 1 of our model.

2. Evaluation of supervised learning techniques, which represents Phase 2 of our model.

3. Evaluation of the process of identifying trend topics for specific fields of topic modeling.

4. Comparison of cost and energy consumption of our model running in three different

environments. 1) cloud instance, 2) cloud functions and 3) distributed Raspberry PIs.

5. Evaluation of different data sources can be used to organize the extracted information

and build efficient recommendation systems.

1.5 Thesis Structure

Chapter 2 of the thesis presents the background and state of the art in the field. In this chapter,

we initially have a list of references used for state of the art. It continues with an overview of

various machine-learning techniques. We represent a survey of unsupervised learning

techniques, their different types, and their characteristics. It continues with an overview of

supervised learning techniques and their application on textual data. In addition, state of the art

for various data representation models for textual data has been presented. In the last part of

Chapter 2, it continues with the benefits of HADOOP in Big Data and different data storage

25

models, starting from traditional ones, continuing with distributed data storage, and finally, an

overview of cloud data platforms and graph structures.

In Chapter 3, a proposed solution has been presented. A hybrid model for recommending

research articles and all the three phases and the data representation have been described in

detail. Section 3.4 presents an enhanced version of the proposed model that can run in

distributed Raspberry PIs. Section 3.5 shows all the pre-defined pre-processing steps used for

the textual data for our experiments. In Section 3.6, we have the list of Datasets used for

different phases of evaluation of the model, and finally, in Section 3.7, the experimental setup

and technical details have been presented.

All experiments and results have been presented in Chapter 4. It contains all the detailed results

for all the 5 (five) conducted studies of this thesis. Initially, it presents results for proof of

concepts for unsupervised learning and supervised learning techniques. Then it continues with

the results of Phase 1 and Phase 2 of the proposed model. In Section 4.4, the results for

comparison and experiments in cost and energy consumption have been presented. In the last

part, the outcomes of topic modeling and graph representation of the extracted information

have been presented.

Chapter 5 represents all the discussions on findings that derived from the experimentation from

Chapter 4. In this chapter, we also present comparisons of our results with other similar studies

and overall findings for different phases, environments, and experimentation setups.

Chapter 6, Conclusion, is the last section of this study. An overview of all our work in this Ph.D.

dissertation has been presented along with our plans for future activities.

1.6 Conclusion

In this chapter, we stated the research problem, which emerged as a result of the literature

review presented in Chapter 2. We saw that research articles can be considered as

unstructured data that can be represented in textual documents. In addition, new researchers,

Master and Ph.D. students spend too much time defining their research fields and their theses.

26

There is a potential of easing the process of literature review by utilizing machine-learning

techniques on textual documents.

In Section 1.2, the central hypothesis was stated, along with five research questions. The

central hypothesis questions if the literature review process can be simplified using machine-

learning techniques on textual data. In addition, the research questions target various issues on

the efficiency of current machine-learning techniques for textual data.

In Section 1.3, we deal with the importance and benefits of solving this problem. It will have a

good impact on improving the quality of research work for young researchers. The

methodology for targeting this research problem is by using experimental and empirical results.

As a result,, a hybrid model will be proposed based on machine-learning techniques. The goal is

to extract information from unstructured data, in our case research articles and store that

extracted information in a graph structure. Finally, multiple supervised and unsupervised

learning techniques will be evaluated.

27

2. LITERATURE REVIEW

Data on the Web is increasing rapidly. More and more data in different formats is generated.

This leads us to a new concept, mainly the big data concept. Each time we have a big data task

there are two main concerns:

• Storing a massive amount of data (big data)

• Processing large datasets

All data come(s) from sources, such as external data sources, mobile devices, social media, IoT,

and media or internal data source transactions, log data, e-mails, etc. In total less than 0.5% of

all data is ever analyzed and used, therefore analyzing data, applying machine learning,

applying statistical analysis on the data is very important nowadays [57].

In machine-learning tasks, supervised learning methods are essential because they allow us to

make predictions. Supervised learning is also known as “classification.” Classification algorithms

are made of two parts. The first part is the learning part, where the model is constructed based

on training data. The second part is the classification part, where the model is used for

prediction. There are many classification algorithms proposed. The most popular are Random

Forest, Logistic Regression, Decision Tree Induction, Bayesian classifier, Neural Networks,

Support Machines, etc. Every algorithm has its pros and cons; they have their application field

and type of data applied [4] [8] [18] [19].

Unsupervised learning methods are also very commonly used in data mining. Its primary

purpose is to discover groups (clusters) of similar data, where elements in the same group are

very similar, and they differ from elements of the other groups. Cluster analysis has been widely

used in many applications such as business intelligence, image pattern recognition, web search,

biology, and security. It is also used to improve recommendation systems. In addition, search

engines do use clustering; cluster mechanisms improve the quality and the speed of a search.

Detection outliers is another application of clustering on data. By finding elements that do not

belong to any cluster (group), we have detected the outliers [6] [31] [52] [53].

28

The process of the literature review is based on six research questions:

1. Which are the most crucial text-mining techniques for text classification and text

clustering?

This question identifies the state of the art in text mining algorithms and techniques and the

most recent algorithms proposed.

2. What data transformation models are used for textual data?

This question is used to determine the impact of the models on the efficiency of the algorithms

considering accuracy.

3. What are the limitations of high-performance text-mining algorithms on big datasets?

This question provides information for the limitations by using traditional approaches on big

datasets.

4. Which are the proposed models for high-performance data-mining algorithms?

This question provides the list of proposed solutions for high-performance data-mining

algorithms on big datasets.

5. What contributions have been proposed for research article analysis?

This question shows whether there are proposed solutions or related work.

6. What is the importance of high-performance text-mining algorithms nowadays?

This question identifies the trends related to our research field.

To answer and conduct an overview of the six research questions, we have manually analyzed

research articles from the IEEE-Xplore and ACM digital libraries. In total, we have analyzed more

than 130 relevant research articles. The papers are from 1994 to 2019. Furthermore, the

distribution of papers based on the primary research field is as follows:

Table 1 Research articles by field

Text Mining Big Data Data Mining Parallel/Distributed Programming

Models

50 23 45 15

29

Figure 2. Paper distribution by year

The classification of the research articles is as follows:

Figure 3. Paper classification

In the first and third research questions of the literature review, we identify the state of the art

in text mining algorithms and techniques and the most recent algorithms proposed. While data

mining is a broad field, our focus was only on the algorithms and techniques applied to textual

Data Type

Text

Log

Relational DB

Web

Social Media

Algorithms

Classification

Clustering

Pattern Mining

Information Retrieval

Term Frequency

TF-IDF

Enhanced TF-IDF

Word2Vec

Paradigm

Traditional

Parallel-Distributed

Programming

Contribution

Model

Method

Review

Survey

30

data sets. For this part, we have included two groups of techniques, 1) classification and 2)

clustering.

From the papers, we can see that there are various techniques proposed for text clustering and

text classification. In most of the papers, we see that the focus is on improving the algorithms'

accuracy. The proposed classification algorithms are from two groups: 1) lazy learners and 2)

eager learners.

The different versions of the probabilistic algorithms, like the Naïve Bayes algorithm, are highly

used [64] [130]. Naïve Bayes has been proven an accurate algorithm when the training set is big

enough. SVM is another proposed solution for textual data classification. Various researchers

present SVM as one of the most accurate algorithms for text data classification. The only

drawback is that as the number of classes increases, its performance in accuracy decreases [4]

[30] [130]. Hybrid solutions have also been proposed, which are a combination of two or more

classification algorithms.

The two algorithms, k-NN and Centroid Classifier, showed that they could achieve high-

classification accuracy in classifying text documents [19]. When the number of documents is

not very high, the algorithms can even perform very well in performance. However, when we

increase the number of input data in the training set, the execution time increases too. As the

data is increasing on the Web, traditional algorithms are not sufficient. Therefore, we are

required to process data by using parallel/distributed programming models [130].

Various techniques have been proposed for information retrieval and feature extraction on

textual data [2] [11] [12]. One of the most fundamental techniques is using the term analysis of

text documents. Bag of Words representation and term frequency are often used when

experimenting with textual documents; both have shown higher results in accuracy. Term

Frequency - Inverse Document Frequency is a way to score the importance of words (terms) in

a textual document based on how frequently it appears across multiple papers. It has proven to

increase the accuracy of supervised and unsupervised learning algorithms.

31

Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent

models proposed by researchers which have shown promising results in various applications by

using various datasets [2] [65] [66]. Using these recent models, there is a potential to think of

more advanced machine-learning applications on textual datasets.

Word2vec is a model that produces word embeddings. These models represent two-

layer neural networks that are trained to reconstruct linguistic contexts of each word.

Word2vec takes as its input a dataset of text and generates a vector, usually made of multiple

dimensions, with each unique word in the corpus assigned a corresponding vector in the space

[12].

Word2vec has been created by Tomas Mikolov and his research team at Google. Word2vec

algorithm has many advantages compared to other algorithms like Latent Semantic Analysis.

LSA (Latent semantic analysis) is a natural language-processing technique for analyzing

relationships between textual documents and their terms by producing a set of concepts

related to the documents and terms [11]. Word embedding is the collective name for feature

learning a language modeling in natural language processing (NLP), where terms from the

documents are transformed to vectors of real numbers.

Based on our literature review process in our field of interest, we can see an increase of

interest for researchers in the application of machine-learning algorithms on big data sets. The

focus is on:

• accuracy of algorithms,

• performance,

• optimization of algorithms

• Information Retrieval and data storage.

The quality of our literature review is dependent on two key factors, 1) the number of papers

used in the process and 2) the construction of the research questions for the literature review.

As a result of this process, we have identified some potential Research GAPS topics, which are

https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Vector_spacehttps://en.wikipedia.org/wiki/Corpus_linguisticshttps://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Language_modelhttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers

32

possible for future work in our research field:

1. Hybrid solutions for high accuracy text-clustering and text-classification algorithms.

2. Improving text-similarity algorithms by extracting contextual meaning.

3. Adapting traditional machine-learning algorithms to parallel/distributed programming

models.

4. High-performance text-mining algorithms, analysis, and best practices.

5. Fully automated text classification systems based on machine-learning algorithms.

6. Real-time machine-learning algorithms.

2.1 Document Clustering

Managing and organizing data is essential and vital nowadays. Many companies, groups of

people or individuals are striving to organize data to get a better and more comprehensive

insight. Having in mind that many operations are moving to digitalization, the amount of data

generated is becoming so vast. A good starting point to think about organizing and managing

data is by grouping (clustering) them. Clustering represents an excellent technique for grouping

similar or related objects. "Clustering is the unsupervised classification of patterns

(observations, data items, or feature vectors) into a group (clusters)" [78]. There is an

abundance of research on clustering and many algorithms proposed by many publications on

the topic [6] [78] [81] [82].

Different algorithms and techniques have been proposed for generating clusters of textual

documents based on their content. The most common clustering techniques are:

• Hierarchical clustering,

• Density-based clustering,

• Partitioning clustering algorithms,

• Graph-based algorithms,

• Grid-based algorithms,

• Model-based clustering algorithms,

• Combinational clustering algorithms.

33

One way to summarize a large amount of data is to use clustering techniques to group data in a

meaningful way so that the objects inside the groups, or clusters, have the most similarities

while objects in different groups have the most differences. Two types of clustering algorithms

are available: nested and partitioned. A nested clustering algorithm creates overlapped

clusters, while a partitioned clustering algorithm creates non-overlapping clusters. For

programming research, in which differences between groups of programmers are required, the

second type of clustering algorithm is more appropriate.

2.1.1 K-means Clustering

K-means algorithm is a center-based clustering algorithm in which a most representative point

(object) for one cluster is chosen, and the distance between the representative point and all

other points (or objects) are computed. Since one can choose the “k” number of clusters, the

algorithm is called K-means. This means that K representative objects (i.e. centroid or medoid)

are selected. Each object is then assigned to the closest centroid and, therefore, the related

cluster. In the next step, the center point is updated according to the objects that have been

assigned to the cluster. For the newly created centroid, the new members of the cluster are

computed again. This process is repeated until the centroids do not change, and the process

converges. In the worst case, the algorithm will converge after at most k·n iterations [72] [85].

K-means clustering has been widely applied to a large amount of data as one of the most

efficient clustering algorithms. However, its performance, to a great extent, varies from the

type of data applied.

K-Means Algorithm steps:

34

Figure 4. K-Mean Clustering Process

The Time Complexity of K-Means algorithm is O(l*k*m*n), where l is the number of iterations,

k is the number of clusters, m is the number of dimensions (attributes) and n is the number of

objects.

Even though the K-Means Algorithm is widely used in practice there are some disadvantages:

• It is sensitive to initialization,

• It is sensitive to the outliers,

• It can handle clusters only with symmetrical point distribution, and

• We must define the value of K in the beginning.

2.1.2 K-Means++ Clustering

The proposed K-means++ algorithm follows almost the same approach as K-means, but with

minor modifications in the centroids' initialization. It is also considered an improved version of

K-means because it is less susceptible (less likely to be influenced by) to the initialization

problem [26].

1. Define centroids for the clusters

2. Add the nearest group to every data point

3. Set the place of every group to the mean value of all data-

points which fit into that group.

4. Repeat steps #2 and #3 until all are grouped

35

K-means++ chooses the first centroid uniformly at random. For each data object (point), it

computes the distance of that point to the centroid. The newest centers, depending on the

number of "k" set by the user, are assigned using a weighted probability distribution where the

points that are the farthest from the centroid have a higher probability of being chosen as

centroids. This modification of k-means arguably leads to solving the initialization problem,

which in some cases can profoundly affect the accuracy of the algorithm [72] [85].

2.1.3 K-Medoids Clustering

K-Medoids algorithm is like K-Means in that they are both partitional and tend to break the

data set into groups. However, in contrast to K-means, with K-medoids, the initialized centers,

also called medoids, are data objects (points) from the dataset. K-medoids is more potent to

outliers compared to K-means and can be used with any distance/similarity measure. However,

K-Medoids is also more expensive, resulting in higher efficiency when applied on smaller

datasets than larger datasets [86].

Figure 5. K-Means vs. K-Medoids Algorithms

Source: Cross Validated – Stack Exchange

36

Although K-medoids is more robust to outliers compared to K-means, it does not always result

in satisfactory results when dealing with noisy data and outliers. For this reason, new

algorithms have been proposed. One of them is the Distributed K-Means clustering algorithm,

which is another enhanced version of the K-Means algorithm. This algorithm relies on

normalized datasets to identify the groups of clusters [86].

37

Figure 6. K-Medoids Process

1. Discover maximum values and minimum values of every feature from every local

dataset and convey them into central place

2. Compute global max and min value at central place

3. Normalize the real scalar values of datasets which are local with overall max

and min values.

4. Cluster every local dataset through k-means clustering and achieve centroids

along with cluster index for every dataset.

5. Make single dataset named as centroids by merging cluster centroids of local

dataset into a single dataset

6. Cluster centroids dataset obtain overall centroid through K-means

7. Later calculating Euclidean distance between the object and overall centroids it updates local cluster indices by conveying each object to adjacent cluster centroid.

38

2.1.4 Hierarchical Clustering

Hierarchical Clusters produce a set of nested clusters organized as a hierarchical tree [86]. An

example of a hierarchical cluster is shown in the next figure. We can use a dendrogram to

visualize the sequences of mergers or splits.

Figure 7. Hierarchical Clustering Source: http://faculty.juniata.edu/rhodes/ml/hiercluster.htm

Hierarchical Clustering algorithms can be beneficial for specific cases when working with data.

In our case, we consider this type of clustering algorithms in our final phase, the

recommendation phase. In our proposed model, we will distinguish between different levels of

details for a given cluster. Therefore, we will apply this type of clustering techniques after the

clusters have been generated.

Two types of Hierarchical Clustering Algorithms exist:

• Agglomerative Hierarchical clustering

• Divisive Hierarchical clustering

The Agglomerative Hierarchical clustering is a bottom-up approach, and the algorithm is as

follows:

http://faculty.juniata.edu/rhodes/ml/hiercluster.htm

39

Figure 8. The Agglomerative Hierarchical clustering is a bottom-up approach (left);

the Hierarchical clustering is a top-down approach (right)

In practice, top-down clustering algorithms are more accurate compared to the bottom-up

approach.

The advantages of the Hierarchical Clustering are as follows:

• No assumptions on the number of clusters. Any desired number of clusters can be obtained.

by ‘cutting’ the dendrogram at the proper level.

• Hierarchical clustering may correspond to meaningful taxonomies.

There are two disadvantages with Hierarchical clustering:

• It is subtle with noisy data and outliers.

• Time complexity for this algorithm is O(n2) where n is the number of textual documents. If

your dataset is big, the algorithm will perform too slowly.

2.1.5 Text representation formats

When implementing a Bag of Words model, insignificant words are filtered out from the

training data mainly because they hold no meaning and increase the dimensionality of data.

When text is represented as a “bag-of-words,” a high number of dimensions is generated.

1. Compute the distancematrix between the

input data points

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the distance matrix until only a single cluster

remains Divisive

1. Start at the top with all patterns in

one cluster

2. The cluster is split using a flat clustering

algorithm

3. This procedure is applied recursively

40

Hence, it is desirable to apply techniques for reducing the dimensionality of data but still

retaining as much information as possible. There are several adequate techniques for reducing

the dimensionality of data. For instance, some well-known techniques which only take the most

important words are [2] [11] [12]:

• TF-IDF

• PCA

• LDA

• SVD

It is worth noting that there is no single best technique for dimensionality reduction. All the

above techniques are widely used and shall be considered depending on the environmental

setup. Additionally, to support the simplifying power of a Bag of Words representation, it is

highly recommended to apply one of the Natural Language Processing methods like

tokenization, relation detection, entity detection, and segmentation.

On the other hand, Word2Vec document-representation of words is a method in

multidimensional space proposed by [12] that processes text by vectorizing words. Word2vec

belongs to a group of related models that are used to produce word embeddings. These models

are considered shallow and two-layer neural networks that are trained to model linguistic

contexts out of words. The input fed to a Word2vecmodel is a large corpus of text, while the

output generated is a vector space that typically includes several hundred dimensions with

each unique word in the corpus being assigned a corresponding vector in the space.

One of the main advantages of Word2Vec is that it relies on exhibiting interesting linguistic

regularities between words on a large text corpus. Simply put, it extracts the distance between

vector representations of words and turns text into a numerical form. Recently, Word2Vec has

been used in various natural language processing tasks, clustering, and classification

applications [2] [12] [65] [66].

The efficiency and effectiveness of many classification algorithms have been increasingly relying

on the power of Word2Vec for extracting valuable information from text [65].

https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Dimensionshttps://en.wikipedia.org/wiki/Corpus_linguistics

41

Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent

models proposed by researchers, which have shown promising results in various applications by

using various datasets. The most recent model developments have proved to be considerably

potential for more advanced machine-learning applications on textual datasets.

Word2vec was created by a team of researchers led by Tomas Mikolov at Google. The algorithm

has since been subsequently analyzed and explained by other researchers [2] [3]. Embedding

vectors created using the Word2vec algorithm has many advantages compared to earlier

algorithms[1] like Latent Semantic Analysis.

Moreover, two additional relevant techniques in natural language processing are: Latent

Semantic Analysis (LSA) and LSTM. LSA is a technique in natural language processing, in

particular, distributional semantics for analyzing relationships between different terms existing

in a group of documents [2]. The purpose and usefulness of this technique are to produce a set

of relevant concepts related to documents and terms. Feature-learning and language-modeling

techniques in natural language processing are also known as word-embedding techniques.

Word-embedding is the technique where words or phrases from a vocabulary are mapped

to vectors of real numbers.

LSTM neural networks are considered state-of-the-art approaches, offering very high accuracy

results in several Natural Language Processing tasks such as Bi-directional LSTM-CRF [18] for

Part of Speech Tagging and Tree-LSTMs for sentiment analysis [19]. Also, simpler versions of

LSTMs, referred to as Gated Recurrent Units (GRUs) [16], are used as crucial parts of larger

systems like state-of-the-art Dynamic Memory Networks, developing more complicated tasks

such as question-answering systems [17].

2.2 Supervised Learning

In machine learning, the classification task is also known as supervised learning because it uses

a set of labeled training data to learn (generate model). Then, based on the learning (model), it

classifies new instances. Each classification task is made of two phases: a training step and a

https://en.wikipedia.org/wiki/Googlehttps://en.wikipedia.org/wiki/Word2vec#cite_note-mikolov-1https://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Distributional_semanticshttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers

42

testing step [9]. The training data is used to apply “learning algorithms” (classification

algorithms). On the other hand, the testing data is used to test the accuracy of the classifiers.

Figure 9. Supervised Learning process

Classifying text is an important topic nowadays. Some application fields of text mining and text

classification are:

Classify Reviews: This is a typical application for e-commerce Web sites. In almost all products,

we have a section where the users write their reviews. The company needs to read and

evaluate all the reviews and classify them as positive or negative. However, in most cases, it is

impossible to go through all the reviews by reading them manually. Therefore, a solution would

be to teach a classifier that will automatically differentiate positive and negative reviews.

Spam Detection: A trending topic in the last decade. Everyone receives e-mails containing

spam. For mail servers, it is an imperative to create systems that can quickly detect spam e-

mails. Using classification techniques, models can be generated that will classify mails as spam

or not spam.

Author Detection: Every author has his/her style of writing. This is another topic where text

classification is applied. However, the author's detection topic is more critical. In most of the

cases, the writing style of the two authors is very similar. Therefore, we cannot reach high

accuracy on this topic.

Language Detection: A prevalent task nowadays. Google uses it in its translation engine. This

type of problem achieves the highest accuracy. This is because languages are very different.

Because of that, in most cases, this task is not considered a classification task.

43

Classify News Articles: This is the topic demonstrated in this thesis. An advantage of working on

this topic is that you can find many news articles on the Web to experiment with and test

different algorithms. It is a prevalent task where researchers are trying to improve the accuracy

of classification algorithms. A drawback of this task is that in news articles, new words appear

very often.

hybrid solution for scalable research ......develop effective and efficient analytical models and...

Documents