hybrid solution for scalable research ......develop effective and efficient analytical models and...
TRANSCRIPT
-
UNIVERSITETI I EVROPËS JUGLINDORE УНИВЕРЗИТЕТ НА ЈУГОИСТОЧНА ЕВРОПА
SOUTH EAST EUROPEAN UNIVERSITY
FAKULTETI I SHKENCAVE DHE TEKNOLOGJIVE BASHKËKOHORE ФАКУЛТЕТ ЗА СОВРЕМЕНИ НАУКИ И ТЕХНОЛОГИИ FACULTY OF CONTEMPORARY SCIENCES AND TECHNOLOGIES
Third Cycle of Academic Studies – Doctoral Studies
Doctoral Dissertation Topic:
HYBRID SOLUTION FOR SCALABLE RESEARCH ARTICLES RECOMMENDATION
CANDIDATE: MENTOR: Nuhi Besimi, MSc Prof. Dr. Betim Çiço
October, 2020
-
2
-
3
Abstract
In recent decades, machine learning has become a crucial factor in automating business
operations and assisting in the decision-making process. The massive volume of data generated
at an unprecedented rate has motivated researchers and industry analysts to continually
develop effective and efficient analytical models and machine learning techniques.
In text mining, clustering and classification are essential techniques to extract information from
textual data. These are techniques, which allow us to identify similar groups of textual
documents or build classification models based on some similarity. Application of machine
learning techniques on textual data has become a crucial factor in extracting useful and
unknown information from textual documents. With the massive volume of unstructured data
generated on the Web, researchers and the industry have been motivated to develop efficient
techniques for structuring and processing such data to extract meaningful information.
In this thesis we present a hybrid model based on clustering and classification techniques to
recommend research articles to researchers. Since the literature review process is time-
consuming, we aim to automate this process and recommend the most relevant research
articles based on users’ research field preferences. All extracted information from raw
unstructured research articles has been represented in a graph structure. The graph
representation of the machine learning outcomes represents a suitable form for the
recommendation process.
This research adds to the machine learning community by evaluating some of the most
significant text mining techniques for unsupervised and supervised learning that will supposedly
ease the process of literature review for researchers. Furthermore, it evaluates the accuracy
and execution time for all the model phases by comparing multiple techniques. It also
compares the execution of the model in terms of cost and energy consumption in three
different environments, namely, cloud instance, cloud functions, and distributed Raspberry PIs.
Results showed that our proposed model could have a positive impact on easing the processing
of literature reviews and identifying trend topics for a given field. Overall, we found out that
-
4
both unsupervised learning and supervised learning showed promising results in accuracy when
working with textual data. On the other hand, this solution does not perform very well in
execution time as the volume of data increases.
Results yielded in our study showed that distributed Raspberry PIs could have a highly positive
impact in terms of lowering costs and being energy efficient. Overall, we found out that
machine-learning algorithms could be adapted and run on distributed raspberry PIs with low
cost and low energy consumption compared to cloud alternatives. On the other hand, this
solution does not offer high scalability, and it requires more time on management, deployment,
and configuration.
Graph structure for representing the extracted information from machine-learning techniques
is one of the most suitable forms for machine-learning tasks and recommendation systems. It
allows us to query the data easily, represent all the relationships better and achieve
performance and scalability for recommendation systems. Other data structures have shown a
lack of performance and an increase of complexity in the process of recommending and storing
the extracted information.
Keywords: recommendation system, supervised learning, unsupervised learning, text mining,
graph databases
-
5
Contents 1. INTRODUCTION ..................................................................................................................... 16
1.1 Problem description ............................................................................................................ 19
1.2 Hypothesis ........................................................................................................................... 22
1.3 Research Questions ............................................................................................................. 22
1.4 Methodology ....................................................................................................................... 22
1.5 Thesis Structure ................................................................................................................... 24
1.6 Conclusion ........................................................................................................................... 25
2. LITERATURE REVIEW ................................................................................................................. 27
2.1 Document Clustering ........................................................................................................... 32
2.1.1 K-means Clustering ....................................................................................................... 33
2.1.2 K-Means++ Clustering ................................................................................................... 34
2.1.3 K-Medoids Clustering ................................................................................................... 35
2.1.4 Hierarchical Clustering .................................................................................................. 38
2.1.5 Text representation formats ........................................................................................ 39
2.2 Supervised Learning ............................................................................................................ 41
2.2.1 k-NN Classifier ............................................................................................................... 45
2.2.2 Centroid Classifier ......................................................................................................... 46
2.2.3 Naive Bayes ................................................................................................................... 47
2.2.4 SVM – Support Vector Machine ................................................................................... 49
2.2.5 Convolutional Neural Network ..................................................................................... 50
2.3 Hadoop ................................................................................................................................ 51
2.3.1 Hadoop Architecture .................................................................................................... 53
2.3.2 Map Reduce and Spark ................................................................................................. 55
2.3.3 Real-time Data Stream Processing: Spark Streaming ................................................... 56
2.4 Storage Systems .................................................................................................................. 56
2.4.1 Hadoop vs. Relational Database Management Systems .............................................. 57
2.4.2 Hadoop vs. Data Warehouse ........................................................................................ 59
2.4.3 Cloud Solutions ............................................................................................................. 61
-
6
2.4.4 Graph Databases .......................................................................................................... 62
2.5 Conclusion ........................................................................................................................... 64
3. METHODOLOGY ........................................................................................................................ 65
3.1 Proposed Model .................................................................................................................. 68
3.2 Phase 1 – Initial text clustering ........................................................................................... 69
3.2.1 What is the right number of clusters? .......................................................................... 71
3.2.2 Outlier Clusters ............................................................................................................. 73
3.3 Phase 2 – A supervised learning model .............................................................................. 74
3.4 Phase 3 – Graph representation and topic modeling ......................................................... 76
3.5. Proposed Distributed Model .............................................................................................. 78
3.6 Text Pre-Processing ............................................................................................................. 81
3.7 Datasets ............................................................................................................................... 81
3.8 Experimental setup ............................................................................................................. 83
3.9 Proof of concept – Unsupervised Learning ......................................................................... 84
3.10 Proof of concept – Supervised Learning ........................................................................... 91
3.11 Conclusion ......................................................................................................................... 94
4. EXPERIMENTS AND RESULTS .................................................................................................... 96
4.1 Results Phase 1 – Text Clustering ........................................................................................ 96
4.2 Results Phase 2 – Supervised Learning ............................................................................. 105
4.3 Results Phase 3 – Graph representation and topic modeling ........................................... 118
4.4 Experiments in cost and energy consumption .................................................................. 123
4.5 Conclusion ......................................................................................................................... 127
5. DISCUSSION OF FINDINGS....................................................................................................... 129
5.1 Evaluation of Machine Learning Techniques .................................................................... 129
5.2 Cost and Energy Consumption .......................................................................................... 132
5.3 Text Pre-Processing ........................................................................................................... 133
5.4 Findings on Data Storage .................................................................................................. 134
5.5 Limitations ......................................................................................................................... 136
5.6 Conclusion ......................................................................................................................... 136
-
7
6. CONCLUSION ........................................................................................................................... 138
PUBLICATIONS AND PRESENTATIONS......................................................................................... 143
ACKNOWLEDGEMENT ................................................................................................................. 144
REFERENCES ................................................................................................................................ 145
APPENDIX A ................................................................................................................................. 160
APPENDIX B ................................................................................................................................. 165
APPENDIX C ................................................................................................................................. 169
APPENDIX D ................................................................................................................................. 225
-
8
List of Figures
Figure 1. Overall Architecture ....................................................................................................... 21
Figure 2. Paper distribution by year ............................................................................................. 29
Figure 3. Paper classification ........................................................................................................ 29
Figure 4. K-Mean Clustering Process ............................................................................................ 34
Figure 5. K-Means vs. K-Medoids Algorithms ............................................................................... 35
Figure 6. K-Medoids Process ........................................................................................................ 37
Figure 7. Hierarchical Clustering ................................................................................................... 38
Figure 8. Agglomerative Hierarchical clustering is bottom-up approach (left) Hierarchical
clustering is top-down approach (right) ....................................................................................... 39
Figure 9. Supervised Learning process.......................................................................................... 42
Figure 10. Decision tree (example) [Data mining Concepts and Techniques] .............................. 44
Figure 11. k-NN example ............................................................................................................... 46
Figure 12. Support Vector Machine Margin ................................................................................. 49
Figure 13. Convolutional Neural Network .................................................................................... 51
Figure 14. Hadoop Core Components .......................................................................................... 53
Figure 15. Hadoop Architecture.................................................................................................... 54
Figure 16. Hadoop Cluster ........................................................................................................... 54
Figure 17. Spark Streaming ........................................................................................................... 56
Figure 18. Data Warehouse Architecture ..................................................................................... 60
Figure 19. Neo4j ............................................................................................................................ 63
Figure 20. Proposed Model ........................................................................................................... 63
Figure 21-a. Phase1 ....................................................................................................................... 68
Figure 21-b. Labeling Clusters ....................................................................................................... 70
Figure 22. Vector representation of textual data ......................................................................... 73
Figure 23. Phase 2 ......................................................................................................................... 75
Figure 24. Graph Structure ........................................................................................................... 76
Figure 25. Phase 3 Identifying trend topics (Add a reference) ..................................................... 78
Figure 26. Typical Master-Slave Architecture ............................................................................... 79
Figure 27. Master-Slave Architecture model with Raspberry PIs ................................................. 80
Figure 28. Phase 1 Clustering Accuracy ........................................................................................ 99
Figure 29. Execution Time in seconds ......................................................................................... 101
Figure 30. Visualization, Top Generated Clusters ....................................................................... 104
Figure 31. Phase 2 Supervised Learning Accuracy ...................................................................... 107
Figure 32. Phase 2 Execution Time in seconds ........................................................................... 109
Figure 33. Natural Language Processing Graph. Group of Papers which all belong to a specific
field in NLP .................................................................................................................................. 119
-
9
Figure 34. Medical Graph. ........................................................................................................... 120
Figure 35. Medical and the bridge papers with other fields ...................................................... 121
Figure 36. Natural Language processing and the bridge papers with other fields..................... 121
Figure 37. Computer Vision and the bridge papers with other fields ........................................ 122
Figure 38. Playing Games and the bridge papers with other fields ............................................ 122
Figure 39. Cost Comparison for 1 year ....................................................................................... 123
Figure 40. Cost Comparison for 1 year ....................................................................................... 125
Figure 41. Power consumption (Watt per hour) of Physical servers with near 100% CPU
utilization. Source: https://www.anandtech.com/show/7285/intel-xeon-e5-2600-v2-12-core-
ivy-bridge-ep/11 ......................................................................................................................... 126
-
10
List of Tables
Table 1 Research articles by field ................................................................................................. 28
Table 2. RDBMS vs. Hadoop .......................................................................................................... 58
Table 3. RDMS vs. MapReduce ..................................................................................................... 58
Table 4. Data Warehouse vs Hadoop [71] .................................................................................... 60
Table 5. Dataset organization ....................................................................................................... 82
Table 6. k-NN accuracy (3 classes) ................................................................................................ 86
Table 7. k-NN accuracy (5 classes) ................................................................................................ 87
Table 8. k-NN accuracy (3 classes only keywords) ........................................................................ 88
Table 9. k-NN accuracy (5 classes only keywords) ........................................................................ 88
Table 10. News articles – Experiment 1 ........................................................................................ 91
Table 11. News articles – Experiment 2 ........................................................................................ 91
Table 12. Testing the accuracy of classifiers ................................................................................. 91
Table 13. Classify Politics News Articles (Total news articles:49) ................................................. 92
Table 14. Classify Technology News Articles (Total news articles: 86) ......................................... 92
Table 15. Classify Sports News Articles (Total news articles: 102) ............................................... 92
Table 16. Execution time (in seconds) .......................................................................................... 93
Table 17. Phase 1 Unsupervised Learning Accuracy ..................................................................... 98
Table 18. Efficiency of Silhouette Coefficient (input: 7 clusters) .................................................. 99
Table 19. Phase 1 Unsupervised Learning Execution Time in seconds ...................................... 100
Table 20. Generated clusters from Dataset 1 ............................................................................. 102
Table 21. Top Generated Clusters from Dataset 1 ..................................................................... 102
Table 22. Phase 2 Supervised Learning Accuracy ....................................................................... 105
Table 23. Phase 2 Supervised Learning Average Accuracy ......................................................... 108
Table 24. Naive Bayes - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 110
Table 25. Naive Bayes (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 110
Table 26. SVM - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 111
Table 27. SVM (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 112
-
11
Table 28. Logistic Regression - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 112
Table 29. Logistic Regression (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical,
2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 113
Table 30. Decision Tree - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 114
Table 31. Decision Tree (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 114
Table 32. KNN - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 115
Table 33.KNN (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 116
Table 34. Random Forest - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 117
Table 35. Random Forest (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -
Speech ......................................................................................................................................... 117
Table 36. PRO attributes for different execution platforms ....................................................... 126
Table 37. Comparison of environments ..................................................................................... 133
Table 38. Comparison of OLAP and OLTP ................................................................................... 135
Table 39. Random Forest - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ...... 169
Table 40. Random Forest (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical 169
Table 41. Random Forest - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 169
Table 42. Random Forest (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 170
Table 43. Random Forest - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 170
Table 44. Random Forest (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 171
Table 45. Random Forest - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 171
-
12
Table 46. Random Forest (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 172
Table 47. Random Forest - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 172
Table 48. Random Forest (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 173
Table 49. KNN - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ........................ 174
Table 50. KNN (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 174
Table 51. KNN - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology
..................................................................................................................................................... 174
Table 52. KNN (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 175
Table 53. KNN - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 – Natural Language Processing ................................................................................................ 175
Table 54. KNN (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 176
Table 55. KNN - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 176
Table 56. KNN (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 177
Table 57. KNN - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 177
Table 58. KNN (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 178
Table 59. Decision Tree - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ......... 179
Table 60. Decision Tree (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical ... 179
Table 61. Decision Tree - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 179
Table 62.Decision Tree (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 180
Table 63. Decision Tree - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 180
Table 64. Decision Tree (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 181
Table 65. Decision Tree - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 181
Table 66. Decision Tree (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 182
-
13
Table 67. Decision Tree - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 182
Table 68. Decision Tree (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 183
Table 69. Logistic Regression - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical 184
Table 70. Logistic Regression (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical
..................................................................................................................................................... 184
Table 71. Logistic Regression - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 184
Table 72. Logistic Regression (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical,
2 - Methodology .......................................................................................................................... 185
Table 73. Logistic Regression - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 185
Table 74. Logistic Regression (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical,
2 – Methodology, 3 – Natural Language Processing .................................................................. 186
Table 75. Logistic Regression - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 186
Table 76. Logistic Regression (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical,
2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .................................... 187
Table 77. Logistic Regression - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 187
Table 78. Logistic Regression (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical,
2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ..... 188
Table 79. SVM - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ....................... 189
Table 80. SVM (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 189
Table 81. SVM - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology
..................................................................................................................................................... 189
Table 82. SVM (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 190
Table 83. SVM - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 – Natural Language Processing ................................................................................................ 190
Table 84. SVM (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 191
Table 85. SVM - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 191
Table 86. SVM (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 192
-
14
Table 87. SVM - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,
3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 192
Table 88. SVM (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 193
Table 89. Naive Bayes - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 194
Table 90. Naive Bayes (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -
Methodology ............................................................................................................................... 194
Table 91. Naive Bayes - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 195
Table 92. Naive Bayes (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 – Natural Language Processing ........................................................................ 195
Table 93. Naive Bayes - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 196
Table 94. Naive Bayes (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 196
Table 95. Naive Bayes - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 197
Table 96. Naive Bayes (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –
Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 197
-
15
List of Abbreviations
SVM Support Vector Machine
k-NN k - Nearest Neighbor
TF Term Frequency
TF-IDF Term Frequency Inverse Document Frequency
LDA Latent Dirichlet Allocation
NMF Non-negative Matrix Factorization
LDA Linear Discriminant Analysis
CNN Convolutional Neural Network
NLP Natural Language Processing
-
16
1. INTRODUCTION
The growth of technologies and the continuous generation of data have posed unique
challenges, especially to the data mining community. These challenges have motivated
researchers and industry analysts to continually develop new tools and methods for improving
the application of various machine-learning techniques [1] [2] [46]. The main goal is to identify
patterns, build recommendation(s) systems, and predictive models that will eventually support
an organization's decision-making process. The application of machine-learning techniques is
broad and spans across different research areas [38] [62].
Nowadays, data is considered as one of the most valuable assets that organizations and
companies are willing to acquire. Vast volumes of data are being captured to gain a better
insight into business processes, operations, products, customers, and more [25].
Because the volume of unstructured data is overgrowing, many enterprises also turn to
technological solutions to better manage and store their unstructured data. These can include
hardware or software solutions that enable them to make the most efficient use of their
available storage space [1] [3] [5] [44].
In machine-learning tasks, supervised learning methods are essential because they allow us to
make predictions. Supervised learning is also known as classification [9] [35]. Unsupervised
learning methods are also very commonly used in data mining. Its primary purpose is to
discover groups (clusters) of similar data, where elements on the same group are very similar
and differ with the other groups' elements. Cluster analysis has been widely used in many
applications such as business intelligence, image pattern recognition, web search, biology, and
security. It is also used to improve recommendation systems. Similarly, searching engines do
use clustering, and the cluster mechanisms improve the quality and the speed for a search [6]
[26] [28].
Statistical and analytical algorithms have recently shown favorable results in working with
structured data. However, analyzing semi-structured and unstructured data is not a
-
17
straightforward task. Most proposed solutions are ad-hoc solutions that are applied to specific
problems [10] [41] [54].
Text mining is one of the most challenging areas in many machine-learning applications, mainly
because of data nature. Textual data is unstructured, and as such, requires additional pre-
processing steps [117]. Two of the most important measurements when applying machine-
learning techniques, and especially when dealing with unstructured textual data, are accuracy
and performance. Accuracy issues emerge as result of the variety of the data and performance
issues due to the enormous volume. To tackle the issues mentioned above, one must establish
a well-defined strategy to store and process "Big Data" [62].
The two significant challenges in the world of big data are 1) storing the data and 2) processing
the data. Since more and more data are available on the Web, applying statistical and analytical
algorithms on these data is an important topic nowadays. Processing unstructured and massive
amount of data usually requires more in-depth analysis and more pre-processing tasks than
traditional data-mining techniques applied on regular datasets. Traditional
techniques/algorithms have limitations on performance and accuracy when working with a
massive amount of data; therefore, there is a need for new programming models/new
programming paradigms to overcome these challenges [129]. Map Reduce, Storm, Spark
Frameworks are the most common frameworks nowadays used for processing big data. The
main advantage of these programming models is that they can process the data in parallel on a
distributed environment [120].
Machine-learning algorithms mainly supervised learning algorithms, and unsupervised learning
algorithms cannot be easily integrated on distributed environments to process the data in
parallel. There are required additional changes to the algorithms to make them suitable for
processing big data. First, the structure of the data differs from the traditional datasets, and
second not every sequential algorithm can be easily adapted and transformed to work in
parallel [23].
Various models and techniques have been proposed for overcoming these two challenges,
which include distributed environments for parallel processing, cloud platforms, high
-
18
computing resources, GPU processing, etc. [1] [3] [122]. However, not always adding more
resources is the best and preferred solution for these scenarios. On the one hand, distributed
environments for parallel processing are widely used and arguably the most suitable for Big
Data [3] [5]. However, any custom distributed environment comes with the cost of
configuration and the knowledge to prepare it. On the other hand, cloud solutions have been
one of the fanciest, since they allow for processing and distributing data stores and integrating
them from legacy systems. Nevertheless, the cost of cloud is the main drawback and the most
significant limitation for researchers and small and medium-sized companies that cannot afford
it [1].
Recommendation systems are becoming more and more important nowadays in solving and
helping people in different tasks. They are continuously used in computer science, finance,
medicine, sports, and many other areas, for automating or semi-automating various tasks. They
are built on top of different data sets and data structures, starting from various databases,
textual data, Web, and streaming data from different sensors [51] [67] [76]. These systems can
also help researchers find the most relevant research articles for their research fields.
Therefore, in this thesis, we propose and evaluate a hybrid model for recommending research
articles built using both supervised and unsupervised machine-learning techniques.
Extracting information and knowledge from unstructured data is becoming more and more
important nowadays since there is a potential of utilizing machine-learning techniques on
organizing and extracting details from textual data [10].
To improve the process of literature review is crucial, therefore we target this problem by using
machine-learning techniques. This problem is critical nowadays to produce more qualitative
and novel research topics by new researchers. One of the most important aspects is making
sure the researchers can explore the research fields easier and check for recent activities and
topics for each field. As a result, they need to have the ability to check the dependency and
relationship between various research fields in recent years.
-
19
1.1 Problem description
Researchers spend too much time reading others' work and finding research questions [143].
This process requires lots of effort in reading and classifying relevant papers. The process of
literature review is a challenging task for new researchers in different fields of study. Through
the proposal in this thesis, we try to ease the literature review process and speed up the time
for defining the research problem.
By analyzing the possible research gaps, we define a problem that we target.
Can we automate the process of literature review?
Our study aims to collect/retrieve and analyze research/scientific articles by applying machine-
learning techniques to recommend research articles or/and research gaps to researchers based
on their research fields. With research articles, we mean scientific articles of any field where we
take into consideration the following attributes:
• Title – the title of the research article plays an important role on the analysis process,
• Author/s – it is important to know the author/s of the research article,
• Year of publication,
• Abstract,
• Keywords,
• Content,
• Contribution,
• Results,
• Future work,
• Conference/Journal,
• Related articles,
• Bibliography.
We plan to propose a model based on a large dataset containing research articles, which will
use a hybrid model for recommending research articles to researchers. This approach will use
-
20
the input parameters such as an abstract, a list of keywords, research articles, or research
field/s. In Figure 1, we present the overall architecture of our research experiment.
We aim to automate and/or semi-automate the process of literature review using machine-
learning algorithms. Nowadays, researchers, master students, Ph.D. candidates spend too much
time identifying the research fields they want to target; in addition, they spend too much time
investigating the research gaps and possible future work on a field. Therefore, our main
contribution is to build a hybrid model based on machine-learning algorithms (supervised and
unsupervised learning).
We will also focus on integrating variously supervised and unsupervised learning techniques
into our model. We will evaluate the techniques and extract the most efficient ones.
Our aim is also to generate a scalable and updatable training set by using any textual dataset.
The training set can be used to recommend research articles or apply other analysis like feature
extraction, classification of new research articles, and a correlation between different research
fields.
-
21
Figure 1. Overall Architecture
The first part is integrating digital research articles from various digital libraries. As a result of
the first phase, we will have a centralized database that can be used to construct a
recommendation system. The second phase is to build a model based on unsupervised and
supervised learning algorithms, which, as an output, will generate a training dataset (model)
that will be used to recommend research articles. The third phase will analyze the quality of the
model; the focus will be on the model's ability to update itself and it’s scalability. By scalability,
we mean the distribution of our training set as the dataset is increasing in the future. The fourth
phase will analyze the models for storing our training set to have an efficient recommendation
system. Finally, we will have the researchers going to test and evaluate the model based on their
queries.
-
22
1.2 Hypothesis
1. The literature review process can be simplified by using a hybrid model based on text-
mining algorithms.
1.3 Research Questions
1. Which unsupervised learning algorithms provide the most efficient results on large data
sets?
2. Which data pre-processing techniques provides higher accuracy for unsupervised and
supervised learning on textual documents?
3. How to generate an efficient training set by using machine-learning algorithms on
textual documents?
4. How to construct an updatable model based on unsupervised and supervised learning
algorithms? Which is the best data storage for machine-learning tasks?
5. What are the different Supervised Learning Techniques for textual documents? Which
techniques are the most efficient for our research?
1.4 Methodology
Our research will answer the one defined hypothesis and five research questions. We will
analyze whether by using machine-learning techniques, we can automate/semi-automate the
process of literature review. As a result, we propose an efficient hybrid model on top of which
we will apply our experiments. Finally, we are planning a case study to evaluate the model,
which will be constructed on a specific data set.
The process of our literature review is an example of our qualitative research. For data
gathering, we have analyzed others' work on text mining, data mining, and big data. As a result,
we have extracted possible research gaps in the field and identified the research problem we
aim to solve.
-
23
Experiments and the generated empirical results will be the base for our quantitative research
methodology. We will also provide comparisons between various techniques (supervised and
unsupervised learning techniques) on efficiency.
A suitable plan of action must be established and carried out to resolve the problem. This
section introduces the chosen research strategy and specific scientific techniques for data
collection. Subsequently, the application of the method and research ethics has been
described. This research, in terms of strategy, follows the experimental approach. ("An
experiment is an empirical investigation under controlled conditions designed to examine the
properties of, and relationships between, specific factors" [7])
Our focus is on supervised and unsupervised learning algorithms because we see a potential for
building a hybrid solution with a combination of classification and clustering algorithms to build
a highly accurate model for textual data. Our study focuses on highly accurate data-mining
techniques for textual data. We also analyze the need for the highly efficient recommendation
systems and the limitations of algorithms on big datasets.
The goal of this thesis is to evaluate the proposed model for recommending research articles. It
will present the overall accuracy and the accuracy of the individual steps. Multiple supervised
and unsupervised techniques have been taken into consideration to evaluate and identify the
most efficient techniques for this type of system. Our primary hypothesis is that we can ease
the literature review process for researchers and identify trend topics for a given field.
Our solution is based on machine-learning techniques, and it contains 3 (three) phases:
1. Generating a group of relevant articles from raw input of research articles. By applying
unsupervised learning techniques, the aim is to generate “n” number of clusters where
we have similar articles on each cluster. The outputs of this phase are the clusters,
distance between clusters, centroid, outlier clusters, cluster labels, and extracted
keywords for each cluster.
2. Using the clusters generated from Phase 1 to build a model based on supervised
learning techniques, which is used to add more articles in the future and recommend
-
24
research articles. Various machine-learning techniques have been considered in this
phase; the results from the experimentations have been presented in Chapter 5.
3. Using topic modeling to extract trend topics and keywords for specific fields (clusters).
All outcomes, i.e. the information extracted from all three phases, have been presented in a
graph structure, which is used as the basis of our recommendation system and the extraction of
trend topics.
To evaluate existing techniques and propose the best approach, we have tested three
unsupervised learning techniques for Phase 1, and 15 supervised learning techniques for Phase
2 on two different datasets. For this purpose, we conducted five studies:
1. Evaluation of the accuracy and execution time of unsupervised learning techniques,
which represents Phase 1 of our model.
2. Evaluation of supervised learning techniques, which represents Phase 2 of our model.
3. Evaluation of the process of identifying trend topics for specific fields of topic modeling.
4. Comparison of cost and energy consumption of our model running in three different
environments. 1) cloud instance, 2) cloud functions and 3) distributed Raspberry PIs.
5. Evaluation of different data sources can be used to organize the extracted information
and build efficient recommendation systems.
1.5 Thesis Structure
Chapter 2 of the thesis presents the background and state of the art in the field. In this chapter,
we initially have a list of references used for state of the art. It continues with an overview of
various machine-learning techniques. We represent a survey of unsupervised learning
techniques, their different types, and their characteristics. It continues with an overview of
supervised learning techniques and their application on textual data. In addition, state of the art
for various data representation models for textual data has been presented. In the last part of
Chapter 2, it continues with the benefits of HADOOP in Big Data and different data storage
-
25
models, starting from traditional ones, continuing with distributed data storage, and finally, an
overview of cloud data platforms and graph structures.
In Chapter 3, a proposed solution has been presented. A hybrid model for recommending
research articles and all the three phases and the data representation have been described in
detail. Section 3.4 presents an enhanced version of the proposed model that can run in
distributed Raspberry PIs. Section 3.5 shows all the pre-defined pre-processing steps used for
the textual data for our experiments. In Section 3.6, we have the list of Datasets used for
different phases of evaluation of the model, and finally, in Section 3.7, the experimental setup
and technical details have been presented.
All experiments and results have been presented in Chapter 4. It contains all the detailed results
for all the 5 (five) conducted studies of this thesis. Initially, it presents results for proof of
concepts for unsupervised learning and supervised learning techniques. Then it continues with
the results of Phase 1 and Phase 2 of the proposed model. In Section 4.4, the results for
comparison and experiments in cost and energy consumption have been presented. In the last
part, the outcomes of topic modeling and graph representation of the extracted information
have been presented.
Chapter 5 represents all the discussions on findings that derived from the experimentation from
Chapter 4. In this chapter, we also present comparisons of our results with other similar studies
and overall findings for different phases, environments, and experimentation setups.
Chapter 6, Conclusion, is the last section of this study. An overview of all our work in this Ph.D.
dissertation has been presented along with our plans for future activities.
1.6 Conclusion
In this chapter, we stated the research problem, which emerged as a result of the literature
review presented in Chapter 2. We saw that research articles can be considered as
unstructured data that can be represented in textual documents. In addition, new researchers,
Master and Ph.D. students spend too much time defining their research fields and their theses.
-
26
There is a potential of easing the process of literature review by utilizing machine-learning
techniques on textual documents.
In Section 1.2, the central hypothesis was stated, along with five research questions. The
central hypothesis questions if the literature review process can be simplified using machine-
learning techniques on textual data. In addition, the research questions target various issues on
the efficiency of current machine-learning techniques for textual data.
In Section 1.3, we deal with the importance and benefits of solving this problem. It will have a
good impact on improving the quality of research work for young researchers. The
methodology for targeting this research problem is by using experimental and empirical results.
As a result,, a hybrid model will be proposed based on machine-learning techniques. The goal is
to extract information from unstructured data, in our case research articles and store that
extracted information in a graph structure. Finally, multiple supervised and unsupervised
learning techniques will be evaluated.
-
27
2. LITERATURE REVIEW
Data on the Web is increasing rapidly. More and more data in different formats is generated.
This leads us to a new concept, mainly the big data concept. Each time we have a big data task
there are two main concerns:
• Storing a massive amount of data (big data)
• Processing large datasets
All data come(s) from sources, such as external data sources, mobile devices, social media, IoT,
and media or internal data source transactions, log data, e-mails, etc. In total less than 0.5% of
all data is ever analyzed and used, therefore analyzing data, applying machine learning,
applying statistical analysis on the data is very important nowadays [57].
In machine-learning tasks, supervised learning methods are essential because they allow us to
make predictions. Supervised learning is also known as “classification.” Classification algorithms
are made of two parts. The first part is the learning part, where the model is constructed based
on training data. The second part is the classification part, where the model is used for
prediction. There are many classification algorithms proposed. The most popular are Random
Forest, Logistic Regression, Decision Tree Induction, Bayesian classifier, Neural Networks,
Support Machines, etc. Every algorithm has its pros and cons; they have their application field
and type of data applied [4] [8] [18] [19].
Unsupervised learning methods are also very commonly used in data mining. Its primary
purpose is to discover groups (clusters) of similar data, where elements in the same group are
very similar, and they differ from elements of the other groups. Cluster analysis has been widely
used in many applications such as business intelligence, image pattern recognition, web search,
biology, and security. It is also used to improve recommendation systems. In addition, search
engines do use clustering; cluster mechanisms improve the quality and the speed of a search.
Detection outliers is another application of clustering on data. By finding elements that do not
belong to any cluster (group), we have detected the outliers [6] [31] [52] [53].
-
28
The process of the literature review is based on six research questions:
1. Which are the most crucial text-mining techniques for text classification and text
clustering?
This question identifies the state of the art in text mining algorithms and techniques and the
most recent algorithms proposed.
2. What data transformation models are used for textual data?
This question is used to determine the impact of the models on the efficiency of the algorithms
considering accuracy.
3. What are the limitations of high-performance text-mining algorithms on big datasets?
This question provides information for the limitations by using traditional approaches on big
datasets.
4. Which are the proposed models for high-performance data-mining algorithms?
This question provides the list of proposed solutions for high-performance data-mining
algorithms on big datasets.
5. What contributions have been proposed for research article analysis?
This question shows whether there are proposed solutions or related work.
6. What is the importance of high-performance text-mining algorithms nowadays?
This question identifies the trends related to our research field.
To answer and conduct an overview of the six research questions, we have manually analyzed
research articles from the IEEE-Xplore and ACM digital libraries. In total, we have analyzed more
than 130 relevant research articles. The papers are from 1994 to 2019. Furthermore, the
distribution of papers based on the primary research field is as follows:
Table 1 Research articles by field
Text Mining Big Data Data Mining Parallel/Distributed Programming
Models
50 23 45 15
-
29
Figure 2. Paper distribution by year
The classification of the research articles is as follows:
Figure 3. Paper classification
In the first and third research questions of the literature review, we identify the state of the art
in text mining algorithms and techniques and the most recent algorithms proposed. While data
mining is a broad field, our focus was only on the algorithms and techniques applied to textual
Data Type
Text
Log
Relational DB
Web
Social Media
Algorithms
Classification
Clustering
Pattern Mining
Information Retrieval
Term Frequency
TF-IDF
Enhanced TF-IDF
Word2Vec
Paradigm
Traditional
Parallel-Distributed
Programming
Contribution
Model
Method
Review
Survey
-
30
data sets. For this part, we have included two groups of techniques, 1) classification and 2)
clustering.
From the papers, we can see that there are various techniques proposed for text clustering and
text classification. In most of the papers, we see that the focus is on improving the algorithms'
accuracy. The proposed classification algorithms are from two groups: 1) lazy learners and 2)
eager learners.
The different versions of the probabilistic algorithms, like the Naïve Bayes algorithm, are highly
used [64] [130]. Naïve Bayes has been proven an accurate algorithm when the training set is big
enough. SVM is another proposed solution for textual data classification. Various researchers
present SVM as one of the most accurate algorithms for text data classification. The only
drawback is that as the number of classes increases, its performance in accuracy decreases [4]
[30] [130]. Hybrid solutions have also been proposed, which are a combination of two or more
classification algorithms.
The two algorithms, k-NN and Centroid Classifier, showed that they could achieve high-
classification accuracy in classifying text documents [19]. When the number of documents is
not very high, the algorithms can even perform very well in performance. However, when we
increase the number of input data in the training set, the execution time increases too. As the
data is increasing on the Web, traditional algorithms are not sufficient. Therefore, we are
required to process data by using parallel/distributed programming models [130].
Various techniques have been proposed for information retrieval and feature extraction on
textual data [2] [11] [12]. One of the most fundamental techniques is using the term analysis of
text documents. Bag of Words representation and term frequency are often used when
experimenting with textual documents; both have shown higher results in accuracy. Term
Frequency - Inverse Document Frequency is a way to score the importance of words (terms) in
a textual document based on how frequently it appears across multiple papers. It has proven to
increase the accuracy of supervised and unsupervised learning algorithms.
-
31
Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent
models proposed by researchers which have shown promising results in various applications by
using various datasets [2] [65] [66]. Using these recent models, there is a potential to think of
more advanced machine-learning applications on textual datasets.
Word2vec is a model that produces word embeddings. These models represent two-
layer neural networks that are trained to reconstruct linguistic contexts of each word.
Word2vec takes as its input a dataset of text and generates a vector, usually made of multiple
dimensions, with each unique word in the corpus assigned a corresponding vector in the space
[12].
Word2vec has been created by Tomas Mikolov and his research team at Google. Word2vec
algorithm has many advantages compared to other algorithms like Latent Semantic Analysis.
LSA (Latent semantic analysis) is a natural language-processing technique for analyzing
relationships between textual documents and their terms by producing a set of concepts
related to the documents and terms [11]. Word embedding is the collective name for feature
learning a language modeling in natural language processing (NLP), where terms from the
documents are transformed to vectors of real numbers.
Based on our literature review process in our field of interest, we can see an increase of
interest for researchers in the application of machine-learning algorithms on big data sets. The
focus is on:
• accuracy of algorithms,
• performance,
• optimization of algorithms
• Information Retrieval and data storage.
The quality of our literature review is dependent on two key factors, 1) the number of papers
used in the process and 2) the construction of the research questions for the literature review.
As a result of this process, we have identified some potential Research GAPS topics, which are
https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Vector_spacehttps://en.wikipedia.org/wiki/Corpus_linguisticshttps://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Language_modelhttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers
-
32
possible for future work in our research field:
1. Hybrid solutions for high accuracy text-clustering and text-classification algorithms.
2. Improving text-similarity algorithms by extracting contextual meaning.
3. Adapting traditional machine-learning algorithms to parallel/distributed programming
models.
4. High-performance text-mining algorithms, analysis, and best practices.
5. Fully automated text classification systems based on machine-learning algorithms.
6. Real-time machine-learning algorithms.
2.1 Document Clustering
Managing and organizing data is essential and vital nowadays. Many companies, groups of
people or individuals are striving to organize data to get a better and more comprehensive
insight. Having in mind that many operations are moving to digitalization, the amount of data
generated is becoming so vast. A good starting point to think about organizing and managing
data is by grouping (clustering) them. Clustering represents an excellent technique for grouping
similar or related objects. "Clustering is the unsupervised classification of patterns
(observations, data items, or feature vectors) into a group (clusters)" [78]. There is an
abundance of research on clustering and many algorithms proposed by many publications on
the topic [6] [78] [81] [82].
Different algorithms and techniques have been proposed for generating clusters of textual
documents based on their content. The most common clustering techniques are:
• Hierarchical clustering,
• Density-based clustering,
• Partitioning clustering algorithms,
• Graph-based algorithms,
• Grid-based algorithms,
• Model-based clustering algorithms,
• Combinational clustering algorithms.
-
33
One way to summarize a large amount of data is to use clustering techniques to group data in a
meaningful way so that the objects inside the groups, or clusters, have the most similarities
while objects in different groups have the most differences. Two types of clustering algorithms
are available: nested and partitioned. A nested clustering algorithm creates overlapped
clusters, while a partitioned clustering algorithm creates non-overlapping clusters. For
programming research, in which differences between groups of programmers are required, the
second type of clustering algorithm is more appropriate.
2.1.1 K-means Clustering
K-means algorithm is a center-based clustering algorithm in which a most representative point
(object) for one cluster is chosen, and the distance between the representative point and all
other points (or objects) are computed. Since one can choose the “k” number of clusters, the
algorithm is called K-means. This means that K representative objects (i.e. centroid or medoid)
are selected. Each object is then assigned to the closest centroid and, therefore, the related
cluster. In the next step, the center point is updated according to the objects that have been
assigned to the cluster. For the newly created centroid, the new members of the cluster are
computed again. This process is repeated until the centroids do not change, and the process
converges. In the worst case, the algorithm will converge after at most k·n iterations [72] [85].
K-means clustering has been widely applied to a large amount of data as one of the most
efficient clustering algorithms. However, its performance, to a great extent, varies from the
type of data applied.
K-Means Algorithm steps:
-
34
Figure 4. K-Mean Clustering Process
The Time Complexity of K-Means algorithm is O(l*k*m*n), where l is the number of iterations,
k is the number of clusters, m is the number of dimensions (attributes) and n is the number of
objects.
Even though the K-Means Algorithm is widely used in practice there are some disadvantages:
• It is sensitive to initialization,
• It is sensitive to the outliers,
• It can handle clusters only with symmetrical point distribution, and
• We must define the value of K in the beginning.
2.1.2 K-Means++ Clustering
The proposed K-means++ algorithm follows almost the same approach as K-means, but with
minor modifications in the centroids' initialization. It is also considered an improved version of
K-means because it is less susceptible (less likely to be influenced by) to the initialization
problem [26].
1. Define centroids for the clusters
2. Add the nearest group to every data point
3. Set the place of every group to the mean value of all data-
points which fit into that group.
4. Repeat steps #2 and #3 until all are grouped
-
35
K-means++ chooses the first centroid uniformly at random. For each data object (point), it
computes the distance of that point to the centroid. The newest centers, depending on the
number of "k" set by the user, are assigned using a weighted probability distribution where the
points that are the farthest from the centroid have a higher probability of being chosen as
centroids. This modification of k-means arguably leads to solving the initialization problem,
which in some cases can profoundly affect the accuracy of the algorithm [72] [85].
2.1.3 K-Medoids Clustering
K-Medoids algorithm is like K-Means in that they are both partitional and tend to break the
data set into groups. However, in contrast to K-means, with K-medoids, the initialized centers,
also called medoids, are data objects (points) from the dataset. K-medoids is more potent to
outliers compared to K-means and can be used with any distance/similarity measure. However,
K-Medoids is also more expensive, resulting in higher efficiency when applied on smaller
datasets than larger datasets [86].
Figure 5. K-Means vs. K-Medoids Algorithms
Source: Cross Validated – Stack Exchange
-
36
Although K-medoids is more robust to outliers compared to K-means, it does not always result
in satisfactory results when dealing with noisy data and outliers. For this reason, new
algorithms have been proposed. One of them is the Distributed K-Means clustering algorithm,
which is another enhanced version of the K-Means algorithm. This algorithm relies on
normalized datasets to identify the groups of clusters [86].
-
37
Figure 6. K-Medoids Process
1. Discover maximum values and minimum values of every feature from every local
dataset and convey them into central place
2. Compute global max and min value at central place
3. Normalize the real scalar values of datasets which are local with overall max
and min values.
4. Cluster every local dataset through k-means clustering and achieve centroids
along with cluster index for every dataset.
5. Make single dataset named as centroids by merging cluster centroids of local
dataset into a single dataset
6. Cluster centroids dataset obtain overall centroid through K-means
7. Later calculating Euclidean distance between the object and overall centroids it updates local cluster indices by conveying each object to adjacent cluster centroid.
-
38
2.1.4 Hierarchical Clustering
Hierarchical Clusters produce a set of nested clusters organized as a hierarchical tree [86]. An
example of a hierarchical cluster is shown in the next figure. We can use a dendrogram to
visualize the sequences of mergers or splits.
Figure 7. Hierarchical Clustering Source: http://faculty.juniata.edu/rhodes/ml/hiercluster.htm
Hierarchical Clustering algorithms can be beneficial for specific cases when working with data.
In our case, we consider this type of clustering algorithms in our final phase, the
recommendation phase. In our proposed model, we will distinguish between different levels of
details for a given cluster. Therefore, we will apply this type of clustering techniques after the
clusters have been generated.
Two types of Hierarchical Clustering Algorithms exist:
• Agglomerative Hierarchical clustering
• Divisive Hierarchical clustering
The Agglomerative Hierarchical clustering is a bottom-up approach, and the algorithm is as
follows:
http://faculty.juniata.edu/rhodes/ml/hiercluster.htm
-
39
Figure 8. The Agglomerative Hierarchical clustering is a bottom-up approach (left);
the Hierarchical clustering is a top-down approach (right)
In practice, top-down clustering algorithms are more accurate compared to the bottom-up
approach.
The advantages of the Hierarchical Clustering are as follows:
• No assumptions on the number of clusters. Any desired number of clusters can be obtained.
by ‘cutting’ the dendrogram at the proper level.
• Hierarchical clustering may correspond to meaningful taxonomies.
There are two disadvantages with Hierarchical clustering:
• It is subtle with noisy data and outliers.
• Time complexity for this algorithm is O(n2) where n is the number of textual documents. If
your dataset is big, the algorithm will perform too slowly.
2.1.5 Text representation formats
When implementing a Bag of Words model, insignificant words are filtered out from the
training data mainly because they hold no meaning and increase the dimensionality of data.
When text is represented as a “bag-of-words,” a high number of dimensions is generated.
1. Compute the distancematrix between the
input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix until only a single cluster
remains Divisive
1. Start at the top with all patterns in
one cluster
2. The cluster is split using a flat clustering
algorithm
3. This procedure is applied recursively
-
40
Hence, it is desirable to apply techniques for reducing the dimensionality of data but still
retaining as much information as possible. There are several adequate techniques for reducing
the dimensionality of data. For instance, some well-known techniques which only take the most
important words are [2] [11] [12]:
• TF-IDF
• PCA
• LDA
• SVD
It is worth noting that there is no single best technique for dimensionality reduction. All the
above techniques are widely used and shall be considered depending on the environmental
setup. Additionally, to support the simplifying power of a Bag of Words representation, it is
highly recommended to apply one of the Natural Language Processing methods like
tokenization, relation detection, entity detection, and segmentation.
On the other hand, Word2Vec document-representation of words is a method in
multidimensional space proposed by [12] that processes text by vectorizing words. Word2vec
belongs to a group of related models that are used to produce word embeddings. These models
are considered shallow and two-layer neural networks that are trained to model linguistic
contexts out of words. The input fed to a Word2vecmodel is a large corpus of text, while the
output generated is a vector space that typically includes several hundred dimensions with
each unique word in the corpus being assigned a corresponding vector in the space.
One of the main advantages of Word2Vec is that it relies on exhibiting interesting linguistic
regularities between words on a large text corpus. Simply put, it extracts the distance between
vector representations of words and turns text into a numerical form. Recently, Word2Vec has
been used in various natural language processing tasks, clustering, and classification
applications [2] [12] [65] [66].
The efficiency and effectiveness of many classification algorithms have been increasingly relying
on the power of Word2Vec for extracting valuable information from text [65].
https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Dimensionshttps://en.wikipedia.org/wiki/Corpus_linguistics
-
41
Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent
models proposed by researchers, which have shown promising results in various applications by
using various datasets. The most recent model developments have proved to be considerably
potential for more advanced machine-learning applications on textual datasets.
Word2vec was created by a team of researchers led by Tomas Mikolov at Google. The algorithm
has since been subsequently analyzed and explained by other researchers [2] [3]. Embedding
vectors created using the Word2vec algorithm has many advantages compared to earlier
algorithms[1] like Latent Semantic Analysis.
Moreover, two additional relevant techniques in natural language processing are: Latent
Semantic Analysis (LSA) and LSTM. LSA is a technique in natural language processing, in
particular, distributional semantics for analyzing relationships between different terms existing
in a group of documents [2]. The purpose and usefulness of this technique are to produce a set
of relevant concepts related to documents and terms. Feature-learning and language-modeling
techniques in natural language processing are also known as word-embedding techniques.
Word-embedding is the technique where words or phrases from a vocabulary are mapped
to vectors of real numbers.
LSTM neural networks are considered state-of-the-art approaches, offering very high accuracy
results in several Natural Language Processing tasks such as Bi-directional LSTM-CRF [18] for
Part of Speech Tagging and Tree-LSTMs for sentiment analysis [19]. Also, simpler versions of
LSTMs, referred to as Gated Recurrent Units (GRUs) [16], are used as crucial parts of larger
systems like state-of-the-art Dynamic Memory Networks, developing more complicated tasks
such as question-answering systems [17].
2.2 Supervised Learning
In machine learning, the classification task is also known as supervised learning because it uses
a set of labeled training data to learn (generate model). Then, based on the learning (model), it
classifies new instances. Each classification task is made of two phases: a training step and a
https://en.wikipedia.org/wiki/Googlehttps://en.wikipedia.org/wiki/Word2vec#cite_note-mikolov-1https://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Distributional_semanticshttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers
-
42
testing step [9]. The training data is used to apply “learning algorithms” (classification
algorithms). On the other hand, the testing data is used to test the accuracy of the classifiers.
Figure 9. Supervised Learning process
Classifying text is an important topic nowadays. Some application fields of text mining and text
classification are:
Classify Reviews: This is a typical application for e-commerce Web sites. In almost all products,
we have a section where the users write their reviews. The company needs to read and
evaluate all the reviews and classify them as positive or negative. However, in most cases, it is
impossible to go through all the reviews by reading them manually. Therefore, a solution would
be to teach a classifier that will automatically differentiate positive and negative reviews.
Spam Detection: A trending topic in the last decade. Everyone receives e-mails containing
spam. For mail servers, it is an imperative to create systems that can quickly detect spam e-
mails. Using classification techniques, models can be generated that will classify mails as spam
or not spam.
Author Detection: Every author has his/her style of writing. This is another topic where text
classification is applied. However, the author's detection topic is more critical. In most of the
cases, the writing style of the two authors is very similar. Therefore, we cannot reach high
accuracy on this topic.
Language Detection: A prevalent task nowadays. Google uses it in its translation engine. This
type of problem achieves the highest accuracy. This is because languages are very different.
Because of that, in most cases, this task is not considered a classification task.
-
43
Classify News Articles: This is the topic demonstrated in this thesis. An advantage of working on
this topic is that you can find many news articles on the Web to experiment with and test
different algorithms. It is a prevalent task where researchers are trying to improve the accuracy
of classification algorithms. A drawback of this task is that in news articles, new words appear
very often.